In [1]:
# imports and initialization

from IPython.display import HTML

# Sentiment Analysis
Sentiment Analysis is a particularly interesting application of deep learning in a field known as *Natural Language Processing*, or NLP. In Sentiment analysis we want to take as input to our network a passage of text, and output the sentiment of the words. This can be things such as positive or negative, or more specific, like happy, confused, or angry.

The primary problem posed here is the issue that the input to the network is words. This isn't really what Neural Networks handle, so before we can pass our data through a Neural Network we must first transform our textual data into a format that the Neural Network can understand; numbers.

This video introduces the problem and describes what we hope to achieve in this notebook.

In [2]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/da1I0mea1jQ?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

Before we go and create a Neural Network, we need to validate our theory of what data representation will provide a *signal* that accurately correlates between the input data and the output sentiment. When taking text as input, the most common way of representing the data to generate this signal is to use a *bag-of-words* representation, where each review is represented as a count of the number of times each word is used. This should help us identify sentiment, as words such as 'terrible', 'fantastic', 'awful' and 'excellent' will likely be more common in different sentiments.

In [4]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/IsTOnkAKaJw?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

Note that all the lab work can be found here, and this contains many of the important notes for the lesson:

Link to [Sentiment Analysis Lab](Labs/sentiment-analysis/Sentiment_Classification_Projects.ipynb)

### Part 1: Verifying the signal
Initially our intuition said that the count of each word would give us a good correlation between the input and the output sentiment. However, if we were to simply take a raw count of all of the words, the signal gets lost in a lot of noise, as there are a very large number of neutral words that occur in large quantities in both positive and negative samples (words such as `a`, `the`, `.`, `in`, `and` etc.). So instead, a better representation is:

    `log(positive_count / negative_count)`
    
By taking the ratio of the count of words in the positive reviews and negative reviews, we can see which words are occurring more often in the positive text rather than the negative. Taking the log of this ratio centers the scale, so that neutral words are approximately 0, negative words have a negative magnitude and positive words a positive magnitude. This allows us to easily identify which words contribute to the signal that the neural network will try and learn.

Note that this is only used for verifying the signal exists in the representation of a paragraph as a count of individual words. We will next use this knowledge to define our method for converting the paragraph into a numerical input ready for a Neural Network.

In [6]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/l4r5l0HvHRI?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

### Part 2: Transforming the Input Data
Now that we have identified the pattern, we need to define our input data conversion and our expected output data format. This is relatively simple; we will just count the number of times each word occurs in our review and use this as input to the network. The output will be a single binary output, where 0 represents negative and 1 represents positive.

See the associated lab for more details.

In [7]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/7rHBU5cbePE?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

In [10]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/45ihpPaeO8E?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

### Part 3: Building a Neural Network
Now we have our data in a format we can use as input to the Neural Network, it is time to build the architecture. Again, see the lab for the full implementation. Using a simple network with one input node for every word in our vocabulary (~ 74000), 10 hidden units and 1 output node we are able to show that the network starts to learn something that allows us to predict sentiment (though not very well). However, there are other techniques we can employ to start improving the performance of the network.

In [11]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/imnxzCev4SI?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

### Part 4: Understanding Neural Noise
In our Neural Network, we noticed that it took a long time to train, and the final accuracy was still quite poor. When this happens, we need to think about what the signal is in our data and what is causing noise on that signal that is making it difficult for the network to learn the correct patterns. By reducing this signal-to-noise ratio by tweaking the network architecture, or the network parameters, or the data format, or any number of other things, we can drastically improve the networks performance.

One of the major sources of noise in the current methodology is the prominance of filler words in the reviews. We initially chose to use the counts of each word in the review as the input to the network. However in most reviews, there is a large number of irrelevant words such as `the`, `a` and `.` which dominates the signal, making it very hard for the neural net to learn to find the signal from the important words like `brilliant` or `offensive`. To remove this noise, we can instead not use the *counts* of each word in the review, but just have a binary switch where if a word is present in the review, the node takes a value of 1, otherwise it takes a value of 0. This drastically reduces the noise, and also drastically improves the training process.

In [12]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/ubqhh4Iv7O4?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

### Part 5: Analyzing Inefficiencies in Our Network
Another source of inefficiency is in the actual computational expense required to do a forward pass and backprop pass on a network with 74000 input nodes. Because our input data is very sparse (meaning that the majority of the input values are 0), we are actually spending a lot of computational effort performing operations that make no difference to the actual training process (as it involves summing something multiplied by 0). Doing this tens of thousands of times per training step seems very inefficient, and so we can tweak our training process to skip these unnecessary operations and speed things up.

In [13]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/4MuS-6ATxCU?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

In [14]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Hv86B_jjWTI?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

Another way to reduce the noise and complexity of the system is to reduce the vocabulary and only focus on the words that actually provide a signal. We actually did this earlier when we were validating the correlation between the words and the sentiment by calculating the ratio between number of times the word appears in a positive review compared to a negative review. By removing the words that are generally neutral, we can drastically reduce the number of inputs to the network and make the signal much more obvious to the neural network. It is also common to remove the words that have a very low frequency, as it is hard to learn a general pattern from something that is only seen once or twice. Also, it is common to remove the *most* frequent words, as these tend to be the filer words such as `the` or `and`.

In [15]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Kl3hWxizKVg?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

In [16]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/ji0famK7gOQ?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

### Visualizing the Learning
As the network learns to identify a positive or negative sentiment from the input words, it is actually also learning a vector space where negative words are clustered together, and positive words are also clustered together. By looking at the weights of the input -> hidden part of the network, we can start to see that negative words have a similar vector when compared to each other, and similarly with the positive words. This can be visualised by performing a dimensionality reduction and plotting the vector space of the positive and negative words, as shown in this video.

In [17]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/UHsT35pbpcE?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')