# Sentiment Analysis Overview

This is an overview of results produced in the first semester of 2018 from the NLP - Sentiment Analysis group.
Feel free to direct any questions regarding this to Steph Garland (me) at garlasl1@student.op.ac.nz.

![Divider Line](http://www.nationalwalleyetour.com/wp-content/uploads/2012/12/dividerline-transparent.png)
Broadly speaking, the goal of sentiment analysis is to determine how a subject feels about something. Whenever we need to make a decision we frequently seek out the opinions of others. This might going to a festival because we heard it was amazing last year, buying something based on reviews we have read, or it could be a company using focus groups, surveys and opinion polls to shape the development of a product. 

We also have available to us huge volumes of opinionated data recorded in digital form. Social media, blogs, forum posts, youTube comments - opinion data has never been so publicly and abundantly available. So abundant, that we could use a little automated help with its analysis. The current favoured approach to automated sentiment analysis is to use deep learning techniques. 

## Neural Networks and Deep learning:
**A common workflow in traditional programming is:**
* You receive input
* You have a rule set about what to do with that input
* You use the rule set to calculate an output.

For example, if you had a summing function that receives a 1 and a 3, in that function you have written a rule that those numbers are to be added together. Your super clever machine figures out that the output is 4.

**In supervised machine learning, this process works differently.**
* You receive lots of inputs.
* You receive what the output of each input should be.
* You use this to figure out what the rules are.
* You can now use this rule set to figure out the output of new inputs that don't already have outputs associated with them.

For example, 
![brain teaser](https://4.bp.blogspot.com/-Kyi_a2YFmWI/UbNjly69uyI/AAAAAAAAGh8/My4eIhQUsrU/s1600/Very-Easy-Number-Sequence-Puzzle.jpg)

In the above brain-teaser, if it wasn't very late and if we weren't very tired, we could figure out the rule from the first two pictures where we are given the inputs and the expected output. 

SPOILER ALERT: (8+3)*4=44 and (9+7)*2=32

We've figured out the rule with our amazing AI brains. We can now apply the rule to figure out that the third output is 5.

### Perceptrons/Neurons:
Neural networks are networks of really simple information processing units called neurons. A perceptron is a neuron where the input is limited to a 0 or 1 value. If you were making a decision about whether or not to buy a new hat, you would likely consider multiple factors, for example:

* Is it on sale? y/n
* Is my head very cold? y/n
* Does it have good reviews on rateMyHat.com? y/n

Here, the answers to the multiple questions are inputs, and the hat purchase or lack thereof is the output. Maybe you're much more concerned about being thrifty and warm than you are about what other people say about your hat. On a scale of 1-10 importance, say you rank the inputs thusly:

* Is it on sale? 5/10
* Is my head very cold? 6/10
* Does it have good reviews on rateMyHat.com? 3/10

Those are the weights associated with your inputs. Maybe your own personal hat-buying rule is that if you can accumulate 7/10, you buy a hat! This is the basic architecture of a perceptron. Each input is a 1 or 0 based on whether you answered yes or no to the question. You multiply these inputs by the weights. The weights are your rankings of how important you consider the question. The weighted sum is the combined total, and in the diagram below, the last step represents whether or not the perceptron was activated -> if the weighted sum met or exceeded your 7/10 threshold, the output is 1, the perceptron is activated, and you buy a new hat!

![Perceptron](http://ataspinar.com/wp-content/uploads/2016/11/perceptron_schematic_overview.png)

A perceptron accepts a 0 or 1 as input, whereas a neuron accepts a number between 0 and 1. This lets us be a lot more descriptive with the answers to input questions. For example, instead of answering 'Is my head very cold?' with a yes or no, '1' could represent a very cold head indeed, '0' an adequate temperature, and any answer in between can be represented on a 0-1 scale. 

You can imagine that if an outsider had enough examples of your hat buying inputs and outputs, with a bit of experimentation, they could probably work out the threshold and the hidden weight values of the neuron. 
**Deep learning** refers to this same problem solving by using neural networks that have multiple layers. We'll have thousands of inputs and our networks will contain vast amounts of neurons chained together over multiple layers, but in each neuron **we're simply trying to automate finding the truth of the hidden weights and biases.** Our deep learning model does this by trial and erroring hidden weight values to see what gets it closest to the target classification output. 

In the diagram below you can see an example of a deep learning neural network. Inputs travel through a neuron where they are multiplied by its weights and an output is calculated. This output becomes an input of the next layer of the network, and so on, allowing for increasingly complex problem solving. 

![Neural Network](https://ds055uzetaobb.cloudfront.net/image_optimizer/42f14c313680eea5b5abcd08813074f09625a70b.png)
![Divider Line](http://www.nationalwalleyetour.com/wp-content/uploads/2012/12/dividerline-transparent.png)

# Using deep learning for Sentiment Analysis:
This semester we focused on categorising opinions expressed in pieces of text. We want to be able to determine whether the writer's attitude towards the subject of any given text-snippet is positive or negative. 
In other words, our inputs were blurbs of opinionated text, and our output was a 0 or 1 based on whether the opinion was negative or positive. The goal is to build a deep learning model to figure out the hidden weights and biases of what makes an opinion get classified either way. Once a model knows the hidden rule set, it has a good shot at correctly classify new opinions.


## Dependencies:
In order to run the code in this notebook, it will need to be opened on a Jupyter Notebook server that is running in a TensorFlow environment and has the Keras package installed. By default, notebook servers run on your local machine. At the time of writing this, the OP-VR machine has a Jupyter Notebook server running with the required dependencies. It can be connected to from within the Otago Polytech network (not wifi though). I have written a brief notebook on how to start the server from the OP-VR machine, and how to connect to it from another computer within the network.

### Python

* Python is the mostly widely used language for machine learning and has a huge set libraries which can be easily used (for e.g. NumPy, SciPy, ScikitLearn) and are designed to optimise the heavy kind of data manipulation we require. 

### Tensorflow and keras

* [TensorFlow is an open source software library for high performance numerical computation.](https://www.tensorflow.org/) The code in this notebook uses a TensorFlow back-end. I found TensorFlow tricky to use, so I also used a library called [Keras](https://keras.io/) that runs over top of Tensorflow making the code much easier to understand. Both have detailed installation guides on their websites.

## Not dependencies for the notebook, but recommendations for beyond notebook work:
### Anaconda

* Anaconda Navigator is an open source distribution of Python for machine learning related applications. It simplifies package management and you can create environments (containment areas so your ML python packages don't clash with pre-existing packages.) Occasionally I had trouble downloading packages. My best tip is to make sure anaconda and all existing packages are up to date using the command **conda update -n root conda**, and then **conda update --all** in Anaconda Prompt before adding anything new. Also, make sure when updating that all applications running through Anaconda Navigator are closed (e.g Spyder or Jupyter Notebook). You won't get an error message -> they just won't be updated.

### Save/load model

* Training models really quickly becomes time consuming. I've written a separate notebook on how to save and load a model so that hopefully you can learn how to minimise how often you're doing this. 

### Running on GPU

* Using the GPU for your machine learning computations is soo much faster. When I first switched over, the model I was working with was taking around an hour and a half to train on my laptop's CPU. It trained in 6 minutes on the OP-VR Machine's GPU. I've written a separate notebook on how to set up on GPU.

![Divider Line](http://www.nationalwalleyetour.com/wp-content/uploads/2012/12/dividerline-transparent.png)

# Training a model:
The basic work flow for solving a problem using deep learning is:
* **Find or put together a dataset** that has the inputs and outputs of the problem you're trying to solve (eg, a good basic sentiment analysis dataset would have thousands of opinions (for the inputs), labeled correctly as either postive or negative. The larger the dataset the better.
* Put together the architecture of the model (e.g. number and type of layers)
* **Train the model on the data set.** This is where the model tries to figure out all the hidden weights so it can generate a good general rule set to use in future.
* **Test the model**'s rule set to see how accurately it calculates the output on data it hasn't been exposed to before. Typically a subset of the labelled dataset is held aside for this purpose so that the real outputs are available to compare the predicted outputs against.


## The imdb dataset:
For training this semester's models we used the [IMDB movie review dataset](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification). This is a set of 25,000 movie reviews. labeled by sentiment (positive = 1/negative = 0). One of our points of interest was in discovering how well a model trained on a movie review dataset could generalise to successfully predict non-movie related sentiment. You would expect, for example, a model trained on movie reviews to perform well on a statement like "Her acting was terrible, and the dialogue was shallow", but ideally it would also be able to classify a statement like "The way the sun feels on my face makes me dream of a simplier time".

**Note:** Below is a Jupyter Notebook code cell. To run code a cell at a time, click in the cell and push shift+enter. Depending on the nature of the code there will not always be an output, but I have included a print statement in the first to demonstrate that the code has run.

In [24]:
from keras.datasets import imdb
print("IMDB dataset has been imported")

IMDB dataset has been imported


For every model we train we use most of the dataset for training, but we also set aside some for testing afterwards to see how accurate it is at predicting unseen data. In both the training and test set we also separate out the data (the reviews) from the labels (the sentiment classification). 

**Note:**
The reviews of the imdb dataset have been preprocessed - each encoded as a sequence of word indexes (integers). So instead of each word being a string, it is represented by a number. Words are indexed by overall frequency in the dataset, so that for instance the integer "49" encodes the 49th most frequent word in the data. This is handy, because it means we can filter out uncommon words.

Below we separate the imdb dataset into training and test sets, and specify that we only want the word indexes for the most frequently used words. This is a trade off. Using more words gives a performance accuracy boost, but is computationally more expensive. [A native english speaker's vocabulary is around 20,000 to 35,000 words](http://testyourvocab.com/blog/2013-05-10-Summary-of-results). Here I've chosen to use 5000 of the most common, sacrificing accuracy for speed.

In [30]:
#TOP_WORDS is the top n most frequently used words in the dataset. 
#More uncommon words are tagged differently, essentially removing them from later computations.
TOP_WORDS = 5000
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=TOP_WORDS)

![Divider Line](http://www.nationalwalleyetour.com/wp-content/uploads/2012/12/dividerline-transparent.png)


# A simple sentiment analysis model:

### Preparing the train/test data:
Let's look at what we've got from imdb. By looking at the shape we can see that we have 25,000 reviews. By looking at the first review, we can see that each review is stored as a list where each word is represented by a number. 

In [38]:
#Print the shape of the pre-vectorised training data
print("PRE-VECTORISED SHAPE: ", train_data.shape)
#Print the review as encoded from IMDB. Each word is mapped to a number.
print("PRE-VECTORISED WORD-ENCODINGS: ", train_data[0])

PRE-VECTORISED SHAPE:  (25000,)
PRE-VECTORISED WORD-ENCODINGS:  [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]


We can't feed a list of integers straight into our network as an input -> we need to re-package it a little. Machine Learning algorithms require all input variables and output variables to be numeric. This can be done in two steps. 

The first is integer encoding, where strings are encoded as integers. In this case, this was done for us by imdb!
When we use integer encoding, it is assumed by our algorithm that the specific integer we used for any given word matters. For example, if 'banana' was encoded as 4, and 'cat' was encoded as 5, it would be assumed that these words have a closer relationship with each other than to the word 'chair', encoded at '29'. 

To ensure we don't imply prior relationships between words, we use a second step. [One-hot-encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f) is when we create a category for every word in our vocabulary. For each category, we mark the corresponding word with a 1, and all other words as 0. This way, every word in our vocabulary has an equal ordinal relationship. For example, if the word 'Rome' was in our vocabulary, it would have its own category where its encoding was marked as a 1, and all other words maked as 0. This is demonstrated in the below diagram:
![One hot encoding](https://cdn-images-1.medium.com/max/674/1*YEJf9BQQh0ma1ECs6x_7yQ.png)

Our network expects a [tensor](https://hackernoon.com/learning-ai-if-you-suck-at-math-p4-tensors-illustrated-with-cats-27f0002c9b32]). Tensors are similar to arrays and indeed are stored in numPy arrays, so we'll import numPy.

In [52]:
import numpy as np

#Method for transforming the imdb reviews into one-hot encodings:
def vectorize_sequences(sequences, dimension=TOP_WORDS):
    #Create an all zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences),dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1. #set specific indices of results[i] to 1s
    return results

#Transform the training data with our one-hot-encoding/vectorising method
x_train = vectorize_sequences(train_data)
#Transform the test set too
x_test = vectorize_sequences(test_data)

And we can inspect the vectorised data to verify the transformation. We still have 25,000 reviews, and now we have a 5000 category one-hot-encoding.

In [50]:
print("VECTORISED SHAPE: ", x_train.shape)

VECTORISED SHAPE:  (25000, 5000)
VECTORISED WORD-ENCODINGS:  [ 0.  1.  1. ...,  0.  0.  0.]


Finally, we also re-house our label data into numPy arrays. Here, each 1 or 0 represents the positive or negative output label of a review:

In [51]:
#Transformation of training labels
print("ORIGINAL SEQUENCE: ",train_labels)
y_train = np.asarray(train_labels).astype('float32')
print("TRANSFORMATION: ",y_train)

#Transformation of test labels
y_test =np.asarray(test_labels).astype('float32')

ORIGINAL SEQUENCE:  [1 0 0 ..., 0 1 0]
TRANSFORMATION:  [ 1.  0.  0. ...,  0.  1.  0.]


### Building a basic sequential model:
[There are two ways to build a model with keras](https://jovianlin.io/keras-models-sequential-vs-functional/), sequentially and functionally. Sequential models are a linear stack of layers where each layer is connected to (at most) the layer directly previous and the layer directly after itself. All of the neural network examples we've seen so far have had these linear connections. For example:
![Linear layers](https://images.xenonstack.com/blog/Artificial-Neural-Network-Architecture.jpg)

To distinguish, the layers in a functionally built model can be connected to any other layer. This is helpful for some tasks, for example, [Siamese Neural Networks](https://hackernoon.com/one-shot-learning-with-siamese-networks-in-pytorch-8ddaab10340e) are built functionally in Keras and can be used in problems where the goal is to measure the relationship between two comparable inputs.

For sentiment anaylsis we're not comparing anything, and we don't need to add any layers out of order, so building sequentially is fine.

In [21]:
from keras import models
from keras import layers
#Specify that we're building a sequential model rather than functional:
model = models.Sequential()

We'll start with a fully-connected network of dense layers. In code, we'll pass each layer a 'hidden unit' parameter and an activation function.

**A dense layer** is simply a layer where each unit or neuron is connected to each neuron in the next layer, as pictured in the previous neural network diagram.
At each layer in our model, each neuron will calculate an output and feed it forward as an input to every neuron of the next layer. 

We pass each layer an argument referring to the number of **hidden units** it should have. Hidden units means that the weight matrix used when calculating the weights of the neuron will have the shape (inputs, hidden units). 

>"You can intuitively understand the dimensionality of your relu representation space as "how much freedom you are allowing the network to have when learning internal representations". Having more hidden units (a higher-dimensional representation space) allows your network to learn more complex representations, but it makes your network more computationally expensive and may lead to learning unwanted patterns (patterns that will improve performance on the training data but not on the test data)." 
>- *Chapter 3.4.3, Deep Learning with Python - Francois Chollet*

The output vector of this layer will have the same amount of dimensions as hidden units. 

**Activations** are functions that take the weighted sum of a neuron and decide whether or not it should fire. In our buying a hat example from earlier, we multiplied our yes/no inputs by weights that represented how important that input was to our decision-making. We then combined them into a weighted sum. We decided that if the weighted sum was 7/10 or more we would buy a hat. That setting up of a threshold and classifying either side is one way an [activation function ](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0) can work. 

The specific activation function to use varies depending on the goal of the layer. **A sigmoid function works well for a classifier**, so we'll use that in our final layer when we want to make a definitive decision between a positive or negative classification. **We'll use a ReLu (rectified linear unit) on our first two layers** because it is computationally inexpensive and is a good general approximator, so we'll use it on our first layers. 

In [None]:
#Build a basic sequential model:
model.add(layers.Dense(16, activation='relu', input_shape=(TOP_WORDS,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

### Compile and Train:
A loss function takes the current output and compares it with the expected/true result. The less the loss, the closer your results are to the expected. The [optimiser](https://keras.io/optimizers/) function helps to minimise this loss.

A [metric](https://keras.io/metrics/) is a function that is used to judge the performance of your model.


In [22]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy',metrics=['accuracy'])
model.fit(x_train, y_train, epochs=4, batch_size=512)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x12a1a477a58>

### Evaluate:
We get around 88% accuracy on the test data! That's so good! That means that the rules that our model came up with when training were used to correctly predict the sentiment for 88% of the test reviews it hadn't seen before.

In [6]:
model.evaluate(x_test, y_test)



[0.2949141507101059, 0.88375999999999999]

## Predicting on our own blurb:
Testing our model on sample sentences is a quick way to feel confident that it is indeed performing well. Because the imdb data comes pre-processed (each review is downloaded as a set of integers), there are a couple of steps we need to complete become we can get a prediction from a string. 

Imdb provides a key to allow us to figure out the integer value used for each word. There are three reserved characters in this key. 
* 0 is used for padding
* 1 is used to mark the start of a new review
* 2 is used for unknown/ignored words

We're going to input a sentence of our own devising, and split it into an array of words:

In [7]:
OFFSET = 3 #number of reserved characters
word_index = imdb.get_word_index() #get key for mapping word->integer

my_blurb = "The spider is ugly"
listOfWords = my_blurb.split()

We start our sentence encoding with 1 to mark it as the start of a new review, and then word by word we map our sentence to imdb's integer representation. If the word is not in the key, if it's a stopword, or if it isn't one of the top n most frequent words, we mark it as unknown with a 2.

In [8]:
encoded_review = [1]
for word in listOfWords:
    if word in word_index and word_index[word] < TOP_WORDS:
        index = word_index[word] + OFFSET
    else:
        index = 2

    encoded_review.append(index)

print(encoded_review)

[1, 2, 5072, 9, 1558]


We can see that 'the' was marked as a 2, almost certainly because it is a stopword. Interestingly 'is' is not included as a stopword in the imdb word_index. 

Finally, we reshape our array and get our model to make a prediction!
For my test sentence of "The spider is ugly", we get around 0.35 - that's a negative classification, so good job model buddy!

In [9]:
data = np.array(encoded_review)  
data.shape = [1,len(encoded_review)]
myInput = vectorize_sequences(data)

print(model.predict(myInput, batch_size=None, verbose=0, steps=None))

[[ 0.3491801]]


If you try out different sentences you'll soon see that a lot of predictions it makes are around the 0.5 mark. It often gets a correct classification, but it's rarely very confident about it. It also has some quirks. The word hate seems to have positive connotations. In order to improve the confidence and not terrify the public with a hate-bot, we should try to improve.

![Divider Line](http://www.nationalwalleyetour.com/wp-content/uploads/2012/12/dividerline-transparent.png)

# A less simple sentiment analysis model:

In the previous model, we used one-hot encoding to help reshape our data. Here, we'll use a different method called **sequence padding**. Remembering that our input layer requires that each training example is the same size, we can simply pad out the short ones with 0's. This is illustrated below, where the red is padding. 


![Sequence Padding](https://d3ansictanv2wj.cloudfront.net/img05-531269967c702e2bc6f49455d9bdcd84.png)


There are a couple of things to consider when using this method of reshaping: 
1. All reviews are padded out to the length of the longest review. 
This means that if there were 249,999 reviews of roughly 200 words each, and 1 review that was 3000 words long, they would all be padded to 3000 words. When input sequences are likely to span a wide range of lengths, like in reviews, this is a bit of a waste of computation time. 
    
2. The network has to learn to ignore the padding. 
Our reviews are already encoded as integers, and the imdb key has designated '0' for padding. Even still, our model still has to discover that ignoring '0's is part of the learning problem.
    

In keras, there is already a sequence padding method defined that we can import and use. It accepts a parameter called **maxlen**, which we can use to help address the wasted computation introduced by padding everything to the length of the longest sequence. All reviews will instead be either padded or truncated to the value of maxlen.
This will speed up computation, and you can easily make an assumption that after a certain amount of words the overall sentiment of a review will not drastically change. I've chosen quite a low value of 100 words here, meaning that any words over this limit will not be considered when anaylising sentiment (because I'm currently running this notebook on CPU and it's going to take forever otherwise.)

In [10]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = TOP_WORDS)
from keras.preprocessing import sequence
MAX_LEN = 100 #Reviews will be padded or truncated to this wordcount
x_train = sequence.pad_sequences(x_train, maxlen=MAX_LEN) #Pad training reviews
x_test = sequence.pad_sequences(x_test, maxlen=MAX_LEN) #Pad testing reviews

### Building a LTSM model:

Once again we will be building this model sequentially. For the input layer, we'll use [Word Embeddings](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/). Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. The output of a word embedding layer is a 2D Vector. The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. Words with similar meanings will be closer in the vector space than those that have little shared meaning/use. For example:

![Example Word Embedding](https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/01/word-vector-space-similar-words.png)

It is possible to use pre-trained word-embeddings when training a new model. A word embedding trained on a wikipedia dump, for example, would get a pretty comprehensive language map. This was on the TO-DO list for this semester, but unfortunately I didn't quite get to it. Instead, I created word embeddings from the imdb dataset. This would be a really interesting area to look in to further - is it better to use a giant, generalised word embedding, or is it better to use a word embedding trained on text specific to the type of problem you're trying to solve?

When training our own embedding layer, we pass the keras Embedding class TOP_WORDS, and DIMENSIONS. TOP_WORDS is the number of total words in our vocabulary. DIMENSIONS is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. 


In [None]:
from keras.layers import Dense, Embedding, Bidirectional
from keras.models import Sequential

DIMENSIONS = 32
model = Sequential()
model.add(Embedding(TOP_WORDS, DIMENSIONS))

Next, let's try out a Long Short Term Memory layer (LTSM). LTSM is a type of Recurrent Network. Recurrent Neural Networks (RNN) are useful for learning sequential data (eg text). Our previous model was a feed-forward network, and assumed all input and output were independent of each other. But when using text, most of the time the current word should be considered in context with the words that came before it. 
A recurrent network has an additional weight matrix that connects back to itself for every element of a sequence, allowing each output to be informed by previous computations. This gives them a kind of memory, allowing them to be used in problems where the output of one element of a sequence is altered by others in that same sequence. Text prediction (e.g. predicting the last word in 'The dog has a loud bark'), could be accomplished using a RNN.

One of the limitations of a simple RNN is that it has trouble remembering long-range dependencies. For example, it would perform poorly if we had a sentence that went 'I'm allergic to shellfish' and then 1000 words about various foods later 'I would die if I ate a prawn'. As we progress in the sequence, the less significance our model places on earlier dependencies. 

[A LTSM cell](https://www.youtube.com/watch?v=9zhrxE5PQgY) is used to be more selective about what to forget and what to remember, instead of using long term memory for everything all the time. 

**Disclaimer: I decided to use LTSM largely because I wasn't sure what to use, and it sounded interesting to learn about. In hindsight, it's not that useful for sentiment analysis. Reviews/opinion text tends to be quite short, and long-range dependencies don't crop up often. Generally, a persons sentiment within a piece of text remains more or less the same for the duration of the text. Nevertheless, without the benefit of hindsight, and being keen just to jump in and try something, it's what I used.**

To add a LTSM cell as our second layer we import the keras LSTM class and add it to our model, passing in the dimension size we expect as output of the previous layer. Bidirectional is just a wrapper for Recurrent Neural Networks in keras.
For the final layer, we once again use the sigmoid activation function for classification.

In [11]:
from keras.layers import LSTM

model.add(Bidirectional(LSTM(DIMENSIONS)))
model.add(Dense(1, activation='sigmoid'))

### Compile and Train:
**EPOCHS:**
One of the main goals of our network is to find the most representative weight values. It does this iteratively - by trying out values and slowing adjusting them to get closer and closer to the target output. In order to accomplish this, a full dataset can be passed multiple times to the same neural network. One pass is called an epoch. As the number of epochs increases, the more the weights of a network have a chance to change. Too few epochs and the model is underfit - it performs poorly on any data. Too many and it overfits - it performs really well on the training data, but isn’t general enough to predict anything new. 

**BATCH_SIZE:**
That said, unless it’s quite small to begin with, you can’t pass an entire dataset into a network at once. Large datasets are trained in batches. The size and number of batches also depends on the dataset, but you’d expect to see sizes somewhere between 200-600. 

**VAL_SPLIT:**
The validation split is a percentage representation. A percentage of data is selected from the training data to use as test data between each pass. This keeps our test data unseen until our model has finished training and is evaluated.

We compile and fit this model just as we did the last.

In [12]:
EPOCHS = 4
BATCH_SIZE = 200
VAL_SPLIT = 0.1
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc']) 
model.fit(x_train, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_split=VAL_SPLIT)

Train on 22500 samples, validate on 2500 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x12a1a3e4e80>

### Evaluate:
In our simple model from earlier, we scored 88% accuracy. Now, even after using more sophisticated techniques, like a Word Embeddings and a Recurrent Neural Network, we get an accuracy of about 85%. What's our fancy AI up to?!

In [13]:
print(model.evaluate(x_test, y_test, batch_size=BATCH_SIZE))

[0.34869458270072939, 0.84640000009536742]


## Predicting on our own blurb:
Again, we do a sanity check by testing on brand new test data from our brains:

In [15]:
my_blurb = "The spider is ugly"
listOfWords = my_blurb.split()

encoded_review = [1]
for word in listOfWords:
    if word in word_index and word_index[word] < TOP_WORDS:
        index = word_index[word] + OFFSET
    else:
        index = 2

    encoded_review.append(index)

data = np.array(encoded_review)  
data.shape = [1,len(encoded_review)]
x_custom_test = sequence.pad_sequences(data, maxlen=MAX_LEN)                                                       
print(loaded_model.predict(np.array(x_custom_test)))

[[ 0.08695981]]


Again, we've got a correctly negative classification, but look how much more confident it is about it! Before it was 0.35, and now it's absolutely convinced "The spider is ugly" is a negative sentiment with 0.09.

This model is getting slightly worse accuracy ratings than the quicker, dumber model, but it’s so much more confident in its predictions. When predicting, the models output a number that could be thought of as a confidence ranking. The more polar the number, the more sure it is that it has classified the sentiment correctly So if it classified something as 0.01, it would be so so sure it was negative, whereas a 0.51 ranking would represent it having no blimin’ clue, really. Could be positive or negative, but it’s slightly more towards positive, so it’ll classify it that way.

I’m wary of using this analogy before getting my marks back, but I think of it as being the difference between a C and an A student. They both pass the same tests, but in the coming semester, as the material gets more complex, the A student is probably going to respond better.