# Lab assignment: analyzing movie reviews with Recurrent Neural Networks

<img src="img/cinemaReviews.png" style="width:600px;">

In this assignment we will analyze the sentiment, positive or negative, expressed in a set of movie reviews IMDB. To do so we will make use of word embeddings and recurrent neural networks.

## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">You will need to solve a question by writing your own code or answer in the cell immediately below, or in a different file as instructed.</td></tr>
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td style="text-align:left">This is a hint or useful observation that can help you solve this assignment. You are not expected to write any solution, but you should pay attention to them to understand the assignment.</td></tr>
 <tr><td width="80"><img src="img/pro.png" style="width:auto;height:auto"></td><td style="text-align:left">This is an advanced and voluntary exercise that can help you gain a deeper knowledge into the topic. Good luck!</td></tr>
</table>

During the assigment you will make use of several Python packages that might not be installed in your machine. If that is the case, you can install new Python packages with

    conda install PACKAGENAME
    
if you are using Python Anaconda. Else you should use

    pip install PACKAGENAME

You will need the following packages for this particular assignment. Make sure they are available before proceeding:

* **numpy**
* **keras**
* **matplotlib**

The following code will embed any plots into the notebook instead of generating a new window:

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Caps+Shift to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## The Keras library

In this lab we will make use of the <a href=http://keras.io/>keras</a> Deep Learning library for Python. This library allows building several kinds of shallow and deep networks, following either a sequential or a graph architecture.

## Data loading

We will make use of a part of the IMDB database on movie reviews. IMDB rates movies with a score ranging 0-10, but for simplicity we will consider a dataset of good and bad reviews, where a review has been considered bad with a score smaller than 4, and good if it features a score larger than 7. The data is available under the *data* folder.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
Load the data into two variables, a list **text** with each of the movie reviews and a list **y** of the class labels.
 </td></tr>
</table>

In [2]:
####### INSERT YOUR CODE HERE
import pandas as pd
#data = pd.read_csv("./data/datafull.csv", sep='\t')
data = pd.read_csv("./data/data.csv", sep='\t')
texts = data["text"]
y = data["sentiment"].values

For convenience in what follows we will also split the data into a training and test subsets.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
Split the list of texts into **texts_train** and **texts_test** lists, keeping 25% of the texts for test. Split in the same way the labels, obtaining lists **y_train** and **y_test**.
 </td></tr>
</table>

In [3]:
####### INSERT YOUR CODE HERE
from sklearn.model_selection import train_test_split
texts_train, texts_test, y_train, y_test = train_test_split(texts, y, stratify=y)

## Data processing

We can't introduce text directly into the network, so we will have to tranform it to a vector representation. To do so, we will first **tokenize** the text into words (or tokens), and assign a unique identifier to each word found in the text. Doing this will allow us to perform the encoding. We can do this easily by making use of the **Tokenizer** class in keras:

In [4]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


A Tokenizer offers convenient methods to split texts down to tokens. At construction time we need to supply the Tokenizer the maximum number of different words we are willing to represent. If out texts have greater word variety than this number, the least frequent words will be discarded. We will choose a number large enough for our purpose.

In [5]:
maxwords = 1000
tokenizer = Tokenizer(num_words = maxwords)

We now need to **fit** the Tokenizer to the training texts.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
Find in the keras documentation the appropriate Tokenizer method to fit the tokenizer on a list of text, then use it to fit it on the training data.
 </td></tr>
</table>

In [6]:
####### INSERT YOUR CODE HERE
tokenizer.fit_on_texts(texts_train)

If done correctly, the following should show the number of times the tokenizer has found each word in the input texts.

In [7]:
tokenizer.word_counts

OrderedDict([('of', 11163),
             ('life', 491),
             ('in', 7015),
             ('some', 1249),
             ('colleges', 3),
             ('course', 215),
             ('there', 1211),
             ('were', 798),
             ('artistic', 39),
             ('licenses', 1),
             ('taken', 88),
             ('but', 3267),
             ('what', 1163),
             ('you', 2289),
             ('saw', 251),
             ('this', 5849),
             ('film', 2998),
             ('go', 417),
             ('on', 2554),
             ('br', 8026),
             ('i', 5840),
             ('went', 101),
             ('to', 10573),
             ('southern', 10),
             ('california', 23),
             ('where', 480),
             ('the', 25672),
             ('races', 7),
             ('pretty', 255),
             ('much', 715),
             ('hang', 14),
             ('around', 306),
             ('with', 3297),
             ('their', 815),
             ('own', 254),


Now we have trained the tokenizer we can use it to vectorize the texts. In particular, we would like to transform the texts to sequences of word indexes.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
Find in the keras documentation the appropriate Tokenizer method to transform a list of texts to a sequence. Apply it to both the training and test data to obtain matrices **X_train** and **X_test**.
 </td></tr>
</table>

In [8]:
####### INSERT YOUR CODE HERE
X_train = tokenizer.texts_to_sequences(texts_train)
X_test = tokenizer.texts_to_sequences(texts_test)

We can see now how a text has been transformed to a list of word indexes.

In [9]:
X_train[0]

[4,
 112,
 8,
 45,
 4,
 238,
 47,
 69,
 532,
 17,
 45,
 4,
 48,
 21,
 200,
 8,
 10,
 19,
 129,
 20,
 8,
 45,
 7,
 7,
 11,
 472,
 5,
 8,
 115,
 1,
 196,
 73,
 167,
 16,
 67,
 197,
 44,
 165,
 79,
 128,
 22,
 12,
 198,
 576,
 3,
 11,
 65,
 134,
 12,
 44,
 47,
 17,
 1,
 162,
 6,
 51,
 608,
 40,
 39,
 51,
 488,
 43,
 40,
 15,
 608,
 35,
 382,
 5,
 43,
 167,
 16,
 76,
 4,
 67,
 197,
 39,
 6,
 12,
 72,
 23,
 61,
 289,
 768,
 2,
 552,
 4,
 17,
 33,
 1,
 579,
 4,
 28,
 4,
 1,
 11,
 259,
 41,
 12,
 71,
 133,
 30,
 320,
 5,
 167,
 16,
 4,
 83,
 3,
 320,
 5,
 113,
 98,
 855,
 326,
 132,
 197,
 7,
 7,
 430,
 8,
 28,
 4,
 142,
 11,
 28,
 54,
 300,
 291,
 167,
 689,
 41,
 1,
 4,
 1,
 3,
 68,
 1,
 4,
 45,
 4,
 142,
 145,
 703,
 33,
 1,
 8,
 1,
 18,
 115,
 1,
 179,
 127,
 472,
 121,
 3,
 553,
 29,
 2,
 7,
 7,
 11,
 65,
 64,
 427,
 3,
 12,
 145,
 33,
 12,
 123,
 84,
 527,
 7,
 7,
 37,
 6,
 856,
 272,
 6,
 9,
 90,
 121,
 727,
 9,
 20,
 115,
 21,
 472,
 5,
 39,
 129,
 5,
 579,
 1,
 162,
 115,
 1,
 43,
 1

This is enough to train a Sequential Network. However, for efficiency reasons it is recommended that all sequences in the data have the same number of elements. Since this is not the case for our data, should **pad** the sequences to ensure the same length. The padding procedure adds a special *null* symbol to short sequences, and clips out parts of long sequences, thus enforcing a common size.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
 
Find in the keras documentation the appropriate text preprocessing method to pad a sequence. Then pad all sequences to have a maximum of 300 words, both in the training and test data.
 </td></tr>
</table>

In [10]:
####### INSERT YOUR CODE HERE
from keras.preprocessing.sequence import pad_sequences 
maxsequence = 300
X_train = pad_sequences(X_train, maxlen=maxsequence)
X_test = pad_sequences(X_test, maxlen=maxsequence)

## Simple LSTM network with Embedding

To transform the word indices into something more amenable for a network we will use an <a href=https://keras.io/layers/embeddings/>**Embedding**</a> layer at the very beginning of the network. This layer will transform word indexes to a vector representation that is learned with the model together with the rest of network weights. After this transformation we will make use of an <a href=https://keras.io/layers/recurrent/#lstm>**LSTM**</a> layer to analyze the whole sequence, and then a final layer taking the decision of the network.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
Build, compile and train a keras network with the following structure:
<ul>
 <li>Embedding layer producing a vector representation of 64 elements</li>
 <li>LSTM layer of 32 units</li>
 <li>Dropout of 0.9</li>
 <li>Dense layer of 1 unit with sigmoid activation</li>
</ul>
Note that the Embedding layer requires specifing as first argument the maximum number of words we chose to for the tokenizer. Also, the LSTM layer requires setting the **input_shape** parameter as a tuple including the number of elements in the input sequences. 
Use the binary crossentropy loss function for training, together with the adam optimizer. Train for 10 epochs. After training, measure the accuracy on the test set.
 </td></tr>
</table>

In [11]:
####### INSERT YOUR CODE HERE
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential

model = Sequential()
model.add(Embedding(maxwords, 64))
model.add(LSTM(32, input_shape=(maxsequence,)))
model.add(Dropout(0.9))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
model.summary()

model.fit(X_train, y_train, batch_size=64, epochs=10, validation_data=(X_test, y_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 64)          64000     
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
Total params: 76,449
Trainable params: 76,449
Non-trainable params: 0
_________________________________________________________________
Train on 1875 samples, validate on 625 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epo

<keras.callbacks.History at 0x7efbc5fbff28>

## Stacked LSTMs

Much like other neural layers, LSTM layers can be stacked on top of each other to produce more complex models. Care must be taken, however, that the LSTM layers before the last one generate a whole sequence of outputs for the following LSTM to process.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td style="text-align:left">
Repeat the training of the previous network, but using 2 LSTM layers. Make sure to configure the first LSTM layer in a way that it outputs a whole sequence for the next layer.
 </td></tr>
</table>

In [12]:
####### INSERT YOUR CODE HERE
from keras.layers.embeddings import Embedding

model = Sequential()
model.add(Embedding(maxwords, 64))
model.add(LSTM(32, input_shape=(maxsequence,), return_sequences=True))
model.add(Dropout(0.9))
model.add(LSTM(32, input_shape=(maxsequence,)))
model.add(Dropout(0.9))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
model.summary()

model.fit(X_train, y_train, batch_size=64, epochs=10, validation_data=(X_test, y_test))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 64)          64000     
_________________________________________________________________
lstm_2 (LSTM)                (None, None, 32)          12416     
_________________________________________________________________
dropout_2 (Dropout)          (None, None, 32)          0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dropout_3 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
Total para

<keras.callbacks.History at 0x7efb6fb63d30>