# Lab assignment: analyzing movie reviews with Recurrent Neural Networks

<img src="img/cinemaReviews.png" style="width:600px;">

In this assignment we will analyze the sentiment, positive or negative, expressed in a set of movie reviews IMDB. To do so we will make use of word embeddings and recurrent neural networks.

## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>You will need to solve a question by writing your own code or answer in the cell immediately below, or in a different file as instructed.</td></tr>
 <tr><td width="80"><img src="img/exclamation.png" style="width:auto;height:auto"></td><td>This is a hint or useful observation that can help you solve this assignment. You are not expected to write any solution, but you should pay attention to them to understand the assignment.</td></tr>
 <tr><td width="80"><img src="img/pro.png" style="width:auto;height:auto"></td><td>This is an advanced and voluntary exercise that can help you gain a deeper knowledge into the topic. Good luck!</td></tr>
</table>

During the assigment you will make use of several Python packages that might not be installed in your machine. If that is the case, you can install new Python packages with

    conda install PACKAGENAME
    
if you are using Python Anaconda. Else you should use

    pip install PACKAGENAME

You will need the following packages for this particular assignment. Make sure they are available before proceeding:

* **numpy**
* **keras**
* **matplotlib**

The following code will embed any plots into the notebook instead of generating a new window:

In [1]:
#Importing required packages
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
import nltk
from nltk.corpus import stopwords
from keras.preprocessing import sequence

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Caps+Shift to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## The Keras library

In this lab we will make use of the <a href=http://keras.io/>keras</a> Deep Learning library for Python. This library allows building several kinds of shallow and deep networks, following either a sequential or a graph architecture.

## Data loading

We will make use of a part of the IMDB database on movie reviews. IMDB rates movies with a score ranging 0-10, but for simplicity we will consider a dataset of good and bad reviews, where a review has been considered bad with a score smaller than 4, and good if it features a score larger than 7. The data is available under the *data* folder.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>
Load the data into two variables, a list **text** with each of the movie reviews and a list **y** of the class labels.
 </td></tr>
</table>

In [2]:
IMDB=pd.read_csv("C:\\Users\\raul_\\JupiterNotebooks\\MasterBigData\\data\\datafull.csv",sep='\t',header=0)


In [3]:
# Stopwords eliminated using NLTK
stop = stopwords.words('english')
IMDB['text']=IMDB['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [4]:
print(IMDB.shape)
IMDB.head()

(25000, 2)


Unnamed: 0,sentiment,text
0,0,I simply cant understand relics Ceausescu era ...
1,1,Director Raoul Walsh like Michael Bay '40's ye...
2,1,"It could better film. It drag points, central ..."
3,1,It hard rate film. As entertainment value 21st...
4,1,"I've read terrible things film, I prepared wor..."


In [5]:
X=IMDB["text"]
y=IMDB["sentiment"]

In [6]:
# we see,it's balanced the dataset
y.value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

For convenience in what follows we will also split the data into a training and test subsets.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>
Split the list of texts into **texts_train** and **texts_test** lists, keeping 25% of the texts for test. Split in the same way the labels, obtaining lists **y_train** and **y_test**.
 </td></tr>
</table>

In [7]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [8]:
split_IMDB = StratifiedShuffleSplit(n_splits=3, test_size=0.25, random_state=124)

for train_index, test_index in split_IMDB.split(X, y):
     
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]

In [9]:
print(X_train.head())
print(y_train.head())

12933    ...but working I surprised see many people con...
21192    [***POSSIBLE SPOILERS***] This movie's reputat...
20160    Even 1942 standards movie-making setup HER CAR...
18376    I've never huge fan Mormon films. Being Mormon...
Name: text, dtype: object
12933    0
8666     1
21192    0
20160    0
18376    0
Name: sentiment, dtype: int64


## Data processing

We can't introduce text directly into the network, so we will have to tranform it to a vector representation. To do so, we will first **tokenize** the text into words (or tokens), and assign a unique identifier to each word found in the text. Doing this will allow us to perform the encoding. We can do this easily by making use of the **Tokenizer** class in keras:

In [10]:
from keras.preprocessing.text import Tokenizer

A Tokenizer offers convenient methods to split texts down to tokens. At construction time we need to supply the Tokenizer the maximum number of different words we are willing to represent. If out texts have greater word variety than this number, the least frequent words will be discarded. We will choose a number large enough for our purpose.

In [11]:
maxwords = 1000
tokenizer = Tokenizer(num_words = maxwords, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)

We now need to **fit** the Tokenizer to the training texts.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>
Find in the keras documentation the appropriate Tokenizer method to fit the tokenizer on a list of text, then use it to fit it on the training data.
 </td></tr>
</table>

In [12]:
tokenizer.fit_on_texts(X_train)
tokenizer.fit_on_texts(X_test)

If done correctly, the following should show the number of times the tokenizer has found each word in the input texts.

In [13]:
tokenizer.word_counts

OrderedDict([('but', 8813),
             ('working', 792),
             ('i', 72637),
             ('surprised', 802),
             ('see', 11461),
             ('many', 6672),
             ('people', 9103),
             ('consider', 512),
             ('good', 15100),
             ('on', 3422),
             ('grounds', 62),
             ('there', 6700),
             ('loose', 277),
             ('hints', 103),
             ('whole', 3078),
             ('material', 759),
             ('self', 1184),
             ('indulgent', 80),
             ('unconvincing', 186),
             ("lynch's", 46),
             ('movies', 7649),
             ('generally', 468),
             ('intriguing', 301),
             ('generate', 50),
             ('sense', 2322),
             ('confusion', 163),
             ('yet', 2748),
             ('playful', 40),
             ('that', 5232),
             ('visual', 522),
             ('subplots', 87),
             ('characters', 7055),
             ('ideas'

Now we have trained the tokenizer we can use it to vectorize the texts. In particular, we would like to transform the texts to sequences of word indexes.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>
Find in the keras documentation the appropriate Tokenizer method to transform a list of texts to a sequence. Apply it to both the training and test data to obtain matrices **X_train** and **X_test**.
 </td></tr>
</table>

In [14]:
X_train=tokenizer.texts_to_sequences(X_train)    #####  Texts -> sequences fo integers
X_test=tokenizer.texts_to_sequences(X_test)

We can see now how a text has been transformed to a list of word indexes.

In [15]:
X_train[0]

[26,
 723,
 2,
 714,
 16,
 42,
 23,
 10,
 117,
 40,
 40,
 146,
 761,
 470,
 33,
 208,
 167,
 63,
 40,
 208,
 35,
 957,
 460,
 26,
 697,
 354,
 828,
 788,
 276,
 327,
 276,
 150,
 261,
 520,
 47,
 84,
 8,
 9,
 229,
 80]

This is enough to train a Sequential Network. However, for efficiency reasons it is recommended that all sequences in the data have the same number of elements. Since this is not the case for our data, should **pad** the sequences to ensure the same length. The padding procedure adds a special *null* symbol to short sequences, and clips out parts of long sequences, thus enforcing a common size.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>
Find in the keras documentation the appropriate text preprocessing method to pad a sequence. Then pad all sequences to have a maximum of 300 words, both in the training and test data.
 </td></tr>
</table>

In [16]:
X_train = sequence.pad_sequences(X_train, maxlen=300)  ###padding/truncating
X_test = sequence.pad_sequences(X_test, maxlen=300)    ### 300 words

## Simple LSTM network with Embedding

To transform the word indices into something more amenable for a network we will use an <a href=https://keras.io/layers/embeddings/>**Embedding**</a> layer at the very beginning of the network. This layer will transform word indexes to a vector representation that is learned with the model together with the rest of network weights. After this transformation we will make use of an <a href=https://keras.io/layers/recurrent/#lstm>**LSTM**</a> layer to analyze the whole sequence, and then a final layer taking the decision of the network.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>
Build, compile and train a keras network with the following structure:
<ul>
 <li>Embedding layer producing a vector representation of 64 elements</li>
 <li>LSTM layer of 32 units</li>
 <li>Dropout of 0.9</li>
 <li>Dense layer of 1 unit with sigmoid activation</li>
</ul>
Note that the Embedding layer requires specifing as first argument the maximum number of words we chose to for the tokenizer. Also, the LSTM layer requires setting the **input_length** parameter as the number of elements in the input sequences. 
Use the binary crossentropy loss function for training, together with the adam optimizer. Train for 10 epochs. After training, measure the accuracy on the test set.
 </td></tr>
</table>

In [17]:
vocab_size=len(tokenizer.word_index)
vocab_size

88073

In [18]:
from keras.layers import Embedding
from keras.models import Sequential
from keras.layers import GRU, LSTM, GlobalMaxPool1D
from keras.layers.core import Activation
from keras.layers.core import Dropout
from keras.layers.core import Flatten
from keras.layers.core import Dense



model = Sequential()
model.add(Embedding(vocab_size+1, 64,input_length=300))
# the model will take as input an integer matrix of size (batch, input_length).
# the largest integer (i.e. word index) in the input should be no larger than 999 (vocabulary size).
# now model.output_shape == (None, 10, 64), where None is the batch dimension.
model.add(LSTM(32))
model.add(Dropout(0.50))

model.add(Dense(1))
model.add(Activation('sigmoid'))

In [19]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(
  X_train, # Training data
  y_train, # Labels of training data
  batch_size=128, # Batch size for the optimizer algorithm
  epochs=10, # Number of epochs to run the optimizer algorithm
  verbose=2, # Level of verbosity of the log messages
  validation_data=(X_test,y_test)
)

Train on 18750 samples, validate on 6250 samples
Epoch 1/10
 - 53s - loss: 0.5806 - acc: 0.7078 - val_loss: 0.3823 - val_acc: 0.8440
Epoch 2/10
 - 51s - loss: 0.3535 - acc: 0.8546 - val_loss: 0.3348 - val_acc: 0.8622
Epoch 3/10
 - 50s - loss: 0.3174 - acc: 0.8726 - val_loss: 0.3397 - val_acc: 0.8579
Epoch 4/10
 - 52s - loss: 0.3029 - acc: 0.8780 - val_loss: 0.3297 - val_acc: 0.8590
Epoch 5/10
 - 52s - loss: 0.2938 - acc: 0.8822 - val_loss: 0.3472 - val_acc: 0.8581
Epoch 6/10
 - 50s - loss: 0.2828 - acc: 0.8881 - val_loss: 0.3343 - val_acc: 0.8558
Epoch 7/10
 - 50s - loss: 0.2677 - acc: 0.8922 - val_loss: 0.3379 - val_acc: 0.8565
Epoch 8/10
 - 49s - loss: 0.2573 - acc: 0.8969 - val_loss: 0.3458 - val_acc: 0.8571
Epoch 9/10
 - 50s - loss: 0.2516 - acc: 0.8979 - val_loss: 0.3671 - val_acc: 0.8539
Epoch 10/10
 - 49s - loss: 0.2422 - acc: 0.9028 - val_loss: 0.3651 - val_acc: 0.8526


<keras.callbacks.History at 0x25561514d30>

test_score=0.8590

## Stacked LSTMs

Much like other neural layers, LSTM layers can be stacked on top of each other to produce more complex models. Care must be taken, however, that the LSTM layers before the last one generate a whole sequence of outputs for the following LSTM to process.

<table>
 <tr><td width="80"><img src="img/question.png" style="width:auto;height:auto"></td><td>
Repeat the training of the previous network, but using 2 LSTM layers. Make sure to configure the first LSTM layer in a way that it outputs a whole sequence for the next layer.
 </td></tr>
</table>

In [20]:
model = Sequential()
model.add(Embedding(vocab_size+1, 64,input_length=300))
# the model will take as input an integer matrix of size (batch, input_length).
# the largest integer (i.e. word index) in the input should be no larger than 999 (vocabulary size).
# now model.output_shape == (None, 10, 64), where None is the batch dimension.
model.add(LSTM(128,return_sequences=True))
model.add(LSTM(64))
model.add(Dropout(0.50))

model.add(Dense(1))
model.add(Activation('sigmoid'))

In [21]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(
  X_train, # Training data
  y_train, # Labels of training data
  batch_size=128, # Batch size for the optimizer algorithm
  epochs=20, # Number of epochs to run the optimizer algorithm
  verbose=2, # Level of verbosity of the log messages
  validation_data=(X_test,y_test)
)

Train on 18750 samples, validate on 6250 samples
Epoch 1/20
 - 277s - loss: 0.4929 - acc: 0.7537 - val_loss: 0.3903 - val_acc: 0.8274
Epoch 2/20
 - 285s - loss: 0.3411 - acc: 0.8575 - val_loss: 0.3342 - val_acc: 0.8589
Epoch 3/20
 - 286s - loss: 0.3296 - acc: 0.8665 - val_loss: 0.3376 - val_acc: 0.8547
Epoch 4/20
 - 289s - loss: 0.3066 - acc: 0.8746 - val_loss: 0.3493 - val_acc: 0.8538
Epoch 5/20
 - 291s - loss: 0.2907 - acc: 0.8812 - val_loss: 0.3413 - val_acc: 0.8576
Epoch 6/20
 - 293s - loss: 0.2741 - acc: 0.8903 - val_loss: 0.3435 - val_acc: 0.8520
Epoch 7/20
 - 295s - loss: 0.2617 - acc: 0.8941 - val_loss: 0.3672 - val_acc: 0.8530
Epoch 8/20
 - 297s - loss: 0.2529 - acc: 0.8979 - val_loss: 0.3531 - val_acc: 0.8494
Epoch 9/20
 - 297s - loss: 0.2490 - acc: 0.9017 - val_loss: 0.3639 - val_acc: 0.8469
Epoch 10/20
 - 295s - loss: 0.2391 - acc: 0.9041 - val_loss: 0.3815 - val_acc: 0.8490
Epoch 11/20
 - 298s - loss: 0.2263 - acc: 0.9097 - val_loss: 0.4204 - val_acc: 0.8456
Epoch 12/20
 -

<keras.callbacks.History at 0x255533c1550>

test_score=0.8589