# IMDB movie review sentiment classification with RNNs

In this notebook, we'll train a recurrent neural network (RNN) for sentiment classification using **Tensorflow** (version $\ge$ 2.0 required) with the **Keras API**. This notebook is largely based on the [Understanding recurrent neural networks](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/6.2-understanding-recurrent-neural-networks.ipynb) by François Chollet.

First, the needed imports.

In [None]:
%matplotlib inline

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.datasets import imdb
from tensorflow.keras.utils import plot_model

from packaging.version import Version as LV

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Using Tensorflow version: {}, and Keras version: {}.'.format(tf.__version__, tf.keras.__version__))
assert(LV(tf.__version__) >= LV("2.0.0"))

## IMDB data set

Next we'll load the IMDB data set. First time we may have to download the data, which can take a while.

The dataset contains 50000 movies reviews from the Internet Movie Database, split into 25000 reviews for training and 25000 reviews for testing. Half of the reviews are positive (1) and half are negative (0).

The dataset has already been preprocessed, and each word has been replaced by an integer index.
The reviews are thus represented as varying-length sequences of integers.
(Word indices begin at "3", as "1" is used to mark the start of a review and "2" represents all out-of-vocabulary words. "0" will be used later to pad shorter reviews to a fixed size.)

In [None]:
# number of most-frequent words to use
nb_words = 10000

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=nb_words)
print('x_train:', x_train.shape)
print('x_test:', x_test.shape)
print()

Let's truncate the reviews to `maxlen` first words, and pad any shorter reviews with zeros at the end.

In [None]:
# cut texts after this number of words
maxlen = 80

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

The first movie review in the training set:

In [None]:
print("First review in the training set:\n", x_train[0], "length:", len(x_train[0]), "class:", y_train[0])

As a sanity check, we can convert the review back to text:

In [None]:
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_train[0]])
print(decoded_review)

## RNN model

### Initialization

Let's create an RNN model that contains an LSTM layer. The first layer in the network is an *Embedding* layer that converts integer indices to dense vectors of length `embedding_dims`. The output layer contains a single neuron and *sigmoid* non-linearity to match the binary groundtruth (`y_train`). 

Finally, we `compile()` the model, using *binary crossentropy* as the loss function and [*RMSprop*](https://keras.io/optimizers/#rmsprop) as the optimizer.

In [None]:
# model parameters:
embedding_dims = 50
lstm_units = 32

inputs = keras.Input(shape=(None,), dtype="int64")

x = layers.Embedding(input_dim=nb_words, 
                     output_dim=embedding_dims)(inputs)
x = layers.Dropout(0.2)(x)

x = layers.LSTM(lstm_units)(x)

outputs = layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs=inputs, outputs=outputs,
                    name="rnn_model")

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
print(model.summary())

In [None]:
plot_model(model, show_shapes=True)

### Learning

Now we are ready to train our model.  An *epoch* means one pass through the whole training data. Note also that we are using a fraction of the training data as our validation set.

Note that LSTMs are rather slow to train.

In [None]:
%%time
epochs = 5
validation_split = 0.2

history = model.fit(x_train, y_train, batch_size=128,
          epochs=epochs, 
          validation_split=validation_split)


Let's plot the data to see how the training progressed. A big gap between training and validation accuracies would suggest overfitting.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,3))

ax1.plot(history.epoch,history.history['loss'], label='training')
ax1.plot(history.epoch,history.history['val_loss'], label='validation')
ax1.set_title('loss')
ax1.set_xlabel('epoch')
ax1.legend(loc='best')

ax2.plot(history.epoch,history.history['accuracy'], label='training')
ax2.plot(history.epoch,history.history['val_accuracy'], label='validation')
ax2.set_title('accuracy')
ax2.set_xlabel('epoch')
ax2.legend(loc='best');

### Inference

For a better measure of the quality of the model, let's see the model accuracy for the test data. 

In [None]:
scores = model.evaluate(x_test, y_test, verbose=2)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

We can also use the learned model to predict sentiments for new reviews:

In [None]:
myreviewtext = 'this movie was the worst i have ever seen and the actors were horrible'
#myreviewtext = 'this movie is great and i madly love the plot from beginning to end'

myreview = np.zeros((1,maxlen), dtype=int)
myreview[0, 0] = 1

for i, w in enumerate(myreviewtext.split()):
    if w in word_index and word_index[w]+3<nb_words:
        myreview[0, i+1] = word_index[w]+3
    else:
        print('word not in vocabulary:', w)
        myreview[0, i+1] = 2

print(myreview, "shape:", myreview.shape)

p = model.predict(myreview, batch_size=1) # values close to "0" mean negative, close to "1" positive
print('Predicted sentiment: {}TIVE ({:.4f})'.format("POSI" if p[0,0]>0.5 else "NEGA", p[0,0]))

## Task 1: Two LSTM layers

Create a model with two LSTM layers. Optionally, you can also use [Bidirectional layers](https://keras.io/api/layers/recurrent_layers/bidirectional/).

The code below is missing the model definition. You can copy any suitable layers from the example above.

In [None]:
ex1_inputs = keras.Input(shape=(None,), dtype="int64")

# WRITE YOUR CODE HERE

ex1_outputs = None

Execute the cell below to see an [example answer](https://raw.githubusercontent.com/csc-training/intro-to-dl/master/day1/solutions/tf2-imdb-rnn-example-answer.py).

In [None]:
%load solutions/tf2-imdb-rnn-example-answer.py

In [None]:
assert ex1_outputs is not None, "You need to write the missing model definition"

ex1_model = keras.Model(inputs=ex1_inputs, outputs=ex1_outputs,
                    name="rnn_model_with_two_lstm_layers")

ex1_model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
print(ex1_model.summary())

In [None]:
%%time
ex1_epochs = 5
ex1_validation_split = 0.2

ex1_history = ex1_model.fit(x_train, y_train, batch_size=128,
          epochs=ex1_epochs, 
          validation_split=ex1_validation_split)

Let's plot the data to see how the training progressed. A big gap between training and validation accuracies would suggest overfitting.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,3))

ax1.plot(ex1_history.epoch,ex1_history.history['loss'], label='training')
ax1.plot(ex1_history.epoch,ex1_history.history['val_loss'], label='validation')
ax1.set_title('Task 1: loss')
ax1.set_xlabel('epoch')
ax1.legend(loc='best')

ax2.plot(ex1_history.epoch,ex1_history.history['accuracy'], label='training')
ax2.plot(ex1_history.epoch,ex1_history.history['val_accuracy'], label='validation')
ax2.set_title('Task 1: accuracy')
ax2.set_xlabel('epoch')
ax2.legend(loc='best');

In [None]:
ex1_scores = ex1_model.evaluate(x_test, y_test, verbose=2)
print("%s: %.2f%%" % (ex1_model.metrics_names[1], ex1_scores[1]*100))

## Task 2: Model tuning

Modify the model further.  Try to improve the classification accuracy on the test set, or experiment with the effects of different parameters.

To combat overfitting, you can try for example to add dropout. For [LSTMs](https://keras.io/api/layers/recurrent_layers/lstm/), dropout for inputs and the recurrent states can be set with the `dropout` and `recurrent_dropout` arguments:

    x = layers.LSTM(lstm_units, dropout=0.2,
                    recurrent_dropout=0.3)(x)

Another option is use regularization, for example with the `kernel_regularizer` and/or `recurrent_regularizer` arguments:

    x = layers.LSTM(lstm_units, kernel_regularizer='l2',
                    recurrent_regularizer='l2')(x)

Third option is to use early stopping in training. It can be implemented with the [EarlyStopping](https://keras.io/api/callbacks/early_stopping/) callback, for example:

    callbacks = [keras.callbacks.EarlyStopping(
                 monitor="val_loss", patience=1,
                 restore_best_weights=True)]

The callback then needs to be added as an argument to the [`model.fit()`](https://keras.io/api/models/model_training_apis/#fit-method) method.

You can also consult the Keras documentation at https://keras.io/. 

---
*Run this notebook in Google Colaboratory using [this link](https://colab.research.google.com/github/csc-training/intro-to-dl/blob/master/day1/04-tf2-imdb-rnn.ipynb).*