# IMDB movie review sentiment classification with RNNs (aclImdb version)

In this notebook, we'll train a recurrent neural network (RNN) for sentiment classification using **Tensorflow** with the **Keras API**. 

First, the needed imports.

In [None]:
%matplotlib inline

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.utils import plot_model

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Using Tensorflow version: {}, and Keras version: {}.'.format(tf.__version__, tf.keras.__version__))

Let's check if we have GPU available.

In [None]:
gpus = tf.config.list_physical_devices('GPU')
if len(gpus) > 0:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    from tensorflow.python.client import device_lib
    for d in device_lib.list_local_devices():
        if d.device_type == 'GPU':
            print('GPU', d.physical_device_desc)
else:
    print('No GPU, using CPU instead.')

## IMDB data set

Next we'll load the IMDB data set. First time we may have to download the data, which can take a while.

The dataset contains 50000 movies reviews from the Internet Movie Database, split into 25000 reviews for training and 25000 reviews for testing. Half of the reviews are positive (1) and half are negative (0).

The dataset consists of movie reviews as text files in a directory hierarchy, and we use the `text_dataset_from_directory()` function to create a `tf.data.Dataset()` from the text files.

In [None]:
DATADIR = "/media/gpu-data/imdb"
BATCH_SIZE = 16
validation_split=0.2
seed = 42

In [None]:
print('Train:')
imdb_train = tf.keras.utils.text_dataset_from_directory(
    DATADIR+"/aclImdb/train",
    batch_size=BATCH_SIZE,
    validation_split=validation_split,
    subset='training',
    seed=seed
)
print('\nValidation:')
imdb_valid = tf.keras.utils.text_dataset_from_directory(
    DATADIR+"/aclImdb/train",
    batch_size=BATCH_SIZE,
    validation_split=validation_split,
    subset='validation',
    seed=seed
)
print('\nTest:')
imdb_test = tf.keras.utils.text_dataset_from_directory(
    DATADIR+"/aclImdb/test",
    batch_size=BATCH_SIZE,
)
print('\nAn example review:')
print(imdb_train.unbatch().take(1).get_single_element())

## RNN model

### Initialization

Let's create an RNN model that contains an LSTM layer. The first layer in the network is an *Embedding* layer that converts integer indices to dense vectors of length `embedding_dims`. The output layer contains a single neuron and *sigmoid* non-linearity to match the binary groundtruth (`y_train`). 

Finally, we `compile()` the model, using *binary crossentropy* as the loss function and [*RMSprop*](https://keras.io/optimizers/#rmsprop) as the optimizer.

In [None]:
nb_words = 10000
maxlen = 80

vectorization_layer = layers.TextVectorization(max_tokens=nb_words,
                             output_mode='int',
                             output_sequence_length=maxlen)
train_text = imdb_train.map(lambda text, labels: text)
vectorization_layer.adapt(train_text)


In [None]:
# model parameters:
embedding_dims = 50
lstm_units = 32

inputs = keras.Input(shape=(1,), dtype=tf.string)

x = vectorization_layer(inputs)
x = layers.Embedding(input_dim=nb_words, 
                     output_dim=embedding_dims)(x)
x = layers.Dropout(0.2)(x)

x = layers.LSTM(lstm_units)(x)

outputs = layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs=inputs, outputs=outputs,
                    name="rnn_model")

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
print(model.summary())

In [None]:
plot_model(model, show_shapes=True)

### Learning

Now we are ready to train our model.  An *epoch* means one pass through the whole training data. Note also that we are using a fraction of the training data as our validation set.

Note that LSTMs are rather slow to train.

In [None]:
%%time
epochs = 5

history = model.fit(imdb_train, validation_data=imdb_valid, epochs=epochs)

Let's plot the data to see how the training progressed. A big gap between training and validation accuracies would suggest overfitting.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,3))

ax1.plot(history.epoch,history.history['loss'], label='training')
ax1.plot(history.epoch,history.history['val_loss'], label='validation')
ax1.set_title('loss')
ax1.set_xlabel('epoch')
ax1.legend(loc='best')

ax2.plot(history.epoch,history.history['accuracy'], label='training')
ax2.plot(history.epoch,history.history['val_accuracy'], label='validation')
ax2.set_title('accuracy')
ax2.set_xlabel('epoch')
ax2.legend(loc='best');

### Inference

For a better measure of the quality of the model, let's see the model accuracy for the test data. 

In [None]:
scores = model.evaluate(imdb_test, verbose=2)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

We can also use the learned model to predict sentiments for new reviews:

In [None]:
myreview = 'This movie was the worst I have ever seen and the actors were horrible.'
#myreview = 'This movie is great and I madly love the plot from beginning to end.'

p = model.predict([myreview], batch_size=1)
print('Predicted sentiment: {}TIVE ({:.4f})'.format("POSI" if p[0,0]>0.5 else "NEGA", p[0,0]))

## Task 1: Two LSTM layers

Create a model with two LSTM layers. Optionally, you can also use [Bidirectional layers](https://keras.io/api/layers/recurrent_layers/bidirectional/).

The code below is missing the model definition. You can copy any suitable layers from the example above.

In [None]:
ex1_inputs = keras.Input(shape=(None,), dtype="int64")

# WRITE YOUR CODE HERE

ex1_outputs = None

Example answer:

In [None]:
%load tf2-aclImdb-rnn-example-answer.py

In [None]:
assert ex1_outputs is not None, "You need to write the missing model definition"

ex1_model = keras.Model(inputs=ex1_inputs, outputs=ex1_outputs,
                    name="rnn_model_with_two_lstm_layers")

ex1_model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
print(ex1_model.summary())

In [None]:
%%time
ex1_epochs = 5

ex1_history = ex1_model.fit(imdb_train,
                            validation_data=imdb_valid,
                            epochs=ex1_epochs)

Let's plot the data to see how the training progressed. A big gap between training and validation accuracies would suggest overfitting.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,3))

ax1.plot(ex1_history.epoch,ex1_history.history['loss'], label='training')
ax1.plot(ex1_history.epoch,ex1_history.history['val_loss'], label='validation')
ax1.set_title('Task 1: loss')
ax1.set_xlabel('epoch')
ax1.legend(loc='best')

ax2.plot(ex1_history.epoch,ex1_history.history['accuracy'], label='training')
ax2.plot(ex1_history.epoch,ex1_history.history['val_accuracy'], label='validation')
ax2.set_title('Task 1: accuracy')
ax2.set_xlabel('epoch')
ax2.legend(loc='best');

In [None]:
ex1_scores = ex1_model.evaluate(imdb_test, verbose=2)
print("%s: %.2f%%" % (ex1_model.metrics_names[1], ex1_scores[1]*100))

## Task 2: Model tuning

Modify the model further.  Try to improve the classification accuracy on the test set, or experiment with the effects of different parameters.

To combat overfitting, you can try for example to add dropout. For [LSTMs](https://keras.io/api/layers/recurrent_layers/lstm/), dropout for inputs and the recurrent states can be set with the `dropout` and `recurrent_dropout` arguments:

    x = layers.LSTM(lstm_units, dropout=0.2,
                    recurrent_dropout=0.3)(x)

Another option is to use early stopping in training. It can be implemented with the [EarlyStopping](https://keras.io/api/callbacks/early_stopping/) callback, for example:

    callbacks = [keras.callbacks.EarlyStopping(
                 monitor="val_loss", patience=1,
                 restore_best_weights=True)]

The callback then needs to be added as an argument to the [`model.fit()`](https://keras.io/api/models/model_training_apis/#fit-method) method.

You can also consult the Keras documentation at https://keras.io/. 

---
*Run this notebook in Google Colaboratory using [this link](https://colab.research.google.com/github/csc-training/intro-to-dl/blob/master/day1/04-tf2-imdb-rnn.ipynb).*