# IMDB movie review sentiment classification with CNNs

In this notebook, we'll train a convolutional neural network (CNN, ConvNet) for sentiment classification using **Tensorflow** (version $\ge$ 2.0 required) with the **Keras API**.  This notebook is largely based on the [`imdb_cnn.py` script](https://github.com/keras-team/keras/blob/master/examples/imdb_cnn.py) in the Keras examples.

First, the needed imports.

In [None]:
%matplotlib inline

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.datasets import imdb
from tensorflow.keras.utils import plot_model

from distutils.version import LooseVersion as LV

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Using Tensorflow version: {}, and Keras version: {}.'.format(tf.__version__, tf.keras.__version__))
assert(LV(tf.__version__) >= LV("2.0.0"))

## IMDB data set

Next we'll load the IMDB data set. First time we may have to download the data, which can take a while.

The dataset contains 50000 movies reviews from the Internet Movie Database, split into 25000 reviews for training and 25000 reviews for testing. Half of the reviews are positive (1) and half are negative (0).

The dataset has already been preprocessed, and each word has been replaced by an integer index.
The reviews are thus represented as varying-length sequences of integers.
(Word indices begin at "3", as "1" is used to mark the start of a review and "2" represents all out-of-vocabulary words. "0" will be used later to pad shorter reviews to a fixed size.)

In [None]:
# number of most-frequent words to use
nb_words = 10000

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=nb_words)
word_index = imdb.get_word_index()

print('IMDB data loaded:')
print('x_train:', x_train.shape)
print('y_train:', y_train.shape, 'positive:', np.sum(y_train))
print('x_test:', x_test.shape)
print('y_test:', y_test.shape, 'positive:', np.sum(y_test))

The first movie review in the training set:

In [None]:
print("First review in the training set:\n", x_train[0], "length:", len(x_train[0]), "class:", y_train[0])

As a sanity check, we can convert the review back to text:

In [None]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_train[0]])
print(decoded_review)

The training data consists of lists of word indices of varying length.  Let's inspect the distribution of the length of the training movie reviews: 

In [None]:
l = []
for i in range(len(x_train)):
    l.append(len(x_train[i]))
plt.figure()
plt.title('Length of training reviews')
plt.hist(l,100);

Let's truncate the reviews to `maxlen` first words, and pad any shorter reviews with zeros at the end.

In [None]:
maxlen = 400

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen,
                                 padding='post', truncating='post')
x_test = sequence.pad_sequences(x_test, maxlen=maxlen, 
                                padding='post', truncating='post')
print('x_train:', x_train.shape)
print('x_test:', x_test.shape)

print("First review in the training set:\n", x_train[0], 'length:', len(x_train[0]))

l = []
for i in range(len(x_train)):
    l.append(len(x_train[i]))
plt.figure()
plt.title('Length of training reviews')
plt.hist(l,100);

## Initialization

Let's create a 1D CNN model that has one (or optionally two) convolutional layers with *relu* as the activation function, followed by a *Dense* layer.  The first layer in the network is an *Embedding* layer that converts integer indices to dense vectors of length `embedding_dims`.  Dropout is applied after embedding and dense layers, and max pooling after the convolutional layers. The output layer contains a single neuron and *sigmoid* non-linearity to match the binary groundtruth (`y_train`). 

Finally, we `compile()` the model, using *binary crossentropy* as the loss function and [*RMSprop*](https://keras.io/optimizers/#rmsprop) as the optimizer.

In [None]:
# model parameters:
embedding_dims = 50
cnn_filters = 100
cnn_kernel_size = 5
dense_hidden_dims = 200

inputs = keras.Input(shape=(None,), dtype="int64")

x = layers.Embedding(input_dim=nb_words, 
                     output_dim=embedding_dims)(inputs)
x = layers.Dropout(0.2)(x)

## uncomment these if you want to use two convolutional layers:
#x = layers.Conv1D(cnn_filters, cnn_kernel_size,
#                  padding='valid', activation='relu')(x)
#x = layers.MaxPooling1D(5)(x)

x = layers.Conv1D(cnn_filters, cnn_kernel_size,
                  padding='valid', activation='relu')(x)
x = layers.GlobalMaxPooling1D()(x)

x = layers.Dense(dense_hidden_dims, activation='relu')(x)
x = layers.Dropout(0.2)(x)

outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=inputs, outputs=outputs,
                    name="cnn_model")
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

print(model.summary())

In [None]:
plot_model(model, show_shapes=True)

## Learning

Now we are ready to train our model.  An *epoch* means one pass through the whole training data. Note also that we are using a fraction of the training data as our validation set.

In [None]:
%%time
epochs = 10
validation_split = 0.2

history = model.fit(x_train, y_train, batch_size=128,
          epochs=epochs, 
          validation_split=validation_split)

Let's plot the data to see how the training progressed. A big gap between training and validation accuracies would suggest overfitting.

In [None]:
plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['loss'], label='training')
plt.plot(history.epoch,history.history['val_loss'], label='validation')
plt.title('loss')
plt.legend(loc='best')

plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['accuracy'], label='training')
plt.plot(history.epoch,history.history['val_accuracy'], label='validation')
plt.title('accuracy')
plt.legend(loc='best');

## Inference

For a better measure of the quality of the model, let's see the model accuracy for the test data. 

In [None]:
scores = model.evaluate(x_test, y_test, verbose=2)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

We can also use the learned model to predict sentiments for new reviews:

In [None]:
myreviewtext = 'this movie was the worst i have ever seen and the actors were horrible'
#myreviewtext = 'this movie is great and i madly love the plot from beginning to end'

myreview = np.zeros((1,maxlen), dtype=int)
myreview[0, 0] = 1

for i, w in enumerate(myreviewtext.split()):
    if w in word_index and word_index[w]+3<nb_words:
        myreview[0, i+1] = word_index[w]+3
    else:
        print('word not in vocabulary:', w)
        myreview[0, i+1] = 2

print(myreview, "shape:", myreview.shape)

p = model.predict(myreview, batch_size=1) # values close to "0" mean negative, close to "1" positive
print('Predicted sentiment: {:.10f}'.format(p[0,0]))

# Model tuning

Modify the model.  Try to improve the classification accuracy on the test set, or experiment with the effects of different parameters.  

You can also consult the Keras documentation at https://keras.io/. 

---
*Run this notebook in Google Colaboratory using [this link](https://colab.research.google.com/github/csc-training/intro-to-dl/blob/master/day1/04a-tf2-imdb-cnn.ipynb).*