# IMDB movie review sentiment classification with MLPs

In this notebook, we'll train a multi-layer perceptron model to classify IMDB movie reviews using **Keras** (version $\ge$ 2 required). This notebook is largely based on the [Classifying movie reviews notebook](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/3.5-classifying-movie-reviews.ipynb) by François Chollet.

First, the needed imports. Keras tells us which backend (Theano, Tensorflow, CNTK) it will be using.

In [None]:
%matplotlib inline

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
#from keras.utils import np_utils
from keras import backend as K

from distutils.version import LooseVersion as LV
from keras import __version__

from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print('Using Keras version:', __version__, 'backend:', K.backend())
assert(LV(__version__) >= LV("2.0.0"))

## IMDB data set

Next we'll load the IMDB dataset. First time we may have to download the data, which can take a while.

The dataset has already been preprocessed, and each word has been replaced by an integer index.
The reviews are thus represented as varying-length sequences of integers.
(Word indices begin at "3", as "1" is used to mark the start of a review and "2" represents all out-of-vocabulary words.)

The ground truth consists of binary sentiments for each review: positive (1) or negative (0).

In [None]:
from keras.datasets import imdb

# number of most-frequent words 
nb_words = 10000

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=nb_words)
word_index = imdb.get_word_index()

print('IMDB data loaded:')
print('x_train:', x_train.shape)
print('y_train:', y_train.shape)
print('x_test:', x_test.shape)
print('y_test:', y_test.shape)

The first review in the training set: 

In [None]:
print(x_train[0], "length:", len(x_train[0]), "class:", y_train[0])

As a sanity check, we can convert the review back to text:

In [None]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') for i in x_train[0]])
print(decoded_review)

### Bag-of-words

A MLP network expects input data to be of fixed size, so we convert the integer sequences into vectors of length `nb_words`.  Each element in the vectors corresponds to a specific word in the vocabulary: the element is "1" if the words appears in the review and "0" otherwise.  

In [None]:
def vectorize_sequences(sequences, dimension=nb_words):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

# Convert training data to bag-of-words:
X_train = vectorize_sequences(x_train)
X_test = vectorize_sequences(x_test)

# Convert labels from integers to floats:
y_train = np.asarray(y_train).astype('float32')
y_test = np.asarray(y_test).astype('float32')

print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

The first training review now looks like this: 

In [None]:
print(X_train[0], "length:", len(X_train[0]), "class:", y_train[0])

## Multi-layer perceptron (MLP) network

### Initialization

Let's create a three-layer MLP model that has two dense layers with *relu* activation functions.  The output layer contains a single neuron and *sigmoid* non-linearity to match the groundtruth (`y_train`). 

Finally, we `compile()` the model, using *binary crossentropy* as the loss function and [*RMSprop*](https://keras.io/optimizers/#rmsprop) as the optimizer.

In [None]:
nb_units = 16

model = Sequential()
model.add(Dense(units=nb_units, activation='relu', input_shape=(nb_words,)))
model.add(Dense(units=nb_units, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', 
              optimizer='rmsprop', 
              metrics=['accuracy'])
print(model.summary())

In [None]:
SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

### Learning

Now we are ready to train our model. An *epoch* means one pass through the whole training data. 

In [None]:
%%time
epochs = 10

history = model.fit(X_train, 
                    y_train, 
                    epochs=epochs, 
                    batch_size=512,
                    verbose=2)

In [None]:
plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['loss'])
plt.title('training loss')

plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['acc'])
plt.title('training accuracy');

### Inference

For a better measure of the quality of the model, let's see the model accuracy for the test reviews. 

If accuracy on the test set is notably worse than with the training set, the model has likely overfitted to the training samples.

In [None]:
%%time
scores = model.evaluate(X_test, y_test, verbose=2)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

We can also use the learned model to predict sentiments for new reviews:

In [None]:
myreviewtext = 'this movie was the worst i have ever seen and the actors were horrible'

myreview = np.zeros((1,nb_words))
myreview[0, 1] = 1.0

for w in myreviewtext.split():
    if w in word_index and word_index[w]+3<nb_words:
            myreview[0, word_index[w]+3] = 1.0
    else:
        print('word not in vocabulary:', w)
        myreview[0, 2] = 1.0
        
print(myreview, "shape:", myreview.shape)

In [None]:
model.predict(myreview, batch_size=1) # values close to "0" mean negative, close to "1" positive

# Model tuning

Modify the model.  Try to improve the classification accuracy on the test set, or experiment with the effects of different parameters (e.g. number of layers, neurons on each layer, different activation functions).  

You can also consult the Keras documentation at https://keras.io/.  For example, the Dense layer is documented at https://keras.io/layers/core/.