# IMDB movie review sentiment classification with CNNs

In this notebook, we'll train a convolutional neural network (CNN, ConvNet) for sentiment classification using Keras.  Keras version $\ge$ 2 is required.  This notebook is largely based on the [`imdb_cnn.py` script](https://github.com/keras-team/keras/blob/master/examples/imdb_cnn.py) in the Keras examples.

First, the needed imports. Keras tells us which backend (Theano, Tensorflow, CNTK) it will be using.

In [None]:
%matplotlib inline

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D

from distutils.version import LooseVersion as LV
from keras import __version__
from keras import backend as K

from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print('Using Keras version:', __version__, 'backend:', K.backend())
assert(LV(__version__) >= LV("2.0.0"))

## IMDB data set

Next we'll load the IMDB data set. First time we may have to download the data, which can take a while.

In [None]:
from keras.datasets import imdb

# number of most-frequent words to use
nb_words = 10000

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=nb_words)

print('IMDB data loaded:')
print('x_train:', x_train.shape)
print('y_train:', y_train.shape)
print('x_test:', x_test.shape)
print('y_test:', y_test.shape)

The first movie review in the training set:

In [None]:
print(x_train[0], "length:", len(x_train[0]), "class:", y_train[0])

The training data consists of lists of word indices of varying length.  Let's inspect the distribution of the length of the training movie reviews: 

In [None]:
l = []
for i in range(len(x_train)):
    l.append(len(x_train[i]))
plt.hist(l,100);

Let's truncate the reviews to `maxlen` first words, and pad any shorter reviews with zeros at the end.

In [None]:
maxlen = 400

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen,
                                 padding='post', truncating='post')
x_test = sequence.pad_sequences(x_test, maxlen=maxlen, 
                                padding='post', truncating='post')
print('x_train:', x_train.shape)
print('x_test:', x_test.shape)

print(x_train[0], 'length:', len(x_train[0]))

l = []
for i in range(len(x_train)):
    l.append(len(x_train[i]))
plt.hist(l,100);

## Initialization

In [None]:
# model parameters:
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250

In [None]:
print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(nb_words,
                    embedding_dims,
                    input_length=maxlen))
model.add(Dropout(0.2))

# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())

In [None]:
SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))

## Learning

In [None]:
%%time
batch_size = 32
epochs = 2

history = model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs)

In [None]:
plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['loss'])
plt.title('training loss')

plt.figure(figsize=(5,3))
plt.plot(history.epoch,history.history['acc'])
plt.title('training accuracy');

## Inference

In [None]:
scores = model.evaluate(x_test, y_test, verbose=2)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))