# Example 1 - IMDB Sentiment Classification

This example trains a  Long short-term memory (LSTM) network on the IMDB sentiment classification task.
The dataset consists of 50,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.
This notebook is based on
- https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py
- https://github.com/fchollet/keras/blob/master/examples/imdb_bidirectional_lstm.py and

First, we load general dependencies from numpy and keras:

In [1]:
from __future__ import print_function
import numpy as np

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.datasets import imdb

Using TensorFlow backend.


We limit the vocabulary size and the length of each sentence:

In [2]:
max_features = 20000
# cut texts after this number of words
# (among top max_features most common words)
maxlen = 100
batch_size = 32

Next, we load the IMDB dataset and print the number of documents.

In [10]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print()
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...


KeyboardInterrupt: 

We pad all sentences to a fixed length of maxlen words.

In [4]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Pad sequences (samples x time)
x_train shape: (25000, 100)
x_test shape: (25000, 100)


We define the architecture of the neural net with Long short-term memory (LSTM) and a dropout to prevent overfitting.

In [5]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

Build model...


Compile the model...

In [6]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

...and train the model!

In [7]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=2,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2
Test score: 0.397236787615
Test accuracy: 0.84064


Accuracy after 2 epochs is approximately 0.841. A higher number of epochs can further increase accuracy.

We can experiment with different neural net architectures in order to improve our results. For example, we can replace the LSTM with a Bidirectional LSTM:

In [8]:
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

Again, we compile and train our model:

In [9]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=2,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2
Test score: 0.352414491043
Test accuracy: 0.84708


Accuracy after 2 epochs is approximately 0.847. A higher number of epochs can further increase accuracy.