## Using a Recurrent Neural Network to classify sentiment on IMDB data
The IMDB data consists of 25000 training sequences and 25000 test sequences. The outcome is binary (positive/negative) and both outcomes are equally represented in both the training and the test set.

Word embedding is a technique where words are encoded as real-valued vectors in a high-dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.  
The embedding layer takes arguments that define the mapping including the maximum number of expected words, also called the vocabulary size (e.g. the largest integer value that will be seen as an integer). The layer also allows you to specify the dimensionality for each word vector, called the output dimension.

## Load libraries and data.

In [1]:
import warnings
warnings.filterwarnings('ignore')

from __future__ import print_function
import keras
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers import Dense, Embedding
from keras.layers import SimpleRNN
from keras.datasets import imdb
from keras import initializers

Using TensorFlow backend.


In [2]:
max_features = 5000  # This is used in loading the data, picks the most common (max_features) words
maxlen = 500  # maximum length of a sequence - truncate after this
batch_size = 64

In [3]:
## Load in the data.  The function automatically tokenizes the text into distinct integers
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features) # 5,000 most used words in the dataset
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

25000 train sequences
25000 test sequences


In [4]:
# Summarize number of words
print("Number of words:", len(np.unique(np.hstack(X))))
# Summarize review length
result = [len(x) for x in X]
print("Review length: " + "mean %.2f words (%f)" % (np.mean(result), np.std(result)))

Number of words: 4998
Review length: mean 234.76 words (172.911495)


In [5]:
# This pads with zeros (or truncates) the sequences so that they are of the maximum length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('x_train shape:', X_train.shape)
print('x_test shape:', X_test.shape)

x_train shape: (25000, 500)
x_test shape: (25000, 500)


In [6]:
X_train[24000,:]  #Here's what an example sequence looks like

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          1,   14,  201,  100,   55,   73,   30,    4,  118,    2,  126,
          5,   15,    9,  660,    6,   87,  855, 1069,    4,    2,    2,
          2,   52,    2,    8,  403,   43,  107,   10,   10,   51,   93,
          2,   38, 1731,   60,    8,    4,  118,   

## Keras layers for (Vanilla) RNNs

### Embedding Layer
`keras.layers.embeddings.Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None)`

- This layer maps each integer into a distinct (dense) word vector of length `output_dim`.
- Can think of this as learning a word vector embedding "on the fly" rather than using an existing mapping (like GloVe)
- The `input_dim` should be the size of the vocabulary.
- The `input_length` specifies the length of the sequences that the network expects.

### SimpleRNN Layer
`keras.layers.recurrent.SimpleRNN(units, activation='tanh', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)`

- This is the basic RNN, where the output is also fed back as the "hidden state" to the next iteration.
- The parameter `units` gives the dimensionality of the output (and therefore the hidden state).  Note that typically there will be another layer after the RNN mapping the (RNN) output to the network output.  So we should think of this value as the desired dimensionality of the hidden state and not necessarily the desired output of the network.
- Recall that there are two sets of weights, one for the "recurrent" phase and the other for the "kernel" phase.  These can be configured separately in terms of their initialization, regularization, etc.






In [7]:
np.random.seed(7)

In [9]:
## Let's build a RNN
rnn_hidden_dim = 5
word_embedding_dim = 75
model = Sequential()
model.add(Embedding(max_features, word_embedding_dim))  #This layer takes each integer in the sequence and embeds it in a 50-dimensional vector
model.add(Dropout(0.25))
model.add(SimpleRNN(rnn_hidden_dim,
                    kernel_initializer=initializers.RandomNormal(stddev=0.001),
                    recurrent_initializer=initializers.Identity(gain=1.0),
                    activation='relu',
                    input_shape=X_train.shape[1:]))
model.add(Dense(256))
model.add(Dropout(0.25))
model.add(Activation('relu'))

model.add(Dense(1))
model.add(Activation('sigmoid'))

rmsprop = keras.optimizers.RMSprop(lr = .0001)
model.compile(loss='binary_crossentropy', optimizer=rmsprop, metrics=['accuracy'])

In [11]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 75)          375000    
_________________________________________________________________
dropout_2 (Dropout)          (None, None, 75)          0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 5)                 405       
_________________________________________________________________
dense_1 (Dense)              (None, 256)               1536      
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
__________

In [12]:
model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=10,
          validation_data=(X_test, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x252e5d45c50>

In [13]:
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.2766432267475128
Test accuracy: 0.8876800000190734
