# IMDB RNN - Sentiment Analysis
This notebook contains my method of using a recurrent neural network. We will be using the IMDB dataset provided by Keras. Since it has already been pre-processed there is no need to sanitise the data.


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


  import pandas.util.testing as tm


In [2]:
data = keras.datasets.imdb


Below we set the maximum amount of words and the variable "maxlen" for the padding we will use later. Max words will limit the amount of words that can be captured by the dataset. Without setting a limit the dataset might become too big.
 
The "maxlen" that will be used in padding is for creating a template base size for every sentence that is passed through the X variables.


In [3]:
max_words = 10240
maxlen = 500
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_words)


Below we see the 6th value in X_train, it is read through as numbers. The reason being is that's what computers understand. Each word has been ranked according to importance and impact on the Y. The 1 which is the label means that the review is positive.

In [4]:
print('----Review----')
print(X_train[6])
print('----Label----')
print(y_train[6])


----Review----
[1, 6740, 365, 1234, 5, 1156, 354, 11, 14, 5327, 6638, 7, 1016, 2, 5940, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 9363, 1117, 1831, 7485, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 8564, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 7175, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 5390, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
----Label----
1


As mentioned above, the values have been converted to words and you see the wording used to review this movie are very powerful. The code below also dismisses words that have a value less than 3. It's used to root out words like "the" and "it" etc.

In [5]:
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

review = [reverse_word_index.get(i-3, "?") for i in X_train[6]]
print(review)


['?', 'lavish', 'production', 'values', 'and', 'solid', 'performances', 'in', 'this', 'straightforward', 'adaption', 'of', 'jane', '?', 'satirical', 'classic', 'about', 'the', 'marriage', 'game', 'within', 'and', 'between', 'the', 'classes', 'in', '?', '18th', 'century', 'england', 'northam', 'and', 'paltrow', 'are', 'a', '?', 'mixture', 'as', 'friends', 'who', 'must', 'pass', 'through', '?', 'and', 'lies', 'to', 'discover', 'that', 'they', 'love', 'each', 'other', 'good', 'humor', 'is', 'a', '?', 'virtue', 'which', 'goes', 'a', 'long', 'way', 'towards', 'explaining', 'the', '?', 'of', 'the', 'aged', 'source', 'material', 'which', 'has', 'been', 'toned', 'down', 'a', 'bit', 'in', 'its', 'harsh', '?', 'i', 'liked', 'the', 'look', 'of', 'the', 'film', 'and', 'how', 'shots', 'were', 'set', 'up', 'and', 'i', 'thought', 'it', "didn't", 'rely', 'too', 'much', 'on', '?', 'of', 'head', 'shots', 'like', 'most', 'other', 'films', 'of', 'the', '80s', 'and', '90s', 'do', 'very', 'good', 'results']

Below the padding is applied. Some sentences will be much shorter than others so padding adds zeros to the sentences that are smaller. This is so the model receives uniform inputs. This process is done for both train and test data so an equal comparison can be performed.

In [6]:
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
print(X_train)


[[   0    0    0 ...   19  178   32]
 [   0    0    0 ...   16  145   95]
 [   0    0    0 ...    7  129  113]
 ...
 [   0    0    0 ...    4 3586    2]
 [   0    0    0 ...   12    9   23]
 [   0    0    0 ...  204  131    9]]


In [7]:
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)
print(X_test)


[[   0    0    0 ...   14    6  717]
 [   0    0    0 ...  125    4 3077]
 [  33    6   58 ...    9   57  975]
 ...
 [   0    0    0 ...   21  846 5518]
 [   0    0    0 ... 2302    7  470]
 [   0    0    0 ...   34 2005 2643]]


Below you see an output of "y_train", as mentioned earlier 1 represents a positive review, and 0 represents a negative one.

In [8]:
print(y_train[:10])


[1 0 0 1 0 0 1 0 1 0]


## Modelling

We going to try 2 different models and approaches. The first will have fewer units and use the "rmsprop" activation method. The second will have more units and use the "Adamax" activation method. Both will use the same amount of epochs.

In [9]:
embedding_vector_length = 64

model = Sequential()
model.add(Embedding(max_words, embedding_vector_length, input_length=maxlen))
model.add(LSTM(256))
model.add(Dense(1, activation='sigmoid'))


In [10]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])


In [11]:
model.fit(X_train, y_train,
          batch_size=256,
          epochs=10,
          validation_data=(X_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f3d6bc546a0>

In [12]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))


Model Accuracy: 86.87%


In [20]:
embedding_vector_length = 64

model2 = Sequential()
model2.add(Embedding(max_words, embedding_vector_length, input_length=maxlen))
model2.add(LSTM(512))
model2.add(Dense(1, activation='sigmoid'))


In [21]:
model2.compile(optimizer='Adamax',
               loss='binary_crossentropy',
               metrics=['acc'])


In [22]:
model2.fit(X_train,
           y_train,
           batch_size=512,
           epochs=10,
           validation_data=(X_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f3d6480e6a0>

It's clear even though by a small margin the second model is the winner. We will save and test it in the following steps.

In [19]:
scores = model2.evaluate(X_test, y_test, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))


Model Accuracy: 87.40%


In [23]:
model2.save('sentiment_analysis.h5')
saved_model = keras.models.load_model('sentiment_analysis.h5')


Let's calculate the first 5 predicted scores and values.

In [28]:
predictions = saved_model.predict(X_test)

[print(predictions[i], y_test[i]) for i in range(0, 5)]


[0.0390748] 0
[0.9862282] 1
[0.9715433] 1
[0.4201262] 0
[0.99675435] 1


[None, None, None, None, None]

Below we see the model predicted them all correct.

In [27]:
print(y_test[:5])

[0 1 1 0 1]


# Final evaluation of the model
Above all answers are correct, but that does not reflect the total dataset but a small subset just to show the model does work for the most part. All in all the model could be improved perhaps with more epochs or layers. The compute power might be expensive but bigger companies possess just that, and models like the one above are used in such cases, with much less error and higher accuracy.

In [25]:

scores = saved_model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))


Accuracy: 87.35%
