## Using recurrent neural network in Keras to classify a movie review

In this Project, Sentiminent Analysis (NLP) is done using recurrent neural networks in Keras to try and classify a movie review as either positive or negative. The task will have the following steps:
1. Get the dataset
2. Preprocessing the Data
3. Build the Model
4. Train the model
5. Test the Model
6. Predict Something

Keras will be used to import built-in IMDb movie review . This dataset contains a collection of 50,000 reviews from IMDb and contains an even number of positive and negative reviews. A negative review has a score ≤ 4 out of 10 (label = 0), and a positive review has a score ≥ 7 out of 10 (lable =1). Neutral reviews are not included in the dataset.

In [20]:
# Import libraries
import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math

from keras.datasets import imdb
from keras.utils import to_categorical
from keras.models import Sequential
from keras.preprocessing import sequence
from keras.layers import LSTM, Dense, Dropout, Embedding

from sklearn.metrics import mean_squared_error
import string


# 1 Get The data set

In [21]:
# Setting the vocab size to download as working with a large data set and 
# words that aren't comonly used aren't necessary
max_unique_words = 5000

# Load data
dataset = imdb.load_data(num_words=max_unique_words)
(X_train, y_train), (X_test, y_test) = dataset
X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

# data size
print("Data size: ")
print(X.shape)
print(y.shape)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Data size: 
(50000,)
(50000,)


In [24]:
# Look at data
print('---Review---')
print(X_train[6])
print('---Label---')
print(y_train[6])


---Review---
[1, 2, 365, 1234, 5, 1156, 354, 11, 14, 2, 2, 7, 1016, 2, 2, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 2, 1117, 1831, 2, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 2, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 2, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 2, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
---Label---
1


In [23]:
# Look at data by mapping back to words
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print('---review with words---')
print([id2word.get(i, ' ') for i in X_train[6]])
print('---label---')
print(y_train[6])


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
---review with words---
['the', 'and', 'full', 'involving', 'to', 'impressive', 'boring', 'this', 'as', 'and', 'and', 'br', 'villain', 'and', 'and', 'need', 'has', 'of', 'costumes', 'b', 'message', 'to', 'may', 'of', 'props', 'this', 'and', 'and', 'concept', 'issue', 'and', 'to', "god's", 'he', 'is', 'and', 'unfolds', 'movie', 'women', 'like', "isn't", 'surely', "i'm", 'and', 'to', 'toward', 'in', "here's", 'for', 'from', 'did', 'having', 'because', 'very', 'quality', 'it', 'is', 'and', 'and', 'really', 'book', 'is', 'both', 'too', 'worked', 'carl', 'of', 'and', 'br', 'of', 'reviewer', 'closer', 'figure', 'really', 'there', 'will', 'and', 'things', 'is', 'far', 'this', 'make', 'mistakes', 'and', 'was', "couldn't", 'of', 'few', 'br', 'of', 'you', 'to', "don't", 'female', 'than', 'place', 'she', 'to', 'was', 'between', 'that', 'nothing', 'and', 'movies', 'get', 'are', 'and', 'br', 'yes'

## 2 Preprocessing the Data

In [25]:
# Trimming length to 500 or padding if less than 500
max_num_words = 500
X_train = sequence.pad_sequences(X_train,
                                 maxlen=max_num_words)
X_test = sequence.pad_sequences(X_test,
                                maxlen=max_num_words)

# Look at review length to confirm all at 500
print("---Review length---")
result = [len(x) for x in X_train]
print("Mean %.2f words, standard dev of word length %.2f in train sample" %
      (np.mean(result), np.std(result)))
result = [len(i) for i in X_test]
print("Mean %.2f words, standard dev of word length %.2f in test sample" %
      (np.mean(result), np.std(result)))

X_train[6]

---Review length---
Mean 500.00 words, standard dev of word length 0.00 in train sample
Mean 500.00 words, standard dev of word length 0.00 in test sample


array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [26]:
# Checking test and train data
# data size
print("Data size: ")
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)


Data size: 
(25000, 500)
(25000,)
(25000, 500)
(25000,)


## 3 Build the Model

In [27]:
# Create a sequential model
model = Sequential()

# First hidden layer
model.add(Embedding(max_unique_words+1, 32, input_length=max_num_words))
# LSTM layer
model.add(LSTM(128))
# Output layer
model.add(Dense(1, activation='sigmoid'))

model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 32)           160032    
_________________________________________________________________
lstm (LSTM)                  (None, 128)               82432     
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 242,593
Trainable params: 242,593
Non-trainable params: 0
_________________________________________________________________


## 4 Train the model

In [28]:
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam',
              metrics=['accuracy'])

# Fit the model
num_epochs = 5
batch_size = 100

model.fit(X_train, y_train, epochs=num_epochs, batch_size=batch_size)

print('Training complete')


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Training complete


## 5 Test the Model

In [29]:
# Final evaluation of the model
test_loss, test_acc = model.evaluate(X_test, y_test)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {} %'.format(test_acc * 100))


Test Loss: 0.31646719574928284
Test Accuracy: 87.5440001487732 %


## 6 Prediction

In [30]:
# predict on a sample text with padding.


def sample_predict(sample_sentence):
    """
    Function that first converts a sentence into tokens, with punctuation
    and word case removed. It is then converted to the respective integer
    using the imdb word to integer index. A prediction is then made to
    determine if it is a positive or negative review.

    """
    remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
    word_list = []
    converted_tokens = []
    
    for word in sample_sentence.split():
        word = word.translate(remove).lower()
        word_list.append(word) 
    
    for word in word_list:
        converted_tokens.append(word2id[word])
    
    converted_tokens_padded =  sequence.pad_sequences([converted_tokens], maxlen=max_num_words)
    predict = model.predict(converted_tokens_padded)
    predict_class = model.predict_classes(converted_tokens_padded)

    if predict_class == 1:
        review_sentiment = f'label is {predict_class}, thus the review is positive. The prediction value is {predict}'
    else:
        review_sentiment = f'label is {predict_class}, thus the review is negative. The prediction value is {predict}'

    return review_sentiment

sample_sentence = ('The movie was impressive. I really liked the acting.')
print(sample_predict(sample_sentence))


Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
label is [[1]], thus the review is positive. The prediction value is [[0.8957933]]
