In [1]:
# DS LT3T20
# Capstone III 

# IMDB Sentiment Analysis 

# NLP using RNN

# Sentiment Analysis using RNN

Sentiment analysis aims to determine the attitude, or sentiment. For example, a speaker or writer with respect to a document, interaction, or event. It is a natural language processing problem in which text needs to be understood to predict the underlying intent.

In [2]:
# Import libraries
# Keras LSTM IBDM dataset
from keras.datasets import imdb

Dataset contains a collection of 50,000 reviews from IMDB and contains an even number of positive and negative reviews. A negative review has a score of ≤ 4 out of 10 and a positive review has a score of ≥ 7 out of 10. Neutral reviews are not included in the dataset.

The dataset is divided into training and test sets. The training set is the same 25,000 labelled reviews.

# Pre Processing 

1. Set the vocabulary size and load in training and test data

In [3]:
# Set vocab size
vocabulary_size = 5000

#  Create Train Test split
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
print("Loaded dataset with {} training samples, {} test samples".format(len(X_train), len(X_test)))

Loaded dataset with 25000 training samples, 25000 test samples


All words have been mapped to integers. The reviews are preprocessed and each one is encoded as a sequence of word indexes in the form of integers.

2. Inspect a sample review and its label

In [4]:
# Review in numbers
print("---Review---")
print(X_train[6])

print("---Label---")
print(y_train[6])

---Review---
[1, 2, 365, 1234, 5, 1156, 354, 11, 14, 2, 2, 7, 1016, 2, 2, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 2, 1117, 1831, 2, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 2, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 2, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 2, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
---Label---
1


The integers represent the words sorted by their frequency (most used word). The number 1 represents the start marker, 2 is for an unknown word and 0 is for padding. The label is also an integer (0 for negative, 1 for positive)

3. Use the dictionary to map the review back to the original words

In [5]:
# Review in words
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}

print("---Review with words---")
print([id2word.get(i, ' ') for i in X_train[6]])
print("---Label---")
print(y_train[6])

---Review with words---
['the', 'and', 'full', 'involving', 'to', 'impressive', 'boring', 'this', 'as', 'and', 'and', 'br', 'villain', 'and', 'and', 'need', 'has', 'of', 'costumes', 'b', 'message', 'to', 'may', 'of', 'props', 'this', 'and', 'and', 'concept', 'issue', 'and', 'to', "god's", 'he', 'is', 'and', 'unfolds', 'movie', 'women', 'like', "isn't", 'surely', "i'm", 'and', 'to', 'toward', 'in', "here's", 'for', 'from', 'did', 'having', 'because', 'very', 'quality', 'it', 'is', 'and', 'and', 'really', 'book', 'is', 'both', 'too', 'worked', 'carl', 'of', 'and', 'br', 'of', 'reviewer', 'closer', 'figure', 'really', 'there', 'will', 'and', 'things', 'is', 'far', 'this', 'make', 'mistakes', 'and', 'was', "couldn't", 'of', 'few', 'br', 'of', 'you', 'to', "don't", 'female', 'than', 'place', 'she', 'to', 'was', 'between', 'that', 'nothing', 'and', 'movies', 'get', 'are', 'and', 'br', 'yes', 'female', 'just', 'its', 'because', 'many', 'br', 'of', 'overly', 'to', 'descent', 'people', 'time', 

In order to feed this data into our RNN, all input documents must have the same length. Since the reviews differ heavily in terms of lengths we need to trim each review to its first 500 words. If reviews are shorter than 500 words we will need to pad them with zeros.

4. Pad Sequences

Limit the maximum review length to max_words by truncating longer reviews and padding shorter reviews with a null value.

4a. Maximum review length

In [6]:
print("Maximum review length: {}".format(
len(max((X_train + X_test), key=len))))

id2word = {i: word for word, i in word2id.items()}

id2word

Maximum review length: 2697


{34701: 'fawn',
 52006: 'tsukino',
 52007: 'nunnery',
 16816: 'sonja',
 63951: 'vani',
 1408: 'woods',
 16115: 'spiders',
 2345: 'hanging',
 2289: 'woody',
 52008: 'trawling',
 52009: "hold's",
 11307: 'comically',
 40830: 'localized',
 30568: 'disobeying',
 52010: "'royale",
 40831: "harpo's",
 52011: 'canet',
 19313: 'aileen',
 52012: 'acurately',
 52013: "diplomat's",
 25242: 'rickman',
 6746: 'arranged',
 52014: 'rumbustious',
 52015: 'familiarness',
 52016: "spider'",
 68804: 'hahahah',
 52017: "wood'",
 40833: 'transvestism',
 34702: "hangin'",
 2338: 'bringing',
 40834: 'seamier',
 34703: 'wooded',
 52018: 'bravora',
 16817: 'grueling',
 1636: 'wooden',
 16818: 'wednesday',
 52019: "'prix",
 34704: 'altagracia',
 52020: 'circuitry',
 11585: 'crotch',
 57766: 'busybody',
 52021: "tart'n'tangy",
 14129: 'burgade',
 52023: 'thrace',
 11038: "tom's",
 52025: 'snuggles',
 29114: 'francesco',
 52027: 'complainers',
 52125: 'templarios',
 40835: '272',
 52028: '273',
 52130: 'zaniacs',

4b. Minimum review length

In [7]:
print("Minimum review length: {}".format(
len(min((X_test + X_test), key=len))))

Minimum review length: 14


4c. Pad sequences

In [8]:
# Pad_sequences() function in Keras, set max_words to 500
from keras.preprocessing import sequence

max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

# Build a RNN Model for Sentiment Analysis

A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. 
Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs.

The position of a word within the vector space is learned from the text and is based on the words that surround the word when it is used. The position of a word in the learned vector space is referred to as its embedding.
Keras offers an embedding layer , used for neural networks on text data, requires that the input data be integer encoded, so each word is represented by a unique integer (Brownlee, 2017)

5. Define an RNN model for sentiment analysis

In [9]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation="sigmoid"))

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 32)           160000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               53200     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


RNN model with 1 embedding, 1 LSTM and 1 dense layers. 213,301 parameters in total need to be trained

# Train the model

6. Train and evaluate the model

Specify the loss function and optimizer to use while training

In [10]:
model.compile(loss="binary_crossentropy", 
             optimizer="adam", 
             metrics=["accuracy"])

In [11]:
batch_size = 32
num_epochs = 10

X_valid, y_valid = X_train[ :batch_size], y_train[ :batch_size]
X_train2, y_train2 = X_train[batch_size: ], y_train[batch_size: ]

model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1dfabd7e088>

# Test the Model

7. Evaluate the performance on test data

In [39]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Test accuracy: ", scores[1])

Test accuracy:  0.8537999987602234


#  Making Predictions

A prediction is an array of 10 numbers. They represent the model's "confidence".

In [47]:
# Predict test data
model.predict(X_test)

array([[0.03026474],
       [0.99997437],
       [0.1591807 ],
       ...,
       [0.21113205],
       [0.68249816],
       [0.9973984 ]], dtype=float32)

# Application

8. Translate the sentence into the relevant integers and pad. This will allow us to put it into our model and see whether it predicts if we will like or dislike the movie.

In [48]:
import string

def pre_process(review):
  return review.translate(str.maketrans(" ", " ", string.punctuation)).lower()

In [49]:
# Predict sentiment function
word2id = {i: word for i, word in word2id.items()}

def predict_sentiment(review):
  sentiment = 0

  review = pre_process(review)

  #  Dictionary to map words back into
  # id2word

  list_encoded = []

  for word in review.split():
    list_encoded.append(word2id.get(word))

  max_words = 500
  list_encoded = sequence.pad_sequences([list_encoded], maxlen=max_words)
  
  sentiment = model.predict(list_encoded)
  return sentiment

In [59]:
# Using a review to classify
value = predict_sentiment("This movie is not a good movie")

if value >= 0.5:
  print("The movie was classified as positive")

else:
  print("The movie was classified as negative")

The movie was classified as negative
