CAPSTONE PROJECT III: Recurrent Neural Networks for sentiment analysis

In [1]:
# Import dependencies
import numpy as np
import pandas as pd 
import tensorflow as tf
from tensorflow import keras
from keras import layers
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Flatten, Embedding

This project uses the Large Movie Review Dataset of 50,000 reviews from IMDb. It contains an even number of positive
and negative reviews. Keras has this data set built-in, which we import below.

In [2]:
# Get the dataset
from keras.datasets import imdb

In [3]:
# Set the number of words and the maximum length of a review
num_words = 5000
maxlen = 500

In [4]:
# Load the data
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [5]:
# View a review and it's label
print('+++++++++Review+++++++++')
print(X_train[6])
print('+++++++++Labels+++++++++')
print(y_train[6])

+++++++++Review+++++++++
[1, 2, 365, 1234, 5, 1156, 354, 11, 14, 2, 2, 7, 1016, 2, 2, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 2, 1117, 1831, 2, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 2, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 2, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 2, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
+++++++++Labels+++++++++
1


The Data is preprocessed - all the words have been mapped to integers according to their frequency. A value of 0 is used for padding. The label is either 0 - negative or 1 - positive.

All reviews need to be the same length, thus reviews shorter than 500 words need to be padded with zeros.

In [6]:
# Process the data to ensure all inputs are 500 values using pad_sequences
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

BUILDING THE MODEL: 
The Embedding layer is the first hidden layer of a network. The first argument is input_dim: This is the size of the vocabulary in the text data. It is thus 5000. The second is output_dim: This defines the size of the output vectors from this layer for each word. We will use 32, as it was found to be the best value for this problem.The last argument is input_length: This is the length of input sequences, which all consist of 500 words.

In [7]:
# Build the model
model = Sequential()

# The embedding layer (input dim=5000, output dim=32 and input length=500)
model.add(Embedding(input_dim=num_words, output_dim=32, input_length=maxlen))
model.add(Dropout(0.4))

# The LSTM model
model.add(LSTM(64))

# The dense layer
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 32)           160000    
_________________________________________________________________
dropout (Dropout)            (None, 500, 32)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 64)                24832     
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 184,897
Trainable params: 184,897
Non-trainable params: 0
_________________________________________________________________


In [8]:
# Fit the model using 4 epochs, a batch size of 100 and a verbose value of 2
model.fit(X_train, y_train, epochs=4, batch_size=100, verbose=2)

Epoch 1/4
250/250 - 96s - loss: 0.4768 - accuracy: 0.7729
Epoch 2/4
250/250 - 96s - loss: 0.2826 - accuracy: 0.8891
Epoch 3/4
250/250 - 95s - loss: 0.2440 - accuracy: 0.9048
Epoch 4/4
250/250 - 95s - loss: 0.2239 - accuracy: 0.9140


<tensorflow.python.keras.callbacks.History at 0x7ff66720da90>

In [9]:
# Test the model
scores = model.evaluate(X_test, y_test, verbose=2)
print("Accuracy: ", (scores[1]*100),"%")

782/782 - 36s - loss: 0.2929 - accuracy: 0.8774
Accuracy:  87.73999810218811 %


PREDICT SOMETHING: 
Here we will define a function to take in text, clean it and convert it to integers to be interpreted by the model.
We will then use the outcome to say wether or not the review was positive.

In [10]:
import re

def clean(txt):
# Remove punctuation, return a list of the words
  list_words = []
  txt = txt.split()

  for x in txt:
    z = re.findall("[a-zA-Z]", x)
    list_words.append(''.join(z).lower())

  return list_words

clean('This is a sentence!')


['this', 'is', 'a', 'sentence']

In [11]:
def predict_sentiment(sentence, model):
  # Generate a sentiment value for a review  
  word_indexes = []
  list_words = clean(sentence)
  print(list_words)

  for word in list_words:
    # Convert words to integers
    word_indexes.append(imdb.get_word_index()[word])
  
  # Ensure all inputs are 500 values using pad_sequences
  padded_indexes =  sequence.pad_sequences([word_indexes], maxlen=maxlen)

  # Output the outcome
  outcome = model.predict(padded_indexes)
  return outcome

In [12]:
# Hypothetical review and outcome
sentiment = predict_sentiment("The film was unoriginal and boring. I would rather watch something else.", model)

if sentiment >= 0.5:
  print("The review was positive")
else:
  print("The review was negative")


['the', 'film', 'was', 'unoriginal', 'and', 'boring', 'i', 'would', 'rather', 'watch', 'something', 'else']
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
The review was negative
