# Bag of Words Meets Bag of Popcorn:  Sentiment Analysis using LSTM

In this notebook, we will compete in the <a href="https://www.kaggle.com/c/word2vec-nlp-tutorial">Kaggle</a> competition using a special type of recurrent network architecture called the Long Short Term Memory (LSTM).

## Quick Introduction on RNN's

Most of the information used in this introduction was found <a href="https://ayearofai.com/rohan-lenny-3-recurrent-neural-networks-10300100899b">here.</a>

A recurrent neural network is a special neural network that can perform taks that your traditional fully connected NN or CNN cannot perform. One main advantage of RNN's is that they can take inputs of variable size. Another advantage of RNN's is that they have 'memory' meaning that their outputs depend on past inputs. For these reason, RNN's are ideal for NLP tasks since the inputs are often variable and the 'memory' helps in keeping track of a sentence's meaning.

With RNN's you can have lots of models, I will describe each one with examples for NLP:
- One to One: An example is determining the sentiment of a single word (good = 1, bad = 0, horrible = 0, great = 1)
- One to Many: An example would be getting the definition of one word (Tiger = An animal belonging to the feline family)
- Many to One: An example is finding the word that fits a certain definition (To be heavily intoxicated = drunk)
- Many to Many: An example would be translating English to Chinese

<img src="./Pics/RNN_Models.jpg">

Although I highlighted a lot of advantages for RNN's, but in practice they are rarely used. They suffer heavily from the <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">vanishing gradient problem</a>, which makes them virtually impossible to train until convergence. For this reason, people often use LSTMs which kind of solve this issue.

So now that we have a little background information on what excactly a RNN and more specifically a LSTM is, let us implement a sentiment analysis classifier using an LSTM.

In [71]:
#Start by importing libraries
import numpy as np
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences

from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

In [72]:
#Read train and test data
train = pd.read_csv("./Data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv("./Data/testData.tsv", header=0, delimiter="\t", quoting=3)

## Step 1: Preprocessing

The first step is to clean up the text, for this challenge, we will remove all the stop words, remove the HTML tags, and remove numbers.

In [73]:
def preprocessing(raw_text):
    #Remove the HTML Tags
    text = BeautifulSoup(raw_text).get_text()
    
    #Remove Numbers and Punctuation
    text = re.sub("[^a-zA-Z]", " ",text)
    
    #Lowercase all words and then split them into list of words
    text = text.lower().split()
    
    #Create a set of stopwords to search
    stops = set(stopwords.words("english"))
    
    #Remove the stopwords
    text = [w for w in text if not w in stops]
    text = " ".join(text)
    
    return text

In [74]:
print(train['review'][0])
print(preprocessing(train['review'][0]))

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [75]:
#Preprocess all of the training data
cleaned_text = []
for i in range(train['review'].size):
    cleaned_text.append(preprocessing(train['review'][i]))
    if i % 5000 == 0 and i != 0:
        print("Completed {} Reviews".format(i))
print("Finished Preprocessing")

Completed 5000 Reviews
Completed 10000 Reviews
Completed 15000 Reviews
Completed 20000 Reviews
Finished Preprocessing


## Step 1: Word Vectors

Word vectors is a method of representing a single word into an array of numbers. The word vector attempts to capture the meaning behind the word and words with similar meaning will have word vectors that are close together.

In [76]:
#Load the word vectors that we trained beforehand
from gensim.models import Word2Vec

model = Word2Vec.load("300features_40minwords_10context")
print(model.wv.vectors.shape)

(16490, 300)


We previously trained the word vectors on the unlabled dataset provided by the kaggle competition and received the following word vectors. Each word is represented by a 300 number vector and there are 16490 word vectors in our model. Below we will show all of the words that are presented in our word vectors.

In [77]:
model.wv.index2word

['the',
 'and',
 'a',
 'of',
 'to',
 'is',
 'it',
 'in',
 'i',
 'this',
 'that',
 's',
 'was',
 'as',
 'with',
 'for',
 'movie',
 'but',
 'film',
 'you',
 't',
 'on',
 'not',
 'he',
 'are',
 'his',
 'have',
 'be',
 'one',
 'all',
 'at',
 'they',
 'by',
 'who',
 'an',
 'from',
 'so',
 'like',
 'there',
 'her',
 'or',
 'just',
 'about',
 'out',
 'has',
 'if',
 'what',
 'some',
 'good',
 'can',
 'more',
 'when',
 'very',
 'she',
 'up',
 'no',
 'time',
 'even',
 'would',
 'my',
 'which',
 'their',
 'story',
 'only',
 'really',
 'see',
 'had',
 'were',
 'well',
 'we',
 'me',
 'than',
 'much',
 'bad',
 'get',
 'been',
 'people',
 'also',
 'into',
 'do',
 'great',
 'other',
 'will',
 'first',
 'because',
 'him',
 'how',
 'most',
 'don',
 'them',
 'made',
 'its',
 'make',
 'then',
 'way',
 'could',
 'too',
 'movies',
 'after',
 'any',
 'characters',
 'character',
 'think',
 'films',
 'two',
 'watch',
 'being',
 'many',
 'plot',
 'seen',
 'never',
 'where',
 'love',
 'life',
 'little',
 'acting

## Step 2: Embedding Layer

Now that we have our word vectors, we need a way to map the our words to the corresponding word vector so that our LSTM can do the computation. This is where the embedding layer comes in.

First we have to tokenize our reviews. We will represent each word as a 'token'. Tokenization is similar to Bag-of-Words. We are essentially getting the frequency of each word being said in our dataset.

In [78]:
#Find all of the individual words in the dataset (Tokenization)
tokenizer = Tokenizer(num_words=6000)
tokenizer.fit_on_texts(cleaned_text)
sequences = tokenizer.texts_to_sequences(cleaned_text)
word_index = tokenizer.word_index

In here, the variable **sequences** is a 2D array of numbers in which each inner array represents a numerical representation of our cleaned reviews. The mapping from string to number is found in **word_index**.

In [52]:
#Find the maximum number of words in a review
max_sequence = 0
for i in range(len(sequences)):
    max_sequence = max(max_sequence,len(sequences[i]))
print("Max number of words in a review is {}".format(max_sequence))

Max number of words in a review is 986


In [41]:
print("Tokenizer found {} unique words".format(len(word_index)))

Tokenizer found 74065 unique words


So, after tokenization, we can see that we have 74065 unique words in our entire dataset and the maximum number of words within a review is 986 words.

Since all of the reviews are of variable length, we must zero pad the sequences arrays so that they are all of the same length.

In [53]:
##Left pad everything such that everything has the same length
sequences = pad_sequences(sequences, maxlen=1000)

Next, we have to create an embedding matrix that stores our word vectors.

In [61]:
#Create an embedding matrix
embedding_matrix = np.zeros(((len(word_index))+1, 300))

for word, i in word_index.items():
    try:
        embedding_vector = model[word]
        embedding_matrix[i] = embedding_vector
    except KeyError:
        pass
print("Shape of embedding matrix = {}".format(embedding_matrix.shape))


  


Shape of embedding matrix = (74066, 300)


Finally, we create our Neural Network architecture in which the first layer is an Embedding Layer. An Embedding layer essentially turns positive integers (indexes) into a dense vector (word vectors) of a fixed size. Basically, it is mapping our sequence vector into the embedding matrix which contains our word vectors.

We specify that we want to use our pretrained word vectors, therefore we do not want to train our embedding layer. Finally, we add a LSTM layer with 100 hidden layers and a final fully connected dense layer with a sigmoid activation to get our binary sentiment prediction.

In [64]:
my_model = Sequential()
my_model.add(Embedding(
    input_dim=len(word_index)+1,
    output_dim=300,
    weights=[embedding_matrix],
    input_length=1000,
    mask_zero=True,
    trainable=False
))
my_model.add(LSTM(100))
my_model.add(Dropout(0.2))
my_model.add(Dense(1, activation='sigmoid'))

In [69]:
my_model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])

In [70]:
my_model.fit(sequences,train['sentiment'],batch_size=32,epochs=5,validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1a2b91beb8>

Now that we have our model, lets use it to predict sentiment on our testing data. Firstly, we have to preprocess the reviews and tokenize them. Secondly, we will use our model to make predictions on the sequences of our testing data set.

In [115]:
#Preprocess all of the testing data
clean_test_text = []
for i in range(test['review'].size):
    clean_test_text.append(preprocessing(test['review'][i]))
    if i % 5000 == 0 and i != 0:
        print("Completed {} Reviews".format(i))
print("Finished Preprocessing")

Completed 5000 Reviews
Completed 10000 Reviews
Completed 15000 Reviews
Completed 20000 Reviews
Finished Preprocessing


In [118]:
#Tokenize our testing data
sequences_test = tokenizer.texts_to_sequences(clean_test_text)
word_index_test = tokenizer.word_index

#Left Pad our testing data
sequences_test = pad_sequences(sequences_test, maxlen=1000)

In [119]:
predictions = my_model.predict(sequences_test)

In [122]:
predictions_binary = []
for num in predictions:
    if num >= 0.5:
        predictions_binary.append(1)
    else:
        predictions_binary.append(0)

In [124]:
len(predictions_binary)

25000

In [126]:
# Write the test results 
output = pd.DataFrame(data={"id":test["id"], "sentiment":predictions_binary})
output.to_csv( "BagOfLSTM.csv", index=False, quoting=3 )

With our RNN model, we were able to achieve an accuracy of 0.87428. Which is roughly 3% better than our normal random forest model. To improve our accuracy, we could actually train on the entire dataset instead of allocating 20% of it for validation.