# Prerequisites

While most required packages are already included with Anaconda, please ensure that Tensorflow, Keras and Gensim have been installed separately, in addition to the Anaconda with Python 3.

To test, try executing the following cells in this section, and pip install the package if the import statement could not be successfully completed.

In [1]:
import tensorflow

In [2]:
# if you don't have TF on your machine, run the following command in Anaconda's terminal
# pip install tensorflow

In [3]:
import keras

Using TensorFlow backend.


In [4]:
# if you don't have TF on your machine, run the following command in Anaconda's terminal
# pip install keras

In [5]:
import gensim

In [6]:
# if you don't have TF on your machine, run the following command in Anaconda's terminal
# pip install gensim

# IMDB Dataset

The IMDB dataset comprises of 50,000 highly polarized reviews from the Internet Movie Database. Keras has already splitted them into 25,000 reviews for training and 25,000 reviews for testing - each of them consisting 50% negative and 50% positive reviews.

In [7]:
from keras.datasets import imdb

For this lab, we'll limit the number of words to the most frequent 2,000, with the rare words disgarded from the data.

In [21]:
num_words = 2000 # limit the number of words to most frequent <num_words>
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=num_words)

Now we can take a look at what each review looks like! 

For example, the first review in the training set would be the following:

In [9]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 1920, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]


And here is the category of the first training instance (1=Positive, 0=Negative):

In [22]:
print(train_labels[0])

1


Since we are restricted to the top 2,000 most frequent words, no word index will exceed 1,999 in any review:

In [11]:
print(max([max(sequence) for sequence in train_data]))

1999


We could decode the word index if needed, and here is one quick way to do this:

First, build a dictionary mapping for (index:word)

In [12]:
word_index = imdb.get_word_index()
word_to_id = dict([(word, index) for (index, word) in word_index.items()])

Second, obtain the actual words from the indices. 

Note that we need to subtract the index by 3, because 0, 1 and 2 are reserved indices for "padding", "start of sequence" and "unknown".

We use a question mark to denote something that cannot be found in the dictionary.

In [13]:
decoded_review = ' '.join([word_to_id.get(i-3, '?') for i in train_data[0]])
print("Here is the first review:\n", decoded_review)

Here is the first review:
 ? this film was just brilliant casting location scenery story direction ? really ? the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same ? island as myself so i loved the fact there was a real ? with this film the witty ? throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the ? ? was amazing really ? at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little ? that played the ? of ? and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all ? up are such a big ? for the whole film but these children are amazing and should be ? for what they have done don't you think the whole story was so lovely because it wa

# Pre-Processing

While the first review appears to be complete (with the punctuations removed), there are still stop words that should be removed for training/testing.

A common way to remove stop words is through The Natural Language Toolkit (NLTK) package. Find out more about this open source package at www.nltk.org. 

In [14]:
import nltk
from nltk.corpus import stopwords

In [15]:
nltk.download('stopwords')
STOP_WORDS = set(stopwords.words('english'))
print(STOP_WORDS)

{'can', 'hasn', 'how', 'doesn', "didn't", 'it', 'is', 'our', 'few', 'on', "you'd", 'will', 'be', 'or', 'his', 're', "mustn't", 'which', 'there', 'shouldn', 'they', 'needn', "shan't", 'themselves', 'hadn', 'you', 'up', 've', "she's", 't', "doesn't", 'i', "isn't", 'too', 'was', 'than', 'such', 'll', 'against', 'between', 'my', 'having', 'the', 'here', 'most', 'them', 'and', 'of', "couldn't", 'she', 'down', 'mightn', 'any', 'mustn', 'm', 'your', 'to', 'at', 'as', 'aren', "weren't", 'we', 'while', 'off', 'each', 'if', 's', 'who', 'this', 'are', 'don', 'isn', 'have', 'below', 'then', 'what', 'won', "won't", 'into', 'ma', 'am', 'ain', 'has', 'by', "wasn't", 'should', 'ourselves', 'ours', "you're", 'no', 'being', 'those', 'that', 'yours', 'own', 'her', 'out', 'over', 'wouldn', 'did', 'before', 'after', "you've", 'where', 'above', 'do', 'an', 'during', 'haven', 'because', 'under', 'not', "that'll", "it's", "mightn't", 'o', 'whom', "wouldn't", 'does', 'more', "you'll", 'all', 'now', 'y', "hasn'

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fatajadd\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


With the downloaded NLTK stop words, let's remove them from the list of reviews.

In [16]:
word_index = imdb.get_word_index()

word_to_id = {word:(index+3) for word,index in word_index.items() if word not in STOP_WORDS}
#word_to_id["<PAD>"] = 0
#word_to_id["<START>"] = 1
#word_to_id["<UNK>"] = 2
#word_to_id["<UNUSED>"] = 3

id_to_word = {index:word for word,index in word_to_id.items()}

train_data = [[wordid for wordid in review if wordid in id_to_word] for review in train_data]
test_data = [[wordid for wordid in review if wordid in id_to_word] for review in test_data]

decoded_review = ' '.join([id_to_word.get(i, '?') for i in train_data[0]])
print("Here is the first review, with stop words removed:\n", decoded_review)

Here is the first review, with stop words removed:
 film brilliant casting location scenery story direction really part played could imagine robert amazing actor director father came island loved fact real film witty throughout film great brilliant much bought film soon released would recommend everyone watch amazing really end sad know say cry film must good definitely also two little played paul brilliant children often left list think stars play big whole film children amazing done think whole story lovely true life us


# Training Random Forest with One-hot Encoded Vectors

In order to train a classification model, we need to convert these lists of integers into vectors. Visualy, each dataset becomes a matrix.

In [27]:
import numpy as np

def vectorize_sequences(sequences, dimension = num_words):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1
    return results

x_train_1hot = vectorize_sequences(train_data)
print("train matrix size:",x_train_1hot.shape)
x_test_1hot = vectorize_sequences(test_data)
print("test matrix size:",x_test_1hot.shape)

train matrix size: (25000, 2000)
test matrix size: (25000, 2000)


Let's try training models using a popular ML algorithm - Random Forests. Implementation of the algorithm is in the Scikit Learn library.

In [28]:
#Random Forest Model
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(n_estimators=100, max_depth=8,random_state=0)
RF.fit(x_train_1hot, train_labels) 
accuracy_train = RF.score(x_train_1hot, train_labels)
print("accuracy on train set = {:.3f}%".format(accuracy_train*100))
accuracy_test = RF.score(x_test_1hot,test_labels)
print("accuracy on test set = {:.3f}%".format(accuracy_test*100))

accuracy on train set = 84.168%
accuracy on test set = 81.616%


# Training Random Forest with "Bag of Words" Vectors

Building on top of One-hot Encoding, we can utilize the "Bag of Words" method to obtain the number of occurrences for each word in the dictionary, and use it as features.

In [29]:
def vectorize_sequences(sequences, dimension = 2000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        for j in sequence:
            results[i,j] += 1
    return results

x_train_BOW = vectorize_sequences(train_data)
print("train matrix size:",x_train_BOW.shape)
x_test_BOW = vectorize_sequences(test_data)
print("train matrix size:",x_train_BOW.shape)


train matrix size: (25000, 2000)
train matrix size: (25000, 2000)


Let's try training models using a popular ML algorithm - Random Forests. Implementation of the algorithm is in the Scikit Learn library.

In [31]:
#Random Forest Model
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(n_estimators=100, max_depth=8,random_state=0)
RF.fit(x_train_BOW, train_labels) 
accuracy_train = RF.score(x_train_BOW, train_labels)
print("accuracy on train set = {:.3f}%".format(accuracy_train*100))
accuracy_test = RF.score(x_test_BOW,test_labels)
print("accuracy on test set = {:.3f}%".format(accuracy_test*100))

accuracy on train set = 84.664%
accuracy on test set = 81.812%
