## The IMDB movie review dataset
Keras provides a **one-dimensional convolutional net** to examine the **IMDB movie review dataset**. Each data point is prelabeled with a `0` (**negative sentiment**) or a `1` (**positive sentiment**). However, we are going to swap out their example IMDB movie review dataset for one in raw text, so we can get our hands dirty with the preprocessing of the text as well. We’ll use the trained model to classify novel review text it has never seen before.

This raw dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification.
The core dataset contains `50,000` reviews split evenly into `25,000` train and `25,000` test sets. The overall distribution of labels is balanced (`25,000` pos and `25,000` neg). In addition to the review text files, the maintainers of the dataset include already-tokenized bag of words (BoW) features that were used in their experiments (we are not going to use the BOW but prepare ours from the raw dataset). *See the dataset maintainers' README file contained in the release for more details*.

Link to dataset: https://ai.stanford.edu/%7eamaas/data/sentiment/ **Learning Word Vectors for Sentiment Analysis**

## Load and Preprocess the data
### Import Required Modules

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import glob
import os
import numpy as np
from random import shuffle
from nltk.tokenize import TreebankWordTokenizer
from gensim.models.keyedvectors import KeyedVectors

## Helper functions
### Load data

In [3]:
def pre_process_data(filepath):
    """This function shall be used to load and preprocess both train and test datasets"""
    positive_path = os.path.join(filepath, 'pos')
    negative_path = os.path.join(filepath, 'neg')
    pos_label = 1
    neg_label = 0
    dataset = []
    
    for filename in glob.glob(os.path.join(positive_path, '*.txt')):
        with open(filename, 'r', encoding = 'utf-8') as f:
            dataset.append((pos_label, f.read()))
            
    for filename in glob.glob(os.path.join(negative_path, '*.txt')):
        with open(filename, 'r', encoding = 'utf-8') as f:
            dataset.append((neg_label, f.read()))
    shuffle(dataset)
    return dataset

We can extract the *target labels* from the loaded datasets...
### Target labels

In [4]:
def collect_labels(dataset):
    """ Extract the target labels from the dataset """
    target_labels = []
    for sample in dataset:
        target_labels.append(sample[0])
    return target_labels

### Data Tokenizer and Vectorizer 

For our *feature engineering*, we are going to employ Google's **Word2vec** model developed by *Thomas Mikolov and team* in 2013 to generate Word2Vec embeddings. The word vector representation from Word2vec **captures much more specific and more precise meaning or semantics of the target word** than the  word-topic vectors generated by **Latent Semantic Analysis (LSA)** and **Latent Dirichlet allocation (LDiA)**.

We are limiting our vocab to just `500,000` words due to lack of sufficient memory. This means our *Google Word2vec* word vectors would not contain all the words in our dataset. For such no match cases we shall skip those words during tokenization in order to bypass the errors and continue with the rest of the words.
Google's Word2vec binary file source: https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

In [5]:
word2vec_file = '../word2vec-GoogleNews-vectors/GoogleNews-vectors-negative300.bin'

In [6]:
word_vectors = KeyedVectors.load_word2vec_format(word2vec_file, binary = True, limit = 500000)

In [7]:
def tokenize_and_vectorize(dataset):
    tokenizer = TreebankWordTokenizer()
    vectorized_data = []
    for sample in dataset:
        tokens = tokenizer.tokenize(sample[1])
        sample_vecs = []
        for token in tokens:
            try:
                sample_vecs.append(word_vectors[token])
            except KeyError:
                pass                                   # if no matching token in the Google w2v vocab
        vectorized_data.append(sample_vecs)
    return vectorized_data

### Optimize the thought vector size
The size of our thought vector in very important because it determines the number of time steps and the number of weights in the feed forward layer to train. But most importantly the size of our thought vector determines the *distance* the **backpropagation had to travel** each time.

In [8]:
def test_len(data, maxlen):
    total_len = truncated = exact = padded = 0
    for sample in data:
        total_len += len(sample)
        if len(sample) > maxlen:
            truncated += 1
        elif len(sample) < maxlen:
            padded += 1
        else:
            exact += 1
    print('Padded: {}'.format(padded))
    print('Equal: {}'.format(exact))
    print('Truncated: {}'.format(truncated))
    print('Avg length: {}'.format(total_len/len(data)))

## Load train data

In [9]:
train_file_path = '../aclImdb/train'

In [10]:
train_dataset = pre_process_data(train_file_path)

In [11]:
len(train_dataset)

25000

In [12]:
train_dataset[2]

(0,
 '...am i missing something here??? "unexpected plot developments"? "plot twisting with subversive glee"? are these viewers watching the same Arquette vehicle to which i just subjected myself (in an now-obvious sub(un)conscious bout of sadomasochism)...I just joined this site simply to make sure that no one else ever rents this stinker...this movie was an embarrassment to every single person involved...quick question: did Sir Stevie read the script before he gave the thumbs-up to Kate C.? if so, then it must be the same Spielberg who greenlighted "howard the duck"...don\'t give me that, "it was a hit play" crap--i\'m guessing Mssr. Reddin ain\'t too pleased ...the DVD cover promised "surprising corners" and a "twisted story..." Story!!Story?? It\'s crap like this that make old Bobby McKee and his wandering band of Structuralists sound like geniuses...Sundance??Berlin??Toronto?? I have a home video of my cat farting that evokes more interest than Arquette\'s negatively-dimensional p

In [13]:
vectorized_train_data = tokenize_and_vectorize(train_dataset)

In [14]:
target_train_labels = collect_labels(train_dataset)

## Load test data

In [15]:
test_file_path = '../aclImdb/test'

In [16]:
test_dataset = pre_process_data(test_file_path)

In [17]:
len(test_dataset)

25000

In [18]:
test_dataset[1]

(0,
 "I was really disappointed by this movie. Great actors in it, and potentially a great plot, but it just seemed to limp along.<br /><br />Charlize Theron was masterful in her role and beautiful, but it seemed like 90% of her on-screen work was in car chases done with Austin Minis. Product placement gone wrong, so very wrong.<br /><br />The direction seemed off, too. Edward Norton is the bad guy, and it was so obvious right from the start. Every time the camera would pass over him, it would linger too long and Norton would grimace or something. C'mon, Hollywood, give us a little credit! It's okay to surprise us with a plot twist without having to telegraph it.<br /><br />Sorry, but this movie was just below average. I have always been one to appreciate the work and talent that goes into a movie, but this one just didn't have it.")

We have `25,000` training samples and `25,000` test samples as expected. The next step is to *tokenize* and *vectorize* the data.

In [19]:
vectorized_test_data = tokenize_and_vectorize(test_dataset)

In [20]:
target_test_labels = collect_labels(test_dataset)

We shall pad/truncate our train and test data, convert it to *numpy arrays* as required by Keras for its optimized vectorized operations. This is a *tensor* with the shape (**number of samples**, **sequence length**, **word vector length**) that we need for our GRU model. **We won’t usually need to pad or truncate with RNNs (LSTMs, GRUs), because they can handle input sequences of variable length**. 

In [22]:
test_len(vectorized_train_data, 400)

Padded: 22506
Equal: 12
Truncated: 2482
Avg length: 203.8464


In [23]:
maxlen = 200
batch_size = 32
embedding_dims = 300
num_neurons = 50
epochs = 2

### Helper function to pad tokens
Keras has a preprocessing helper method, **pad_sequences**, that in theory could be used to pad our input data, but unfortunately *it works only with sequences of scalars*, but we have *sequences of vectors*. Let’s write a helper function to pad our input sequence of vectors...

In [24]:
def pad_trunc(data, maxlen):
    """
    For a given dataset pad with zero vectors or truncate to maxlen
    """
    # This one-liner can accomplish the same task!
    # return [sample[:maxlen] + [[0.] * embedding_dims] * (maxlen - len(sample)) for sample in data]
    
    new_data = []
    # Create a vector of 0s the length of our word vectors
    zero_vector = []
    for _ in range(len(data[0][0])):
        zero_vector.append(0.0)
    for sample in data:
        if len(sample) > maxlen:
            temp = sample[:maxlen]
        elif len(sample) < maxlen:
            temp = sample
            # Append the appropriate number 0 vectors to the list
            additional_elems = maxlen - len(sample)
            for _ in range(additional_elems):
                temp.append(zero_vector)
        else:
            temp = sample
        new_data.append(temp)
    return new_data

In [25]:
X_train = pad_trunc(vectorized_train_data, maxlen)
X_test = pad_trunc(vectorized_test_data, maxlen)

In [26]:
X_train = np.reshape(X_train, (len(X_train), maxlen, embedding_dims))
y_train = np.array(target_train_labels)

In [27]:
X_test = np.reshape(X_test, (len(X_test), maxlen, embedding_dims))
y_test = np.array(target_test_labels)

### Train data stats

In [28]:
X_train[0]

array([[ 0.07910156, -0.0050354 ,  0.11181641, ..., -0.0067749 ,
         0.04272461, -0.10351562],
       [ 0.09667969, -0.07080078, -0.06933594, ...,  0.0189209 ,
         0.13574219,  0.19140625],
       [-0.06640625,  0.19921875, -0.22460938, ...,  0.13476562,
         0.22070312, -0.26757812],
       ...,
       [ 0.22851562,  0.04516602,  0.09521484, ..., -0.08056641,
        -0.08398438,  0.01611328],
       [-0.03442383,  0.10351562,  0.02160645, ...,  0.07324219,
         0.03320312,  0.03833008],
       [-0.02490234,  0.02197266, -0.03540039, ...,  0.01080322,
        -0.01879883, -0.06884766]])

In [29]:
X_train[0].shape

(200, 300)

In [30]:
y_train[:10]

array([1, 0, 0, 0, 1, 1, 1, 1, 1, 0])

In [31]:
X_train.shape

(25000, 200, 300)

In [32]:
y_train.shape

(25000,)

### Test data stats

In [33]:
X_test[0]

array([[ 0.07910156, -0.0050354 ,  0.11181641, ..., -0.0067749 ,
         0.04272461, -0.10351562],
       [ 0.19335938, -0.07128906,  0.10839844, ...,  0.0480957 ,
         0.16503906,  0.04418945],
       [ 0.12597656,  0.19042969,  0.06982422, ...,  0.0612793 ,
         0.17285156, -0.07861328],
       ...,
       [-0.03369141,  0.05151367,  0.02368164, ...,  0.27148438,
         0.01324463, -0.19140625],
       [-0.02368164,  0.10791016, -0.13574219, ..., -0.21386719,
        -0.08251953, -0.0168457 ],
       [-0.6015625 , -0.08398438,  0.05395508, ...,  0.10107422,
         0.05688477, -0.20898438]])

In [34]:
X_test[0].shape

(200, 300)

In [35]:
y_test[:10]

array([0, 0, 1, 0, 0, 1, 0, 1, 0, 1])

In [36]:
X_test.shape

(25000, 200, 300)

In [37]:
y_test.shape

(25000,)

## Build the LSTM model
### Import required packages and modules 

In [38]:
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)

1.14.0


## Initialize model and add a GRU layer

In [39]:
model = keras.models.Sequential()
model.add(keras.layers.GRU(
    num_neurons, 
    return_sequences = True,
    input_shape = (maxlen, embedding_dims)))

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Because our sequences are `200` (maxlen) tokens long and we’re using `50` hidden **neurons**, our output from this layer will be a vector of `200` elements long. Each of those elements is a vector `50` elements long, with **one output for each of the neurons** (Each token is processed by `50` neurons to out a vector of length `50` bundled into an output vector of length `200`). The output is therefore a list of lists where the inner list are `200` and each is of length `50`!
The keyword argument `return_sequences` will tell the network to return the network value at each *time step*, hence the `200` vectors, each `50` elements long. If `return_sequences` is set to *False*, the model will be a feedforward and *not* an RNN.

## Add a dropout and output layers
To prevent *overfitting* we add a **Dropout layer** to zero out `20%` of those inputs, randomly chosen on each input example. And then finally we add a classifier. In this case, we have **binary classification** task: *Yes or Positive Sentiment* is labeled `1` and *No or Negative Sentiment* is labeled `0`. So we chose a layer with one neuron (Dense(1)) and a `sigmoid` activation function. But a Dense layer expects a “*flat*” vector of **n elements** (each element a float) as input. And the data coming out of the *GRU* is a *tensor* `200` elements long, and each of those are `50` elements long. So we use a `Flatten()` layer to flatten the input from a `200 x 50` tensor to a vector `10,000` elements long.

In [40]:
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(1, activation = 'sigmoid'))

## Compile the RNN model

In [41]:
model.compile('rmsprop', 'binary_crossentropy', metrics = ['accuracy'])
model.summary()

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru (GRU)                    (None, 200, 50)           52650     
_________________________________________________________________
dropout (Dropout)            (None, 200, 50)           0         
_________________________________________________________________
flatten (Flatten)            (None, 10000)             0         
_________________________________________________________________
dense (Dense)                (None, 1)                 10001     
Total params: 62,651
Trainable params: 62,651
Non-trainable params: 0
_________________________________________________________________


## Train and save the model

In [42]:
history = model.fit(X_train, y_train,
                    batch_size = batch_size,
                    epochs = epochs,
                    validation_split = 0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


## Evaluate the model

In [43]:
print(history.history.keys())

dict_keys(['loss', 'acc', 'val_loss', 'val_acc'])


In [44]:
train_acc = history.history['acc']
val_acc = history.history['val_acc']
train_loss = history.history['loss']
val_loss = history.history['val_loss']

In [49]:
train_loss, train_accuracy = model.evaluate(X_train, y_train)



In [50]:
print('Training Accuracy: {}%'.format(round(float(train_accuracy) * 100, 2)))

Training Accuracy: 86.82%


In [46]:
print('Validation Accuracy: {}%'.format(round(sum(val_acc) / len(val_acc) * 100, 2)))

Validation Accuracy: 82.24%


In [47]:
test_loss, test_accuracy = model.evaluate(X_test, y_test)



In [48]:
print('Test Accuracy: {}%'.format(round(float(test_accuracy) * 100, 2)))

Test Accuracy: 83.8%


We have comparable validation and test accuracies of about `82%` and `83%` respectively. This an indication that *overfitting* of the training data is not a huge problem. However, with a training accuracy of more than `86%` and a test accuracy of about `83%` the model can still be tweaked. We would probably get `2%` - `3%` increase in test score with *LSTM layer*.
### Save the model
Saving both the model architecture and its weights will allow it to be reloaded and trained from that point on if necessary.

In [53]:
model_structure = model.to_json()
with open("./model/gru_model.json", "w") as json_file:
    json_file.write(model_structure)

model.save_weights("./model/gru_weights.h5")

## Reload model for Prediction

In [54]:
with open("./model/gru_model.json", "r") as json_file:
    json_string = json_file.read()

In [55]:
loaded_model = keras.models.model_from_json(json_string)
loaded_model.load_weights('./model/gru_weights.h5')

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


## Prepare Test example
Get some sample reviews from the internet and use the model to predict the reviewers' sentiments regarding the movie...(make sure to remove any sensitive issues like names and replace them with make up ones).

In [59]:
review_text = ["Loved the film. I wasn’t sure at the start but it was lovely "+ \
               "to see all the characters from the small screen arrive in the "+ \
               "cinema as old friends, and I laughed and cried. This is a great "+ \
               "film and I really hope they make a sequel. "+ \
               "To the person who gave this film one star you should have reviewed "+ \
               "the film, not the taxi driver and as you didn’t see the first 30 minutes "+ \
               "you aren’t in a position to comment on the entire film anyway.", 
               
               "I have no doubt that Uptown fans will support this film. We have every "+ \
               "episode on DVD, so it is with something of a heavy heart to give this film "+ \
               "such a low rating. As a stand-alone film (or if you have never seen the TV series), "+ \
               "all you get are lavish scenery and costumes. However, the characters appear shallow "+ \
               "and the plot flimsy. As a Uptown fan, yes – the pleasant and familiar characters are "+ \
               "there for you to enjoy in their familiar costumes. However, that is not enough. "+ \
               "Soon into the film, we found that the depth of our characters was not there. "+ \
               "I can only allude to metaphors. It was like watching a Formula One race run at "+ \
               "20 mph – where was the excitement? It was like watching a Weakenhand ruby game of "+ \
               "touch rugby – all spectacle but no impact. It was like being forced to lie in a bubble "+ \
               "bath of lukewarm water for too long. Even a couple of people around gave up and stated "+ \
              "playing with their iPhone, with mutterings as we left at it was far too long. "+ \
               "Maybe it was our local cinema’s projection but even the film quality was nothing "+ \
               "like my Blu-ray at home, let along 4K. So, all in all, this is best seen as a light touch "+ \
               "homage to the TV series. Bearing in mind the trouble taken to assemble the actors in one "+ \
               "place at one time to make this movie, this was a wasted opportunity to create a real Uptown epic. "+ \
               "I hope they do not make a sequel."
              ]

In [60]:
sample_1 = review_text[0]

In [61]:
vec_sample_1 = tokenize_and_vectorize([(1, sample_1)])                                            # tokenize and vectorize
test_vec_sample_1 = pad_trunc(vec_sample_1, maxlen)                                               # padding / truncate
test_vec1 = np.reshape(test_vec_sample_1, (len(test_vec_sample_1), maxlen, embedding_dims))       # reshape

In [62]:
print('Sentiment class: {}'.format(loaded_model.predict_classes(test_vec1)))

Sentiment class: [[1]]


In [63]:
sample_2 = review_text[1]

In [64]:
vec_sample_2 = tokenize_and_vectorize([(1, sample_2)])                                           
test_vec_sample_2 = pad_trunc(vec_sample_2, maxlen)                                               
test_vec2 = np.reshape(test_vec_sample_2, (len(test_vec_sample_2), maxlen, embedding_dims))        

In [65]:
print('Sentiment class: {}'.format(loaded_model.predict_classes(test_vec2)))

Sentiment class: [[1]]


In [68]:
sample_3 = 'The story has no center; the duck is not likable, and the costly, overwrought, laser-filled special effects '+ \
'that conclude the movie are less impressive than a sparkler on a birthday cake. James ‘Star Wars’ Luke supervised the ' + \
'production of this film, and maybe it’s time he went back to making low-budget films like his best picture'

In [69]:
vec_sample_3 = tokenize_and_vectorize([(1, sample_3)])                                           
test_vec_sample_3 = pad_trunc(vec_sample_3, maxlen)                                               
test_vec3 = np.reshape(test_vec_sample_3, (len(test_vec_sample_3), maxlen, embedding_dims))        

In [70]:
print('Sentiment class: {}'.format(loaded_model.predict_classes(test_vec3)))

Sentiment class: [[0]]


Our model seems to be doing a good job!

The `predict_classes()` method gives the expected `0` or `1` for a binary classification task. The `.predict()` method reveals the raw `sigmoid` activation function output (a continuous value between `0` and `1`) before thresholding. Anything **above** `0.5` will be classified as positive (`1`) and **below** `0.5` will be negative (`0`).

In [71]:
print("Raw output of sigmoid function for sample_1: {}".format(loaded_model.predict(test_vec1)))

Raw output of sigmoid function for sample_1: [[0.8725347]]


In [72]:
print("Raw output of sigmoid function for sample_2: {}".format(loaded_model.predict(test_vec2)))

Raw output of sigmoid function for sample_2: [[0.6017147]]


In [73]:
print("Raw output of sigmoid function for sample_3: {}".format(loaded_model.predict(test_vec3)))

Raw output of sigmoid function for sample_3: [[0.43969184]]
