<b><h1><center>Recurrent Neural Network (using Dataset = imdb)</h1>
<b><h2><center>Natural Language Processing - Sentiment Analysis

This tutorial is about Sentimental Analysis which is a basic form of Natural Language Processing. In this tutorial, the dataset consists of movie reviews which will be calssified as positive or negative and the Neural Network will be trained to do this classification.
Neural Networks are compatible with numbers instead of texts but the dataset used in this tutorial contains texts. Therefore, to be compatible with Neural Networks, the text data will be converted into numbers.
The texts in the dataset is composed of different lengths. Hence, the type of Neural Network that can work on both long and short sequences of text data will be used in this tutorial.
The pre-requisites for this tutorial include basic know ledge of Linear Algebra, Machine Learning and Classification, Recurrent Neural Networks and Layers, Python programming language and Jupyter Notebook editor.

To perform Sentimental Analysis from a dataset composed of text sequences, foloowing steps need to be followed:
-  Conversion of words into integer numbers also called tokens which serves as indices to the entire vocabulary list.
-  Conversion of tokens into real-valued vectors called embeddings, whose mapping will be trained along with the neural network.
-  Feeding the embedding vectors into the Recurrent Neural Network (RNN) which will take sequences of arbitrary length as input and will show a summary about the input as its output.
-  The Sigmoid-function is then applied to the Neural Network output to get the values between 0.0 and 1.0. The Value 0.0 will refer to the negative sentiment and value 1.0 will refer to the positive sentiment.

<b><h2>Importing Libraries

Following are the libraries that will be used for the entire tutorial.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Embedding, GRU, Dense
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

<b><h2>Importing Data

The data set used for the tutorial is composed of imdb movie reviews which is to be classified as either positive or negative sentiment. That means the dataset has 2 different classes. The dataset can be downloaded and loaded automatically or can also be loaded manually by downloading the dataset from an online source.
The dataset is composed of 50,000 reviews of movies from IMDB

In [2]:
import imdb

In [3]:
imdb.maybe_download_and_extract()

Data has apparently already been downloaded and unpacked.


<b><h2>Loading Training and Testing Datasets

The dataset is already divided into 2 sets; the training set and the testing set each containing 25,000 reviews. The datasets are then loaded into variable separately as follows.

In [4]:
dataset_text_for_training, outputs_for_training = imdb.load_data(train=True)
dataset_text_for_testing, outputs_for_testing = imdb.load_data(train=False)

Checking out the Sizes of Training and Testing datasets

In [5]:
print("Size of Training Dataset: ", len(dataset_text_for_training))
print("Size of Testing Dataset :  ", len(dataset_text_for_testing))

Size of Training Dataset:  25000
Size of Testing Dataset :   25000


Combining Training and Testing Dataset for some uses in the tutorial.

In [6]:
data_text_in_total = dataset_text_for_training + dataset_text_for_testing

<b><h2>Tokenizer

Neural Networks does not work with text data. Therefore, the texts in the dataset will be converted into integers also called tokens so that it can be fed to the the Neural Network later.
The Tokenizer can be instructed to use a particular number of frequently used or popular words. In this tutorial, the tokenizer is instructed to use 10,000 most popular words from the dataset as specified below.

In [7]:
take_number_words = 10000

In [8]:
tokenizer = Tokenizer(num_words=take_number_words)

The Tokenizer is then fitted to the dataset where all the text is scanned and is converted to lower case along with removing any unwanted characters like punctuations.
The Tokenizer is fit on the entire dataset including the training and testing sets. The Tokenizer will only build the vocabulary of all the unique words. The model will be trained only on the training set.

In [9]:
tokenizer.fit_on_texts(data_text_in_total)

The Tokenizer then converts the texts in the training and testing sets into intergers called tokens.

In [10]:
tokens_training = tokenizer.texts_to_sequences(dataset_text_for_training)

In [11]:
tokens_testing = tokenizer.texts_to_sequences(dataset_text_for_testing)

The Recurrent Neural Networks can take the data of arbitrary lengths for the input but the text sequences should have the same length to use a whole batch of data.
One of the solutions to do that is using a sequence-length that covers most of the data sequences. The longer sequences will then be truncated and the shorter ones will be padded.

The total number of tokens is calculated for each sequence.

In [12]:
tokens_number_totals = np.array([len(tokens) for tokens in tokens_training + tokens_testing])

The maximum number of tokens in each sequence that will be allowed is set to the average plus two standard deviations that will cover around 95% of the dataset.

In [13]:
maximum_sequence_length = int(np.mean(tokens_number_totals) + 2 * np.std(tokens_number_totals))

For rest of the data sequences, padding (adding zeros to achieve the maximum sequence length allowed) or truncating (throwing away part of the text to achieve the maximum sequence length allowed) is done. It has to be specified whether the truncating or padding should be 'pre' or 'post'. 

In [14]:
training_data_padded_tokens = pad_sequences(tokens_training, maxlen=maximum_sequence_length, padding='pre', truncating='pre')

In [15]:
testing_data_padded_tokens = pad_sequences(tokens_testing, maxlen=maximum_sequence_length, padding='pre', truncating='pre')

<b><h2>Tokenizer Inverse Map

Keras does not have any implementation of mapping the integer token back to words, therefore, it is done manually.

In [16]:
inverse_mapping_indices = tokenizer.word_index
inverse_map = dict(zip(inverse_mapping_indices.values(), inverse_mapping_indices.keys()))

Helper Funtion to Convert Tokens into Words

In [17]:
def translate(input_tokens):
    output_words = [inverse_map[tokens] for tokens in input_tokens if input_tokens !=0]
    converted_string = " ".join(output_words)
    return converted_string

For example, below is an example of the original text from the training dataset.

In [18]:
dataset_text_for_training[10]

'When I first read Armistead Maupins story I was taken in by the human drama displayed by Gabriel No one and those he cares about and loves. That being said, we have now been given the film version of an excellent story and are expected to see past the gloss of Hollywood...<br /><br />Writer Armistead Maupin and director Patrick Stettner have truly succeeded! <br /><br />With just the right amount of restraint Robin Williams captures the fragile essence of Gabriel and lets us see his struggle with issues of trust both in his personnel life(Jess) and the world around him(Donna).<br /><br />As we are introduced to the players in this drama we are reminded that nothing is ever as it seems and that the smallest event can change our lives irrevocably. The request to review a book written by a young man turns into a life changing event that helps Gabriel find the strength within himself to carry on and move forward.<br /><br />It\'s to bad that most people will avoid this film. I only say th

Below is the array of tokens of the corresponding dataset.

In [19]:
np.array(tokens_training[10])

array([  50,   10,   86,  339,   64,   10,   13,  607,    8,   31,    1,
        395,  449, 4380,   31, 5078,   54,   27,    2,  143,   28, 2150,
         42,    2, 1358,   12,  109,  301,   73,   25,  146,   75,  358,
          1,   19,  318,    4,   32,  320,   64,    2,   23,  862,    5,
         63,  509,    1,    4,  369,    7,    7,  563,    2,  164, 2372,
         25,  371, 4163,    7,    7,   16,   39,    1,  203, 1120,    4,
       8163, 1943, 1858, 2446,    1, 7019, 3258,    4, 5078,    2, 1573,
        176,   63,   24, 1708,   16, 1262,    4, 1740,  195,    8,   24,
        114, 6641,    2,    1,  181,  183,   87, 5050,    7,    7,   14,
         73,   23, 1705,    5,    1, 1887,    8,   11,  449,   73,   23,
       1568,   12,  161,    6,  123,   14,    9,  184,    2,   12,    1,
       8927, 1560,   67,  665,  260,  468,    1, 7954,    5,  715,    3,
        275,  407,   31,    3,  186,  128,  511,   82,    3,  114, 2538,
       1560,   12, 1526, 5078,  165,    1, 2095,  7

Now, this is what the text looks like when the tokens are converted back to words. The original and translated texts can be compared.

In [20]:
translate(tokens_training[10])

"when i first read story i was taken in by the human drama displayed by gabriel no one and those he cares about and loves that being said we have now been given the film version of an excellent story and are expected to see past the of hollywood br br writer and director patrick have truly succeeded br br with just the right amount of restraint robin williams captures the fragile essence of gabriel and lets us see his struggle with issues of trust both in his life jess and the world around him donna br br as we are introduced to the players in this drama we are reminded that nothing is ever as it seems and that the smallest event can change our lives the request to review a book written by a young man turns into a life changing event that helps gabriel find the strength within himself to carry on and move forward br br it's to bad that most people will avoid this film i only say that because the average american will probably think robin williams in a serious role that didn't work befo

<b><h2>Creating Recurrent Neural Network

The Recurrent Neural Network is created using the Keras API for simplicity.

In [21]:
selected_model = Sequential()

The first RNN layer is the Embedding Layer that converts integers into vectors of values. Embedding Layer makes the integer tokens to take on values between 0 and 10000 for a vocabulary of 10000 words. This is because the RNN cannot work on a wide range of values.
The size of the embedding vector has to be defined for each integer token. For Sentimental Analysis, smaller values for the embedding vector works better. In this tutorial, it is set to 8. Now each token will be converted into a vector of length 8. Embedding Vector values lies roughly between -1.0 and 1.0.
There are a few other things that needs to be specified for adding an Embedding Layer in the model that include:
-  total number of words in the vocabulary.
-  maximum sequence length.

In [22]:
size_for_embedding = 8
selected_model.add(Embedding(input_dim=take_number_words,
                             output_dim=size_for_embedding,
                             input_length=maximum_sequence_length,
                             name='layer_embedding'))

After the Embedding Layer, the Gated Recurrent Unit (GRU) Layers are added along with their number of outputs. The output of 1st GRU layer is fed as an input to the 2nd GRU Layer. In the same way, the output of 2nd GRU layer is fed as an input to the 3rd GRU Layer.
For the first two GRU Layers, sequences of data needs to be returned. This returened sequences will also be used as an input by the next GRU Layer whereas the last GRU Layer only needs to give the final output which is fed to the Dense Layer.

In [23]:
selected_model.add(GRU(units=16, return_sequences=True))
selected_model.add(GRU(units=8, return_sequences=True))
selected_model.add(GRU(units=4))

The last layer of the model is the Fully Connected Dense Layer that will compute the final output as a value between 0.0 and 1.0 which is used for classification.

In [24]:
selected_model.add(Dense(1, activation='sigmoid'))

Compile the Keras model after adding the Adam Optimizer with a certain learning rate.

In [25]:
selected_model.compile(loss='binary_crossentropy',
                       optimizer=Adam(lr=1e-3),
                       metrics=['accuracy'])

Below command is used to check out the model summary and gives insights into the compiled model.

In [26]:
selected_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 544, 8)            80000     
_________________________________________________________________
gru_1 (GRU)                  (None, None, 16)          1200      
_________________________________________________________________
gru_2 (GRU)                  (None, None, 8)           600       
_________________________________________________________________
gru_3 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 5         
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________


<b><h2>Training the Recurrent Neural Network

The model is then trained using the padded sequences of the training set. 5% of the training dataset is used as validation set to get a rough estimate of how well the model is doing or perhaps it is overfitting.

In [27]:
%%time
selected_model.fit(training_data_padded_tokens,
                   outputs_for_training,
                   validation_split=0.05,
                   epochs=3,
                   batch_size=64)

Train on 23750 samples, validate on 1250 samples
Epoch 1/3

Epoch 2/3

Epoch 3/3

Wall time: 22min 22s


<tensorflow.python.keras._impl.keras.callbacks.History at 0x23db1f79940>

<b><h2>Performance on Test-Set

After training the model, its accuracy on the test set is calculated.

In [28]:
evaluation = selected_model.evaluate(testing_data_padded_tokens, outputs_for_testing)




In [29]:
print("Accuracy: {0:.2%}".format(evaluation[1]))

Accuracy: 81.18%


<b><h2>Mis-Classified Text Example

To show a Mis-Classified Text example, the predicted sentiment for the first 1000 texts in the testing set is calculated.

In [30]:
predicted_test_set = selected_model.predict(x=testing_data_padded_tokens[0:1000])
predicted_test_set = predicted_test_set.T[0]

The predicted values range between 0.0 and 1.0. A threshold/cutoff is used to predict the class as either 1.0 or 0.0. The values above 0.5 are taken as 1.0 and the below ones are taken as 0.0.

In [31]:
predicted_test_set_class = np.array([1.0 if p>0.5 else 0.0 for p in predicted_test_set])

The true classes of the first 1000 texts in the testing set are also need to compare them with the predicted classes

In [32]:
true_test_set_class = np.array(outputs_for_testing[0:1000])

Comparing the predicted and true Classes of the test set, the indices of all the incorrect classified texts can be collected.

In [33]:
incorrect = np.where(predicted_test_set_class != true_test_set_class)
incorrect = incorrect[0]

In [34]:
index = incorrect[0]
index

32

Below is an example of a mis-classified text.

In [35]:
predicted_test_set_class[index]

0.0

In [36]:
true_test_set_class[index]

1.0

<b><h2>New Texts

New texts can be added to see the results.

In [37]:
NewText1 = "This movie is great! I like it because it is very good!"
NewText2 = "Fantastic movie!"
NewText3 = "Maybe I do not like this movie."
NewText4 = "Meh ..."
NewText5 = "If I were a drunk teenager then this movie might be good."
NewText6 = "Worst movie!"
NewText7 = "Not a nice movie!"
NewText8 = "This movie really sucks! Can I get my money back please?"
Texts = [NewText1, NewText2, NewText3, NewText4, NewText5, NewText6, NewText7, NewText8]

Converting New Texts into Tokens

In [38]:
tokens_NewText = tokenizer.texts_to_sequences(Texts)

In [39]:
tokens_NewText

[[11, 17, 6, 78, 10, 37, 9, 84, 9, 6, 52, 49],
 [799, 17],
 [273, 10, 77, 21, 37, 11, 17],
 [],
 [43, 10, 70, 3, 1942, 2188, 91, 11, 17, 233, 26, 49],
 [246, 17],
 [21, 3, 331, 17],
 [11, 17, 62, 1692, 67, 10, 76, 56, 290, 141, 591]]

Padding Token Sequences of NewTexts

In [40]:
tokens_padded_NewText = pad_sequences(tokens_NewText, maxlen = maximum_sequence_length, padding='pre', truncating='pre')

In [41]:
tokens_padded_NewText

array([[  0,   0,   0, ...,   6,  52,  49],
       [  0,   0,   0, ...,   0, 799,  17],
       [  0,   0,   0, ...,  37,  11,  17],
       ...,
       [  0,   0,   0, ...,   0, 246,  17],
       [  0,   0,   0, ...,   3, 331,  17],
       [  0,   0,   0, ..., 290, 141, 591]])

In [42]:
tokens_padded_NewText.shape

(8, 544)

Prediction on New Texts Using the Trained Model

In [43]:
selected_model.predict(tokens_padded_NewText)

array([[0.92998534],
       [0.7742285 ],
       [0.6559391 ],
       [0.5634046 ],
       [0.75763464],
       [0.33406493],
       [0.76830375],
       [0.48407638]], dtype=float32)

<b><h2>Getting Embedded Layer and its Weights

To check out the weights in the embedding layer, get the embedding layer from the model.

In [44]:
Embedding_Layer = selected_model.get_layer('layer_embedding')

Get weights of the embedding layer

In [45]:
Embedding_Weights = selected_model.get_weights()[0]

Checking out the shape of Embedding Weights

In [47]:
Embedding_Weights.shape

(10000, 8)

To see the weights for a few examples, get tokens of a few words

In [48]:
token_for_word_nice = tokenizer.word_index['nice']
token_for_word_nice

331

In [49]:
token_for_word_terrific = tokenizer.word_index['terrific']
token_for_word_terrific

1322

Display Embedding weights of the corresponding tokens

In [50]:
Embedding_Weights[token_for_word_nice]

array([ 0.3455799 ,  0.31407207, -0.09928934,  0.15568393,  0.13026911,
        0.15672676,  0.9624171 ,  0.21042493], dtype=float32)

In [51]:
Embedding_Weights[token_for_word_terrific]

array([0.7768304 , 0.5274931 , 0.34701896, 1.1513089 , 0.8382184 ,
       0.83672684, 0.3298065 , 0.5625317 ], dtype=float32)

<b><h1><center>!---The End---!