<b><h1><center>Recurrent Neural Network (using Dataset = imdb)</h1>
<b><h2><center>Natural Language Processing - Sentiment Analysis

<b><h2>Importing Libraries

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Embedding, GRU, Dense
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

<b><h2>Importing Data

In [4]:
import imdb

In [5]:
imdb.maybe_download_and_extract()

Data has apparently already been downloaded and unpacked.


<b><h2>Loading Training and Testing Datasets Separately

In [6]:
dataset_text_for_training, outputs_for_training = imdb.load_data(train=True)

In [7]:
dataset_text_for_testing, outputs_for_testing = imdb.load_data(train=False)

<b><h2>Total Data

In [8]:
data_text_in_total = dataset_text_for_training + dataset_text_for_testing

<b><h2>Suggesting Number of Unique Words to Use by Tokenizer

In [9]:
take_number_words = 10000

In [10]:
tokenizer = Tokenizer(num_words=take_number_words)

<b><h2>Fitting Tokenizer on Total DataSet

In [11]:
tokenizer.fit_on_texts(data_text_in_total)

<b><h2>Converting Texts of Training and Test Sets into Integer Tokens

In [12]:
tokens_training = tokenizer.texts_to_sequences(dataset_text_for_training)

In [13]:
tokens_testing = tokenizer.texts_to_sequences(dataset_text_for_testing)

<b><h2>Calculating the Number of Integer Tokens in Each Text

In [14]:
tokens_number_totals = np.array([len(tokens) for tokens in tokens_training + tokens_testing])

<b><h2>Calculating the Maximum Number of Tokens for Each Sequence

In [15]:
maximum_sequence_length = int(np.mean(tokens_number_totals) + 2 * np.std(tokens_number_totals))

<b><h2>Padding the Token Sequences to have the Same Length

In [16]:
training_data_padded_tokens = pad_sequences(tokens_training, maxlen=maximum_sequence_length, padding='pre', truncating='pre')

In [17]:
testing_data_padded_tokens = pad_sequences(tokens_testing, maxlen=maximum_sequence_length, padding='pre', truncating='pre')

<b><h2>Inverse Mapping of Tokens to Convert them Back into Words

In [18]:
inverse_mapping_indices = tokenizer.word_index

In [19]:
inverse_map = dict(zip(inverse_mapping_indices.values(), inverse_mapping_indices.keys()))

<b><h2>Helper Funtion to Convert Tokens into Words

In [20]:
def translate(input_tokens):
    output_words = [inverse_map[tokens] for tokens in input_tokens if input_tokens !=0]
    converted_string = " ".join(output_words)
    return converted_string

<b><h2>An Example of Comparing Original Text to the Text Translated from the Tokens

In [21]:
dataset_text_for_training[10]

'When I first read Armistead Maupins story I was taken in by the human drama displayed by Gabriel No one and those he cares about and loves. That being said, we have now been given the film version of an excellent story and are expected to see past the gloss of Hollywood...<br /><br />Writer Armistead Maupin and director Patrick Stettner have truly succeeded! <br /><br />With just the right amount of restraint Robin Williams captures the fragile essence of Gabriel and lets us see his struggle with issues of trust both in his personnel life(Jess) and the world around him(Donna).<br /><br />As we are introduced to the players in this drama we are reminded that nothing is ever as it seems and that the smallest event can change our lives irrevocably. The request to review a book written by a young man turns into a life changing event that helps Gabriel find the strength within himself to carry on and move forward.<br /><br />It\'s to bad that most people will avoid this film. I only say th

In [22]:
np.array(tokens_training[10])

array([  50,   10,   86,  339,   64,   10,   13,  607,    8,   31,    1,
        395,  449, 4380,   31, 5078,   54,   27,    2,  143,   28, 2150,
         42,    2, 1358,   12,  109,  301,   73,   25,  146,   75,  358,
          1,   19,  318,    4,   32,  320,   64,    2,   23,  862,    5,
         63,  509,    1,    4,  369,    7,    7,  563,    2,  164, 2372,
         25,  371, 4163,    7,    7,   16,   39,    1,  203, 1120,    4,
       8163, 1943, 1858, 2446,    1, 7019, 3258,    4, 5078,    2, 1573,
        176,   63,   24, 1708,   16, 1262,    4, 1740,  195,    8,   24,
        114, 6641,    2,    1,  181,  183,   87, 5050,    7,    7,   14,
         73,   23, 1705,    5,    1, 1887,    8,   11,  449,   73,   23,
       1568,   12,  161,    6,  123,   14,    9,  184,    2,   12,    1,
       8927, 1560,   67,  665,  260,  468,    1, 7954,    5,  715,    3,
        275,  407,   31,    3,  186,  128,  511,   82,    3,  114, 2538,
       1560,   12, 1526, 5078,  165,    1, 2095,  7

In [23]:
translate(tokens_training[10])

"when i first read story i was taken in by the human drama displayed by gabriel no one and those he cares about and loves that being said we have now been given the film version of an excellent story and are expected to see past the of hollywood br br writer and director patrick have truly succeeded br br with just the right amount of restraint robin williams captures the fragile essence of gabriel and lets us see his struggle with issues of trust both in his life jess and the world around him donna br br as we are introduced to the players in this drama we are reminded that nothing is ever as it seems and that the smallest event can change our lives the request to review a book written by a young man turns into a life changing event that helps gabriel find the strength within himself to carry on and move forward br br it's to bad that most people will avoid this film i only say that because the average american will probably think robin williams in a serious role that didn't work befo

<b><h2>Model Selection

In [24]:
selected_model = Sequential()

<b><h2>Embedding Tokens to feed the GRU

In [25]:
size_for_embedding = 8
selected_model.add(Embedding(input_dim=take_number_words,
                             output_dim=size_for_embedding,
                             input_length=maximum_sequence_length,
                             name='layer_embedding'))

<b><h2>Addition of GRU Layers

In [26]:
selected_model.add(GRU(units=16, return_sequences=True))
selected_model.add(GRU(units=8, return_sequences=True))
selected_model.add(GRU(units=4))

<b><h2>Addition of a Dense Layer

In [27]:
selected_model.add(Dense(1, activation='sigmoid'))

<b><h2>Compiling Model for Training

In [28]:
selected_model.compile(loss='binary_crossentropy',
                       optimizer=Adam(lr=1e-3),
                       metrics=['accuracy'])

<b><h2>Model Summary

In [29]:
selected_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 544, 8)            80000     
_________________________________________________________________
gru_1 (GRU)                  (None, None, 16)          1200      
_________________________________________________________________
gru_2 (GRU)                  (None, None, 8)           600       
_________________________________________________________________
gru_3 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 5         
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________


<b><h2>Training the Model

In [30]:
%%time
selected_model.fit(training_data_padded_tokens,
                   outputs_for_training,
                   validation_split=0.05,
                   epochs=3,
                   batch_size=64)

Train on 23750 samples, validate on 1250 samples
Epoch 1/3

Epoch 2/3

Epoch 3/3

Wall time: 23min 42s


<tensorflow.python.keras._impl.keras.callbacks.History at 0x1ee0ed6d5f8>

<b><h2>Evaluating the Model

In [31]:
evaluation = selected_model.evaluate(testing_data_padded_tokens, outputs_for_testing)




<b><h2>Printing Accuracy

In [32]:
print("Accuracy: {0:.2%}".format(evaluation[1]))

Accuracy: 87.12%


<b><h2>Model Prediction

In [33]:
predicted_test_set = selected_model.predict(x=testing_data_padded_tokens)

In [34]:
predicted_test_set = predicted_test_set.T[0]

In [35]:
predicted_test_set_class = np.array([1.0 if p>0.5 else 0.0 for p in predicted_test_set])

<b><h2>True Clases of the Test Set

In [36]:
true_test_set_class = np.array(outputs_for_testing)

<b><h2>Comparison of Incorrect Predicted and True Classes of Test Set

In [37]:
incorrect = np.where(predicted_test_set_class != true_test_set_class)
incorrect = incorrect[0]

In [38]:
predicted_test_set_class[379]

1.0

In [39]:
true_test_set_class[379]

1.0

<b><h2>Adding New Texts

In [40]:
NewText1 = "This movie is great! I like it because it is very good!"
NewText2 = "Fantastic movie!"
NewText3 = "Maybe I do not like this movie."
NewText4 = "Meh ..."
NewText5 = "If I were a drunk teenager then this movie might be good."
NewText6 = "Worst movie!"
NewText7 = "Not a nice movie!"
NewText8 = "This movie really sucks! Can I get my money back please?"
Texts = [NewText1, NewText2, NewText3, NewText4, NewText5, NewText6, NewText7, NewText8]

<b><h2>Converting New Texts into Tokens

In [41]:
tokens_NewText = tokenizer.texts_to_sequences(Texts)

In [42]:
tokens_NewText

[[11, 17, 6, 78, 10, 37, 9, 84, 9, 6, 52, 49],
 [799, 17],
 [273, 10, 77, 21, 37, 11, 17],
 [],
 [43, 10, 70, 3, 1942, 2188, 91, 11, 17, 233, 26, 49],
 [246, 17],
 [21, 3, 331, 17],
 [11, 17, 62, 1692, 67, 10, 76, 56, 290, 141, 591]]

<b><h2>Padding Token Sequences of NewTexts

In [43]:
tokens_padded_NewText = pad_sequences(tokens_NewText, maxlen = maximum_sequence_length, padding='pre', truncating='pre')

In [44]:
tokens_padded_NewText

array([[  0,   0,   0, ...,   6,  52,  49],
       [  0,   0,   0, ...,   0, 799,  17],
       [  0,   0,   0, ...,  37,  11,  17],
       ...,
       [  0,   0,   0, ...,   0, 246,  17],
       [  0,   0,   0, ...,   3, 331,  17],
       [  0,   0,   0, ..., 290, 141, 591]])

In [45]:
tokens_padded_NewText.shape

(8, 544)

<b><h2>Prediction on New Texts Using the Trained Model

In [46]:
selected_model.predict(tokens_padded_NewText)

array([[0.9365637 ],
       [0.8564043 ],
       [0.2860732 ],
       [0.81650347],
       [0.23368984],
       [0.30014473],
       [0.73337275],
       [0.09402247]], dtype=float32)

<b><h2>Getting Embedded Layer and its Weights

In [47]:
Embedding_Layer = selected_model.get_layer('layer_embedding')

In [48]:
Embedding_Weights = selected_model.get_weights()[0]

In [49]:
Embedding_Weights.shape

(10000, 8)

<b><h2>Tokens and Embedded Weights for a Few Words

In [50]:
token_for_word_nice = tokenizer.word_index['nice']
token_for_word_nice

331

In [51]:
token_for_word_terrific = tokenizer.word_index['terrific']
token_for_word_terrific

1322

In [52]:
Embedding_Weights[token_for_word_nice]

array([0.36549115, 0.77727985, 0.25779524, 0.24881174, 0.75840694,
       0.5090445 , 0.8220042 , 0.69435525], dtype=float32)

In [53]:
Embedding_Weights[token_for_word_terrific]

array([-0.06922172,  0.7642933 ,  0.13405271,  0.9164007 ,  0.5210624 ,
       -0.22038034,  0.22925146,  0.95140827], dtype=float32)

<b><h1><center>!---The End---!