# Positive-Negative Sentiments

**Are your customers happy?**

Consider the statement "This movie is not very good". The sentence ends with the words "very good" which indicates a very positive sentiment, but it is negated because it is preceeded by the word "not", so this text should be classified as having a negative sentiment.

Using this notebook as a guide, can you build a sentiment analysis classifier for KejaBot and have it used in chatyfy - https://bit.ly/Chatyfy ? Pull this notebook, scrap some reviews from the websites and build your classifier.

Alternatively, download Chatyfy to access all the data from KejaBot and build a classifier that improves on this one. You will get a unique code to play around with KejaBot Customer Reviews Data on Real-Estates around Nairobi.

**Processing Steps**

1. Conert raw text into tokens which are integers - These are indices into a list of the entire vocabulary
2. Convert the tokens into embeddings - These are real-valued vectors, whose mapping will be trained along with the neural network, so as to map words with similar meanings to similar embedding-vectors. The mapping of the tokens and the embeddings learns the semantic meanings of the words i.e. words that have similar meanings are somewhat close together i.e. cosine similarity in the embedding space.
3. Feed the embedding-vectors into an RNN - The RNN takes sequences of arbitrary length as input and output a kind of summary of what it has seen into the input
4. Squash the output of the RNN using a sigmoid function - This will give us the prediction probability value between 0.0 and 1.0 where 0.0 mean a negative sentiment and 1.0 means a positive sentiment.

**RNN Re-Cap**

Types of recurrent units within an RNN:
    1. LSTM - Long-Short-Term-Memory
    2. GRU - Gated Recurrent Unit
 
A gated reurrent unit has an internal state that is being updated every-time the unit recieves a new input and it is a kind of memory that stores floating-point values in its memory state which are read and written using matrix operations so that the operations are all differentiable. This means the memory-state can store arbitrary floating-point values (although typically limited between -1.0 and 1.0) and therefore the network can be trained like a normal neural network using Gradient Descent.

![title](images/gated-recurrent-unit.png)



**Un-Rolled Network**
![title](images/unrolled-gated-recurrent-unit.png)

**Steps in an Un-Rolled Network - How the sequence of words are processed**
1. The initial memory-state of the recurrent unit is set to zero internally by Keras / Tensorflow everytime a sequence begins
2. In the first time-step, the word "this" is input to the recurrent unit which users its internal state (initialized to zero) and its gate to calculate the new state.
3. In the second time-step the word "is" is input to the recurrent unit which now uses the internal state that was just updated from seeing the previous word "this" - at this point the recurrent unit does not save anything in its internal state since no much can be deduced from the words "this is"
4. In the third time-step the word "not", the recurrent unit will have learned that it may be important for determining the overall sentiment of the input-text, so this needs to be stored in the memory state of the recurrent unit, which can be used later when the recurrent unit sees the word "good" in time-step 6
5. When the entire sequence of words have been processed, the recurrent unit outputs a vector of values that summarizes what it has seen in the input sequence. A fully connected layer will then be used with a sigmoid function to get the output value which is a single value between 0.0 and 1.0 which we interpret as the sentiment either being negative or positive.

**Layered Un-Rolled Network**

![title](images/layered-unrolled-gated-recurrent-unit.png)

**How the layered un-rolled RNN works**

Asuming we have a three layered unrolled neural network as depicted in the above image then the first layer is much like the unrolled figure above for a single-layer RNN. First the recurrent unit RU1 has its internal state initialized to zero by Keras / TensorFlow. Then the word "this" is input to RU1 and it updates its internal state. Then it processes the next word "is", and so forth. But instead of outputting a single summary value at the end of the sequence, we use the output of RU1 for every time-step. This creates a new sequence that can then be used as input for the next recurrent unit RU2. The same process is repeated for the second layer and this creates a new output sequence which is then input to the third layer's recurrent unit RU3, whose final output is passed to a fully-connected Sigmoid layer that outputs a value between 0.0 (negative sentiment) and 1.0 (positive sentiment).

**NB:**
- The New State value depends on the Old State Value as well as the input value. For example if the internal state value has memorized that we have recently seen the word "not" and the current input is "good" then we need to store a new state value that memorizes "not good" which indicates a negative sentiment.
- The part of the recurrent unit that is responsible for mapping old state values and inputs to the new state value is called a gate which is just a type of matrix operation. 
- There is a gate that is responsible in calculating the output of the recurrent unit.
- IN order to train the recurrent unit, we must gradually change the weight-matrices of the gates so the recurrent unit gives the desired output for an input sequence which is done automatically using Tensorflow.
- When unrolled, the recurrent unit explains how the sequences of words are processed in an RNN in time-steps

**Loss function, Explosive & Vanishing Gradients**

In order to train the weights for the gates inside the recurrent unit, we need to minimize some loss function which is a measure of the difference between the actual output of the network as compared to the desired output.

From the "unrolled" figures above we see that the reccurent units are applied recursively for each word in the input sequence. This means the recurrent gate is applied once for each time-step. The gradient-signals have to flow back (back propagation) from the loss-function all the way to the first time the recurrent gate is used. If the gradient of the recurrent gate is multiplicative, then we essentially have an exponential function.

In this tutorial we will use texts that have more than 500 words. This means the RU's gate for updating its internal memory-state is applied recursively more than 500 times. 

If a gradient of just 1.01 is multiplied with itself 500 times then it gives a value of about 145. 

If a gradient of just 0.99 is multiplied with itself 500 times then it gives a value of about 0.007. 

These are called exploding and vanishing gradients. The only gradients that can survive recurrent multiplication are 0 and 1.
To avoid these so-called exploding and vanishing gradients, care must be made when designing the recurrent unit and its gates. 

That is why the actual implementation of the GRU is more complicated, because it tries to send the gradient back through the gates without this distortion.

# Workbench

In [85]:
# Import the standard libraries
import numpy as np
import pandas as pd

# Visualisation Imports
import matplotlib.pyplot as plt
from matplotlib import gridspec
from wordcloud import WordCloud, STOPWORDS
import seaborn as sns

# import tensorflow
import tensorflow as tf

# import keras
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

# import scipy
from scipy.spatial.distance import cdist

# Enable logging and warnings
import logging
import warnings

# Import the IMDB Data
import imdb

# Import SKLearn Libraries
from sklearn.preprocessing import LabelBinarizer

import operator

In [64]:
# Setting the configurations
pd.set_option('display.max_columns',100) #Displays all the columns in a dataframe
pd.set_option('display.max_colwidth',10000) #Display all the text in a dataframe column

# Set the imdb download folder
imdb.data_dir = "data/IMDB/"

%matplotlib inline
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
warnings.filterwarnings("ignore",category=DeprecationWarning)
sns.set()

**Check the Versions of the libraries**

In [65]:
print("Tensorflow version : {} ".format(tf.__version__))

Tensorflow version : 1.14.0 


In [66]:
print("Keras Version : {} ".format(tf.keras.__version__))

Keras Version : 2.2.4-tf 


**Load the data**

In [67]:
#importing the training data
imdb_data=pd.read_csv('data/IMDB/IMDBDataset.csv',engine="python")
print(imdb_data.shape)
imdb_data.head(5)

(50000, 2)


Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.",positive


In [68]:
#Summary of the dataset
imdb_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,"Loved today's show!!! It was a variety and not solely cooking (which would have been great too). Very stimulating and captivating, always keeping the viewer peeking around the corner to see what was coming up next. She is as down to earth and as personable as you get, like one of us which made the show all the more enjoyable. Special guests, who are friends as well made for a nice surprise too. Loved the 'first' theme and that the audience was invited to play along too. I must admit I was shocked to see her come in under her time limits on a few things, but she did it and by golly I'll be writing those recipes down. Saving time in the kitchen means more time with family. Those who haven't tuned in yet, find out what channel and the time, I assure you that you won't be disappointed.",negative
freq,5,25000


In [69]:
#labeling the sentient data by making the target variable either 1 or 0
lb=LabelBinarizer()
imdb_data['sentiment']=lb.fit_transform(imdb_data['sentiment'])

In [70]:
#sentiment count
imdb_data['sentiment'].value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

In [71]:
#split the dataset 

#train dataset
x_train_text=imdb_data.review[:40000]
y_train=imdb_data.sentiment[:40000]

#test dataset
x_test_text=imdb_data.review[40000:]
y_test=imdb_data.sentiment[40000:]

In [72]:
print("Train-set size: ", len(x_train_text))
print("Test-set size:  ", len(x_test_text))
print("=========================================")
print("The train dataset has {} rows of text and {} rows of sentiment classes. ".format(x_train_text.shape[0],y_train.shape[0]))
print("The test dataset has {} rows of text and {} rows of sentiment classes. ".format(x_test_text.shape[0],y_test.shape[0]))

Train-set size:  40000
Test-set size:   10000
The train dataset has 40000 rows of text and 40000 rows of sentiment classes. 
The test dataset has 10000 rows of text and 10000 rows of sentiment classes. 


In [88]:
# Combine the dataset for easier processing
data_text = np.concatenate((x_train_text,x_test_text),axis=0)  

In [110]:
# Visualize a sample training set text just to confirm that it is text as expected
x_train_text[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [91]:
# Visualize a sample training set class just to confirm that it is a class as expected
y_train[0]

1

## Tokenize the data

We do this in two steps:
1. Convert the words into integers - done on the dataset before input into the NN
2. Embedding Layer

In [59]:
# Instruct the tokenizer to use 10000 most popular words from the data-set
num_words = 10000
tokenizer = Tokenizer(num_words=num_words)

**Scan through the text and strip it off unwanted characters suhc as punctuation, converts it to lowecase. 
The tokenizer will then builds a vocabulary of all unique words along with various data sctructures for accessing the data**

In [92]:
%%time
tokenizer.fit_on_texts(data_text)

Wall time: 10.6 s


In [93]:
# Inspect the vocabulary that thas been gathered by the tokenizer
tokenizer.word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'out': 41,
 'about': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'my': 56,
 'even': 57,
 'would': 58,
 'she': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'bad': 74,
 'been': 75,
 'get': 76,
 'do': 77,
 'great': 78,
 'other': 79,
 'will': 80,
 'also': 81,
 'into': 82,
 'p

In [94]:
# We use the tokenizor to convert the text in the training set to a list of these tokens
x_train_tokens = tokenizer.texts_to_sequences(x_train_text)

In [95]:
#To check the tokens for the first text in the training set
x_train_text[1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [96]:
# The equivalent in tokens is
np.array(x_train_tokens[1])

array([   3,  393,  120,  353,    7,    7,    1, 1385, 2977,    6,   52,
         52,  155,   55, 2381, 1582,    2,  411,    3,    2,  530,  282,
          4, 1847,    5,    1,  438,  412,    7,    7,    1,  150,   23,
        568,   69, 2274,  498, 4571,   21,   61,   45,  189,   29,    1,
         18,   28,   45,   29,    1, 2294,  175, 3336,   96,   22,   67,
        371,   63,    1,  791, 9719,   31,    1, 1825,    5, 7366, 6594,
         21,   61,    6,    9,   69,  278,    1,  147,   18,    9,    6,
          3,  407,    2, 2406,  412,    3, 4339,  353,   42,   27,    4,
          1,   78,    4,  202,    2,   24,  114,    7,    7,    1, 1847,
         62,  270,  344,   16,    1,  120,  177,    1, 1029,    4,    1,
       2924,   60,  248,   71,  356,    1, 2206, 3127, 1289, 1192,   91,
       4911,    9,  297,   20,  260, 1830,    2,  260, 4592,  583,   16,
          1,  134, 3690,    2,    2,    1,  730,  583,    4,   65, 1054,
         16,  170, 2297,   23, 1977,   69,  221])

In [97]:
# We repeat tokenizing the text for the test text
x_test_tokens = tokenizer.texts_to_sequences(x_test_text)

**Padding or Truncating the Data**

Since we are using a whole batch of data, the sequences need to have the same length which can be achieved by either one of the below:
1. Ensure all sequences in the entire dataset have the same length by using the length of the maximum sentence - Simpler but takes alot of memory
2. Generate padding values that ensures that all sequences have the same length within each batch.

A compromise is to use a sequence length that covers most sequences in the data-set and then later truncate longer sequences in the dataset and pad shorter sequences

In [112]:
# Count the number of tokens in all the sequencesin the dataset
num_tokens = [len(tokens) for tokens in x_train_tokens + x_test_tokens]
num_tokens = np.array(num_tokens)

In [113]:
# The average number of tokens in the sequence is :
np.mean(num_tokens)

221.27714

In [114]:
# The maximum number of tokens in a sequence is : 
np.max(num_tokens)

2209

In [115]:
# The sequence length we shall use is the average plus two standard deviations
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens

544

In [117]:
# This will cover about 95% of the entire dataset
np.sum(num_tokens < max_tokens) / len(num_tokens)*100

94.53

In [118]:
# select the prefered padding method whether it is pre or post
pad = 'pre'

In [121]:
# Pad the training dataset
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,
                            padding=pad, truncating=pad)

In [122]:
# View the shape of the new training dataset matrix
x_train_pad.shape

(40000, 544)

In [120]:
# Pad the testing dataset
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)

In [123]:
# View the shape of the new testing dataset matrix
x_test_pad.shape

(10000, 544)

In [124]:
# An example before padding
np.array(x_train_tokens[1])

array([   3,  393,  120,  353,    7,    7,    1, 1385, 2977,    6,   52,
         52,  155,   55, 2381, 1582,    2,  411,    3,    2,  530,  282,
          4, 1847,    5,    1,  438,  412,    7,    7,    1,  150,   23,
        568,   69, 2274,  498, 4571,   21,   61,   45,  189,   29,    1,
         18,   28,   45,   29,    1, 2294,  175, 3336,   96,   22,   67,
        371,   63,    1,  791, 9719,   31,    1, 1825,    5, 7366, 6594,
         21,   61,    6,    9,   69,  278,    1,  147,   18,    9,    6,
          3,  407,    2, 2406,  412,    3, 4339,  353,   42,   27,    4,
          1,   78,    4,  202,    2,   24,  114,    7,    7,    1, 1847,
         62,  270,  344,   16,    1,  120,  177,    1, 1029,    4,    1,
       2924,   60,  248,   71,  356,    1, 2206, 3127, 1289, 1192,   91,
       4911,    9,  297,   20,  260, 1830,    2,  260, 4592,  583,   16,
          1,  134, 3690,    2,    2,    1,  730,  583,    4,   65, 1054,
         16,  170, 2297,   23, 1977,   69,  221])

In [125]:
# The same text after padding
x_train_pad[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

## Inverse Token

Unfortunately Keras has no function of converting the tokens back to words and therefore we have to build our own inverter from tokens back to strings

In [126]:
idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))

# Build the helper function
def tokens_to_string(tokens):
    # Map from tokens back to words.
    words = [inverse_map[token] for token in tokens if token != 0]
    
    # Concatenate all words.
    text = " ".join(words)

    return text

In [127]:
# If we pick an original text from the dataset
x_train_text[1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [128]:
# Converting it's equivalent token back to string would yield
tokens_to_string(x_train_tokens[1])

'a wonderful little production br br the filming technique is very very old time bbc fashion and gives a and sometimes sense of realism to the entire piece br br the actors are extremely well chosen michael sheen not only has got all the but he has all the voices down pat too you can truly see the editing guided by the references to diary entries not only is it well worth the watching but it is a written and performed piece a masterful production about one of the great of comedy and his life br br the realism really comes home with the little things the fantasy of the guard which rather than use the traditional techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning and and the sets particularly of their flat with every surface are terribly well done'

## The RNN

In [129]:
# Create the model
model = Sequential()

The mapping of integer tokens to real-valued vectors is called embedding and the first layer in an RNN is the Embedding Layer which converts the integer tokens derived above into input vectors i.e. vectors of values mainly because:
1. Integer tokens may take on values between 0 and 10000 for a vocabulary of 10000 words but the RNN cannot work with values on such a huge range
2. The embedding layer learns to map words with similar semantic meanings to similar embedding-vectors

Embedding allows us to quickly lookup the mapping of each integer-token by simply using the token as an index into the matrix.

In [130]:
# Define the size of each integer token. This defines the size of the vector for each integer token. 
# The values of the integer vector would generally be between -1.0 to 1.0
# The size of the embedding vectors is typically selected between 100-300, but it seems to work reasonably well withsmall 
# values for sentiment analysis
embedding_size = 8

Besides the embedding size, the embedding layer will also need:
1. Number of words in the vocabulary - num_words
2. Length of the padded tokens - max_tokens which is the chosen custom length of the sequence as derived above
3. A Name - The layer will also need a name because we shall need to retrieve the weights at later stages

In [131]:
# Add the embedding layer
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [132]:
# Add the first Gated Recurrent Unit that will have 16 outputs and since we are going to add another GRU layer it's output
# should be sequences which is expected by the next GRU
model.add(GRU(units=16, return_sequences=True))

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [133]:
# Add the second GRU layer with eight output units and since the output of this GRU forms the input of the next GRU then
# this GRU needs to return sequences
model.add(GRU(units=8, return_sequences=True))

In [134]:
# Add the final layer which will feed into a dense layer. The output will be 4 units and it will forward its final output to
# the dense layer and NOT sequence as previous GRU layers
model.add(GRU(units=4))

In [135]:
# Add a fully connected layer i.e. A dense layer that will output a value between 0.0 and 1.0 that will be used as the 
# classification output. This Dense layer uses the sigmoid activation function
model.add(Dense(1, activation='sigmoid'))

In [136]:
# Add an optimizer with the given learning rates
optimizer = Adam(lr=1e-3)

In [137]:
# Compile the model so that it is ready for training
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [138]:
# View the model Summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 544, 8)            80000     
_________________________________________________________________
gru (GRU)                    (None, 544, 16)           1200      
_________________________________________________________________
gru_1 (GRU)                  (None, 544, 8)            600       
_________________________________________________________________
gru_2 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense (Dense)                (None, 1)                 5         
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________


**Training the RNN**

**NB:**

- We are using the training dataset with the padded sequence
- WE use 5% of the training set as the validation set so as to have a rough idea whether the model is generalizing well or if it is perhaps over-fitting to the training-set

In [139]:
%%time
model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=3, batch_size=64)

Train on 38000 samples, validate on 2000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Wall time: 18min 10s


<tensorflow.python.keras.callbacks.History at 0x1df59ea04a8>

**How does the model perform on the test data**

In [140]:
%%time
result = model.evaluate(x_test_pad, y_test)

Wall time: 47.6 s


In [141]:
# Print the test performance report on the test data
print("Accuracy: {0:.2%}".format(result[1]))

Accuracy: 89.02%


**View mis-classified text**

Steps to show mis-classified texts:
1. Calculate the predicted sentiment for the first 1000 texts in the test-set
2. Define a thre-shold / cut-off using the prediction
3. Define the true values within the test set
4. Get those that were mi-classified within the 1000 texts
5. Display the mis-classified text

In [143]:
%%time
# Calculate the predicted sentiment for the first 1000 texts in the test set
y_pred = model.predict(x=x_test_pad[0:1000])
y_pred = y_pred.T[0]

Wall time: 4.5 s


In [144]:
# Define the threshold because the predicted values fall between 0.0 and 1.0
cls_pred = np.array([1.0 if p>0.5 else 0.0 for p in y_pred])

In [145]:
# Define the true class and get the values for the true classes within the range
cls_true = np.array(y_test[0:1000])

In [146]:
# get indices for all the text that were incorrectly classified
incorrect = np.where(cls_pred != cls_true)
incorrect = incorrect[0]

In [147]:
# Out of the range of the first 1000 text, how many were mis-classified
len(incorrect)

103

In [150]:
# Get the index of the first mis-classified text
idx = incorrect[1]
idx

18

**How does the model perform out-of-sample**

In [160]:
text1 = "This movie is fantastic! I really like it because it is so good!"
text2 = "Good movie!"
text3 = "Maybe I like this movie."
text4 = "Meh ..."
text5 = "If I were a drunk teenager then this movie might be good."
text6 = "Bad movie!"
text7 = "Not a good movie!"
text8 = "This movie really sucks! Can I get my money back please?"
texts = [text1, text2, text3, text4, text5, text6, text7, text8]

In [161]:
# Convert the texts to arrays of integer-tokens because that is needed by the model
tokens = tokenizer.texts_to_sequences(texts)

In [162]:
# Do padding and Truncation of the text
tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)
tokens_pad.shape

(8, 544)

In [163]:
# DO the predictions
model.predict(tokens_pad)

array([[0.91632265],
       [0.68180573],
       [0.4200648 ],
       [0.3542525 ],
       [0.34074825],
       [0.18598664],
       [0.6484486 ],
       [0.12222669]], dtype=float32)

**Embeddings**

The embeddings are learned along with the rest of the model during training. Ideally the embedding would learn a mapping where words that are similar in meaning also have similar embedding-values

In [164]:
# Get the embedding layer of the model
layer_embedding = model.get_layer('layer_embedding')

In [165]:
# Get the weights used for the mapping done by the embedding
weights_embedding = layer_embedding.get_weights()[0]

In [166]:
# Display the weights - these are matrices with the number of words in the vocabulary times the vector length for 
# each embedding. Basically forming a lookup matrix
weights_embedding

array([[-0.07242753, -0.11667494, -0.05893565, ..., -0.00454331,
        -0.03069109, -0.02718002],
       [-0.03614771,  0.02394958, -0.01193809, ..., -0.01257587,
         0.02597064, -0.02292734],
       [-0.0534177 ,  0.0277775 , -0.02637164, ...,  0.02117256,
         0.03115044, -0.03246065],
       ...,
       [-0.00314166,  0.04078488, -0.03987049, ..., -0.00361507,
         0.04128303,  0.00029174],
       [ 0.01949737,  0.0014849 , -0.04249571, ...,  0.03466958,
        -0.03937452, -0.00418026],
       [ 0.07824972,  0.01078728, -0.04018669, ..., -0.02829858,
        -0.0210399 , -0.03831206]], dtype=float32)

In [167]:
weights_embedding.shape

(10000, 8)

**Sorted words**

Within the embedding space, can we identify the similarity of words? We want to see if words that have similar embedding-vectors also have similar meanings. Similarity of embedding-vectors can be measured by different metrics, e.g. Euclidean distance or cosine distance.

In [168]:
# Helper function to help produce similarity based on their cosine
def print_sorted_words(word, metric='cosine'):
    """
    Print the words in the vocabulary sorted according to their
    embedding-distance to the given word.
    Different metrics can be used, e.g. 'cosine' or 'euclidean'.
    """

    # Get the token (i.e. integer ID) for the given word.
    token = tokenizer.word_index[word]

    # Get the embedding for the given word. Note that the
    # embedding-weight-matrix is indexed by the word-tokens
    # which are integer IDs.
    embedding = weights_embedding[token]

    # Calculate the distance between the embeddings for
    # this word and all other words in the vocabulary.
    distances = cdist(weights_embedding, [embedding],
                      metric=metric).T[0]
    
    # Get an index sorted according to the embedding-distances.
    # These are the tokens (integer IDs) for words in the vocabulary.
    sorted_index = np.argsort(distances)
    
    # Sort the embedding-distances.
    sorted_distances = distances[sorted_index]
    
    # Sort all the words in the vocabulary according to their
    # embedding-distance. This is a bit excessive because we
    # will only print the top and bottom words.
    sorted_words = [inverse_map[token] for token in sorted_index
                    if token != 0]

    # Helper-function for printing words and embedding-distances.
    def _print_words(words, distances):
        for word, distance in zip(words, distances):
            print("{0:.3f} - {1}".format(distance, word))

    # Number of words to print from the top and bottom of the list.
    k = 10

    print("Distance from '{0}':".format(word))

    # Print the words with smallest embedding-distance.
    _print_words(sorted_words[0:k], sorted_distances[0:k])

    print("...")

    # Print the words with highest embedding-distance.
    _print_words(sorted_words[-k:], sorted_distances[-k:])

**Let us print the words that are near and far from the word 'great' in terms of their vector-embeddings. Note that these may change each time you train the model.**

In [169]:
print_sorted_words('great', metric='cosine')

Distance from 'great':
0.000 - great
0.007 - touch
0.008 - complain
0.009 - dwarf
0.009 - businessman
0.015 - greatest
0.015 - refreshing
0.015 - denver
0.016 - theatres
0.016 - wonderfully
...
1.978 - fiancã©e
1.979 - lame
1.979 - robbery
1.980 - supposed
1.981 - carole
1.981 - threw
1.982 - wasted
1.982 - brainless
1.987 - weirdness
1.987 - cheap


**Tips on how to improve this kind of model when using Keras/Tensorflow**
1. Run more training-epochs. Does it improve performance?
2. If the model overfits the training-data, try using dropout-layers and dropout inside the GRU.
3. Increase or decrease the number of words in the vocabulary. This is done when the Tokenizer is initialized. Does it affect performance?
4. Increase the size of the embedding-vectors to e.g. 200. Does it affect performance?
5. Try varying all the different hyper-parameters for the Recurrent Neural Network.
6. Use 'post' for padding and truncating in pad_sequences(). Does it affect the performance?
7. Use individual characters instead of tokenized words as the vocabulary. You can then use one-hot encoded vectors for each character instead of using the embedding-layer.
8. Use model.fit_generator() instead of model.fit() and make your own data-generator, which creates a batch of data using a random subset of x_train_tokens. The sequences must be padded so they all match the length of the longest sequence.
