# Recurrent Neural Networks

{{ badge }}

A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

Keras API provides easy implementation for RNN layers, we'll go over them as we train a model for sentiment analysis.

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import model_selection
# We'll import the layers directly for easier model definition
from tensorflow.keras.layers import Input, Dense, Dropout, Embedding, Flatten, SimpleRNN, LSTM, GRU, Bidirectional

In [2]:
tf.random.set_seed(42)

In [3]:
# Below are some parameters that we'll set up now, don't worry about it for now

vocab_size = 8000
max_sequence = 128
embeddings_dims = 100

## Download Data From Kaggle

We'll be downloading the dataset from Kaggle, this requires using their download API, we'll go over the steps to do it.

First we'll be uploading `kaggle.json` into the '~\.kaggle` directory that we'll be creating, this will enable us to download datasets directly from Kaggle, more info on the process can be found here: https://github.com/Kaggle/kaggle-api 

In [4]:
# First we'll create a new folder to put kaggle.json into
!mkdir /root/.kaggle

# Let's confirm that the directory is created
!cd /root/ && ls -la

total 64
drwx------ 1 root root 4096 Feb  3 02:12 .
drwxr-xr-x 1 root root 4096 Feb  3 01:46 ..
-r-xr-xr-x 1 root root 1169 Jan  1  2000 .bashrc
drwxr-xr-x 1 root root 4096 Feb  1 17:56 .cache
drwxr-xr-x 1 root root 4096 Feb  1 17:54 .config
drwxr-xr-x 3 root root 4096 Feb  1 17:28 .gsutil
drwxr-xr-x 5 root root 4096 Feb  1 17:54 .ipython
drwx------ 2 root root 4096 Feb  1 17:54 .jupyter
drwxr-xr-x 2 root root 4096 Feb  3 02:12 .kaggle
drwxr-xr-x 2 root root 4096 Feb  3 01:46 .keras
drwx------ 1 root root 4096 Feb  1 17:54 .local
drwxr-xr-x 4 root root 4096 Feb  1 17:54 .npm
-rw-r--r-- 1 root root  148 Aug 17  2015 .profile
-r-xr-xr-x 1 root root  254 Jan  1  2000 .tmux.conf


Now that `.kaggle` directory is all set up, we'll need to upload the `kaggle.json` file. For this you'll need a Kaggle account, you can obtain the file from this url `https://www.kaggle.com/<username>/account` (make sure to replace <username> with your actual username).

More info here: https://github.com/Kaggle/kaggle-api

In [5]:
# Import colab's files module
from google.colab import files

# Start the upload, this will open the upload prompt below
uploaded = files.upload()

# Confirm that we've uploaded the kaggle.json file
print("Uploaded File:", list(uploaded.keys())[0])

Saving kaggle.json to kaggle.json
Uploaded File: kaggle.json


Now that we have the `kaggle.json` file uploaded, we'll need to move it to `.kaggle` directory. 

In [6]:
# Move kaggle.json to .kaggle directory
!mv kaggle.json /root/.kaggle/kaggle.json

# Change file permission to allow python to access it
!chmod 600 /root/.kaggle/kaggle.json

# List files inside .kaggle to confirm that the file is moved
!cd /root/.kaggle && ls -la

total 16
drwxr-xr-x 2 root root 4096 Feb  3 02:12 .
drwx------ 1 root root 4096 Feb  3 02:12 ..
-rw------- 1 root root   64 Feb  3 02:12 kaggle.json


And finally, we can download the dataset directly from kaggle using their Python API command (note that you may need to run `!pip install kaggle`)

In [7]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 35% 9.00M/25.7M [00:00<00:00, 24.2MB/s]
100% 25.7M/25.7M [00:00<00:00, 52.3MB/s]


And now, let's unzip the downloaded file

In [8]:
!unzip imdb-dataset-of-50k-movie-reviews.zip

Archive:  imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


## Load & Preprocess Data

First, we'll load the extracted csv file into a dataframe to examine the content.

In [9]:
df = pd.read_csv('IMDB Dataset.csv')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


We can see that we have two columns, the review which has the review text that will be used as the input for the model, and sentiment, which is what the model will try to predict, let's split them into X's and y's  

In [10]:
x = df.review
y = df.sentiment

x.shape, y.shape

((50000,), (50000,))

### Tokenization

Since we know that neural network don't work with text, we'll need a way to prepare the review text to be consumed by the network.

Keras provides APIs for preparing text that can be fit and reused to prepare multiple text documents. This may be the preferred approach for large projects.

We'll first use the `Tokenizer` class from `tf.keras.preprocessing.text` module to convert tokens (i.e. words, symbols, numbers...etc.) into numbers that can be consumed by neural networks.

You can read the documentation of the `Tokenizer` class here: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [11]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=vocab_size, # Maximum number of tokens to include, we'll use vocab_size that we defined earlier
    oov_token='<OOV>', # A token that will replace words that will not be in the limited vocabulary set by vocab_size  
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n' # Symbols that will be removed from the texts
)

# Now we will train the tokenizer on our datasets, this allows the tokenizer to learn the most frequent words and create an index for them

tokenizer.fit_on_texts(x)

Now we can use the tokenizer to convert texts into sequences of numbers (each token/word has its own unique index)

In [12]:
tokenizer.word_index

{'<OOV>': 1,
 'the': 2,
 'and': 3,
 'a': 4,
 'of': 5,
 'to': 6,
 'is': 7,
 'br': 8,
 'in': 9,
 'it': 10,
 'i': 11,
 'this': 12,
 'that': 13,
 'was': 14,
 'as': 15,
 'for': 16,
 'with': 17,
 'movie': 18,
 'but': 19,
 'film': 20,
 'on': 21,
 'not': 22,
 'you': 23,
 'are': 24,
 'his': 25,
 'have': 26,
 'be': 27,
 'one': 28,
 'he': 29,
 'all': 30,
 'at': 31,
 'by': 32,
 'an': 33,
 'they': 34,
 'so': 35,
 'who': 36,
 'from': 37,
 'like': 38,
 'or': 39,
 'just': 40,
 'her': 41,
 'out': 42,
 'about': 43,
 'if': 44,
 "it's": 45,
 'has': 46,
 'there': 47,
 'some': 48,
 'what': 49,
 'good': 50,
 'when': 51,
 'more': 52,
 'very': 53,
 'up': 54,
 'no': 55,
 'time': 56,
 'my': 57,
 'even': 58,
 'would': 59,
 'she': 60,
 'which': 61,
 'only': 62,
 'really': 63,
 'see': 64,
 'story': 65,
 'their': 66,
 'had': 67,
 'can': 68,
 'me': 69,
 'well': 70,
 'were': 71,
 'than': 72,
 'much': 73,
 'we': 74,
 'bad': 75,
 'been': 76,
 'get': 77,
 'do': 78,
 'great': 79,
 'other': 80,
 'will': 81,
 'also': 82,
 '

In [13]:
x_tokenized = tokenizer.texts_to_sequences(x)

# Let's print s string before and after tokenization and examine the differences
print(x[0])
print(x_tokenized[0])

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

### Padding Sequences

Let's look at the lengths of the different tokenized reviews



In [14]:
print("Length of Review 1:", len(x_tokenized[0]))
print("Length of Review 10:", len(x_tokenized[9]))
print("Length of Review 1000:", len(x_tokenized[999]))

Length of Review 1: 314
Length of Review 10: 34
Length of Review 1000: 620


We can see that each tokenized review has different length. And since these tokenized reviews will be used as input to the model (which needs to be of a fixed shape), then clearly this won't work, an extra preprocessing step is required called `pad_sequences` available at `tf.keras.preprocessing.sequence`, this function will ensure that sequences are of same length by either clipping the sequence or padding it with zeros. 

In [15]:
x_padded = tf.keras.preprocessing.sequence.pad_sequences(
    x_tokenized, # The sequences that will be padded/clipped 
    maxlen=max_sequence, # The maximum length of the sequence using max_sequence that's defined earlier
    padding='post', # Where we'll add zeros if sequence length is shorter that the maximum length, this will add zeros to the end of the sentence
)

# Let's print out the length of some padded sequences  
print("Length of Review 1:", len(x_padded[0]))
print("Length of Review 10:", len(x_padded[9]))
print("Length of Review 1000:", len(x_padded[999]))

Length of Review 1: 128
Length of Review 10: 128
Length of Review 1000: 128


### Preprocess Targets
Now that the inputs are all setup, let's work on the targets. Specifically, let's change posative/negative into 1/0.

In [16]:
y.replace({'positive':1,'negative':0}, inplace=True)
y

0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64

### Train/Test Split
Now that everything is setup, all we need to do is creating our training/testing split using Scikit Learn's `model_selection.train_test_split`.

In [17]:
x_train, x_test, y_train, y_test = model_selection.train_test_split(x_padded, y, test_size = 0.05, random_state=42, stratify=y)

x_train.shape, y_train.shape, x_test.shape, y_test.shape

((47500, 128), (47500,), (2500, 128), (2500,))

### Create TF Data Pipeline

In [18]:
def dataset_creator(x,y):
    data=tf.data.Dataset.from_tensor_slices((x,y))
    data=data.shuffle(50000)
    data=data.batch(64)
    data=data.prefetch(tf.data.experimental.AUTOTUNE)
    return data

train_dataset = dataset_creator(x_train,y_train)
test_dataset = dataset_creator(x_test,y_test)

## Models


### Model 1 - Fully Connected Neural Network

We'll be training a fully connected network, since we know that Dense layers don't work well with sequence data, we can assume that this model will perform poorly.

In [19]:
model_fcnn = tf.keras.Sequential([
      Input([max_sequence]), # Input shape is equal to the padded sequences maximum length (i.e. max_sequence)
      Dense(units=128,activation='relu'),
      Dropout(0.3),
      Dense(units=1,activation='sigmoid'),
])

model_fcnn.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 128)               16512     
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 16,641
Trainable params: 16,641
Non-trainable params: 0
_________________________________________________________________


In [20]:
model_fcnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [21]:
model_fcnn.fit(train_dataset, epochs=5, validation_data=test_dataset)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f7836308e80>

An binary accuracy of 50% means that it's equal to a random classifier, meaning that our model isn't quite able to learn. Let's add another thing to the network that might help if perform better.

### Model 2 - Fully Connected Neural Network with Embedding Layer

Word Embeddings takes in words and converts them to a feature vector that can represent the word with mode data points than just a single number, Word Embeddings learn the relationship between different words and provide more information on the meaning of the word.

You can read more about Word Embeddings here: https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

Keras offers an Embedding layer that can be used for neural networks on text data. It is a flexible layer that can be used in a variety of ways, such as:

*   It can be used alone to learn a word embedding that can be saved and used in another model later.
*   It can be used as part of a deep learning model where the embedding is learned along with the model itself.
*   It can be used to load a pre-trained word embedding model, a type of transfer learning.


The Embedding layer is defined as the first hidden layer of a network.

It must specify 3 arguments:

`input_dim`: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.

`output_dim`: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.

`input_length`: This is the length of input sequences, as you would define for any input layer of a Keras model.
For example, if all of your input documents are comprised of 1000 words, this would be 1000.


In [22]:
model_fcnn_with_embeddings = tf.keras.Sequential([
      Input([max_sequence]), # Input shape is equal to the padded sequences maximum length (i.e. max_sequence). Alternatively, this can be defined as a part of the Embedding layer
      Embedding(vocab_size+1, 100, mask_zero=True), # Embedding layer with input dim of vocab_size + 1 (to account for paddings)
      Flatten(),
      Dense(units=128, activation='relu'),
      Dropout(0.3),
      Dense(units=1,activation='sigmoid'),
])

model_fcnn_with_embeddings.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 128, 100)          800100    
_________________________________________________________________
flatten (Flatten)            (None, 12800)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               1638528   
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 129       
Total params: 2,438,757
Trainable params: 2,438,757
Non-trainable params: 0
_________________________________________________________________


In [23]:
model_fcnn_with_embeddings.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [24]:
model_fcnn_with_embeddings.fit(train_dataset, epochs=5, validation_data=test_dataset)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f78353fa8d0>

The inclusion of the `Embedding` layer provided the network with better word representation that allowed the model to pick up on important features and use it to classify sentiment. But the performance can still be improved once the model is able to understand sequences. 

### Model 3 - Recurrent Neural Network

RNNs have an advantage over regular Dense layers, which is that they are able to hold on into sequential information during prediction, this allows RNNs to perform better than a Dense networks but it will require more training time due to the complexity of RNN layers compared to Dense layers.

In [25]:
model_rnn = tf.keras.Sequential([
      Input([max_sequence]), # Input shape is equal to the padded sequences maximum length (i.e. max_sequence). Alternatively, this can be defined as a part of the Embedding layer
      Embedding(vocab_size+1, 100, mask_zero=True,), # Embedding layer with input dim of vocab_size + 1 (to account for paddings)
      SimpleRNN(128,return_sequences=True),
      SimpleRNN(64),
      Dense(units=1,activation='sigmoid'),                             
])

model_rnn.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 128, 100)          800100    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 128, 128)          29312     
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 64)                12352     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 65        
Total params: 841,829
Trainable params: 841,829
Non-trainable params: 0
_________________________________________________________________


In [26]:
model_rnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [27]:
model_rnn.fit(train_dataset, epochs=4, validation_data=test_dataset)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7f7835128be0>

### Model 4 - Long Short Term Memory

LSTMs has the advantage over RNNs in that they have gates that tells the cell which information to forget/hold onto instead of simply passing the data to the next cell. This allows the model to selectivly remember/forget things depending on their significance.

Read more about RNNs and LSTMs here: https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa



In [28]:
model_lstm = tf.keras.Sequential([
      Input([max_sequence]), # Input shape is equal to the padded sequences maximum length (i.e. max_sequence). Alternatively, this can be defined as a part of the Embedding layer
      Embedding(vocab_size+1, 100, mask_zero=True), # Embedding layer with input dim of vocab_size + 1 (to account for paddings)
      LSTM(128),
      Dense(units=1,activation='sigmoid'),                             
])

model_lstm.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 128, 100)          800100    
_________________________________________________________________
lstm (LSTM)                  (None, 128)               117248    
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 129       
Total params: 917,477
Trainable params: 917,477
Non-trainable params: 0
_________________________________________________________________


In [29]:
model_lstm.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [30]:
model_lstm.fit(train_dataset, epochs=3, validation_data=test_dataset)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f77c4317d30>

## Evaluation

In [31]:
def predict(text):
  
  tokenized_texts = tokenizer.texts_to_sequences([text])
  input = tf.keras.preprocessing.sequence.pad_sequences(tokenized_texts, maxlen=max_sequence, padding='post')

  output = model_lstm.predict(input)[0][0]

  print("The Sentence: ", text)

  if output >= 0.5:
    print("Is Postive", output)
  else:
    print("Is Negative", output)

In [32]:
text = "I am sad"  #@param {type: "string"}

predict(text)

The Sentence:  I am sad
Is Negative 0.1679133
