# Recurrent Neural Network

## Background

The following model will use the **Keras's** bundled **IMDB** *Internet Movie Database* movie reviews dataset to perform a **binary classification** to determine if a review is negative or positve.

**RNN**s process sequences of data, such as the text of a sentence. **Recurrent** means the neural network contains *loops* that causes the output of a given layer to become the input to the same layer in the next *time step*. A *time step* is the next point in time in a time series, a *time step* would be the next word in a sequence of words. Looping in **RNN**s enables learning and remembering relationships among the data in the sequence. 

For example condsider the following:
* The movie is not good

* The actor is good

* The actor is great!

The first sentence is cleary negative. The second is positive but not as positve as the third sentence. The word *good* in the first sentence has its own positive sentiment, however, when it follows the word *not* which appears before *good* in this sequence, the sentiment becomes negative. **RNNs** take into account the relationship among the earlier and later parts of a sequence. Determining the meaning of text can involve many words to consider and an unknown number of words between them. This notebook will use a **LSTM** *Long Short-Term Memory* layer to make the network **recurrent** and optimize learning from sequences like the ones described above.

**Applications:**
* Predictive text
* Sentiment Analysis
* Responding to questions with the predicted best answers from a corpus
* Inter-language Translation
* Automated closed captioning in video

**More Information:**
* __[Overview of Recurrent Neural Networks](https://www.analyticsindiamag.com/overview-of-recurrent-neural-networks-and-their-applications/)__
* __[Applications](https://en.wikipedia.org/wiki/Recurrent_neural_network#Applications)__
* __[Binary Clasification](https://docs.aws.amazon.com/machine-learning/latest/dg/binary-classification.html)__


The **IMDb** movie reviews dataset included with **Keras** contains 25,000 training samples and 25,000 testing samples, each labeled *positive* (1) or *negative* (0). There are over 88,000 unique words in the dataset so the number of words loaded for training and testing will be reduced to the top 10,000 most frequently occuring words since this notebook is intended to be used with a CPU and not a GPU, however, more words loaded would improve results.

In [1]:
# import module to load data
from tensorflow.keras.datasets import imdb

In [2]:
# limit amount of data loaded
num_of_words = 10000

# create tuples  for training and testing
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_of_words)

## Evaluate 
Inspect the dimensions of the set samples

In [3]:
X_train.shape

(25000,)

In [4]:
y_train.shape

(25000,)

In [5]:
X_test.shape

(25000,)

In [6]:
y_test.shape

(25000,)

The arrays `y_train` and `y_test` are one-dimensional arrays containing 1s and 0s, to indicate if a review is positive or negative. However, each element is a *list* of integers that represent a review's contents. **Keras** deep learning models require *numeric data*, so the developers at **Keras** preprocessed the **IDMb** dataset. **Keras IDMb** dataset provides a dictionary that maps the words to their indexes. Each word's corresponding value is its frequency ranking among all the words in the entire set of reviews. So the word with the rank **1** is the most frequently occuring word (calculated by **Keras**). The dictionary values begin with **1** as the most occuring word, **however**, in each encoded review such as `X_train[123]`, the **ranking values** are *offset by 3*. Any review containing the most frequently occuring word will have the value **4** wherever that word appears in the review. Here's why:

* The value **0** in a review represents *padding*. **Keras** deep learning algorithms expect all the training samples to have the same dimensions. Some reviews may need to be expanded to a given length and some shortened to that length. Those that are expanded will be padded with **0**

* The value **1** represents a token that **Keras** uses internally to indicate the start of a text sequence for learning purposes

* The value **2** in a review represents an unknown word (usually a word that is not loaded because of `.load_data(num_words=num_of_words)`. Any reveiw that containted words with frequency rankings greater than `num_of_words` would have those words' numeric values replaced with **2**. This is handled by **Keras**.

We must account for this *offset by 3* when decoding the review.

In [7]:
%pprint

Pretty printing has been turned OFF


In [8]:
X_train[123]

[1, 307, 5, 1301, 20, 1026, 2511, 87, 2775, 52, 116, 5, 31, 7, 4, 91, 1220, 102, 13, 28, 110, 11, 6, 137, 13, 115, 219, 141, 35, 221, 956, 54, 13, 16, 11, 2714, 61, 322, 423, 12, 38, 76, 59, 1803, 72, 8, 2, 23, 5, 967, 12, 38, 85, 62, 358, 99]

## Decode

In [9]:
'''
The following statement gets the word_to_index 
dictionary provided by tensorflow.keras.datasets.imdb module
'''
word_to_index = imdb.get_word_index()

The word *great* is likely to appear in a positive review. We can verify it is in the dictionary.

In [10]:
word_to_index['great']

84

In this case it is the 84th most frequent word in this dataset

Reversing the `word_to_index` dictionary mapping will transform the ferquency ratings into words so it is easier to identify every word by its frequency. This can be achieved by dictionary comprehension.

In [11]:
# reverse mapping
index_to_word = \
    {index: word for (word, index) in word_to_index.items()}

In [12]:
# get the top 50 words
[index_to_word[i] for i in range(1, 51)]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'on', 'not', 'you', 'are', 'his', 'have', 'he', 'be', 'one', 'all', 'at', 'by', 'an', 'they', 'who', 'so', 'from', 'like', 'her', 'or', 'just', 'about', "it's", 'out', 'has', 'if', 'some', 'there', 'what', 'good', 'more']

Now the `index_to_word` dictionsary's two-argument method can be used rahter than the [ ] operator to get the value for each key.

In [13]:
' '.join([index_to_word.get(i - 3, '?') for i in X_train[123]])

'? beautiful and touching movie rich colors great settings good acting and one of the most charming movies i have seen in a while i never saw such an interesting setting when i was in china my wife liked it so much she asked me to ? on and rate it so other would enjoy too'

In [14]:
y_train[123]

1

The value **1** indicates this review is considered positive 

## Prepare Data

The number of words per review varies, **Keras** requires all samples to have the same dimensions. This requires **data preparation**. In this case, it is necessary to restrict every review to the *same* number of words. Some reviews will need to be *padded* with additional data and others may need to be *truncated*. The function `pad_sequences` reshapes the rows of `X_train` to the number of features specified by the `maxlen` argument **(100)** and returns a two-dimensional array

In [15]:
#restrict number of words in each review
words_per_review = 100 

In [16]:
# import pad_sequences from keras
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [17]:
# reshape test and train datasets
X_train = pad_sequences(X_train, maxlen=words_per_review)
X_test = pad_sequences(X_test, maxlen=words_per_review)

In [18]:
# verify reshape
X_train.shape

(25000, 100)

In [19]:
# verify reshape
X_test.shape

(25000, 100)

## Split Data

**The test data must be split for testing and validation**

The test samples will be manually split from 25,000 test samples to 15,000 test samples and 10,000 validation samples. The validation samples will be passed to the model's fit method via the argument `test_size=0.40`. **Scikit-learn's** `train_test_split` will be used to achieve this

In [20]:
# split data using sklearn
from sklearn.model_selection import train_test_split
X_test, X_val, y_test, y_val = train_test_split(
    X_test, y_test, test_size=0.40) 

In [21]:
# confirm split
X_test.shape

(15000, 100)

In [22]:
# confirm split
X_val.shape

(10000, 100)

## Create Neural Network

The **RNN** will now be configured begining with a *Sequential* model to add the layers that compose the networks

In [23]:
# import sequential model
from tensorflow.keras.models import Sequential

rnn = Sequential()

In [24]:
# import the layers needed for the model
from tensorflow.keras.layers import Dense, LSTM

from tensorflow.keras.layers import Embedding

### Add an Embedding Layer

To *reduce dimensionality* **RNNs** typically begin with an **embedding layer**. The vectors produced by the **embedding layer** also capture how a word relates to the words around it, allowing **RNN** to learn word relationships among the data

There are popular predefined word embeddings, such as `Word2Vec` and `GloVe`. These can be loaded into nueral networks to save time training the model. They can also add basic word relationships to a model when smaller amounts of training data are available. This can improve the model's accuracy by allowing it to build up previously learned word relationships, rather then learning those relationships with insufficient amounts of data.

__[More info on Predefined Embedding](https://blog.kera.io/using-pre-trained-word-embeddings-in-a-keras-model.html)__

In [25]:
# create embedding layer
rnn.add(Embedding(input_dim=num_of_words, output_dim=128,
                  input_length=words_per_review))

### Add LSTM Layer 

**This will allow past information to be reinjected at a later time.**

The arguments are:
* `units` - The number of neurons in the layer. The more neurons, the more the network can remember. It is common to start with a value between the length of the sequences used for processing (500 in this case) and the number of predicted classes (2 in this case)

* `dropout` - The percentage of neurons to randomly disable when processing the layer's input and output. **Dropout** is a proven technique that reduces overfitting. **Keras** provides a **Dropout** layer.  

* `recurrent_dropout` - The percentage of neurons to randomly disable when the layer's output is fed back into the layer again to allow the network to learn from what was seen previosly

__[More info on Dropout](https://arxiv.org/abs/1512.05287)__

In [26]:
# Add LSTM Layer
rnn.add(LSTM(units=256, dropout=0.2, recurrent_dropout=0.2))

### Add Dense Output Layer

After the **LSTM Layer** will be reduced to one result indicating whether a review is positive or negative. Here the ` sigmoid activation function` is used since it preferred for **binary classification**. It reduces arbitrary values into 0.0 - 0.1, producing a probability.

In [27]:
rnn.add(Dense(units=1, activation='sigmoid'))

### Compile Model, Display Summary

The model will now be compiled.`optimizer` is the optimizer the model uses to adjust the weights throughout the network as it learns.*'adam'* performs well for a variety of models. `binary_crossentropy` is used since there are only two possible outputs. The large number of parameters primarily comes from the number of words in the vocabulary (10,000) times in the number of neurons in the `Embedding Layer's` output (128). `metrics` is a list of the *metrics* that the network will produce to help with evalutation. *'accuracy'* is commonly used in classification models.


__[More info on Optimizers](https://keras.io/optimizers/)__

__[More info on Adam](https://medium.com/octavian-ai/which-optimizer-and-learning-rate-should-i-use-for-deep-learning-5acb418f9b2)__

__[More info on Metrics](https://keras.io/metrics/)__

In [28]:
rnn.compile(optimizer='adam',
            loss='binary_crossentropy', 
            metrics=['accuracy'])

In [29]:
rnn.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 128)          1280000   
_________________________________________________________________
lstm (LSTM)                  (None, 256)               394240    
_________________________________________________________________
dense (Dense)                (None, 1)                 257       
Total params: 1,674,497
Trainable params: 1,674,497
Non-trainable params: 0
_________________________________________________________________


### Model Training and Evaluation

Now the model will be trained using `rnn.fit`. Each *epoch* that the model takes significantly longer to train. This is because of the large numbers of parameter *weights* the **RNN** model needs to learn. The **validation accuracy** represents the percentage of training samples and the percentage of `validation_data` samples that the model predicts correctly.
Results can be evaluated using `rnn.evaluate`

In [30]:
rnn.fit(X_train, y_train, epochs=5, batch_size=32, validation_data=(X_val, y_val))

Train on 25000 samples, validate on 10000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History object at 0x1485ce2b0>

In [31]:
# get loss and accuracy
results = rnn.evaluate(X_test, y_test)



Though the accuaracy is only 85%, after research it is common to find ltos of results in the high 80s. This small **RNN** of only three layers did reasonably well. 