# Workshop DL03: Recurrent Neural Network

## Agenda:
- Introduction to RNN
- Use RNN for text generation

## Motivation: 
From time to time, beside from tabular data or image data we also need to deal with sequential data, for example, sentences, coordinates of a moving object, sound waves... 

Unlike pixel values or data from a table, sequnce-type data can be one of the hardest to deal with, because we need to: 
- handle sequences with varying length,
- track long-term dependencies (e.g. long sentences), 
- preserve order (e.g. "this is good, not bad at all" versus "this is bad, not good at all"), and 
- share parameters across sequnence (so no need to relearn something if they appeared before in the sequence). 

As we talk about RNN we will see how RNN meets these requirements. 


## Types of RNN:

|Type|Input|Output|Examples|
|:--:|:---:|:----:|:-----:|
|**One-to-many**| zero/one input| a sequence| music generation |
|**Many-to-one**| a sequence | one output | sentiment classification  |
|**Many-to-many**| a sequence | a sequence | speech recognition, machine translation|


We will revisit these concepts after we learn more about RNN structure.

## What is a RNN?
A one-sentence description for it could be a loop of neural network layers. We will see more clearly when we discuss each property of RNN.
#### Order Preservation
When dealing with sequence data, we need to preserve the order of the components (e.g. words) in the sequence. Instead of putting everything in the bag and hand it to the model, we need the model to process the data in the right order. So we need to introduce some order or time component to the model, like this:

<img src='./img/ordering.png'>


Let's use an example of name identification, where the model needs to identify whether the words are names or not. When the word is part of the name, the model predicts 1, otherwise 0. Let's say $x_0$ is our first word, $x_1$ is second word ... and so on. $\hat{y_0}$ is the prediction of the first word, $\hat{y_1}$ is the prediction of the second word ... and so on. What's inside the green box is the hidden layers that the model revisits everytime it processes a new input, which is why it's also called **recurrent cell**. So the model can scan through the sentence from left to right by processing the words one at a time.


#### Dependencies 
RNN not only preserves the order but also track dependencies. For each word in the $k^{th}$ position, its activation is not only passed to the $k^{th}$ prediction, but also passed to the ${k+1}^{th}, {k+2}^{th}, ...$ predictions. So the prediction at time $k$ uses the current information as well as the information from earlier sequence. (There is other model that uses information from later sequence, namely Bidirectional RNN.) For example, in the following graph, the second prediction uses all information up to the current one.

<img src = './img/flow.png'>


#### Parameters sharing
RNN also uses the same layers across the sequence, i.e. it shares the same set of parameters/weights thought-out.

<img src = './img/sharing.png'>

As we can see from the graph above, for each timestamp or order, RNN uses the exact same set of parameters and layer structure for prediction as well as activation to pass on to the next cell, i.e.
$$a_t = g(W_{aa}a_{t-1}+W_{ax}x_t+b_a)$$
$$\hat{y_t} = f(W_{ay}a_t+b_y)$$
for $t=1,2,...$ where $g$ is your choice of activation function (e.g. tanh), $f$ is the activation functions for the output (e.g. sigmoid, softmax), and $a_0$ is initialised activations (often initialised with zeros).

**Vectorization version**: 

$$a_t = g(W_{a}[a_{t-1},x_t]+b_a)$$
$$\hat{y_t} = f(W_{ay}a_t+b_y)$$
$where\ W_a $ is horizontal stack of $W_{aa}\ and\ W_{ax}$, $[a_{t-1},x_t]$ is vertical stack of $a_{t-1}\ and\ x_t$ 


## Other representation of RNN
This representation of RNN is also known as the **unrolled** version. Since the green box (recurrent cell) is revisited again and again for the whole sequence, sometime we can see simplified version like the one on the left:

<img src = './img/loop.png'>

If we unroll the internal loop, it's exactly the one on the right.


## How do we train a RNN?
Like other neural network, we train RNN through backpropagation, where we update parameters to minimise loss function by gradient descent. By the use of chain rule, this process is going backwards as the following graph shown. Since RNN has time element, its bakpropagation is also known as **bachpropagation through time** (cool name eh) as the gradient flows from right to left. 

<img src = './img/backprop-through-time.png'>

Images sources: ©MIT 6.S191: Introduction to Deep Learning [introtodeeplearning.com](http://introtodeeplearning.com/)


## Examples of RNN architectures
Now that we have covered the basis of RNN, let's dig a little bit more about different types of RNN as mentioned before.

|one-to-many|many-to-one|many-to-many|
|:---------:|:---------:|:----------:|
|<img src = './img/one-to-many.png'>|<img src = './img/many-to-one.png'>|<img src = './img/many-to-many.png'>|
|music generation (given one note or nothing to generate a melody), text generation | sentiment classification (given comments to classify sentiment), DNA sequence analysis | speech recognition (given sound waves to recognise speech), machine translation |

For one-to-many, usually the output in earlier timestamp is also passed to later timestamp. There is also a type called one-to-one, but nothing too exciting there except that it's a standard or vanilla neural network. 

Perhaps the most interesting one is many-to-many, where the input and output length can vary (e.g. machine translation), so it requires something called "**encoder**" and "**decoder**". <img src='./img/encoder-decoder.png'>There are other RNN structures like attention based architecture that are not captured in these diagrams, which we encourage you to explore it yourself.  


## Deep RNN
Deep RNN is a stack of RNNs. This is an example of a 3-layer RNN:

<img src='./img/deep-rnn.png'>

Source: [deeplearning.ai](https://www.coursera.org/learn/nlp-sequence-models/lecture/ehs0S/deep-rnns)


## Build a RNN
Now let's build a RNN model to generate some headlines!


In [2]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
tf.disable_eager_execution()
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout, Masking
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 

# set seeds for reproducability
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)

import pandas as pd
import numpy as np
import os, time, string 


Using TensorFlow backend.


In [3]:
# reading the files
curr_dir = './data/'
corpus = []
for filename in os.listdir(curr_dir):
    article_df = pd.read_csv(curr_dir + filename)
    corpus.extend(list(article_df.headline.values))
    break

corpus = [h for h in corpus if h != "Unknown"]
corpus[:10]

['N.F.L. vs. Politics Has Been Battle All Season Long',
 'Voice. Vice. Veracity.',
 'A Stand-Up’s Downward Slide',
 'New York Today: A Groundhog Has Her Day',
 'A Swimmer’s Communion With the Ocean',
 'Trail Activity',
 'Super Bowl',
 'Trump’s Mexican Shakedown',
 'Pence’s Presidential Pet',
 'Fruit of a Poison Tree']

### step 1: Data prep
To train an RNN we need a large corpus(body or set) of text. Recall that in the classification workshop we have prepared the string-type data by encoding different string values into distint numbers. It works OK if we have small amount of string values, but what if we have a dictionary of words (10,000 to 1M words depending on the dictionary)? We're gonna do something known as **tokenization**. What it really means is that it forms a vocabulary/dictionary of the text and maps the sentence to the individual tokens. 

To tell the model when the sentence ends, we can add an extra token called **EOS** (end of sentence) at the end of the sentence. If some word is not known in our vocabulary, we can replace the word with a **UNK**(unknown) token instead of identifying it as a specific word. 


There is also a trend of tokenizing characters instead of words. [Here](https://www.tensorflow.org/tutorials/sequences/text_generation) is an example.

In [4]:
tokenizer = Tokenizer(num_words=None, 
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
                      lower=True, split=' ', char_level=False, 
                      oov_token=None, document_count=0) # these are the defaults
# Keras has inbuilt model for tokenization 
# which can be used to obtain the tokens and their index in the corpus.
tokenizer.fit_on_texts(corpus)
# Updates internal vocabulary based on a list of texts
sequence = tokenizer.texts_to_sequences(corpus)
#Transforms each text in texts to a sequence of integers
print(len(sequence))
sequence[:10]


829


[[224, 159, 160, 123, 97, 78, 677, 678, 40, 27, 225],
 [226, 679, 680],
 [2, 227, 681, 682, 359],
 [11, 26, 28, 2, 683, 78, 161, 98],
 [2, 684, 685, 12, 1, 686],
 [360, 687],
 [228, 361],
 [20, 688, 689],
 [362, 363, 690],
 [691, 4, 2, 364, 692]]

There are many ways to construct RNN for texr generation. Here we show one way: give the network a sequence of words to guess what the next word is. So in terms of supervise models, we will use the fitst $n^{th}$ words as features and the ${n+1}^{th}$ word as label. Then use the $2^{th}$ ~ ${n+1}^{th}$ as features and ${n+2}^{th}$ as label.


In [5]:
def get_features_from_sequence(sequence):
    # determine how many words we use to guess the next word
    training_length = 5 
    
    # initialising arrays with zeros
    features = np.array([np.zeros([training_length])],dtype=np.int8)
    labels = np.zeros([0],dtype=np.int8)
    
    
    # Iterate through all the sequences of tokens
    for seq in sequence:
 
        # Create multiple training examples from each sequence
        for i in range(training_length, len(seq)):
            # extract components in the sequence
            extract = seq[(i - training_length):(i + 1)]  
            
            # Set the features and label
            feature = np.array(extract[:-1]) # take all but the last one
            label = np.array(extract[-1]) # take the last one
            
            # stacking features and labels
            features = np.append(features,[feature],axis=0) 
            labels = np.append(labels,[label],axis=0)
    
    # getting rid of the init zeros
    features = np.array(features[1:])
    labels = np.array(labels[1:])
    return features, labels

features, labels = get_features_from_sequence(sequence)


In [6]:
# print first 10 features and labels
print(np.shape(features));print(np.shape(labels))
print(features[0:10]);print(labels[0:10])

(1726, 5)
(1725,)
[[224 159 160 123  97]
 [159 160 123  97  78]
 [160 123  97  78 677]
 [123  97  78 677 678]
 [ 97  78 677 678  40]
 [ 78 677 678  40  27]
 [ 11  26  28   2 683]
 [ 26  28   2 683  78]
 [ 28   2 683  78 161]
 [  2 684 685  12   1]]
[677 678  40  27 225  78 161  98 686   8]


In [7]:
# number of words in vocabulary
num_words = len(tokenizer.index_word)+1

# initialise array with zeros
label_array = np.zeros((len(features),num_words),dtype=np.int8)

#one hot encoding the labels
for example_index, word_index in enumerate(labels):
    label_array[example_index,word_index] = 1
    
print(label_array.shape)
label_array[0]

(1726, 2351)


array([0, 0, 0, ..., 0, 0, 0], dtype=int8)

Before we put data into the model, we also want to put in an extra layer call **embedding layer** that adds in extra knowledge about each word. Learn more in this [optional workshop](https://github.com/Phoebe0222/MLSA-workshops-2019-student/blob/master/Deep-Learning/workshop-DL04-optional-NLP-and-WordEmbedding.ipynb). 

In [9]:
# Load in embeddings from https://nlp.stanford.edu/projects/glove/
glove_vectors = './glove.6B.50d.txt'
glove = np.loadtxt(glove_vectors, dtype='str', comments=None)

# Extract the vectors and words
vectors = glove[:, 1:].astype('float')
words = glove[:, 0]

# Create lookup of words to vectors
word_lookup = {word: vector for word, vector in zip(words, vectors)}

# New matrix to hold word embeddings
embedding_matrix = np.zeros((num_words, vectors.shape[1]))


In [10]:
for i, word in enumerate(tokenizer.index_word.keys()):
    # Look up the word embedding
    vector = word_lookup.get(word, None)

    # Record in matrix
    if vector is not None:
        embedding_matrix[i + 1, :] = vector
# pretrained embedding for the word president        
word_lookup['president'][:10]

array([-0.11875,  0.6722 ,  0.19444,  0.55269,  0.53698, -0.37237,
       -0.73494, -0.30575, -0.92601, -0.43276])

### step 2: Design RNN
So far we still haven't looked at what's really in the green box or recurrent cell. In a stadard RNN, what's inside is just one layer, but in more complicated models, there can be multiple layers interacting in a strange way. 

The original intention to build such layers is to address the problem known as **vanishing gradient**. It means in a long sentence or sequence, when we backpropagate through time we're multiplying a long chain of gradients and it's very likely that the product become so small that the gradients far down the sequence might have very little effect.

A more intuitive way to interpret it is to look at the sentence. For example, in the sentence "The cat, that already ate ..., was full", "the cat" is singular and the "..." can be arbitrary long, so the RNN might not be able to detect the singular subject and correctly use singular "was". 

What if we can tell the model to hold some important information aside? This is exactly what **Gated Recurrent Unit** (GRU) and **Long Short Term Memory** (LSTM) do.

These models introduce a **memory cell** state the stores important information like subject and pass it through a **gate** to later sequence where it really matters. And in the bakpropagation, we can use the cell state chain to calculate the gradients to avoid long product of gradients. 

Notations:

- a **memory cell** state $c_0, c_1, ...c_t$ is a sequence of vector that contains value between 0 to 1 indicating whether the important word is stored in memory
    - if $c_t = 1 $, it means "keep the word, it's still important"
    - if $c_t = 0 $, it means "forget abbout the word, it's useless now"
    - for example, in the sentence "The cat, that already ate ..., was full", if the "cat" was stored in the memory, when the model reaches the choice between "was" or "were", it remembers that the subject is singular, so it will chooses "was". I.e. $c_0(cat),c_1(cat),c_2(cat),...c_{t-1}(cat),c_t(cat)=0,1,1,...,1,0$

- a **gate** decides whether to let the information through to protect and control the memory cell state; there are:
    - update gate $u$: decides whether to update the cell state value with the canditate value
        - if $u=1$, it means "update the cell"
        - if $u=0$, it means "do nothing"
    - relevant gate $r$: decides how relevant the current activation and new input is regarding to cell state
    - forget gate $f$: decides whether to keep the old cell state value 
    - output gate $o$: controls the output of activation

Let
- $c_{t-1}$ be the current cell value, with the same dimemsion as activation
- $\tilde{c}_t$ be the canditate cell value for $c_t$, 


Then
- **LSTM**: 
    - current cell: $c_{t-1}$
    - canditate cell: $\tilde{c}_t = tanh(W_c[a_{t-1},x_t]+b_c)$
    - update gate: $u = \sigma(W_u[a_{t-1},x_t]+b_u)$
    - forget gate: $f=\sigma(W_f[a_{t-1},x_t]+b_f)$
    - new cell: $c_t = u*\tilde{c}_t + f*c_{t-1}$
    - output gate: $o = \sigma(W_o[a_{t-1},x_t]+b_o) $
    - new activation: $a_t=o*tanh(c_t)$ 

- **GRU**: 
    - current cell: $c_{t-1}  = a_{t-1}$
    - canditate cell: $\tilde{c}_t = tanh(W_c[r*a_{t-1},x_t]+b_c)$
    - update gate: $u = \sigma(W_u[a_{t-1},x_t]+b_u)$
    - relevant gate: $r = \sigma(W_r[a_{t-1},x_t]+b_r)$
    - new cell: $c_t = u*\tilde{c}_t + (1-u)*c_{t-1}$
    - new activation: $a_t = c_t$


A snapshot:
<img src='./img/snapshot.png'>

[This](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) is a really good blog on LSTM, which we highly recommand. 

In [15]:
training_length = 5
model = Sequential()

# Embedding layer
model.add(
    Embedding(input_dim=num_words,
              input_length = training_length,
              output_dim=50,
              weights=[embedding_matrix],
              trainable=False,
              mask_zero=True))

# Masking layer for pre-trained embeddings
model.add(Masking(mask_value=0.0))

# Recurrent layer
model.add(LSTM(64, return_sequences=False, 
               dropout=0.1, recurrent_dropout=0.1))

# Fully connected layer
model.add(Dense(64,activation='relu'))

# Dropout for regularization
model.add(Dropout(0.5))

# Output layer
model.add(Dense(num_words, activation='softmax'))

# Compile the model
model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 5, 50)             117550    
_________________________________________________________________
masking_3 (Masking)          (None, 5, 50)             0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 64)                29440     
_________________________________________________________________
dense_5 (Dense)              (None, 64)                4160      
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 2351)              152815    
Total params: 303,965
Trainable params: 186,415
Non-trainable params: 117,550
________________________________________________________________

### step 3: Model training
Like any other model, we need to define the loss function based on prediction and actual output. In the example of language model, suppose we have a sentence consist of words $$y_0, y_1, ... y_t.$$ Then the prediction of each word is the set of the probabilities of the word being classified as all possibble words in the dictionary (e.g. a, aaron, ..., cat, ...UNK, EOS):
$$\hat{y}_t = P(token_i)\ for\ i=1,2,..,$$ 
Then loss function at each timestamp is defined by softmax function:
$$L_t(\hat{y}_t,y_t)=-\sum_i y_t\log{\hat{y}_t}$$
Since we have prepared $y_t$ by one-hot encoding, this equation can be reduced to $$L_t=-\log{P(y_t)}$$
Total loss function is:
$$L=\sum_t L_t$$
Replacing $L_t$: 
$$=\sum_t -\log{P(y_t)}$$
$$=-\log{\prod_t}P(y_t)$$

Often we initialise with zeros, so $x_0 = \vec{0}$, $x_1 = y_0$, $x_2 = y_1$... $x_t=EOS$

<img src='./img/loss-function.png'>


Note that $\prod_tP(y_t)$ in the loss function is also the probabbility of sentence. In a language-based RNN, **Language Modelling** is the fundamental for speech recognition and machine translation, which tells the probabilities of sentences. What it means is that it tells by how much the sentence is "real" comparing to sentences we pick up randomly from, say newspapera, emails, webpages etc. 
For example: $$P("The\ apple\ and\ pear\ salad\ is\ tasty.") = 5.7\times 10^{-10}$$ $$and$$ $$P("The\ apple\ and\ pair\ salad\ is\ tasty.") = 3.2\times 10^{-13}$$

How we calculate the probability of sentence is by multiplying the conditional probabilities of words in the sentence. Let's say $y_1,y_2...y_t$ are the words from a sentence, then the probabbility of sentence is:
$$P(y_1,y_2...y_t)=P(y_1)P(y_2)...P(EOS)$$



In [16]:
history = model.fit(features, label_array, epochs=10, verbose=2)


Epoch 1/10
 - 5s - loss: 7.7495 - acc: 0.0440
Epoch 2/10
 - 1s - loss: 7.7072 - acc: 0.0440
Epoch 3/10
 - 1s - loss: 7.6686 - acc: 0.0440
Epoch 4/10
 - 1s - loss: 7.6311 - acc: 0.0440
Epoch 5/10
 - 1s - loss: 7.5948 - acc: 0.0440
Epoch 6/10
 - 1s - loss: 7.5596 - acc: 0.0440
Epoch 7/10
 - 1s - loss: 7.5255 - acc: 0.0440
Epoch 8/10
 - 1s - loss: 7.4924 - acc: 0.0440
Epoch 9/10
 - 1s - loss: 7.4605 - acc: 0.0440
Epoch 10/10
 - 1s - loss: 7.4295 - acc: 0.0440


In [18]:
# given the seed_text and try to predict the next word
seed_text = 'President Donald Trump was in'
token_list = tokenizer.texts_to_sequences([seed_text])
predicted = model.predict(np.array(token_list))
output_word = ""
for word,index in tokenizer.word_index.items():
    if np.any(index == np.argmax(predicted)):
        output_word = word
        break
seed_text += " "+output_word
seed_text

'President Donald Trump was in the'