### Recurrent Neural Networks
In this section, we train a basic RNN on the IMDB movie reviews dataset. The task is to predict whether a movie review is positive or negative.

In [4]:
%%bash
pip install tensorflow==2.1.0rc1
pip install tensorflow-datasets



In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

tf.compat.v1.enable_eager_execution()

import tensorflow_datasets as tfds

# Downloading the data using tensorflow-datasets
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']


### Embedding the data

Text by itself is not a good representation for ML algorithms since they expect vector inputs. There are multiple way we can choose to transform text as a vector. The simplest method would be to create a dictionary of all the words in the english language and encode a word as one-hot vector where we have a 1 at it's position in that dictionary, however, this results in very sparse matrices and we wouldn't be able to give any information about that word if it's not in the dictionary (eg. hellooo).

In this example, the dataset is already encoded using a variant of "Byte-pair encoding". 

#### Byte-Pair Encodings
To build a BPE, follow these steps:

initial vocabulary: $low:5$, $lowest:2$, $newer: 6$, $wider: 3$

1.   Start with the end of words and add a token $</w>$
2.   List all of the possible n-grams subword and count their occurences: $r</w>: 9$, $er</w>$, $lo:7$, $low:7$, ... 
3.   Keep the N most common subwords 
4.   You can now encode a word you have never seen before using those subwords: e.g. $lower: low\_er$

A very popular approach of byte-pair encoding is given in: https://arxiv.org/abs/1808.06226.


In [6]:
# The encoder used for this dataset is available in the `info` variable

encoder = info.features['text'].encoder

sample_string = 'Hello, world.'


encoded_string = encoder.encode(sample_string)
original_string = encoder.decode(encoded_string)
print ('The original string: "{}"'.format(original_string))
print ('Encoded string is {}'.format(encoded_string))

# Number of "subwords"
print ('Vocabulary size (number of subwords): {} '.format(encoder.vocab_size))

for index in encoded_string:
  print ('{} ----> {}'.format(index, encoder.decode([index])))

The original string: "Hello, world."
Encoded string is [4025, 8040, 2, 562, 7975]
Vocabulary size (number of subwords): 8185 
4025 ----> Hell
8040 ----> o
2 ----> , 
562 ----> world
7975 ----> .


### What does our data look like?

In [7]:
for data in train_dataset.take(10):
  print(f'Encoded data: {data[0][:10]} (truncated up to the 10th character)')
  print(f'Decoded data: {encoder.decode(data[0])}')
  print(f'Label: {data[1]}')
  print(f'Shape: {data[0].shape}')

Encoded data: [ 249    4  277  309  560    6 6639 4574    2   12] (truncated up to the 10th character)
Decoded data: As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.<br /><br />Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a cliché, but he was a people's writer.<br /><br />And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result

### Padding the data

At the moment, each row in our dataset has a different length. In practice, it's a good thing to fix that length. To do that, we will pad the dataset with 0s with the length of the longest row.


In [0]:
# See https://stackoverflow.com/questions/46444018/meaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle
# For an explanation on buffer_size. The following code creates a pipeline that will feed our model
# batches of size BATCH_SIZE from the training and testing datasets

BATCH_SIZE = 64
BUFFER_SIZE = 10000

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(test_dataset))

### Creating our Model

The model is illustrated in the following picture, we use a bidirectional recurrent neural network to capture the information from the whole sentence. Because words in a sentences can have dependencies from everywhere in the sentence (not just before them).

![](https://drive.google.com/uc?id=1yZWleHpX7xFJJ-Sd-7CaqRbkE5YYWtnV)


In [9]:
rnn = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

rnn.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

history = rnn.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

test_loss, test_acc = rnn.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
    391/Unknown - 81s 207ms/step - loss: 0.5295 - accuracy: 0.8344Test Loss: 0.5295045492822862
Test Accuracy: 0.8344399929046631


### Exploding/Vanishing Gradient

Recall that the formula for RNN is given by:

$h_t = tanh(W x_t + U h_{t-1})$ and $\hat{y_t} = softmax(V h_t) $

The loss is given by:

$L(y_t, \hat{y_t}) = - y_t log(\hat{y_t})$

For a sequence of $T$ steps, the total loss is then given by the sum:

$L(y_T, \hat{y_T}) = \sum_{t=1}^{T}L(y_t, \hat{y_t})$

When we take the gradient with respect to the weights we get:

$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y_T}\frac{\partial y_T}{\partial h_T}\frac{\partial h_T}{\partial h_{T-1}}...\frac{\partial h_2}{\partial h_1}\frac{\partial h_1}{\partial W} $

Therefore, when $\frac{\partial h_t}{\partial h_{t-1}}$ is small (or big) our gradient vanish (explode). [This paper](https://arxiv.org/pdf/1211.5063.pdf) give exact conditions for when this happen.

### Long Short-Term Memory (LSTM) Networks

LSTMs have 3 gates:

*   Forget gate: $F_t= \sigma(W^f [h_{t-1}, x_t])$
*   Input gate: $I_t= \sigma(W^i [h_{t-1}, x_t])$
*   Output gate: $O_t= \sigma(W^o [h_{t-1}, x_t])$

A cell state:

*   $\tilde{C_t} = tanh(W^C [h_{t-1}, x_t])$
*   $C_t = F_t C_{t-1} + I_t \tilde{C_t}$

And the outputs are given by:

*   $o_t = O_t tanh(C_t)$



TLDR: There is a path with no vanishing/exploding term passing through the forget gate. Therefore, there is at least one path that will propagate the gradient. 

https://d-nb.info/1082034037/34

![](https://drive.google.com/uc?id=1T4ea91weNsiNzkYuKMWX5oL10KBtMk0w)





In [10]:
lstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

lstm.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])


history = lstm.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

test_loss, test_acc = lstm.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
    391/Unknown - 202s 517ms/step - loss: 0.3914 - accuracy: 0.8587Test Loss: 0.3914450703145903
Test Accuracy: 0.858680009841919


### Attention

Not every word in a text contribute equally to the classification. The intuition behind attention mechanism is that is provides us with a way to "weight" each word.

The implementation of attention we're using is from [this paper](https://www.cc.gatech.edu/~dyang888/docs/naacl16.pdf).

We add an attention layer on top of the bidirectional LSTM, we get:

*   A representation of the output of the LSTM at each time step: $u_i = tanh(W o_i + b)$
*   The weights of each word: $\alpha_i = \frac{u_i u_w}{\sum_t u_t u_w}$ (softmax layer so this sums to 1)
*   The sentence vector: $s = \sum_i \alpha_i o_i$ 

In our model, we pass the sentence vector $s$ to the MLP before doing the classification.


In [11]:
import tensorflow as tf


def dot_product(x, kernel):
    """
    Wrapper for dot product operation, in order to be compatible with both
    Theano and Tensorflow
    Args:
        x (): input
        kernel (): weights
    Returns:
    """
    return tf.squeeze(tf.keras.backend.dot(x, tf.expand_dims(kernel, axis=-1)), axis=-1)


class AttentionWithContext(tf.keras.layers.Layer):
    """
    Taken from: https://towardsdatascience.com/nlp-learning-series-part-3-attention-cnn-and-what-not-for-text-classification-4313930ed566
    
    Attention operation, with a context/query vector, for temporal data.
    Supports Masking.
    Follows the work of Yang et al. [https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf]
    "Hierarchical Attention Networks for Document Classification"
    by using a context vector to assist the attention
    # Input shape
        3D tensor with shape: `(samples, steps, features)`.
    # Output shape
        2D tensor with shape: `(samples, features)`.
    How to use:
    Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
    The dimensions are inferred based on the output shape of the RNN.
    Note: The layer has been tested with Keras 2.0.6
    Example:
        model.add(LSTM(64, return_sequences=True))
        model.add(AttentionWithContext())
        # next add a Dense layer (for classification/regression) or whatever...
    """

    def __init__(self,
                 W_regularizer=None, u_regularizer=None, b_regularizer=None,
                 W_constraint=None, u_constraint=None, b_constraint=None,
                 bias=True, **kwargs):

        self.supports_masking = True
        self.init = tf.keras.initializers.get('glorot_uniform')

        self.W_regularizer = tf.keras.regularizers.get(W_regularizer)
        self.u_regularizer = tf.keras.regularizers.get(u_regularizer)
        self.b_regularizer = tf.keras.regularizers.get(b_regularizer)

        self.W_constraint = tf.keras.constraints.get(W_constraint)
        self.u_constraint = tf.keras.constraints.get(u_constraint)
        self.b_constraint = tf.keras.constraints.get(b_constraint)

        self.bias = bias
        super(AttentionWithContext, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[-1], input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        if self.bias:
            self.b = self.add_weight(shape=(input_shape[-1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)

        self.u = self.add_weight(shape=(input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_u'.format(self.name),
                                 regularizer=self.u_regularizer,
                                 constraint=self.u_constraint)

        super(AttentionWithContext, self).build(input_shape)

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):

        # hidden representation of the output of the LSTM:
        uit = dot_product(x, self.W)

        if self.bias:
            uit += self.b

        uit = tf.tanh(uit)

        # Weights of each word in the sentence \alpha
        ait = dot_product(uit, self.u)

        a = tf.exp(ait)

        # IGNORE THE MASK:
        # apply mask after the exp. will be re-normalized next
        # if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
        #     a *= tf.cast(mask, tf.keras.backend.floatx())
        # ----------------------------------------------------------------

        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        # a /= tf.keras.cast(tf.keras.sum(a, axis=1, keepdims=True), tf.keras.floatx())
        a /= tf.cast(tf.keras.backend.sum(a, axis=1, keepdims=True) + tf.keras.backend.epsilon(), tf.keras.backend.floatx())

        a = tf.expand_dims(a, axis=-1)

        # Sentence representation:
        weighted_input = x * a
        return tf.keras.backend.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0], input_shape[-1]


model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    AttentionWithContext(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])


history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
    391/Unknown - 248s 635ms/step - loss: 0.4419 - accuracy: 0.8568Test Loss: 0.4419185679282069
Test Accuracy: 0.8567600250244141
