[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/Notebooks/Lecture_20_2023_04_11.ipynb)

# Lecture 20: 2023-04-11 - Recurrent Neural Networks

## Lecture Overview

* Word2Vec Assignment (description and questions)
* Word embeddings and Neural Networks (Masked Language Models)
* Recurrent Neural Networks (RNNs)
* Long Short-Term Memory (LSTM) Networks
* Gated Recurrent Units (GRUs)
* Bidirectional Recurrent Neural Networks (BRNNs)
* Attention Mechanisms

## Word2Vec Assignment

* [Description](../assignment_descriptions/08_Word_Embeddings.md)
* [Notebook](../assignment_notebooks/Word_Embeddings.ipynb)

## Word embeddings and Neural Networks

* [Notebook](./Lecture_19_2023_04_06.ipynb)

## Recurrent Neural Networks (RNNs)

Recurrent neural networks are a class of neural networks that are particularly well suited to processing sequential data such as text. They are able to remember information for a long period of time, and are thus applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

### Sequence data

* Sequence data is data that is ordered in some way. For example, a sequence of words in a sentence, a sequence of characters in a word, a sequence of pixels in an image, a sequence of notes in a song, a sequence of frames in a video, and so on.

* Unlike Bag-of-Words models, sequence models can take into account the order of the words in a sentence. This makes them ideal for tasks such as machine translation, speech recognition, and text summarization.

* We will follow the standard conventions and model sequence data as follows:

$$x^{(i)} = (x_1^{(i)}, x_2^{(i)}, \ldots, x_T^{(i)})$$

Where $T$ is the length of the sequence and $x_t^{(i)}$ is the $t^{th}$ element of the $i^{th}$ sequence in the training set.

### Different categories of sequence models

* one to one - input layer is a single value (vector or scalar), output layer is a single value (vector or scalar). For example, image classification is a one to one model.
* one to many - input layer is a single value (vector or scalar), output layer is a sequence. For example, image captioning is a one to many model.
* many to one - input layer is a sequence, output layer is a single value (vector or scalar). For example, sentiment analysis is a many to one model.
* many to many - input layer is a sequence, output layer is a sequence. For example, machine translation is a many to many model. Some variants of this model depend on the synchronization of the input and output sequences. For example, in video classification, the input and output sequences are synchronized, whereas in machine translation, the input and output sequences are not synchronized.

<center><img src="http://karpathy.github.io/assets/rnn/diags.jpeg" width="800" height="300"></center>

N.B.: a rectangle is a vector and arrows are functions. 

source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

In [None]:
%%javascript
IPython.load_ipython_extensions([
  "nb-mermaid/nb-mermaid"
]);

### RNN Architecture

#### Standard feedforward neural network

```mermaid
graph BT
    i[Input] --> h((Hidden Layer))
    h --> o[Output]

```

#### Recurrent Neural Network

```mermaid

graph BT
    i[Input] --> h((Hidden State))
    h --> h
    h --> o[Output]

```

Recall that in standard neural network data is processed by passing the inputs to the forward layer (or hidden layer) and then to the output layer. In a recurrent neural network, the hidden layer receives the input and the current time step from the previous step. This allows the network to process the data sequentially.

#### Single and Multiple layer RNNs

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_04.png?raw=true" width="800" height="600"></center>

#### Notes

* The hidden state $h_t$ is the output of the hidden layer at time step $t$. It is also the input to the hidden layer at time step $t+1$.
* layer = 1 is represented as $h^{(t)}_{1}$, layer = 2 is represented as $h^{(t)}_{2}$, etc.

### Activations in RNN

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_05.png?raw=true" width="800" height="400"></center>

* $W_{xh}$ is the weight matrix for the input to the hidden layer
* $W_{hh}$ is the weight matrix for the hidden layer to the hidden layer (recurrent edge)
* $W_{ho}$ is the weight matrix for the hidden layer to the output layer

In [None]:
# code adapted from Rashka, 2020, Deep Learning with PyTorch

import torch
import torch.nn as nn

torch.manual_seed(1)

# our rnn layer
rnn_layer = nn.RNN(input_size=5, hidden_size=2, num_layers=1, batch_first=True)

# weights
w_xh = rnn_layer.weight_ih_l0
w_hh = rnn_layer.weight_hh_l0

# biases
b_xh = rnn_layer.bias_ih_l0
b_hh = rnn_layer.bias_hh_l0

print('W_xh shape: ', w_xh.shape)
print('W_hh shape: ', w_hh.shape)
print('b_xh shape: ', b_xh.shape)
print('b_hh shape: ', b_hh.shape)

N.B.: Input shape (batch_size, sequence_length, input_size=5)


In [None]:
# Run a forward pass
x_seq = torch.tensor([[1.0]*5, [2.0]*5, [3.0]*5]).float()

## output of the RNN layer
output, hn = rnn_layer(torch.reshape(x_seq, (1, 3, 5)))
print('output shape: ', output.shape)
print('hn shape: ', hn.shape)

In [None]:
## Analyzing the RNN layer in comparison to manual computation
out = []
for t in range(3):
    xt = torch.reshape(x_seq[t], (1, 5))
    print(f'Time step {t} =>')
    print('    Input      :', xt.numpy())
    ht = torch.matmul(xt, torch.transpose(w_xh, 0, 1)) + b_xh
    print('    Hidden     :', ht.detach().numpy())
    
    if t > 0:
        prev_h = out[t-1]
    else:
        prev_h = torch.zeros((ht.shape))
    ot = ht + torch.matmul(prev_h, torch.transpose(w_hh, 0, 1)) + b_hh
    ot = torch.tanh(ot)
    out.append(ot)
    print('    Output     :', ot.detach().numpy())
    print('    RNN output :', output[:, t].detach().numpy())

The hidden state tensor `ht` is computed using the matrix multiplication of the input tensor `xt` and the weight matrix `w_xh` plus the bias term `b_xh`. The `detach()` method is called on `ht` to remove any gradients associated with it, and the resulting tensor is converted to a NumPy array using the `numpy()` method. This hidden state tensor represents the current state of the RNN at time step `t`.

If the current time step is greater than 0, the previous hidden state tensor `prev_h` is set to the value of `out[t-1]`. Otherwise, `prev_h` is initialized to a tensor of zeros with the same shape as `ht`.

The output tensor `ot` is then computed by adding `ht` to the matrix multiplication of `prev_h` and the weight matrix `w_hh` plus the bias term `b_hh`. The resulting tensor is passed through the `tanh()` activation function and the resulting tensor is stored in `ot`.

The current hidden state tensor `ht` is appended to the output list out, and the values of `ot` and the corresponding element of output are printed to the console.

### Problems with RNNs

* Vanishing gradients
* Exploding gradients
* Long-term dependencies

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_08.png?raw=true" height="400" width="800">

#### Vanishing gradients

As the length of the sequence increases, the gradient of the loss function with respect to the weights of the RNN decreases. This is because the gradient is computed as the product of the gradients of the loss function with respect to the output of the RNN and the gradients of the output of the RNN with respect to the weights of the RNN. As a consequence, the network will take a long time to learn the weights of the RNN.


#### Exploding gradients

Gradients can explode if the weights of the RNN are increasing. The problem of exploding gradients occurs when the weights in the network are updated using the chain rule during backpropagation. In RNNs, the gradients are propagated through the same set of weights over multiple time steps, which can lead to very large gradients.

##### Gradient clipping

To address the issue of exploding gradients in RNNs, several techniques have been proposed. One such technique is gradient clipping, which involves setting a maximum threshold on the gradient to prevent it from growing too large.

#### Long-term dependencies

RNNs struggle to learn long-term dependencies. This is because the hidden state of the RNN at time step $t$ is computed using the hidden state of the RNN at time step $t-1$. As a consequence, the hidden state of the RNN at time step $t$ is dependent on the hidden state of the RNN at time step $t-1$, which is in turn dependent on the hidden state of the RNN at time step $t-2$, and so on. This means that the hidden state of the RNN at time step $t$ is dependent on the hidden state of the RNN at all previous time steps. As a consequence, the RNN is unable to learn long-term dependencies.

#### Solutions to the problems with RNNs

* Long short-term memory (LSTM)
* Gated recurrent unit (GRU)

##### Long short-term memory (LSTM)

The LSTM is a type of RNN that addresses the problems of vanishing gradients, exploding gradients, and long-term dependencies. The LSTM is a type of RNN that has a memory cell that can store information for long periods of time. The LSTM has three gates that control the flow of information into and out of the memory cell. The three gates are the input gate, the forget gate, and the output gate.

<center><img src="https://github.com/rasbt/machine-learning-book/blob/main/ch15/figures/15_09.png?raw=true" height="400" width="800"></center>

* cell state - the recurrent edge is the memory of the network
* input gate (i) - controls the flow of information into the cell state
* output gate (o) - updates the values of the hidden state
* forget gate (f) - controls the flow of information out of the cell state

Good news: the LSTM architecture is super easy to implement in either PyTorch or TensorFlow

#### Gated recurrent unit (GRU)

The GRU is a type of RNN that addresses the problems of vanishing gradients, exploding gradients, and long-term dependencies. The GRU is a type of RNN that has a memory cell that can store information for long periods of time. The GRU has two gates that control the flow of information into and out of the memory cell. The two gates are the update gate and the reset gate.

* For further information, see Rashka et al. (2022) and Bansal (2021).

### RNN Example in TensorFlow

In [None]:
!pip install tensorflow_text --quiet --exists-action i

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_text as text
import numpy as np
import pandas as pd

In [None]:
# Load the IMDB reviews dataset
dataset, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

# Split the dataset into train and test
train, validate = dataset['train'], dataset['test']

# Examine the dataset
train.element_spec

In [None]:
# Dataset Info
info

In [None]:
# Examine a review
for eg, label in train.take(1):
  print("text: ", eg.numpy())
  print("label: ", label.numpy())

In [None]:
# Shuffle and batch the data
BUFFER_SIZE = 10_000
BATCH_SIZE = 64

# create a dataset of batches - see https://www.tensorflow.org/guide/data_performance#prefetching
train_dataset = train.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
validate_dataset = validate.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
for eg, label in train_dataset.take(1):
  print("texts: ", eg.numpy()[:3])
  print("labels: ", label.numpy()[:3])

In [None]:
# Set our vocabulary size
VOCAB_SIZE = 1000

# Create a text vectorization layer
encoder = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

In [None]:
# Examine the vocabulary
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

In [None]:
# Examine the encoded text
encoder_example = encoder(eg)[:3].numpy()
encoder_example

In [None]:
# compare the original text to the encoded text
for n in range(3):
  print("Original: ", eg[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoder_example[n]]))

In [None]:
# Create a model
model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        mask_zero=True),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

In [None]:
# Compile the model
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.Adam(1e-4), metrics=['accuracy'])

In [None]:
# train the model
history = model.fit(train_dataset, epochs=10, validation_data=validate_dataset, validation_steps=30)

In [8]:
# validate our model
val_loss, val_acc = model.evaluate(validate_dataset)

print('Test Loss:', val_loss)
print('Test Accuracy:', val_acc)

NameError: name 'model' is not defined

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

  
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

In [None]:
# test our model
sample_text = ('The movie was a joke. The animation and the graphics '
               'were out of this world, but the acting was horrendous. I would not recommend this movie.')

In [None]:
# predict the sentiment
prediction = model.predict([sample_text])

In [None]:
# Show the results
prediction

In [None]:
# Our LSTM model
model.summary()

### BiLSTM Model

In [None]:
# Model
model_bilstm = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

In [None]:
# compile our bidirectional LSTM model
model_bilstm.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.Adam(1e-4), metrics=['accuracy'])

In [None]:
# train our model
history = model.fit(train_dataset, epochs=10, validation_data=validate_dataset, validation_steps=30)

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

  
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

## Character RNN

[Notebook](https://colab.research.google.com/drive/1Et8IO-BCBdSYkhkTcCbo624gqfJD9H7h#scrollTo=cTqhw4K0qIBx)