In [7]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

import pandas as pd
pd.options.mode.chained_assignment = None

import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from jupyterquiz import display_quiz
import plotly.graph_objects as go
import ipywidgets as widgets
from IPython.display import display

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

file_path = 'HP_5_chapters.txt'

with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1
input_sequences = []

for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]

# Vanilla Recurrent Neural Networks (RNNs)

![meme.png](images/meme.png)

## Overview
Convolutional Neural Networks (CNNs) excel at tasks like image classification, where fixed-size inputs correspond to fixed-size outputs. However, they face challenges with variable-length sequences, such as time series, text sequences, and image sequences. Recurrent Neural Networks (RNNs) come to the forefront as a solution for processing **sequential** data.

RNNs find application in diverse fields such as speech recognition, music generation, sentiment analysis, video processing, and text analysis and translation. Their ability to handle sequences makes them a powerful tool in capturing temporal dependencies.

The term **Vanilla RNN** is often used to refer to the basic form of recurrent neural network with a single hidden layer and without architectural enhancements. Vanilla RNN has a simple architecture consisting of an **input layer**, a recurrent **hidden layer** and an **output layer**.

A basic RNN processes a time series of input data $\boldsymbol X$ by estimating the output $\boldsymbol o_t$ given the input vector $\boldsymbol x_t$ and the hidden state vector $\boldsymbol h_t$. The hidden state is updated at each time step. It acts as a memory of previous time steps allowing the network to capture sequential patterns.

![RNN structure](images/rnn.png)

For instance, consider a natural language processing task where $\boldsymbol X$ is a sequence of words in a sentence, $\boldsymbol x_t$ is the word at position $t$, and $\boldsymbol o_t$ represents the predicted probability distribution over the vocabulary for the next word in the sequence. RNN learns from the context of previous words, using the hidden state to generate predictions for the next word in the sentence.


## Forward Pass

![RNN forward](images/forward.png)

Consider a minibatch of inputs $\boldsymbol x_t$ $\in\mathbb{R}^{n \times d}$ at time step $t$. Each row of $\boldsymbol x_t$ corresponds to one example at time step $t$ within a minibatch of $n$ sequence examples. The weight parameter $\boldsymbol W_{xh}$ $\in \mathbb{R}^{d \times h}$ and bias parameter $\boldsymbol b_h$ are applied to the current input. Additionally, let $\boldsymbol h_t$ $\in \mathbb{R}^{n \times h}$ denote the hidden layer output at time step $t$. The calculation of the hidden layer output, $\boldsymbol h_t$ at the current time step, $t$ is determined by:

$$
    \boldsymbol h_t = \phi({\boldsymbol x_tW_{xh}} + {\boldsymbol h_{t-1}W_{hh}} + {\boldsymbol b_h})
$$

Here, $\phi$ is an **[activation function](https://fedmug.github.io/kbtu-ml-book/mlp/activations.html)** of the hidden layer output. In contrast to [MLP](https://fedmug.github.io/kbtu-ml-book/mlp/layers.html), we preserve the hidden layer output $\boldsymbol h_{t-1}$ from the previous time step. By introducing a new weight parameter $\boldsymbol W_{hh}$ $\in \mathbb{R}^{h \times h}$, we define how to use the hidden layer output from the previous time step in the current time step. Since the hidden state at the current time step uses the same definition as the previous time step, the computation involves *recurrence* which is why the model is called *recurrent neural network*.

For time step $t$, the **output** of the output layer is computed similarly to [MLP](https://fedmug.github.io/kbtu-ml-book/mlp/forward_backward_pass.html):

$$
\boldsymbol o_t = \boldsymbol x_t\boldsymbol W_{hq} + \boldsymbol b_q
$$

Here, $\boldsymbol o_t \in \mathbb{R}^{n \times q}$ represents the output variable, $\boldsymbol W_{hq} \in \mathbb{R}^{h \times q}$ is the weight parameter, and $\boldsymbol b_q \in \mathbb{R}^{1 \times q}$ is the bias parameter. In the case of a classification problem, the softmax function can be applied to $\boldsymbol o_t$ to compute the probability distribution of the output categories. As you can see, hidden state at the current time step, $\boldsymbol h_t$, does not only participate in computing hidden state at next time step $t+1$, but is also used in output computation at current time step, $\boldsymbol o_t$.

```{note}
RNNs consistently employ the same set of parameters, $\boldsymbol W_{hx}, \boldsymbol W_{hh}, \boldsymbol W_{hy}, \boldsymbol b_h, \boldsymbol b_q$, across different time steps. This parameter reuse ensures that the computational cost of parameterization remains **constant**, irrespective of the number of time steps.
```

````{important}
While we, and a lot of sources, focus on many-to-many paradigm, where sequences are processed, and outputs are generated at each time step as we discussed above, it is important to highlight flexibility of RNNs to different input-output models. RNNs can process different sequence-to-sequence architectures, such as one-to-one, one-to-many, many-to-one.
````

<span style="display:none" id="first_q">W3sicXVlc3Rpb24iOiAiV2UgcnVuIFJOTiwgdGhhdCBwcm9jZXNzZXMgYSB0aW1lIHNlcmllcyBvZiBsZW5ndGggJFQgPSAxMCQsIGFuZCBlYWNoIHRpbWUgc3RlcCBoYXMgYW4gaW5wdXQgdmVjdG9yICAgICBvZiBzaXplICQ1JC4gV2hhdCBpcyB0aGUgdG90YWwgbnVtYmVyIG9mIGlucHV0IHBhcmFtZXRlcnMgZm9yIHRoZSBjb25uZWN0aW9ucyBmcm9tIHRoZSBpbnB1dCBsYXllciB0byB0aGUgaGlkZGVuIGxheWVyPyIsICJ0eXBlIjogIm51bWVyaWMiLCAiYW5zd2VycyI6IFt7InR5cGUiOiAidmFsdWUiLCAidmFsdWUiOiA1MCwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdCEifSwgeyJ0eXBlIjogImRlZmF1bHQiLCAiZmVlZGJhY2siOiAiTm8sIHRoYXRzIHdyb25nLiJ9XX1d</span>

In [2]:
display_quiz('#first_q')

<IPython.core.display.Javascript object>

## Training Vanilla RNNs.

### Data preprocessing

Let's do some predictions.  Let's import and preprocess the IMDB movie review sentiment classification dataset. This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). `num_words` limits the vocabulary size to the most frequent 30,000 words. `maxlen` sets the maximum length of each review to 50 words.

In [3]:
num_words = 30000
maxlen = 50
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)

Pad sequences to ensure they have the same length. This is necessary for creating a consistent input size for the neural network. Also convert the target labels to one-hot encoded format. This is necessary for categorical classification tasks.

In [4]:
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

y_train = to_categorical(y_train, num_classes=2)
y_test = to_categorical(y_test, num_classes=2)

Finally let's split the training data into training and validation sets for monitoring model performance during training.

In [5]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(17500, 50) (7500, 50) (17500, 2) (7500, 2)


### Model Architecture

We build a sequential neural network model with an embedding layer, a SimpleRNN layer, dropout for regularization, and a dense layer with softmax activation for binary classification.

In [6]:
embedding_dim = 128
model = Sequential()
model.add(Embedding(num_words, embedding_dim, input_length=maxlen))
model.add(SimpleRNN(50, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

Configure the model for training by specifying the loss function, optimizer, and metrics to monitor.

In [7]:
adam = Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])



Implement early stopping to monitor the validation loss and stop training if it doesn't improve for a certain number of epochs (patience).

In [8]:
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

### Training the Model:
Train the model using the training data. The training is monitored on the validation set, and early stopping is applied.

In [5]:
history = model.fit(X_train, y_train, epochs=10, batch_size=50, verbose=2, validation_data=(X_val, y_val), callbacks=[early_stop])

NameError: name 'model' is not defined

### Evaluate the Model:
Use the trained model to make predictions on the test set and evaluate its accuracy using the ground truth labels.

In [47]:
y_pred = model.predict(X_test, verbose = 0)
y_test_ = np.argmax(y_test, axis=1)
y_pred_ = np.argmax(y_pred, axis=1)
accuracy = accuracy_score(y_test_, y_pred_)
print("Test accuracy:", accuracy)

Test accuracy: 0.7788


The job is done. Our model now can predict sentiment of a movie review with 80% accuracy.

## Input-output relations in the RNNs

Basing on the sizes of input and output, the next classification of input-output relations in the RNNs takes place.

### One-to-one

This is the classic feed forward neural network architecture, with one input and we expect one output. One-to-one relationship can be formulated like this:

$$
f_\theta : \mathbb{R}^D \rightarrow \mathbb{R}^C
$$

where $\boldsymbol D$ is the size of the input vector, and $\boldsymbol C$ is the output vector. We usually take MSE as the our loss function in such cases:

$$
\mathcal{L} = {MSE} = \frac{1}{2} \sum_{i=1}^{C} (\boldsymbol {y_i - \hat{y}_i})^2$$

An one-to-one relationship is applicable classifying images into categories (e.g., cat, dog, bird) or recognizing handwritten digits.

### One-to-many (Vec2Seq, sequence generation)

One-to-many relationship can be formulated like this:

$$
f_\theta : \mathbb{R}^D \rightarrow \mathbb{R}^{N \infty C}
$$

where $\boldsymbol D$ is the size of the input vector, and the output is an arbitrary-length sequence of vectors, each of size $\boldsymbol C$. The loss function for a one-to-many relationship can be expressed using the cross-entropy loss, which is commonly used for sequence generation tasks. The total loss for the entire sequence is equal to the overall sun of loss function for all stages:

$$
\mathcal{L} = \sum_{t=1}^{T} \text{L}_t = -\sum_{i} y_{t,i} \cdot \log(\hat{\boldsymbol y})_{t,i})
$$

An one-to-many relationship is useful for generating a descriptive caption for an input image, creating a musical composition or converting spoken language into written text with word-level timing information.

### Many-to-many (Seq2Seq, sequence translation)

In this case we consider learning functions of the form 

$$
f_\theta : \mathbb{R}^{T \times D} \rightarrow \mathbb{R}^{T' \times C} 
$$

We consider two cases: one in which ${T}^′ = T$ , so the input and output sequences have the same length (and hence are aligned), and one in which ${T}′\neq T$ , so the input and output sequences have different lengths.

Еhe loss value in this relationship is to be computed at each time step for all the training examples and accumulated in one variable for the overall loss such as:

$$
\mathcal{L} = -\frac{1}{m} \sum_{i=1}^{m} \sum_{j=1}^{T} \log(\hat{\boldsymbol y}_{i,j,c})  
$$

Many-to-many relatiions is most common for machine translation tasks or generating textual descriptions for video sequences.

### Many-to-one (Seq2Vec, sequence classification)

Assume that we have a single fixed-length output vector $\boldsymbol y$ we want to predict, given a variable length sequence as input. Thus we want to learn a function of the form:

$$
f_\theta : \mathbb{R}^{T \times D} \rightarrow \mathbb{R}^{C}
$$

As it’s a classification problem, the cross-Entropy Loss is also used to compute the loss value:

$$
\mathcal{L} = -\sum_{i=0}^{1} (\boldsymbol y_i \log \hat{\boldsymbol y}_i)
$$

We can meet such relationship in the determining the sentiment of a text, or assigning a document or a sentence to one of several predefined categories.

<span style="display:none" id="third_q">W3sicXVlc3Rpb24iOiAiV2hhdCB0eXBlIG9mIHJlbGF0aW9ucyBiZXR3ZWVuIGlucHV0IGFuZCBvdXRwdXQgaXMgbW9zdCBzdWl0YWxiZSBmb3IgcHJldmlvdXMgSU1CRCAgICAgIGRhdGFzZXQgc2VudGltZW50IGNsYXNzaWZpY2F0aW9uIG1vZGVsPyIsICJ0eXBlIjogIm1hbnlfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJhbnN3ZXIiOiAibWFueS10by1tYW55IiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIk5vISBXZSBnZXQgb25seSBvbmUgd29yZCBmb3Igb3VwdXQhIn0sIHsiYW5zd2VyIjogIm9uZS10by1tYW55IiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIk5vISBUaGUgUk5OIGhhcyBzb21lIHNlcXVlbmNpZXMgdG8gd29yayB3aXRoISJ9LCB7ImFuc3dlciI6ICJtYW55LXRvLW9uZSIsICJjb3JyZWN0IjogdHJ1ZSwgImZlZWRiYWNrIjogIkNvcnJlY3QhIn0sIHsiYW5zd2VyIjogIm9uZS10by1vbmUiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiTm8hIn1dfV0=</span>

In [11]:
display_quiz('#third_q')

<IPython.core.display.Javascript object>


## Backpropagation through time (BPTT)

Recurrent neural networks use **backpropagation through time (BPTT)**, which means forwarding through entire sequence to compute **losses**, then backwarding through entire sequence to compute **gradients** and update the weights accordingly.

Recall that the input and the hidden state are be concatenated before being multiplied by one weight variable in the hidden layer. Thus, we use $\boldsymbol w_h$ and $\boldsymbol w_o$ to indicate the weights of the hidden layer and the output layer, respectively. As a result, the hidden states and outputs at each time step are

```{math}
:label: h_t&o_t
\begin{split}\begin{aligned}h_t &= f(x_t, h_{t-1}, w_\textrm{h}),\\o_t &= g(h_t, w_\textrm{o}),\end{aligned}\end{split}
```

where $f$ and $g$ are transformations of the hidden layer and the output layer, respectively. Hence, we have a chain of values $\{\ldots, (x_{t-1}, h_{t-1}, o_{t-1}), (x_{t}, h_{t}, o_{t}), \ldots \}$ that depend on each other via recurrent computation. The forward propagation is fairly straightforward. All we need is to loop through the $(x_{t}, h_{t}, o_{t})$ triples one time step at a time. The discrepancy between the output $\boldsymbol o_t$ and the desired target $\boldsymbol y_t$ is evaluated by an objective function across all the $T$ time steps as:

$$
\mathcal L(\boldsymbol {x_t,\ldots,x_T, y_t,\ldots,y_T, w_h, w_o}) = \frac{1}{T}\sum_{t=1}^T l({\boldsymbol {y_t}, o_t})
$$

![BPTT.png](images/BPTT.png)

## Gradient Calculation

For backpropagation, things are a bit trickier, especially when we calculate gradients with regard to the parameters $w_h$ of the objective function $\mathcal L$ . To be specific, by the chain rule,

```{math}
:label: bptt
\begin{split}\begin{aligned}\frac{\partial{\mathcal L}}{\partial w_\textrm{h}}  & = \frac{1}{T}\sum_{t=1}^T \frac{\partial l(y_t, o_t)}{\partial w_\textrm{h}}  \\& = \frac{1}{T}\sum_{t=1}^T \frac{\partial l(y_t, o_t)}{\partial o_t} \frac{\partial g(h_t, w_\textrm{o})}{\partial h_t}  \frac{\partial h_t}{\partial w_\textrm{h}}.\end{aligned}\end{split}
```

The first and the second factors of the product in {eq}`bptt` are easy to compute. The third factor $\partial h_t/\partial w_\textrm{h}$ is where things get tricky, since we need to recurrently compute the effect of the parameter $w_h$ on $h_t$. According to the recurrent computation in {eq}`h_t&o_t`, $h_t$ depends on both $h_{t-1}$ and $w_h$, where computation of $h_{t-1}$ also depends on $w_h$. Thus, evaluating the total derivate of $h_t$ with respect to $w_h$ using the chain rule yields
```{math}
:label:derivative

\frac{\partial h_t}{\partial w_\textrm{h}}= \frac{\partial f(x_{t},h_{t-1},w_\textrm{h})}{\partial w_\textrm{h}} +\frac{\partial f(x_{t},h_{t-1},w_\textrm{h})}{\partial h_{t-1}} \frac{\partial h_{t-1}}{\partial w_\textrm{h}}.

```
To derive the above gradient, assume that we have three sequences $\{a_{t}\},\{b_{t}\},\{c_{t}\}$ satisfying $a_{0}=0$ and $a_{t}=b_{t}+c_{t}a_{t-1}$ for $t=1, 2,\ldots$. Then for $t\geq 1$, it is easy to show
```{math}
:label:a_t
a_{t}=b_{t}+\sum_{i=1}^{t-1}\left(\prod_{j=i+1}^{t}c_{j}\right)b_{i}.
```

By substituting $a_t$, $b_t$, and $c_t$ according to

$$
\begin{split}\begin{aligned}a_t &= \frac{\partial h_t}{\partial w_\textrm{h}},\\
b_t &= \frac{\partial f(x_{t},h_{t-1},w_\textrm{h})}{\partial w_\textrm{h}}, \\
c_t &= \frac{\partial f(x_{t},h_{t-1},w_\textrm{h})}{\partial h_{t-1}},\end{aligned}\end{split}
$$

the gradient computation in {eq}`derivative` satisfies $a_{t}=b_{t}+c_{t}a_{t-1}$. Thus, per {eq}`a_t`, we can remove the recurrent computation in {eq}`derivative` with

$$
\frac{\partial h_t}{\partial w_\textrm{h}}=\frac{\partial f(x_{t},h_{t-1},w_\textrm{h})}{\partial w_\textrm{h}}+\sum_{i=1}^{t-1}\left(\prod_{j=i+1}^{t} \frac{\partial f(x_{j},h_{j-1},w_\textrm{h})}{\partial h_{j-1}} \right) \frac{\partial f(x_{i},h_{i-1},w_\textrm{h})}{\partial w_\textrm{h}}.
$$

While we can use the chain rule to compute $\partial h_t/\partial w_\textrm{h}$ recursively, this chain can get very long whenever t is large. In practice, an approximation called **truncated BPTT** is used, which is essentially running forward and backward through **chunks of the sequence** instead of the whole sequence.

<span style="display:none" id="second_q">W3sicXVlc3Rpb24iOiAiV2hhdCB0aW1lIGl0IHRha2VzIHRvIGNvbXB1dGUgdGhlIGdyYWRpZW50IG9uIG9uZSBzdGVwIGluIEJQVFQ/IiwgInR5cGUiOiAibWFueV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICIkTyh7bG9nVH0pJCIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJObywgdGhhdCdzIHRvbyBzbWFsbCEifSwgeyJhbnN3ZXIiOiAiJE8oe1R9KSQiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJFeGFjdGx5ISJ9LCB7ImFuc3dlciI6ICIkTyh7VH1eezJ9KSQiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiTm8sIHRoaXMgaXMgdGhlIHRvdGFsIGNvbXB1dGF0aW9uYWwgdGltZSBmb3IgYWxsIHN0ZXBzLiJ9LCB7ImFuc3dlciI6ICIkTyh7VH1ee1R9KSQiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiTm8sIHRoYXQncyB0b28gYmlnISJ9XX1d</span>

In [14]:
display_quiz('#second_q')

<IPython.core.display.Javascript object>

## Vanishing or exploding of gradient

Depending on the size of $\boldsymbol w_{h}$, the gradient can either vanish or explode over time:

For matrix $\boldsymbol w_{h}$:
  - If the largest singular value < 1: vanishing gradients.
  - If the largest singular value > 1: exploding gradients.

To address the exploding gradient problem a technique called radient clipping is used. Gradient clipping imposes a constraint on the magnitude of the gradients, preventing them from exceeding a predefined threshold. If the L2 norm  of the gradients exceeds the threshold, it scales down all gradients proportionally to ensure that the overall norm is within the specified limit.

$ \nabla_{\text{clipped}} = \frac{clip\_value}{\max(clip\_value, \lVert \nabla \rVert}) \cdot \nabla $

where:
- $ \nabla_{\text{clipped}} $ is the clipped gradient vector
- $ clip\_value $ is the specified threshold
- $ \lVert \nabla \rVert $ is the L2 norm of the gradient vector.

Considering more advanced RNN architectures like [Long Short-Term Memory (LSTM)](https://fedmug.github.io/kbtu-ml-book/rnn/lstm.html) or [Gated Recurrent Unit (GRU)](https://fedmug.github.io/kbtu-ml-book/rnn/gru.html) is a common and effective approach to address the vanishing gradient problem in traditional RNNs.

