- Introducing sequential data;
- RNNs for modeling sequences;
- Long short-term memory;
- Truncated backpropagation through time;
- Implementing a multilayer RNN for sequence modeling in PyTorch;
- Project one: RNN sentiment analysis of the IMDb movie review dataset;
- Project two: RNN character-level language modeling with LSTM cells, using text data from a book;
- Using gradient clipping to avoid exploding gradients.

# Introducing sequential data

What makes sequences unique, compared to other types of data, is that elements in a sequence appear in a certain order
and are not independent of each other. Typical machine learning algorithms for supervised learning assume that the input
is independent and identically distributed (IID) data, which means that the training examples are mutually independent
and have the same underlying distribution. Under this constraints the order of the data is irrelevant.

With sequences this assumption is not valid.

## Sequential data versus time series data

Time series data is a special type of sequential data where each example is associated with a dimension for time. Here
samples are taken at successive timestamps. For example, stock prices and voice or speech records are time series data.

On the other hand, not all sequential data has the time dimensione (like text or DNA).

## Representing sequences

We need a way to leverage the order information. Throught this chapter, we will represent sequences as $\langle x^{(1)},
...,x^{(T)}\rangle$, where the length of the sequence is $T$.

In a classical NNs we can say that they don't have a memory of previously seen training examples.

RNNs, by contrast, are designed for modeling sequences and are capable of remembering pas information and processing
new events accordingly, which is a clear advantage when working with sequence data.

## The different categories of sequence modeling

Sequence modeling has many fascinating applications, such as language transaltion, image captioning and text generation.

There are three different relationship categories between input and output data in RNNs:
- **Many-to-one**: The input data is a sequence, but the output is a fixed-size vector or scalar, not a sequence. In
sentiment analysis the input is text-based and the output is a class label.
- **One-to-many**: The input data is in standard format and not a sequence, but the output is a sequence. In image
captioning, the input is an image and the output is an English phrase summarizing the content of that image.
- **Many-to-many**: Both the input and output array are sequences. This category can be further devided based on whether
the input and output are synchronized.
    - **Synchronized**: In video classification, each frame is labeled.
    - **Delayed**: In language translation, we first read the phrase to be translated, and then we translate.

# RNNs for modeling sequences

In an RNN, the hidden layer receives its input from both the input layer of the current time step and the hidden layer
from the previous time step. The flow of information in adjacent time steps in the hidden layer allows the network to
have a memory of past events. This flow of information is usually displayed as a loop, also known as a recurrent edge
in graph notation.

Similar to MLPs, RNNs can consist of multiple hidden layers. Note that it's a common convention to refer to RNNs with
one hidden layer as a single layer RNN.

As we know, each hidden unit in a standard NN receives only one input. In contrast, each hidden unit in an RNN receives
two distinct sets of input: the preactivation from the input layer and the activation of the same hidden layer from the
previous time step, $t-1$.

At the first time step, $t=0$, the hidden units are initialized to zeros or small random values. Then, at a time step
where $t>0$, the hidden units receive their input from the data point at the current time, $x^{(t)}$, and the previous
values of hidden units at $t-1$, indicated as $h^{(t-1)}$.

Since, each recurrent layer must receive a sequence as input, all the recurrent layers except the last one must return
a sequence as output. The behaviour of the last recurrent layer depends on the type of problem.

## Computing activations in an RNN

Each directed edge in the representation of an RNN that we just looked at is associated with a weight matrix. Those
weights do not depend on time, $t$; therefore, they are shared across the time axis. The different weight matrices in a
single-layer RNN are as follows:
- $W_{xh}$: The weight matrix between the input and the hidden layer;
- $W_{hh}$: The weight matrix associated with the recurrent edge;
- $W_{xh}$: The weight matrix between the hidden layer and output layer.

Computing the activations is very similar to standard multilayer perceptrons and other types of feedforward NNs. For the
hidden layer (in this example we have just one hidden layer), the net input, $z_h$ (preactivation), is computed through
a linear combination; that is, we compute the sum of the multiplications of the weight matrices with the corresponding
vectors and add the bias unit:
$$z_h^{(t)} = W_{xh}x^{(t)} + W_{hh}h^{(t-1)} + b_h$$

Then, the activations of the hidden units at the time step, $t$, are calculated as follows:
$$h^{(t)} = \sigma_h(z_h^{(t)}) = \sigma_h(W_{xh}x^{(t)} + W_{hh}h^{(t-1)} + b_h)$$

Once the activations of the hidden units at the current time step are computed, then the activations of the output units
will be computed, as follows:
$$o^{(t)} = \sigma_o(W_{ho}h^{(t)}+b_o)$$

### Backpropagation Through Time (BPTT)

The derivation of the gradients might be a bit complicated, but the basic idea is that the overall loss, $L$, is the sum
of all the loss functions at times $t=1$ to $t=T$:
$$L=\sum_{t=1}^{T}L^{(t)}$$

Since the loss at time $t$ is dependent on the hiddenunits at all previous time steps $1:t$, the gradient will be
computed as follows:
$$\frac{\partial L^{(t)}}{\partial W_{hh}} = \frac{\partial L^{(t)}}{\partial o_{(t)}}\times\frac{\partial o^{(t)}}
{\partial h_{(t)}}\times\bigl(\sum_{k=1}^t\frac{\partial h^{(t)}}{\partial h_{(k)}}\times\frac{\partial h^{(k)}}
{\partial W_{(hh)}}\bigr)$$

Here, $\frac{\partial h^{(t)}}{\partial h_{(k)}}$ is computed as a multiplication of adjacent time steps:
$$\frac{\partial h^{(t)}}{\partial h_{(k)}} = \prod_{i=k+1}^t\frac{\partial h^{(i)}}{\partial h_{(i-1)}}$$

## Hidden recurrence versus output recurrence

Note, that there is a type of model in which the recurrent connection comes from the output layer. In this case, the
net activations from the output layer at the previous time step, o^{t-1}, can be added in one of two ways:
- The hidden layer at the current time step, h^t (output to hidden recurrence);
- The output layer at the current time step, o^t (output to output recurrence).

In [1]:
import torch
import torch.nn as nn
torch.manual_seed(1)
rnn_layer = nn.RNN(input_size=5,hidden_size=2,num_layers=1,batch_first=True)

w_xh = rnn_layer.weight_ih_l0
w_hh = rnn_layer.weight_hh_l0
b_xh = rnn_layer.bias_ih_l0
b_hh = rnn_layer.bias_hh_l0
print("w_xj shape: ", w_xh.shape)
print("w_hh shape: ", w_hh.shape)
print("b_xh shape: ", b_xh.shape)
print("b_hh shape: ", b_hh.shape)

w_xj shape:  torch.Size([2, 5])
w_hh shape:  torch.Size([2, 2])
b_xh shape:  torch.Size([2])
b_hh shape:  torch.Size([2])


The input shape for this layer is `(batch_size, sequence_length, timestep_size)` where the first dimension is the batch
dimension, the second dimension corresponds to the sequence, and the last dimension corresponds to the features. Notice
that we will output a sequence, which, for an input sequence of length 3, will result in an output sequence of length 3.

In [2]:
x_seq = torch.tensor([[1.0]*5,[2.0]*5,[3.0]*5]).float()

# Output of the RNN
output, hn = rnn_layer(torch.reshape(x_seq,(1,3,5)))

# Manually computing the output
out_man = []
for t in range(3):
    xt = torch.reshape(x_seq[t], (1,5))
    print(f'Time step {t} =>\nInput: ',xt.numpy())
    ht = torch.matmul(xt, torch.transpose(w_xh,0,1)) + b_xh
    print(f'Hidden: ',ht.detach().numpy())
    if t > 0:
        prev_h = out_man[t-1]
    else:
        prev_h = torch.zeros((ht.shape))
    ot = ht + torch.matmul(prev_h,torch.transpose(w_hh,0,1)) + b_hh
    ot = torch.tanh(ot)
    out_man.append(ot)
    print("Output (manual): ",ot.detach().numpy())
    print("RNN output: ",output[:,t].detach().numpy())
    print()

Time step 0 =>
Input:  [[1. 1. 1. 1. 1.]]
Hidden:  [[-0.47019297  0.58639044]]
Output (manual):  [[-0.35198015  0.52525216]]
RNN output:  [[-0.3519801   0.52525216]]

Time step 1 =>
Input:  [[2. 2. 2. 2. 2.]]
Hidden:  [[-0.8888316  1.2364398]]
Output (manual):  [[-0.68424344  0.76074266]]
RNN output:  [[-0.68424344  0.76074266]]

Time step 2 =>
Input:  [[3. 3. 3. 3. 3.]]
Hidden:  [[-1.3074702  1.8864892]]
Output (manual):  [[-0.8649416  0.9046636]]
RNN output:  [[-0.8649416  0.9046636]]



The hyperbolic tangent (tanh) activation function is the default in RNNs.

## The challenges of learning long-range interactions

BPTT, introduces some new challenges. Because of the multiplicative factor $\frac{\partial h^{(t)}}{\partial h_{(k)}}$,
in computing the gradients of a loss function, the so-called vanishing and exploding gradient problems arise.

Basically $\frac{\partial h^{(t)}}{\partial h_{(k)}}$ has $t-k$ multiplications; therefore, multiplying the weight, $w$,
by itself $t-k$ times results in a factor, $w^{t-k}$. As a result, if $|w|<1$, this factor becomes very small when $t-k$
is large. On the other hand, if the weight of the recurent edge is $|w| >1$, then $w^{t-k}$ becomes very large when
$t-k$ is large. **Note that a large t-k refers to long-range dependencies.** We can see that a naive solution to avoid
vanishing or exploding gradients can be reached by ensuring $|w| =1$.

There are at least three solutions to this problem:
- Gradient clipping: a threshold value is specified for the gradients, and we assign this cut-off value to gradient
values that exceed this value.
- Truncated backpropagation through time (TBTT): it simply limits the number of time steps that the signal can 
backpropagate after each forward pass (i.e. even if the sequence has 100 elements, we may only backpropagate the most
recent 20 time steps).
    - The truncation limits the number of steps that the gradient can effectively flow back and properly update the
    weights.
- Long Short Term Memory (LSTM).

# Long Short Term Memory (LSTM)

It was introduced to overcome the vanishing gradient problem. The building block of an LSTM is a memory cell, which
replaces the hidden layer of standard RNNs.

In each memory cell, there is a recurrent edge that has the desirabl weight $w=1$, as we discussed, to overcome the
vanishing and exploding gradient problems. The values associated with this recurrent eddge are collectively called the
cell state.

![LSTM Cell](./LSTM-cell-architecture.png)

Notice:
 - The cell state from the previous time step, $C^{(t-1)}$, is modified to get the cell state at the current time step,
 $C^{(t)}$, without being multiplied directly by any weight factor;
 - $h^{(t-1)}$ indicates the hidden units at time $t-1$.
 - In LSTM there are three different types of gates, which are known as the forget gate, the input gate and the output
 gate.

**The forget gate** ($f_t$):

Allows the memory cell to reset the cell state without growing indefinitely. In fact the forget gate decides which
information is allowed to go through and which information to suppress. It wasn't a part of the original LSTM cell but
was added later to improve the original model.
$$f_t = \sigma(W_{xf}x^{(t)}+W_{hf}h^{(t-1)}+b_f)$$

**The input gate** ($i_t$) and the candidate value ($\tilde{C}_t$):

They are responsible for updating the cell state. They are computed as follows:
$$i_t = \sigma(W_{xi}x^{(t)}+W_{hi}h^{(t-1)}+b_i)$$
$$\tilde{C}_t = \sigma(W_{xc}x^{(t)}+W_{hc}h^{(t-1)}+b_c)$$
Then the cell state at time $t$ is computed as follows:
$$C^{(t)}=(C^{(t-1)}\odot f_t)\oplus(i_t\odot\tilde{C}_t)$$

**The output gate** ($o_t$):

It decides how to update the values of hidden units:
$$o_t = \sigma(W_{xo}x^{(t)}+W_{ho}h^{(t-1)}+b_o)$$

Given this, the hidden units at the current time step are computed as follows:
$$h^{(t)}=o_t\odot tanh(C^{(t)})$$

### Little explanation of how LSTM achieves not having gradient problems

The cell state has a direct path from one time step to the next, meaning that gradients can flow through it with little
modification. This directo path is a key factor that prevents the gradients from vanishing over long sequences. It does
not undergo the same extensive transformations as the hidden state, so its gradients are less likely to vanish as they
pass backward through time.

The forget gate determines how much of the previous cell state should be retained in the current cell state. It outputs
a value between 0 (forget everything) and 1 (keep everything).
- If the forget gate is set to 1, it ensures that the cell state carries forward important information from previous
time steps without losing it (this preserves gradients from shrinking);
- By controlling how much information is passed forward, the forget gate helps prevent the gradients from growing too
large (avoiding accumulation of unnecessary information);

The input gate decides how much new information from the current input and previous hidden state should be added to the
cell state.
The input gate helps prevent the gradient from exploding by limiting the contribution of new information (since large
updates to the cell state are avoided).

The output gate determines how much of the cell state should be exposed as the hidden state at the current time step.
- The output gate ensures that the cell state's information is passed to the hidden state in a controlled manner.
- The output gate can prevent the hidden state frombecoming too large by regulating the contribution of the cell state.

The sigmoid function uses its extremes (0 and 1) to control the flow of information, while the tanh is used to squash
the cell state values, limiting them to a range of -1 to 1.
- The outputs of the sigmoid and tanh functions are constrained, meaning that the gradients passed backward through
these functions are also constrained, helping prevent the gradients from exploding.

In summary, the cell state provides a stable path for gradients to propagate backward through many time steps. The
hidden state is more susceptible to vanishing gradients because it involves additional non-linear transformation, but
its gradient flow is still stabilized by the gating mechanisms and the control exerted by the forget and input gates.


