# **Modeling Sequential Data Using Recurrent Neural Networks (Part 1/3)**


## **Introducing sequential data**

### **Modeling sequential data⁠—order matters**

What makes sequences unique, compared to other types of data, is that elements in a sequence appear in a certain order and are not independent of each other. Typical machine learning algorithms for supervised learning assume that the input is `independent and identically distributed (IID)` data, which means that the training examples are `mutually independent` and have the same underlying distribution. In this regard, based on the mutual independence assumption, the order in which the training examples are given to the model is irrelevant. For example, if we have a sample consisting of n training examples, `x(1), x(2), ..., x(n)`, the order in which we use the data for training our machine learning algorithm does not matter. An example of this scenario would be the Iris dataset that we worked with previously. In the Iris dataset, each flower has been measured independently, and the measurements of one flower do not influence the measurements of another flower.

However, this assumption is not valid when we deal with `sequences—by definition`, order matters. Predicting the market value of a particular stock would be an example of this scenario. For instance, assume we have a sample of `n training ` examples, where each training example represents the market value of a certain stock on a particular day. If our task is to predict the stock market value for the next three days, it would make sense to consider the previous stock prices in a date-sorted order to derive trends rather than utilize these training examples in a randomized order.

### **Sequential data versus time series data**

Time series data is a special type of sequential data where each example is associated with a dimension for time. In time series data, samples are taken at successive timestamps, and therefore, the time dimension determines the order among the data points. For example, stock prices and voice or speech records are time series data.

On the other hand, not all sequential data has the time dimension. For example, in text data or DNA sequences, the examples are ordered, but text or DNA does not qualify as time series data. 

The fundamental characteristics of sequential data is that the order of its elements is significant and matters greatly for its meaning and context. 
- Order Matters
- Contextual Understanding
- Structure

### **Representing sequences**

![Example of time series data](./figures/15_01.png)

As we have already mentioned, the standard `NN` models that we have covered so far, such as `multilayer perceptrons (MLPs)` and `CNNs` for image data, assume that the training examples are independent of each other and thus do not incorporate ordering information. We can say that such models do not have a memory of previously seen training examples. 

For instance, the samples are passed through the `feedforward` and `backpropagation` steps, and the weights are updated independently of the order in which the training examples are processed.

`RNNs`, by contrast, are designed for modeling sequences and are capable of remembering past information and processing new events accordingly, which is a clear advantage when working with sequence data.

### **The different categories of sequence modeling**

- Language translation
- Image Captioning
- text generation


![Effectiveness of RNNs](./figures/diags.jpeg)


If neither the input nor output data represent sequences, then we are dealing with standard data, and we could simply use a multilayer perceptron to model such data. However, if either the input or output is a sequence, the modeling task likely falls into one of these categories:

- `Many-to-one:` The input data is a sequence, but the output is a fixed-size vector or scalar, not a sequence. For example, in sentiment analysis, the input is text-based (for example, a movie review) and the output is a class label (for example, a label denoting whether a reviewer liked
the movie).


- `One-to-many:` The input data is in standard format and not a sequence, but the output is a sequence. An example of this category is image captioning—the input is an image and the output is an English phrase summarizing the content of that image.


- `Many-to-many:` Both the input and output arrays are sequences. This category can be further divided based on whether the input and output are synchronized. An example of a synchronized many-to-many modeling task is video classification, where each frame in a video is labeled.
An example of a delayed many-to-many modeling task would be translating one language into another. For instance, an entire English sentence must be read and processed by a machine before its translation into German is produced.

## **RNNs for modeling sequences**

### **Understanding the dataflow in RNNs**

![The dataflow of a standard feedforward NN and an RNN](./figures/15_03.png)


In a standard feedforward network, information flows from the input to the hidden layer, and then from the hidden layer to the output layer. On the other hand, in an RNN, the hidden layer receives its input from both the input layer of the current time step and the hidden layer from the previous time step.

The flow of information in adjacent time steps in the hidden layer allows the network to have a memory of past events. This flow of information is usually displayed as a loop, also known as a `recurrent edge` in graph notation, which is how this general RNN architecture got its name.

Similar to multilayer perceptrons, RNNs can consist of multiple hidden layers. Note that it’s a common convention to refer to RNNs with one hidden layer as a single-layer RNN, which is not to be confused with single-layer NNs without a hidden layer. 


![An RNN with one and two hidden layers](./figures/15_04.png)


As we know, each hidden unit in a standard NN receives only one input—the net preactivation associated with the input layer. In contrast, each hidden unit in an RNN receives two distinct sets of input—the preactivation from the input layer and the activation of the same hidden layer from the previous time step, `t – 1`.

At the first time step, `t = 0`, the hidden units are initialized to zeros or small random values. Then, at a time step where `t > 0`, the hidden units receive their input from the data point at the current time, `x(t)`, and the previous values of hidden units at `t – 1`, indicated as `h(t–1)`.

Similarly, in the case of a multilayer RNN, we can summarize the information flow as follows:

- `layer = 1`: Here, the hidden layer is represented as $h_{1}^{(t)}$ and it receives its input from the data point, $x^{(t)}$, and the hidden values in the same layer, but at the previous time step, $h_{1}^{(t - 1)}$.

- `layer = 2`: The second hidden layer, $h_{2}^{(t)}$, receives its inputs from the outputs of the layer below at the current time step ($o_{1}^{(t)}$) and its own hidden values from the previous time step $h_{2}^{(t - 1)}$.


Since, in this case, each recurrent layer must receive a sequence as input, all the recurrent layers except the last one must return a sequence as output (that is, we will later have to set `return_sequences=True`).

### **Computing activations in an RNN**

Now that you understand the structure and general flow of information in an `RNN`, let’s get more specific and compute the actual activations of the hidden layers, as well as the output layer. For simplicity, we will consider just a single hidden layer; however, the same concept applies to multilayer RNNs.

Each directed edge (the connections between boxes) in the representation of an `RNN` that we just looked at is associated with a weight matrix. Those weights do not depend on time, `t`; therefore, they are shared across the time axis. The different weight matrices in a single-layer `RNN` are as follows:

- $W_{xh}$: The weight matrix between the input, $x^{(t)}$, and the hidden layer, $h$.
- $W_{hh}$: The weight matrix associated with the recurrent edge.
- $W_{ho}$: The weight matrix between the hidden layer and the output layer.

These weight matrices are depicted in the figure below;

![Applying weights to a single-layer RNN](./figures/15_05.png)


### **Deep Dive into the Mathematics of RNNs**



**1. Core Idea**

RNNs process sequential data by **reusing** a hidden state across time. The hidden state acts as a **memory** that encodes information about all previous steps.

Given an input sequence $`x = (x_1, x_2, \dots, x_T)`$,

the RNN produces hidden states $`h = (h_1, h_2, \dots, h_T)`$

and (optionally) outputs $`y = (y_1, y_2, \dots, y_T)`$.


![Computing the activations](./figures/15_06.png)


**2. Hidden State Update Equation**

The hidden state at time $`t`$ is computed as:

$$h_t = \phi(W_{xh} x_t + W_{hh} h_{t-1} + b_h)$$

Where:

| Symbol          | Meaning                                  |
| --------------- | ---------------------------------------- |
| $`x_t`$         | Input vector at time step $`t`$          |
| $`h_t`$         | Hidden state at time step $`t`$          |
| $`W_{xh}`$      | Input-to-hidden weight matrix            |
| $`W_{hh}`$      | Hidden-to-hidden recurrent weight matrix |
| $`b_h`$         | Bias for hidden state                    |
| $`\phi(\cdot)`$ | Nonlinearity (tanh or ReLU)              |

If the network outputs at each time step:

$$y_t = W_{hy} h_t + b_y$$



**3. Forward Pass Unrolled in Time**

Unrolling the RNN shows how the state flows:

$`h_1 = \phi(W_{xh} x_1 + W_{hh} h_0 + b_h)`$

$`h_2 = \phi(W_{xh} x_2 + W_{hh} h_1 + b_h)`$

$`\cdots`$

$`h_T = \phi(W_{xh} x_T + W_{hh} h_{T-1} + b_h)`$

This makes it clear that:

**Information flows recursively**, but **gradients must pass through many multiplications of $`W_{hh}`$**.



**4. Loss Over the Sequence**

If the task requires predictions for each timestep:

$$\mathcal{L} = \sum_{t=1}^{T} \mathrm{Loss}(y_t, \hat{y}_t)$$



**5. Backpropagation Through Time (BPTT)**

To learn parameters, gradients must propagate backward across **time**, not just layers.

Derivative of loss w.r.t. hidden state:

$$
\frac{\partial \mathcal{L}}{\partial h_t}
=========================================

\frac{\partial \mathcal{L}*t}{\partial h_t}
+
\frac{\partial \mathcal{L}}{\partial h*{t+1}} \cdot \frac{\partial h_{t+1}}{\partial h_t}
$$

Where:

$$
\frac{\partial h_{t+1}}{\partial h_t}
=====================================

W_{hh}^\top \cdot \mathrm{diag}\left(\phi'(a_{t+1})\right)
$$

So gradients accumulate multiplicatively:

$$
\frac{\partial \mathcal{L}}{\partial h_t}
=========================================

\sum_{k=t}^{T}
\left(
\prod_{j=t+1}^{k} W_{hh}^\top \mathrm{diag}(\phi'(a_j))
\right)
\frac{\partial \mathcal{L}_k}{\partial h_k}
$$



**6. Vanishing and Exploding Gradients**

The critical term is:

$$\prod_{j=t+1}^{k} W_{hh}^\top \mathrm{diag}(\phi'(a_j))$$

If the largest singular value of $`W_{hh}`$ is:

* **Less than 1** → gradients **shrink** → **vanishing gradient**.
* **Greater than 1** → gradients **blow up** → **exploding gradient**.

This explains why **long sequences** break standard RNN training.



**7. Why LSTM/GRU Fix the Problem**

They introduce **gates** to **control** information flow:

* Additive memory updates reduce multiplicative gradient chains.
* Gradients propagate more stably.

Key structural difference:

Standard RNN memory update (multiplicative):

$$h_t = \phi(W_{xh} x_t + W_{hh} h_{t-1})$$

LSTM memory update (additive):

$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$

This **additive path** preserves gradient magnitude.



**8. Summary Table**

| Model        | Memory Mechanism                | Gradient Behavior          | Suitable For     |
| ------------ | ------------------------------- | -------------------------- | ---------------- |
| Standard RNN | Multiplicative recurrence       | Vanishing/Exploding common | Short sequences  |
| GRU          | Gated recurrence (reset/update) | More stable                | Medium sequences |
| LSTM         | Explicit memory cell + gates    | Most stable                | Long sequences   |

---


### **Hidden-recurrence vs. output-recurrence**

So far, you have seen recurrent networks in which the hidden layer has the recurrent property. However, note that there is an alternative model in which the recurrent connection comes from the output layer. In this case, the net activations from the output layer at the previous time step, $o^{t–1}$, can be added in one of two ways:

- To the hidden layer at the current time step, $h^t$ (output-to-hidden).
- To the output layer at the current time step, $o^t$ (output-to-output).


![Different recurrent connection models](./figures/15_07.png)


As shown in the figure above, the difference between these architectures can be clearly seen in the recurring connections. Following our notation, the weights associated with the recurrent connection will be denoted for the `hidden-to-hidden` recurrence by $W_{hh}$, for the `output-to-hidden` recurrence by $W_{oh}$, and for the `output-to-output` recurrence by $W_{oo}$. In some articles in literature, the weights associated with the recurrent connections are also denoted by $W_{rec}$.

- Manually compute the forward pass for one of these recurrent types. Using the `torch.nn` module, a recurrent layer can be defined via `RNN`, which is similar to the `hidden-to-hidden` recurrence. 

In [1]:
from IPython.display import Image
%matplotlib inline

In [6]:
import torch
import torch.nn as nn 

torch.manual_seed(1)
rnn_layer = nn.RNN(input_size=5, hidden_size=2,
                   num_layers=1, batch_first=True)

w_xh = rnn_layer.weight_ih_l0
w_hh = rnn_layer.weight_hh_l0
b_xh = rnn_layer.bias_ih_l0
b_hh = rnn_layer.bias_hh_l0

print('W_xh shape:', w_xh.shape)
print('W_hh shape:', w_hh.shape)
print('b_xh shape:', b_xh.shape)
print('b_hh shape:', b_hh.shape)

W_xh shape: torch.Size([2, 5])
W_hh shape: torch.Size([2, 2])
b_xh shape: torch.Size([2])
b_hh shape: torch.Size([2])


- The input shape for this layer is `(batch_size, sequence_length, input_size)`,
where;
    - the first dimension is the batch dimension
    - the second dimension correponds to the sequence, and 
    - the last dimension corresponds to the features.

- We will output a sequence, which, for an input sequence of length `3`, will result in the output sequence $({o^{(0)}, o^{(1)}, o^{(2)}})$.
- You can set `num_layers` to stack multiple RNN layers together to form a stacked RNN.

- Call forward pass on the `rnn_layer` and manually compute the outputs at each time step and compare them:

In [9]:
x_seq = torch.tensor([[1.0]*5, [2.0]*5, [3.0]*5]).float()

## output of the simple RNN:
output, hn = rnn_layer(torch.reshape(x_seq, (1, 3, 5)))

## manually computing the output:
out_man = []
for t in range(3):
    xt = torch.reshape(x_seq[t], (1, 5))
    print(f'Time step {t} =>')
    print('   Input           :', xt.numpy())
    
    ht = torch.matmul(xt, torch.transpose(w_xh, 0, 1)) + b_xh    
    print('   Hidden          :', ht.detach().numpy())
    
    if t>0:
        prev_h = out_man[t-1]
    else:
        prev_h = torch.zeros((ht.shape))

    ot = ht + torch.matmul(prev_h, torch.transpose(w_hh, 0, 1)) + b_hh
    ot = torch.tanh(ot)
    out_man.append(ot)
    print('   Output (manual) :', ot.detach().numpy())
    print('   RNN output      :', output[:, t].detach().numpy())
    print()

Time step 0 =>
   Input           : [[1. 1. 1. 1. 1.]]
   Hidden          : [[-0.47019297  0.58639044]]
   Output (manual) : [[-0.35198015  0.52525216]]
   RNN output      : [[-0.3519801   0.52525216]]

Time step 1 =>
   Input           : [[2. 2. 2. 2. 2.]]
   Hidden          : [[-0.8888316  1.2364398]]
   Output (manual) : [[-0.68424344  0.76074266]]
   RNN output      : [[-0.68424344  0.76074266]]

Time step 2 =>
   Input           : [[3. 3. 3. 3. 3.]]
   Hidden          : [[-1.3074702  1.8864892]]
   Output (manual) : [[-0.8649416  0.9046636]]
   RNN output      : [[-0.8649416  0.9046636]]



In [11]:
x_seq

tensor([[1., 1., 1., 1., 1.],
        [2., 2., 2., 2., 2.],
        [3., 3., 3., 3., 3.]])

- We used the `hyperbolic tangent (tanh)` activation function since it is also used in `RNN (default activation)`. 
- Outputs from the manual forward computations exactly match the output of the `RNN` layer at each time step.

### **The challenges of learning long-range interactions**

`BPTT`, introduces some new challenges. Because of the multiplicative factor, in computing the gradients of a loss function, the so-called `vanishing` and `exploding` gradient problems arise.


![Problems in computing the gradients of the loss function](./figures/15_08.png)


In practice, there are at least three solutions to this problem:

- `Gradient clipping`
- `Truncated backpropagation through time (TBPTT)`
- `LSTM`


* Using `gradient clipping`, we specify a cut-off or threshold value for the gradients, and we assign this cut-off value to gradient values that exceed this value. 

* In contrast, `TBPTT` simply limits the number of time steps that the signal can `backpropagate` after each forward pass. For example, even if the sequence has `100` elements or steps, we may only backpropagate the most recent `20` time steps.

* `LSTM` has been more successful in vanishing and exploding gradient problems while modeling long-range dependencies through the use of memory cells.

---

### **Long short-term memory cells**

`LSTMs` were introduced to overcome the vanishing gradient problem. The building block of an LSTM is a `memory cell`, which essentially represents or replaces the hidden layer of standard `RNNs`.

In each memory cell, there is a recurrent edge that has the desirable weight, $w = 1$, to overcome the `vanishing` and `exploding` gradient problems. The values associated with this recurrent edge are collectively called the `cell state`. 

Below is the unfolded structure of a modern `LSTM` cell;

![The structure of an LSTM cell](./figures/15_09.png)


Notice that the cell state from the previous time step, $C^{(t–1)}$, is modified to get the cell state at the current time step, $C^{(t)}$, without being multiplied directly by any weight factor. The flow of information in this memory cell is controlled by several computation units (often called gates) that will be described here. In the figure, `⨀` refers to the `element-wise product` (element-wise multiplication) and `⨁` means `element-wise summation` (element-wise addition). Furthermore, $x^{(t)}$ refers to the input data at time $t$, and $h^{(t–1)}$ indicates the hidden units at time $t – 1$. Four boxes are indicated with an activation function, either the sigmoid function ($\sigma$) or $tanh$, and a set of weights; these boxes apply a linear combination by performing matrix-vector multiplications on their inputs (which are $h^{(t–1)}$ and $x^{(t)}$). These units of computation with sigmoid activation functions, whose output units are passed through `⨀`, are called *gates*.


In an LSTM cell, there are three different types of *gates*, which are known as the 
- forget gate, 
- the input gate, and 
- the output gate:

- The `forget gate` $(f_{t})$ allows the memory cell to reset the cell state without growing indefinitely. In fact, the forget gate decides which information is allowed to go through and which information to suppress. 

Now, $f_t$ is computed as follows:

$$f_{t} = \sigma(W_{xf}x^{(t)} + W_{hf}h^{(t - 1)} + b_{f})$$


- The `input gate` $(i_{t})$ and `candidate value` $(\tilde{C})$ are responsible for updating the cell state. They are computed as follows:


$$i_{t} = \sigma(W_{xi}x^{(t)} + W_{hi}h^{(t - 1)} + b_{i})$$

$$\tilde{C} = \tanh(W_{xc}x^{(t)} + W_{hc}h^{(t - 1)} + b_{c})$$


The cell state at time t is computed as follows:


$$C^{(t)} = (C^{(t - 1)} ⨀ f_{t}) ⨁ (i_{t} ⨀ \tilde{C}_{t})$$


The `output gate` $(o_{t})$ decides how to update the values of hidden units:

$$o_{t} = \sigma(W_{xo}x^{(t)} + W_{ho}h^{(t - 1)} + b_{o})$$


Given this, the hidden units at the current time step are computed as follows:

$$h^{(t)} = o_{t}⨀\tanh(C^{(t)})$$

The structure of an `LSTM` cell and its underlying computations might seem very complex and hard to implement. However, the good news is that `PyTorch` has already implemented everything in optimized wrapper functions, which allows us to define our LSTM cells easily and efficiently.

### **Long Short-Term Memory (LSTM) Networks: Deep Dive into Equations and Gradient Flow**



**1. Motivation**

Standard RNNs suffer from **vanishing/exploding gradients** because their memory update is **multiplicative**:

$$h_t = \phi(W_{xh} x_t + W_{hh} h_{t-1} + b_h)$$

Gradients propagate through repeated multiplication by $`W_{hh}`$, which becomes unstable over long time spans.

LSTMs solve this by introducing an **explicit memory cell** with **additive** state updates, enabling stable gradient flow across long sequences.



**2. LSTM Architecture Overview**

At each timestep $`t`$, the LSTM maintains two states:

* **Hidden state** $`h_t`$

* **Cell state** $`c_t`$ (long-term memory)

The cell state allows **additive accumulation** of information, preventing vanishing gradients.



**3. LSTM Forward Equations**

Define the input at timestep $`t`$ as $`x_t`$, previous hidden state $`h_{t-1}`$, and previous cell state $`c_{t-1}`$.

Concatenate:

$$z_t = \begin{bmatrix} x_t \\ h_{t-1} \end{bmatrix}$$

#### Gates (sigmoid activations):

1. **Forget Gate** (controls what to erase)
   $`f_t = \sigma(W_f z_t + b_f)`$

2. **Input Gate** (controls what to write)
   $`i_t = \sigma(W_i z_t + b_i)`$

3. **Output Gate** (controls what to expose)
   $`o_t = \sigma(W_o z_t + b_o)`$

#### Candidate Cell Content (tanh activation):

$$\tilde{c}_t = \tanh(W_c z_t + b_c)$$

#### Cell State Update (additive):

$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$

#### Hidden State Update:

$$h_t = o_t \odot \tanh(c_t)$$

Where:

| Term              | Meaning                          |
| ----------------- | -------------------------------- |
| $`\sigma(\cdot)`$ | Logistic sigmoid gate activation |
| $`\tanh(\cdot)`$  | Candidate state nonlinearity     |
| $`\odot`$         | Element-wise multiplication      |



**4. Gradient Flow Through the Cell State**

The key innovation is visible in:

$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$

Differentiate w.r.t. the previous cell state:

$$\frac{\partial c_t}{\partial c_{t-1}} = f_t$$

Thus:

$$\frac{\partial c_T}{\partial c_t} = \prod_{k=t+1}^{T} f_k$$

Since gate activations $`f_k \in (0, 1)`$, gradients **scale smoothly**, without uncontrolled decay or explosion.

This **additive path** is why LSTMs retain information over long sequences.



**5. Gradient Through the Hidden State**

Hidden state:

$`h_t = o_t \odot \tanh(c_t)`$

Gradient contributes through both $`o_t`$ and $`c_t`$:

$$
\frac{\partial h_t}{\partial c_t}
=================================

o_t \odot \left(1 - \tanh^2(c_t)\right)
$$

Combined with the previous gradient:

$$
\frac{\partial \mathcal{L}}{\partial c_t}
=========================================

\frac{\partial \mathcal{L}}{\partial h_t} \odot o_t \odot \left(1 - \tanh^2(c_t)\right)
+
\frac{\partial \mathcal{L}}{\partial c_{t+1}} \odot f_{t+1}
$$

**Important Insight**

* The recurrence involves **addition**, not pure multiplication.
* This prevents gradients from collapsing through deep time.



**6. Why LSTMs Avoid Vanishing Gradients**

| Model        | Memory Update          | Gradient Path             | Stability                   |
| ------------ | ---------------------- | ------------------------- | --------------------------- |
| Standard RNN | Multiplicative         | $`\prod W_{hh}`$          | Unstable for long sequences |
| LSTM         | Additive (via $`c_t`$) | $`\prod f_t`$ with gating | Stable long-term behavior   |

The forget gate $`f_t`$ acts as a **learned decay coefficient**, enabling **controlled memory retention**.



**7. Interpretation of Gates**

| Gate            | Interpretation    | Effect                                    |
| --------------- | ----------------- | ----------------------------------------- |
| $`f_t`$         | What to forget    | Controls decay of past memory             |
| $`i_t`$         | What to write now | Controls strength of new information      |
| $`o_t`$         | What to expose    | Controls visible influence on output      |
| $`\tilde{c}_t`$ | Candidate content | The new information proposed to be stored |

This separation allows the network to **store**, **retain**, and **expose** information selectively.



**8. Summary**

* LSTMs introduce a **cell state $`c_t`$** that evolves **additively**, ensuring stable gradient flow.
* Gates regulate information storage and retrieval.
* The key mathematical stability comes from:

$`\frac{\partial c_t}{\partial c_{t-1}} = f_t`$

instead of repeated multiplication by weight matrices.

---
