# **1. Sequence Modelling Using CNNs**



---

## **1.1 What Is a Sequence?**

1. A sequence in deep learning refers to an ordered list of elements written as
$$
(x_1, x_2, \dots, x_T)
$$
where $T$ is the length of the sequence.
2. Each individual element $x_t$ of the sequence may be a scalar value (such as temperature at time $t$), a vector (such as a word embedding in an NLP task), or a multi-channel vector (such as multi-lead ECG data or sensor readings collected simultaneously).
3. Because the order of elements in a sequence carries important information, any model used for sequence tasks must take temporal ordering into account.

Examples:

* Text: sequence of words/tokens.
* ECG / EEG: sequence of time samples or short windows.
* Stock prices: sequence of time-stamped values.
* Audio: sequence of frames.

The key:

> **Order matters** and often **later values depend on earlier ones**.




---

## **1.2 How Sequences Are Represented in Batches**

1. In practical machine-learning pipelines, we typically process many sequences together in a batch to take advantage of parallel computation on modern hardware.
2. A batch of sequences is commonly represented as a 3-dimensional tensor with shape $(B, T, D)$, where:

   * $B$ is the batch size (the number of sequences),
   * $T$ is the sequence length (the number of time steps),
   * $D$ is the feature dimension (the number of features at each time step).
3. Each individual sequence can be visualized as a 2-dimensional matrix with $T$ rows (corresponding to time steps) and $D$ columns (corresponding to features).
4. When we stack $B$ such matrices on top of each other along a new axis, we obtain the complete batch tensor.
5. This representation is important because nearly all sequence-processing layers in PyTorch, such as RNNs, LSTMs, GRUs, Transformers, and 1D CNNs, expect input tensors in a structure that is compatible with the $(B, T, D)$ arrangement or a simple transposition of it.

---

## **1.3 Why Start With CNNs for Sequence Modelling?**

1. Although Recurrent Neural Networks (RNNs) are naturally suited for modelling sequential data, it is helpful to begin with Convolutional Neural Networks (CNNs) because we are already familiar with CNN architectures through image-processing tasks.
2. CNNs can be adapted to work with one-dimensional sequence data by applying 1D convolutions across the time axis, which allows them to detect local temporal patterns in a similar way that 2D convolutions detect spatial patterns in images.

---

## **1.4 How CNNs Operate on Sequences Using 1D Convolutions**

1. PyTorch's `nn.Conv1d` layer expects input tensors in the format $(B, C_{in}, L)$, where $B$ is the batch size, $C_{in}$ is the number of input channels, and $L$ is the length of the sequence along the time dimension.
2. To use 1D convolutions for sequence modelling, we reinterpret the feature dimension $D$ of the original tensor $(B, T, D)$ as the channel dimension $C_{in}$.
3. To achieve this, we simply transpose the tensor to the shape $(B, D, T)$, where:

   * $D$ becomes the number of input channels,
   * $T$ becomes the length of the sequence.
4. Once the sequence is in this form, the convolutional kernel, which has width $K$, slides along the time dimension and computes features based on a window of length $K$ around each time step.
5. Consequently, the output at each time step $t$ is influenced only by a localized neighborhood of inputs centered around that position, making CNNs particularly effective at identifying short-term patterns.

---

## **1.5 What CNNs Capture in Sequential Data**

1. CNNs excel at detecting **local temporal patterns**, such as short phrases in a sentence, characteristic shapes in ECG waveforms, or small segments of sensor activity that occur over short time durations.
2. Since convolutions across different time steps are independent of one another, CNNs compute all outputs for all time positions in parallel.
3. This parallel computation makes CNNs highly efficient on modern GPUs, enabling them to process entire sequences significantly faster than sequential models like RNNs.

---

## **1.6 Limitations of CNNs for Long-Range Sequence Modelling**

1. A 1D CNN's receptive field, which determines how much context from the past and future it can use, depends on the kernel size, the number of convolutional layers stacked on top of one another, and the use of techniques like dilation.
2. Capturing very long-range dependencies—for example, relationships between events at the very beginning and the very end of a long document - requires either deeper networks or complex dilation patterns.
3. Because CNNs do not maintain an internal memory state that persists over time, they are limited to processing information within a fixed-sized temporal window unless the architecture is manually deepened.
4. This lack of a persistent, dynamic memory makes CNNs less suited for tasks where long-term dependencies are essential, such as language modelling, speech recognition, or long-duration physiological signals.
5. These limitations motivate the transition from CNNs to Recurrent Neural Networks (RNNs), which introduce an explicit hidden state that carries information across arbitrarily many time steps and is therefore more appropriate for modelling long-term temporal relationships.

---


### **Example of a 1D CNN used for sequence classification in PyTorch.**
The input is assumed to be of shape $(B, T, D)$:


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CNN1DClassifier(nn.Module):
    def __init__(self, input_dim, num_classes, hidden_channels=64, kernel_size=3):
        super().__init__()
        self.conv1 = nn.Conv1d(
            in_channels=input_dim,
            out_channels=hidden_channels,
            kernel_size=kernel_size,
            padding=kernel_size // 2
        )
        self.conv2 = nn.Conv1d(
            in_channels=hidden_channels,
            out_channels=hidden_channels,
            kernel_size=kernel_size,
            padding=kernel_size // 2
        )
        self.pool = nn.AdaptiveAvgPool1d(1)  # average over time dimension
        self.fc = nn.Linear(hidden_channels, num_classes)

    def forward(self, x):
        # x: (B, T, D)
        x = x.transpose(1, 2)        # (B, D, T) for Conv1d
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = self.pool(x)             # (B, hidden_channels, 1)
        x = x.squeeze(-1)            # (B, hidden_channels)
        logits = self.fc(x)          # (B, num_classes)
        return logits


* This model extracts **local temporal patterns** and then averages them over time to get a fixed-size representation for classification.
* It works well for tasks where local context is enough, but it does not explicitly "remember" long-term information in a flexible way.



# **2. Vanilla RNNs: idea, equations**

Recurrent Neural Networks are designed specifically to handle sequences by maintaining an internal state that evolves over time. The key idea is that at each time step $t$, the RNN takes the current input $x_t$ and the previous hidden state $h_{t-1}$, and produces a new hidden state $h_t$. This hidden state plays the role of a memory that accumulates information about the past.

Conceptually, you can picture an RNN as a small feed-forward network (the RNN cell) that is applied repeatedly along the time axis. When we "unroll" the RNN over time, we get a chain of these cells, one for each time step, all sharing the same parameters.

---
## **2.1 Specific intent of a Vanilla RNN**

1. A Recurrent Neural Network (RNN) is designed to process **sequential data** by maintaining an **internal hidden state** that changes at every time step.
2. The hidden state at time $t$, denoted $h_t$, acts as a **memory vector** that stores information gathered from all earlier inputs $x_1, x_2, \dots, x_T$.
3. At each time step, the RNN computes a new hidden state $h_t$ using:

   * the current input $x_t$, and
   * the previous hidden state $h_{t-1}$.
4. The RNN cell is **reused** at every time step and shares the same parameters $W_{xh}$, $W_{hh}$, and $b_h$, which allows it to generalize across arbitrary sequence lengths.

This reuse of the same parameters is what gives RNNs "recurrence."

---


# **2.2 Notation and Dimensions for a Simple (tanh) RNN**

### **Inputs**
* $x_t \in \mathbb{R}^{d_x}$
  "input (vector) at time step $t$"

  * Example: a 300-dimensional word embedding
  * Shape: $(d_x,;)$

### **Hidden states**
* $h_t \in \mathbb{R}^{d_h}$
  “hidden state / (memory) vector at time step $t$”
  * Shape: $(d_h,;)$

### **Weights for input → hidden**

* $W_{xh} \in \mathbb{R}^{d_h \times d_x}$
  Maps the input vector to the hidden dimension.

### **Weights for hidden → hidden**

* $W_{hh} \in \mathbb{R}^{d_h \times d_h}$
  Maps the previous hidden state to the new hidden dimension.

### **Bias term**

* $b_h \in \mathbb{R}^{d_h}$

### **Activation function**

* $\tanh(\cdot)$ :
  Applied element-wise, outputs values in $(-1, 1)$.
* Prevents uncontrolled growth of hidden states.

### **Output**

* $W_{hy} \in \mathbb{R}^{d_y \times d_h}$: hidden → output
* $b_y \in \mathbb{R}^{d_y}$: output bias
* $y_t \in \mathbb{R}^{d_y}$: output at time $t$

---

# **2.3 Core RNN Equations**

The RNN updates its hidden state using two steps:

---

## **(1) Computation of pre-activation value**

$$
a_t = W_{xh} x_t + W_{hh} h_{t-1} + b_h
$$

### **Explanation**:

* $W_{xh} x_t$: influence of the current input
* $W_{hh} h_{t-1}$: influence of previous hidden state (memory)
* $b_h$: bias term
* $a_t \in \mathbb{R}^{d_h}$: intermediate activation vector

---

## **(2) Apply nonlinearity to get hidden state**

$$
h_t = \tanh(a_t)
$$

* The $\tanh$ nonlinearity allows the RNN to represent complex patterns.
* Each component of $h_t$ lies between $-1$ and $+1$.
* The hidden state now "summarizes" all inputs from $x_1$ to $x_t$.

---

## **Output Layer**

$$
y_t = W_{hy} h_t + b_y
$$

* Produces task-specific output (classification, regression, tagging, etc.).
* If we want an output at *every* time step, we compute this for each $t$.
* If we want a single output for the entire sequence, we use **only $h_T$**.

---

# **2.4 Why All Time Steps Share the Same Parameters**

### The weights do **not** depend on $t$.

* $W_{xh}, W_{hh}, b_h$ are the **same** at every time step.
* This ensures:

  1. The model can process variable-length sequences.
  2. Knowledge learned from early time steps transfers to later steps.
  3. The computation is efficient and stable.

This shared-parameter structure is what gives RNNs their sequential modelling power.

---

# **2.5 Unrolling the RNN Over Time**

Even though the RNN is "recurrent," we typically visualize it using *unrolling*:
![rnn.svg](https://d2l.ai/_images/rnn.svg)

Think of this as a **chain of identical mini-networks**, each taking:

* one input $x_t$,
* the previous hidden state $h_{t-1}$,
* producing the next state $h_t$.

---

# **2.6 Different RNN setups used in applications**
![rnn.svg](https://karpathy.github.io/assets/rnn/diags.jpeg)

### **1. Whole-sequence classification**

* Feed the entire sequence into the RNN.
* Use the final hidden state $h_T$ as a "summary."
* Example: sentiment classification of a sentence.

### **2. Sequence tagging (output at each time step)**

* Compute output $y_t$ for every $t$.
* Example: POS tagging, named entity recognition.

### **3. Sequence-to-sequence models (encoder-decoder)**

* One RNN encodes the input into $h_T$.
* Another RNN decodes it into an output sequence.
* Example: machine translation, text-to-speech.

---

# **2.7 Points to always remember**

1. **RNNs act like "loops" inside neural networks.**
   Instead of processing all time steps independently, the network passes hidden information from step to step.

2. **The hidden state is the RNN's memory.**
   It contains everything the RNN has decided is important from previous inputs.

3. **Each new input modifies the hidden state.**
   The RNN learns *how* to combine the new input and prior memory using the matrices $W_{xh}$ and $W_{hh}$.

4. **The tanh activation controls the hidden state range.**
   Without a bounded activation, the hidden state might explode numerically.

5. **The recurrence allows the RNN to learn temporal patterns.**
   For example, it can detect when part of the input happened earlier in the sequence.

6. **Vanilla RNNs struggle with very long sequences.**
   Because gradients multiply repeatedly through $W_{hh}$, they often vanish or explode.

7. **This motivates improved architectures like LSTMs and GRUs.**
   These add gating mechanisms that better manage long-term dependencies.

---


### **Summary of update equations**
$$
a_t = W_{xh} x_t + W_{hh} h_{t-1} + b_h
$$
$$
h_t = \tanh(a_t)
$$

where:

* $W_{xh} \in \mathbb{R}^{d_h \times d_x}$ maps inputs to hidden state,
* $W_{hh} \in \mathbb{R}^{d_h \times d_h}$ maps the previous hidden state to the new one,
* $b_h \in \mathbb{R}^{d_h}$ is the bias term,
* and $\tanh$ is applied element-wise.

Optionally, if we want an output $y_t$ at each time step, we can define:
$$
y_t = W_{hy} h_t + b_y
$$
where $W_{hy} \in \mathbb{R}^{d_y \times d_h}$ and $b_y \in \mathbb{R}^{d_y}$.

The important conceptual point is that the same matrices $W_{xh}$, $W_{hh}$, and $b_h$ are used at every time step. This means the RNN performs the same type of computation at each step, but applied to evolving inputs and hidden states.



### **Actual Depiction of the Unrolled RNN**


Untitled diagram-2025-11-17-092510.svg


####  **CASE: 3-layer RNN with *different* hidden sizes**

Let’s assume:

* **Input dimension = 4**
* **Layer 1 hidden size = 8**
* **Layer 2 hidden size = 16**
* **Layer 3 hidden size = 32**
* **Sequence length T = 3**

So:

* $x_t \in \mathbb{R}^{4}$
* $h_t^{(1)} \in \mathbb{R}^{8}$
* $h_t^{(2)} \in \mathbb{R}^{16}$
* $h_t^{(3)} \in \mathbb{R}^{32}$

These clearly form a **non-symmetric vertical structure**.

---

#### 1. **Vertical flows (between layers) are *not symmetric***

Input enters **layer 1**, and each layer sends its output upward:

* $x_t (4) → h_t^{(1)} (8)$
* $h_t^{(1)} (8) → h_t^{(2)} (16)$
* $h_t^{(2)} (16) → h_t^{(3)} (32)$

Each vertical arrow uses a matrix:

* $W_{xh}^{(1)} \in \mathbb{R}^{8×4}$
* $W_{xh}^{(2)} \in \mathbb{R}^{16×8}$
* $W_{xh}^{(3)} \in \mathbb{R}^{32×16}$

Thus **vertical edges expand in width**, and the diagram must look asymmetric.

---

##### 2. **Horizontal flows (time recurrence) stay inside a single layer**

Inside each layer:

* Layer 1: $h_{t-1}^{(1)} (8) → h_t^{(1)} (8)$
* Layer 2: $h_{t-1}^{(2)} (16) → h_t^{(2)} (16)$
* Layer 3: $h_{t-1}^{(3)} (32) → h_t^{(3)} (32)$

These use:

* $W_{hh}^{(1)}, W_{hh}^{(2)}, W_{hh}^{(3)}$

All **square matrices**.

---

A stacked RNN is just a sequence of transformations per time step:

$$
x_t (4) → h_t^{(1)} (8) → h_t^{(2)} (16) → h_t^{(3)} (32)
$$



## **Important Reference**

* Read through this article mandatorily before proceeding ahead - [Reference for RNN Basics](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)

* Revise basics of RNNs - https://d2l.ai/chapter_recurrent-neural-networks/index.html

### **Example of a Simple RNN Cell**

Let us implement a minimal RNN cell in PyTorch and manually unroll it over a sequence. This code explicitly shows that the same `SimpleRNNCell` is used at each time step, and that the hidden state `h` is updated step by step.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleRNNCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.W_xh = nn.Linear(input_size, hidden_size, bias=True)
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=False)

    def forward(self, x_t, h_prev):
        # x_t: (B, input_size)
        # h_prev: (B, hidden_size)
        h_t = torch.tanh(self.W_xh(x_t) + self.W_hh(h_prev))
        return h_t

def run_simple_rnn(cell, x):
    # x: (B, T, input_size)
    B, T, _ = x.shape
    h = torch.zeros(B, cell.hidden_size)
    outputs = []

    for t in range(T):
        x_t = x[:, t, :]      # (B, input_size)
        h = cell(x_t, h)      # update hidden state
        outputs.append(h.unsqueeze(1))

    # outputs: (B, T, hidden_size)
    return torch.cat(outputs, dim=1), h


In practice, we usually use PyTorch’s built-in `nn.RNN` module.

* Following is an example of a simple sequence classifier using `nn.RNN`.
* This model reads the whole sequence and then uses the last hidden state to make a prediction for the entire sequence.

In [None]:
class RNNClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super().__init__()
        self.rnn = nn.RNN(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,   # (B, T, D)
            nonlinearity='tanh'
        )
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # x: (B, T, D)
        out, h_n = self.rnn(x)
        # out: (B, T, hidden_size)
        # h_n: (num_layers, B, hidden_size)
        last_hidden = out[:, -1, :]       # (B, hidden_size)
        logits = self.fc(last_hidden)     # (B, num_classes)
        return logits


---


# 3. Backpropagation Through Time and gradient problems

Training an RNN is done by backpropagation, but because the network is unrolled over time, backpropagation must go through each time step. This process is called Backpropagation Through Time (BPTT).

If we unroll an RNN for $T$ time steps and compute a loss that depends on the final hidden state $h_T$, the gradient of the loss with respect to an earlier hidden state $h_t$ involves a product of many Jacobian matrices. In a simplified form:

$$
\frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_T} \cdot
\prod_{k=t+1}^{T} \frac{\partial h_k}{\partial h_{k-1}}
$$

Each factor $\frac{\partial h_k}{\partial h_{k-1}}$ is related to the derivative of the activation function and the recurrent weight matrix $W_{hh}$. When we multiply many such matrices, two extreme behaviours can occur:

1. **Vanishing gradients:** The gradient becomes very small as it is propagated back through many time steps. This happens when the eigenvalues of the effective Jacobian are mostly less than 1 in magnitude. In practice, the gradient becomes almost zero for early time steps, and the model cannot learn long-range dependencies well.

2. **Exploding gradients:** The gradient becomes very large as it is propagated. This happens when the eigenvalues are mostly greater than 1 in magnitude. In practice, gradients can blow up, leading to numerical instability and `NaN` values during training.

A simple scalar example makes this intuition clearer. Consider a scalar RNN where:
$$
h_t = \tanh(w \cdot h_{t-1})
$$
Then:
$$
\frac{\partial h_t}{\partial h_{t-1}} = (1 - \tanh^2(w h_{t-1})) \cdot w
$$
The derivative of $\tanh$ is at most 1 in magnitude, so if $|w| < 1$, repeated multiplication drives the gradient toward zero as we go back in time. If $|w| > 1$, repeated multiplication can cause the gradients to explode.

In practice, when training simple RNNs on long sequences, we observe that they tend to remember only short-term information. Long-term dependencies are almost ignored, because the gradients needed to learn them have vanished. This is one of the main reasons why simple RNNs are not sufficient for many realistic sequence tasks.

There are some engineering tricks to mitigate these problems. For Exploding Gradients, one common method is **gradient clipping**, where we limit the norm of the gradients before updating the parameters:

```python
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
```

One trick for vanishing gradients is **truncated BPTT**, where we backpropagate through only a fixed number of time steps (for example 20 or 50) instead of the entire sequence. However, the deeper solution is to change the architecture of the recurrence itself, which leads us to LSTMs and GRUs.

---


### **Mathematical Derivation - BPTT**



#### **1. Setup: A simple tanh RNN**

We consider the standard RNN recurrence:

$$
h_t=\tanh(a_t), \qquad
a_t = W_{xh}x_t + W_{hh}h_{t-1} + b_h
$$

Where:

* $x_t \in \mathbb{R}^{d_x}$ — input at time step (t)
* $h_t \in \mathbb{R}^{d_h}$ — hidden state at time step (t)
* $W_{xh} \in \mathbb{R}^{d_h \times d_x}$ — input→hidden weights
* $W_{hh} \in \mathbb{R}^{d_h \times d_h}$ — recurrent weights
* $b_h \in \mathbb{R}^{d_h}$
* We assume the output is computed only at the last time step:

$$
y = W_{hy}h_T,\qquad L = \ell(y, \text{target})
$$

---

#### **2. Why BPTT is different from standard backprop**

In a feed-forward network:

* The computation graph is a straight chain.

In an RNN:

* The same parameters $W_{xh}, W_{hh}$ are reused at **every time step**.
* This creates a computational graph that looks like a chain **over time**.

So the gradient must be propagated **back in time** through all hidden states:

$$
h_T \rightarrow h_{T-1} \rightarrow h_{T-2} \rightarrow \dots \rightarrow h_1
$$

This is the essence of **Backpropagation Through Time (BPTT)**.

---

#### **3. Gradients needed**

For now, assume the loss depends only on the final hidden state (h_T):

$$
L = L(h_T)
$$

We must compute:

1. $\frac{\partial L}{\partial h_t}$ for all $t$
2. $\frac{\partial L}{\partial W_{xh}}$
3. $\frac{\partial L}{\partial W_{hh}}$
4. $\frac{\partial L}{\partial b_h}$

---

#### **4. Core idea of BPTT**

Because the hidden state recurrence is:

$$
h_t = \tanh ( W_{xh} x_t + W_{hh} h_{t-1} + b_h ),
$$

a change in $h_{t-1}$ affects **all future hidden states**:

$$
h_{t-1} \to h_t \to h_{t+1} \to ... \to h_T.
$$

Thus:

$$
\frac{\partial L}{\partial h_{t-1}}
= \frac{\partial L}{\partial h_t}
\frac{\partial h_t}{\partial h_{t-1}}
$$

---

#### **5. Step-by-step BPTT Derivation**

##### **5.1 Gradient from loss to last hidden state**

We begin with:

$$
\delta h_T = \frac{\partial L}{\partial h_T}.
$$

If $y = W_{hy}h_T$, and loss is cross-entropy or MSE, then:

$$
\delta h_T = W_{hy}^T \delta y.
$$

---

##### **5.2 Backpropagate from $h_t$ to $h_{t-1}$**

Because:

$$
h_t = \tanh(a_t),
$$

$$
\frac{\partial h_t}{\partial a_t}
= \operatorname{diag}(1 - h_t^2).
$$

And:

$$
a_t = W_{xh}x_t + W_{hh}h_{t-1} + b_h,
$$

so:

$$
\frac{\partial a_t}{\partial h_{t-1}} = W_{hh}.
$$

Thus:

$$
\frac{\partial h_t}{\partial h_{t-1}}
= (1 - h_t^2) W_{hh}.
$$


#### **Important RNN gradient formula**
Finally the BPTT recurrence:

$$
\boxed{
\delta h_{t-1}
= \delta h_t \cdot (1 - h_t^2) \cdot W_{hh}^T
}
$$

This is how gradients move backward through time.

---

##### **5.3 Compute weight gradients**

**Input→hidden gradient**

$$
\frac{\partial L}{\partial W_{xh}}
= \sum_{t=1}^T \left( (1 - h_t^2) \cdot \delta h_t \right) x_t^T
$$

**Hidden→hidden gradient**

$$
\frac{\partial L}{\partial W_{hh}}
= \sum_{t=1}^T \left( (1 - h_t^2) \cdot \delta h_t \right) h_{t-1}^T
$$

**Bias gradient**

$$
\frac{\partial L}{\partial b_h}
= \sum_{t=1}^T \left( (1 - h_t^2) \cdot \delta h_t \right)
$$

These are summed over **all time steps** because the weights are reused.

---

#### **6. Putting everything together (final BPTT algorithm)**

1. **Forward pass**

   * Compute each $h_t$
   * Compute loss $L$

2. **Backward pass initialization**

   * Compute $\delta h_T = \frac{\partial L}{\partial h_T}$

3. **For t = T … 1 (backwards in time):**

   * Compute local gradient:
     $$
     \delta a_t = (1 - h_t^2) \odot \delta h_t
     $$
   * Accumulate parameter gradients:
     $$
     \Delta W_{xh} += \delta a_t x_t^T
     $$
     $$
     \Delta W_{hh} += \delta a_t h_{t-1}^T
     $$
     $$
     \Delta b_h += \delta a_t
     $$
   * Backpropagate to previous hidden state:
     $$
     \delta h_{t-1} = W_{hh}^T \delta a_t
     $$

4. **Update weights**

   * using SGD/Adam

---

#### **7. Why Vanishing/Exploding Gradients Occur**

Observe the recurrence:

$$
\delta h_{t-1}
= \delta h_t \cdot (1 - h_t^2) \cdot W_{hh}^T
$$

This involves **repeated multiplication** of:

1. The Jacobian of tanh:
   $(1 - h_t^2)$ in $[0,1]$ → tends to shrink gradients → **vanishing**

2. The recurrent matrix $W_{hh}$:
   If eigenvalues > 1 → explosion
   If eigenvalues < 1 → vanishing

Thus:

$$
\delta h_{t-k}
\approx
\delta h_t
\prod_{i=t-k+1}^{t} \left[ (1 - h_i^2) W_{hh}^T \right]
$$

This product can:

* shrink to **zero** → vanishing gradients
* blow up to **infinity** → exploding gradients

---
