<div style="background-color: black; color: white; padding: 10px;text-align: center;">
  <strong>Date Published:</strong> December 2, 2025 <strong>Author:</strong> Adnan Alaref
</div>

# üß† Introduction to Vanilla Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are one of the simplest and most fundamental ways to model **sequential data** such as text, audio, and time-series.

However, many beginners struggle to understand **what actually happens inside an RNN**, especially during **Backpropagation Through Time (BPTT)**.

In this notebook, we will build a **vanilla RNN completely from scratch** using basic PyTorch tensor operations ‚Äî  
**no `nn.RNN`, no autograd**, only pure math and matrix operations.

This approach will help you build deep and intuitive understanding of how RNNs work internally.

---

## üîç What You Will Learn

### ‚úîÔ∏è Forward Pass
- How a single RNN step computes the next hidden state using:

$$
h_t = \tanh(x_t W_x + h_{t-1} W_h + b)
$$

- How hidden states flow across timesteps in a sequence.

### ‚úîÔ∏è Backward Pass (BPTT)
- How gradients move backward through time.
- How to compute:
  - gradients w.r.t **input**
  - gradients w.r.t **previous hidden state**
  - gradients w.r.t **weights** and **biases**

### ‚úîÔ∏è Concepts & Intuition
- Why **vanishing** and **exploding gradients** happen.
- How deep-learning frameworks compute RNN gradients under the hood.

---

## üéØ Why This Notebook Is Useful

- Builds **real understanding** ‚Äî not just API usage.
- Helps you debug **training instabilities** in sequence models.
- Prepares you for more advanced models:
  - LSTM  
  - GRU  
  - Transformers  
- Makes you a stronger ML engineer because you understand **the math behind the frameworks**.

---

Let's get started and open the RNN ‚Äúblack box‚Äù together üöÄ

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 1: Import Library.</div>

In [1]:
import torch

import warnings
warnings.filterwarnings(action='ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 2: RNN Forward Pass.</div>

### **Forward pass:**
<div style="background:#ffffff; padding:18px; border-radius:6px; box-shadow:0 1px 2px rgba(0,0,0,0.05);">
<pre style="font-family: 'Menlo', 'Courier New', monospace; font-size:14px; line-height:1.3; margin:0;">
x_t --->[Wx]--\
                \
                 +--> z_t = x_t @ Wx + h_{t-1} @ Wh + b --> h_t = tanh(z_t)
h_{t-1}-->[Wh]--/
</pre>
</div>

In [2]:
def rnn_step_forward(x, Wx, prev_h, Wh, b):
  """
    Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
    activation function.

    The input data has dimension D, the hidden state has dimension H, and we use
    a minibatch size of N.

    Args:
        x: Input data for this timestep, of shape (N, D).
        prev_h: Hidden state from previous timestep, of shape (N, H)
        Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
        Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
        b: Biases, of shape (H,)

    Returns a tuple of:
        next_h: Next hidden state, of shape (N, H)
        cache: Tuple of values needed for the backward pass.
  """
  next_h , cache = None, None

  # h t=tanh(Wx.xt + bx + Wh.ht‚àí1 + bh)
  out = x @ Wx + prev_h @ Wh + b
  next_h = torch.tanh(out)

  # store everything needed for backward
  cache = (x, prev_h, Wx, Wh, b, next_h)

  return next_h, cache


def rnn_forward(x, Wx, h0, Wh, b):
  """
    Run a vanilla RNN forward on an entire sequence of data. We assume an input
    sequence composed of T vectors, each of dimension D. The RNN uses a hidden
    size of H, and we work over a minibatch containing N sequences. After running
    the RNN forward, we return the hidden states for all timesteps.

    Args:
      x: Input data for the entire timeseries, of shape (N, T, D).
      h0: Initial hidden state, of shape (N, H)
      Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
      Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
      b: Biases, of shape (H,)

    Returns a tuple of:
        h: Hidden states for the entire timeseries, of shape (N, T, H).
        cache: Values needed in the backward pass
    """
  h, cache = None, None

  N, T, D = x.shape
  H = h0.shape[1]

  cache, prev_h = [], h0
  h = torch.zeros((N, T, H), dtype=x.dtype, device=x.device)

  for t in range(T):
    xt = x[:,t,:]
    next_h, step_cache = rnn_step_forward(xt, Wx, prev_h, Wh, b)
    prev_h = next_h
    h[:,t,:] = next_h
    cache.append(step_cache)

  return h, cache

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 3: Explain Forward Pass.</div>

## **1- Why We Loop Over **T** not **N** in an RNN**

### Short Answer  
You loop over **T** because **RNNs process sequences over time**, not batches.

- **N** = number of independent examples (batch size)  
- **T** = number of time steps in a sequence  

The RNN moves step-by-step **along the time dimension** ‚Üí so you must loop over **T**, not N.

---

## **2- Deep Explanation (Clear and Simple)**

`x` has shape **(N, T, D)**

Example:
- **Batch dimension (N)** = 2 ‚Üí processes two sequences in parallel  
- **Time dimension (T)**  = 3 ‚Üí iterates over time steps  
- **Feature dimension (D)** = 4 ‚Üí vector per token/time step

---
## **3- Temporal Recurrence in an RNN**
The RNN computes hidden states **over time**:
- **This is temporal recurrence**, because each hidden state depends on the previous hidden state.

<div style="background:#ffffff; padding:18px; border-radius:6px; box-shadow:0 1px 2px rgba(0,0,0,0.05);"><pre style="font-family: 'Menlo', 'Courier New', monospace; font-size:14px; line-height:1.3; margin:0;">

```python
t = 0 ‚Üí compute h0 ‚Üí h1
t = 1 ‚Üí compute h1 ‚Üí h2
t = 2 ‚Üí compute h2 ‚Üí h3
```
</pre>
</div>

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 4: RNN Backward Pass.</div>

In [3]:
def rnn_step_backward(dnext_h, cache):
  """
    Backward pass for a single timestep of a vanilla RNN.

    Args:
        dnext_h: Gradient of loss with respect to next hidden state, of shape (N, H)
        cache: Cache object from the forward pass

    Returns a tuple of:
        dx: Gradients of input data, of shape (N, D)
        dprev_h: Gradients of previous hidden state, of shape (N, H)
        dWx: Gradients of input-to-hidden weights, of shape (D, H)
        dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
        db: Gradients of bias vector, of shape (H,)
  """
  dx, dprev_h, dWx, dWh, db = None, None, None, None, None
  x, prev_h, Wx, Wh, b, next_h = cache

  # z = x @ Wx + prev_h @ Wh + b Equation

  # Step 1: Backprop through tanh
  dz = dnext_h * (1 - next_h**2)

  # Step 2: Gradients with respect(w.r.t) to inputs and weights
  dx = dz @ Wx.T
  dWx = x.T @ dz

  dprev_h = dz @ Wh
  dWh = prev_h.T @ dz

  # Step 3: Gradients w.r.t bais
  db = dz.sum(dim=0) # sum over batch dimension

  return dx, dprev_h, dWx, dWh, db

def rnn_backward(dh, cache):
  """
    Compute the backward pass  vanilla  RNN over an entire sequence of data.
    Args:
      dh: Upstream gradients of all hidden states, of shape (N, T, H).
      cache : cache list storing all caches for all timesteps from the forward pass.

    NOTE: 'dh' contains the upstream gradients produced by the
    individual loss functions at each timestep, *not* the gradients
    being passed between timesteps (which you'll have to compute yourself
    by calling rnn_step_backward in a loop).

    Returns a tuple of:
      dx: Gradient of inputs, of shape (N, T, D)
      dh0: Gradient of initial hidden state, of shape (N, H)
      dWx: Gradient of input-to-hidden weights, of shape (D, H)
      dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
      db: Gradient of biases, of shape (H,)
  """
  dx, dh0, dWx, dWh, db = None, None, None, None, None

  N, T, H = dh.shape
  x0, _, Wx, Wh, _, _ = cache[0]
  D = cache[0][0].shape[1] # or D = x0.shape[1]

  # Initialize gradients
  dx = torch.zeros((N, T, D), dtype=x0.dtype, device=x0.device)
  dprev_h_t = torch.zeros((N, H), dtype=Wx.dtype, device=Wx.device)

  dWx = torch.zeros((D, H), dtype=Wx.dtype, device=Wx.device)
  dWh = torch.zeros((H,H), dtype=Wh.dtype, device=Wh.device)
  db = torch.zeros((H,), dtype=Wx.dtype, device=Wx.device)

  # Backprop through time (reverse order)
  for t in reversed(range(T)):

    # Cache at time step t
    step_cached = cache[t]

    # Total gradient flowing into h_t
    # dh_total = dh_from_next_layer + dh_from_future
    # Combine gradients from: Loss at time t ‚Üí dh[:, t, :] , Future timestep t+1 ‚Üí dprev_h_t
    dnext_h = dh[:,t,:] + dprev_h_t

    # Step backward
    dx_t, dprev_h_t, dWx_t, dWh_t, db_t = rnn_step_backward(dnext_h, step_cached)

    # Store gradients
    dx[:,t,:] = dx_t
    dWx +=dWx_t
    dWh +=dWh_t
    db +=db_t

  # Gradient of initial hidden state at t0
  dh0 = dprev_h_t

  return dx, dh0, dWx, dWh, db

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 5: Explain Backward Pass.</div>

**Backward pass (derivatives):**
<div style="background:#ffffff; padding:18px; border-radius:6px; box-shadow:0 1px 2px rgba(0,0,0,0.05);">
<pre style="font-family: 'Menlo', 'Courier New', monospace; font-size:14px; line-height:1.3; margin:0;">
   dh_t
    |
    v
 dz = dh_t * (1 - h_t^2)
    |
  +---+
  |   |
  v   v
dx = dz @ Wx^T      dh_prev = dz @ Wh^T
    |
    v
dWx = x_t^T @ dz    dWh = h_{t-1}^T @ dz
    |
    v
   db = sum(dz)

</pre>
</div>

---

### **Explanation:**
- **Forward**: Compute z_t = x_t Wx + h_{t-1} Wh + b ‚Üí h_t = tanh(z_t)  
- **Backward**: Start with upstream gradient dh_t  
- **Through tanh**: dz = dh_t * (1 - h_t^2) (element-wise)  
- **Linear layer**: Compute gradients w.r.t inputs and weights:
   - dx_t = dz @ Wx^T  
   - dh_{t-1} = dz @ Wh^T  
   - dWx = x_t^T @ dz  
   - dWh = h_{t-1}^T @ dz  
   - db = dz.sum(axis=0)

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 6: Evaluate The Code.</div>

In [4]:
# -----------------------------
# Dummy input and parameters
# -----------------------------
N, T, D, H = 2, 3, 4, 5
torch.manual_seed(0)

x = torch.randn(N, T, D)
h0 = torch.randn(N, H)
Wx = torch.randn(D, H)
Wh = torch.randn(H, H)
b = torch.randn(H)

print("=== Testing rnn_step_forward ===")
x_t = x[:, 0, :]
prev_h = h0
next_h, step_cache = rnn_step_forward(x_t, Wx, prev_h, Wh, b)
print("next_h shape:", next_h.shape)
print("next_h Values:", next_h)

print("\n=== Testing rnn_forward ===")
h, cache = rnn_forward(x, Wx, h0, Wh, b)
print("h shape:", h.shape)
print("h Values:", h)

print("\n=== Testing rnn_step_backward ===")
dnext_h = torch.randn_like(next_h)
dx, dprev_h, dWx, dWh, db = rnn_step_backward(dnext_h, step_cache)
print("dx shape:", dx.shape)
print("dprev_h shape:", dprev_h.shape)
print("dWx shape:", dWx.shape)
print("dWh shape:", dWh.shape)
print("db shape:", db.shape)

print("\n=== Testing rnn_backward ===")
dh = torch.randn_like(h)
dx, dh0, dWx, dWh, db = rnn_backward(dh, cache)
print("dx shape:", dx.shape)
print("dh0 shape:", dh0.shape)
print("dWx shape:", dWx.shape)
print("dWh shape:", dWh.shape)
print("db shape:", db.shape)

print("\n‚úÖ All tests completed successfully!")

=== Testing rnn_step_forward ===
next_h shape: torch.Size([2, 5])
next_h Values: tensor([[-1.0000, -0.9995, -0.5591, -0.7388,  0.9966],
        [ 0.9997, -0.9578, -0.6461,  0.5296, -0.9957]])

=== Testing rnn_forward ===
h shape: torch.Size([2, 3, 5])
h Values: tensor([[[-1.0000, -0.9995, -0.5591, -0.7388,  0.9966],
         [ 0.8206,  0.9999, -0.7205, -0.9238,  0.9926],
         [ 0.9559,  0.5068, -0.9920,  0.9981, -0.7791]],

        [[ 0.9997, -0.9578, -0.6461,  0.5296, -0.9957],
         [ 1.0000, -0.7078, -0.4450,  1.0000, -1.0000],
         [ 0.6779,  0.6420,  0.9915, -0.9779, -0.9824]]])

=== Testing rnn_step_backward ===
dx shape: torch.Size([2, 4])
dprev_h shape: torch.Size([2, 5])
dWx shape: torch.Size([4, 5])
dWh shape: torch.Size([5, 5])
db shape: torch.Size([5])

=== Testing rnn_backward ===
dx shape: torch.Size([2, 3, 4])
dh0 shape: torch.Size([2, 5])
dWx shape: torch.Size([4, 5])
dWh shape: torch.Size([5, 5])
db shape: torch.Size([5])

‚úÖ All tests completed successfull

# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Step 7: üèÅ Conclusion.</div>

In this notebook, we opened up the **black box** of Recurrent Neural Networks and built a full **vanilla RNN** from scratch ‚Äî both the **forward pass** and **backpropagation through time (BPTT)**.

By implementing every step manually using only basic tensor operations, you learned:

- How an RNN computes new hidden states across timesteps  
- How gradients flow backward through time  
- How to compute gradients w.r.t inputs, states, weights, and biases  
- Why vanishing/exploding gradients naturally occur in RNNs  
- How deep-learning frameworks (PyTorch, TensorFlow) compute RNN gradients internally  

Understanding these internals makes you a stronger ML practitioner because you now know **what is happening behind the scenes**, not just how to call `nn.RNN`.

This knowledge prepares you for learning more advanced sequence models such as **LSTM**, **GRU**, and even **Transformers**, which build on the same core ideas but add smarter gating and memory mechanisms.

If you've reached this point ‚Äî congratulations üéâ  
You now understand RNNs at a deeper level than most beginners and many practitioners.

Happy learning, and keep experimenting! üöÄ


# <a id="Import"></a><div style="background: linear-gradient(to right, #1b5e20, #2e7d32, #388e3c, #43a047, #4caf50); font-family: 'Times New Roman', serif; font-size: 28px; font-weight: bold; text-align: center; border-radius: 15px; padding: 15px; border: 2px solid #ffffff; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); -webkit-background-clip: text; -webkit-text-fill-color: transparent;">Thanks & Upvote ‚ù§Ô∏è</div>