# 🔁 Backpropagation Through Time (BPTT) – In-Depth Explanation

---

## 📌 What Happens During BTT?

When training an RNN, we perform **forward propagation through time** to calculate the output `ŷ` and compute the **loss**.

Then, to improve the model, we perform **Backpropagation Through Time (BPTT)** — where we compute the gradients of the loss **with respect to all parameters** and update them using gradient descent.

---

## 🎯 What Parameters Get Updated?

During BTT, we update the following:

- **W** → input-to-hidden weights  
- **Wʰ** → hidden-to-hidden weights (shared across time)  
- **V** → hidden-to-output weights  
- **b** → bias for hidden layer  
- **b_output** → bias for output layer

---

## 🧮 Gradient Flow using Chain Rule

Let’s say we process an input sequence across 3 time steps:
- t = 1 → `x₁`
- t = 2 → `x₂`
- t = 3 → `x₃`

And we compute:
- `h₁ = f(x₁ ⋅ W + b)`
- `h₂ = f(x₂ ⋅ W + h₁ ⋅ Wʰ + b)`
- `h₃ = f(x₃ ⋅ W + h₂ ⋅ Wʰ + b)`
- `ŷ = softmax(h₃ ⋅ V + b_output)`
- `L = loss(ŷ, y)`

---

    Time Steps →         t = 1             t = 2               t = 3

  Input x₁ₜ         →   [ x₁₁ ]           [ x₁₂ ]               [ x₁₃ ]
                       (input vector)    (input vector)     (input vector)

                          │                  │                  │
                         [W]                [W]                [W]
                          │                  │                  │
                        ┌────┐            ┌────┐            ┌────┐
     Hidden state hₜ →   │ h₁ │ ───[Wʰ]── →│ h₂ │ ───[Wʰ]───→│ h₃ │
                        └────┘            └────┘            └────┘
                                                               │
                                                              [V]
                                                               │
                                                       Output ŷ (softmax)

---
## 🔧 Step 1: Update Output Layer Weights (V)

V is updated using formula:

V_new = V_old - η ⋅ ∂L/∂V_old

where - `η` is the **learning rate**
- Controls **how big a step** we take in the direction of the gradient
- A small `η` (e.g., 0.0001) means slow learning; large `η` can lead to overshooting


---

## 🔧 Step 2: Update Hidden-to-Hidden Weights (Wʰ)

Wʰ is **reused across time steps**, so its gradient is the **sum of contributions from each time step**.

So, here we calculate:

Wʰ_new = Wʰ_old - η ⋅ ∂L/∂Wʰ_old


---

## 🔧 Step 3: Update Input-to-Hidden Weights (W)

Similar to Wʰ, we use formula:

W_new = W_old - η ⋅ ∂L/∂W

## ✅ Summary of BTT Steps

1. Perform **forward pass** to compute all `hₜ`, final output `ŷ`, and loss `L`
2. Compute gradients of `L` w.r.t.:
   - Output layer weights `V`
   - Hidden-to-hidden weights `Wʰ` (shared across time)
   - Input-to-hidden weights `W`
   - Biases
3. Use **chain rule** to propagate loss **back through time**
4. Update all weights using **gradient descent**

---

## 🔁 Reminder: This Happens for Every Sequence During Training

The above gradient flow happens for **each training sample (sequence)** during every epoch.

Weights get updated → Output improves → Loss reduces → Model learns.