# 🔄 Backpropagation Through Time (BPTT) – RNN Training Explained

---

## 📌 What is Backpropagation Through Time?

In a Recurrent Neural Network (RNN), we process input **step by step**, using hidden states to carry forward **contextual memory**.  
After computing the final output and **loss**, the model must **update its parameters** to improve performance — this happens through **Backpropagation Through Time (BPTT)**.

BPTT is an extension of backpropagation used for **sequential models**, where **gradients flow backward not just through layers**, but also **across time steps**.

---

## 🧠 Example Recap: Sentence "I like NLP"

Let’s revisit our toy sentence and its one-hot encoded inputs:

**Sentence:**  
"I like NLP"

**Time steps:**
- t=1 → x₁₁ = "I"
- t=2 → x₁₂ = "like"
- t=3 → x₁₃ = "NLP"

---

## ⏱ Forward Pass (from before)

At each time step, the RNN computes the hidden state:

- h₁ = f(x₁₁ ⋅ W + b)  
- h₂ = f(x₁₂ ⋅ W + h₁ ⋅ Wʰ + b)  
- h₃ = f(x₁₃ ⋅ W + h₂ ⋅ Wʰ + b)

Final output:
- ŷ = softmax(h₃ ⋅ V + b_output)

Loss:
- L = ŷ - y  ← (difference between predicted and actual target)

---

## 🔁 Backpropagation Through Time Begins

BPTT computes **gradients of the loss with respect to all weights**, including those used **at every time step**.

---

### 🎯 Gradient Flow Overview

We compute partial derivatives and update:

1. **dL/dŷ** ← gradient of loss w.r.t output  
2. **Backprop through output layer**:  
   - Update `V` and `b_output`  
   - Formula:  
     - ∂L/∂V = h₃ᵗ × (ŷ - y)  
     - V ← V - α × ∂L/∂V  

3. **Backprop through hidden states** (from h₃ → h₂ → h₁):
   - Accumulate gradients due to recurrence
   - Update `Wʰ` (shared at all time steps)
   - Update `W` (input-to-hidden for x₁₁, x₁₂, x₁₃)

---

## 🔁 Parameter Update – Chain Rule and Weight Flow

At each time step (starting from last):

### 📌 Update `V` (Hidden → Output)
- `ŷ` depends on `o₃ = h₃ ⋅ V`
- Update rule:  
