## 🔁 RNN Forward Propagation Through Time – Conceptual Walkthrough

A **Recurrent Neural Network (RNN)** processes sequential data one step at a time, maintaining memory of previous inputs using a hidden state. This is ideal for NLP tasks where **word order and context matter**, such as text generation, translation, and classification.

---

### 🧠 Example: Three Simple Sentences

We’ll build a vocabulary and simulate RNN forward propagation using the following toy dataset:

Sentence 1: I like NLP  
Sentence 2: You study hard  
Sentence 3: We write code

---

### 📚 Vocabulary

["I", "like", "NLP", "You", "study", "hard", "We", "write", "code"]

Vocabulary size = **9** (Each word is unique)  
We’ll represent each word using **One-Hot Encoding**.

---

### 🔤 One-Hot Encoded Vectors for Sentence 1: "I like NLP"

| Word   | One-Hot Vector             |
|--------|-----------------------------|
| I      | [1 0 0 0 0 0 0 0 0]         |
| like   | [0 1 0 0 0 0 0 0 0]         |
| NLP    | [0 0 1 0 0 0 0 0 0]         |

---

## ⏱ Forward Propagation Through Time (Step-by-Step)

We feed **one word at a time** into the RNN.

Let’s denote:  
- `x11`, `x12`, `x13` as the word vectors at time steps t = 1, 2, 3  
- `w` as the **input-to-hidden** weights  
- `w'` as the **hidden-to-hidden** weights  
- `b` as the bias for hidden state  
- `f()` as an activation function (like tanh or ReLU)  
- `h1`, `h2`, `h3` as the hidden states  
- `ŷ` as the final prediction  
- `L` as the loss function

---

           ┌───────────────┐  
x11 ─────> │   RNN Cell    │  
           └──────┬────────┘  
                  ↓  
                h1  
                  ↓  
x12 ─────> ┌──────┴──────┐  
           │   RNN Cell  │  
           └──────┬──────┘  
                  ↓  
                h2  
                  ↓  
x13 ─────> ┌──────┴──────┐  
           │   RNN Cell  │  
           └──────┬──────┘  
                  ↓  
                h3 ─────> ŷ (Prediction)
---

### 🧮 Forward Pass Equations

**Time Step 1 (t = 1)**  
Input: x11 (word "I")

h1 = f(x11 ⋅ w + b)

---

**Time Step 2 (t = 2)**  
Input: x12 (word "like")

h2 = f(x12 ⋅ w + h1 ⋅ w' + b)

---

**Time Step 3 (t = 3)**  
Input: x13 (word "NLP")

h3 = f(x13 ⋅ w + h2 ⋅ w' + b)

---

### 📤 Output and Loss Computation

ŷ = softmax(h3 ⋅ v + b_output)

Loss = y- ŷ

---

### 🔁 Backpropagation Through Time (BPTT)

- Loss is propagated backward through all time steps  
- Gradients are calculated for all trainable parameters (`w`, `w'`, `v`, `b`, `b_output`)  
- Parameters are updated using gradient descent

---

### 🔢 Parameter Summary Example

Assume:  
- Input size (vocabulary size) = 9  
- Hidden size = 5  
- Output size = 3  

| Component           | Shape         | Parameters |
|---------------------|---------------|------------|
| Input → Hidden      | 9 × 5         | 45         |
| Hidden → Hidden     | 5 × 5         | 25         |
| Hidden → Output     | 5 × 3         | 15         |
| Biases              | 5 (hidden) + 3 (output) | 8 |

*Total Trainable Parameters = 45 + 25 + 15 + 8 = 93*
---