# RNN Forward Propagation, Loss, and Backpropagation Formulae

## Complete Mathematical Derivation with Matrix Dimensions

---

## Forward Pass

### Hidden State Computation

```
Matrix Dimensions: hdx1  hdxhd   hdx1     hdxF   Fx1   hdx1
```

$$z_t = W_{hh} \odot h_{t-1} + W_{xh} \odot x_t + b_h$$

$$h_t = \tanh(z_t)$$

### Output Computation

```
Matrix Dimensions: odx1    od*hd   hdx1   odx1
```

$$o_t = W_{hy} \odot h_t + b_y$$

$$y_t = \text{softmax}(o_t) = \frac{\exp(o_t)}{\sum_j \exp(o_t[j])}$$

### Forward Pass Summary

| Variable | Formula | Dimensions | Description |
|----------|---------|------------|-------------|
| $z_t$ | $W_{hh} \odot h_{t-1} + W_{xh} \odot x_t + b_h$ | $(hd \times 1)$ | Pre-activation hidden state |
| $h_t$ | $\tanh(z_t)$ | $(hd \times 1)$ | Hidden state (activated) |
| $o_t$ | $W_{hy} \odot h_t + b_y$ | $(od \times 1)$ | Output logits |
| $y_t$ | $\text{softmax}(o_t)$ | $(od \times 1)$ | Output probabilities |

**Notation:**
- $hd$ = hidden dimension
- $od$ = output dimension (number of classes)
- $F$ = feature/input dimension

---

## Compute Loss

### Cross Entropy Loss

$$L_t = -\log(y_t[c])$$

where $c$ is the correct class.

**Note:** Actually $L_t = -\sum_i \text{target}_t[i] \times \log(y_t[i])$, where $\text{target}_t[i] = 0$ for all other classes except correct class $c$, which simplifies to the above equation.

---

## Backpropagation

### Gradient w.r.t. Output Probabilities

```
Matrix Dimensions: âˆ‚y_t[i] has odx1
```

$$\frac{\partial L_t}{\partial y_t[i]} = \begin{cases} 
-\frac{1}{y_t[c]} & \text{if } i = c \text{ (correct class)} \\
0 & \text{if } i \neq c \text{ (other classes)}
\end{cases}$$

### Gradient w.r.t. Logits (Output Layer)

$$\frac{\partial L_t}{\partial o_t} = \frac{\partial L_t}{\partial y_t} \times \frac{\partial y_t}{\partial o_t} = y_t - \text{one\_hot}(c)$$

**Simplified form:**

$$\frac{\partial L_t}{\partial o_t[i]} = \begin{cases} 
y_t[i] - 1 & \text{if } i = c \text{ (correct class)} \\
y_t[i] & \text{if } i \neq c \text{ (other classes)}
\end{cases}$$

### Gradients for Output Layer Weights

$$\frac{\partial L_t}{\partial W_{hy}[i]} = \frac{\partial L_t}{\partial y_t} \times \frac{\partial y_t}{\partial o_t} \times \frac{\partial o_t}{\partial W_{hy}}$$

$$\frac{\partial L_t}{\partial b_y[i]} = \frac{\partial L_t}{\partial y_t} \times \frac{\partial y_t}{\partial o_t} \times \frac{\partial o_t}{\partial b_y} = \frac{\partial L_t}{\partial y_t} \times \frac{\partial y_t}{\partial o_t}$$

where:
- $\frac{\partial o_t}{\partial W_{hy}} = h_t^T$
- $\frac{\partial o_t}{\partial b_y} = 1$

**Final Results:**

$$\boxed{\frac{\partial L_t}{\partial W_{hy}} = (y_t - \text{one\_hot}(c)) \times h_t^T}$$

$$\boxed{\frac{\partial L_t}{\partial b_y} = (y_t - \text{one\_hot}(c))}$$

### Gradient w.r.t. Hidden State

$$\frac{\partial L_t}{\partial h_t} = \frac{\partial L_t}{\partial y_t} \times \frac{\partial y_t}{\partial o_t} \times \frac{\partial o_t}{\partial h_t} = \frac{\partial L_t}{\partial y_t} \times \frac{\partial y_t}{\partial o_t} \times W_{hy}^T$$

$$\frac{\partial L_t}{\partial h_t} = (y_t - \text{one\_hot}(c)) \times \frac{\partial o_t}{\partial h_t} = (y_t - \text{one\_hot}(c)) \times W_{hy}^T$$

where:
- $\frac{\partial o_t}{\partial h_t} = W_{hy}^T$

### Gradient w.r.t. Pre-activation (z_t)

$$\frac{\partial L_t}{\partial z_t} = \frac{\partial L_t}{\partial h_t} \times \frac{\partial h_t}{\partial z_t}$$

where:
- $\frac{\partial h_t}{\partial z_t} = \text{derivative of } \tanh(z_t) = 1 - \tanh^2(z_t) = 1 - h_t^2$

Therefore:

$$\frac{\partial L_t}{\partial z_t} = \frac{\partial L_t}{\partial h_t} \times (1 - h_t^2)$$

$$\frac{\partial L_t}{\partial z_t} = (y_t - \text{one\_hot}(c)) \times W_{hy}^T \times (1 - h_t^2)$$

### Gradients for Hidden Layer Weights

$$\frac{\partial L_t}{\partial W_{hh}} = \frac{\partial L_t}{\partial z_t} \times \frac{\partial z_t}{\partial W_{hh}}$$

where:
- $\frac{\partial z_t}{\partial W_{hh}} = h_{t-1}^T$
- $\frac{\partial z_t}{\partial W_{xh}} = x_t^T$
- $\frac{\partial z_t}{\partial b_h} = 1$

Therefore:

$$\frac{\partial L_t}{\partial W_{hh}} = \frac{\partial L_t}{\partial z_t} \times h_{t-1}^T$$

$$\frac{\partial L_t}{\partial W_{xh}} = \frac{\partial L_t}{\partial z_t} \times x_t^T$$

$$\frac{\partial L_t}{\partial b_h} = \frac{\partial L_t}{\partial z_t}$$

**Final Results:**

$$\boxed{\frac{\partial L_t}{\partial W_{hh}} = (y_t - \text{one\_hot}(c)) \times W_{hy}^T \times (1 - h_t^2) \times h_{t-1}^T}$$

$$\boxed{\frac{\partial L_t}{\partial W_{xh}} = (y_t - \text{one\_hot}(c)) \times W_{hy}^T \times (1 - h_t^2) \times x_t^T}$$

$$\boxed{\frac{\partial L_t}{\partial b_h} = (y_t - \text{one\_hot}(c)) \times W_{hy}^T \times (1 - h_t^2)}$$

---

## Summary

### Forward Pass Formulas

| Variable | Formula | Dimensions |
|----------|---------|------------|
| $z_t$ | $W_{hh} \odot h_{t-1} + W_{xh} \odot x_t + b_h$ | $(hd \times hd) \odot (hd \times 1) + (hd \times F) \odot (F \times 1) + (hd \times 1) = (hd \times 1)$ |
| $h_t$ | $\tanh(z_t)$ | $(hd \times 1)$ |
| $o_t$ | $W_{hy} \odot h_t + b_y$ | $(od \times hd) \odot (hd \times 1) + (od \times 1) = (od \times 1)$ |
| $y_t$ | $\text{softmax}(o_t)$ | $(od \times 1)$ |

### Backpropagation Gradient Formulas

| Parameter | Gradient Formula | Dimensions |
|-----------|------------------|------------|
| $W_{hy}$ | $(y_t - \text{one\_hot}(c)) \times h_t^T$ | $(od \times 1) \times (1 \times hd) = (od \times hd)$ |
| $b_y$ | $(y_t - \text{one\_hot}(c))$ | $(od \times 1)$ |
| $W_{hh}$ | $(y_t - \text{one\_hot}(c)) \times W_{hy}^T \times (1 - h_t^2) \times h_{t-1}^T$ | $(hd \times 1) \times (1 \times hd) = (hd \times hd)$ |
| $W_{xh}$ | $(y_t - \text{one\_hot}(c)) \times W_{hy}^T \times (1 - h_t^2) \times x_t^T$ | $(hd \times 1) \times (1 \times F) = (hd \times F)$ |
| $b_h$ | $(y_t - \text{one\_hot}(c)) \times W_{hy}^T \times (1 - h_t^2)$ | $(hd \times 1)$ |

**Notation:**
- $hd$ = hidden dimension
- $od$ = output dimension (number of classes)
- $F$ = feature/input dimension
- $\odot$ = matrix multiplication (@)