Here is the **Backward Propagation Through Time (BPTT) in RNN** with the **code properly formatted**:

---

### **Backward Propagation Through Time (BPTT) in RNN**

| **Aspect**               | **Details**                                                                                                                                                                                                                                                                                                                                                                        |
| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Definition**           | Backpropagation Through Time (BPTT) is the training algorithm for RNNs where errors are propagated **backward through each time step** to update network weights. It is an extension of standard backpropagation adapted to sequential data.                                                                                                                                       |
| **Concept**              | - The RNN is **unrolled** across all time steps $T$.<br>- Forward propagation calculates hidden states $h_t$ and outputs $y_t$.<br>- The total loss $L$ is computed as the sum of losses across all time steps.<br>- Gradients are computed by applying the chain rule **through time**, moving from the last time step $T$ back to the first.                                     |
| **Loss Function**        | For a sequence of length $T$, total loss is:<br> $L = \sum_{t=1}^{T} L_t(y_t, \hat{y_t})$                                                                                                                                                                                                                                                                                          |
| **Gradient Calculation** | - Gradients at time $t$ depend on both the error at time $t$ and future time steps.<br>- Weight updates are accumulated across all time steps:<br> $\frac{\partial L}{\partial W} = \sum_{t=1}^{T} \frac{\partial L}{\partial W_t}$                                                                                                                                                |
| **Equations**            | 1. Output error: $\delta y_t = y_t - \hat{y_t}$<br>2. Hidden state error: $\delta h_t = (W_y^T \delta y_t + W_h^T \delta h_{t+1}) \odot f'(h_t)$<br>3. Weight gradients:<br> - $\frac{\partial L}{\partial W_y} = \sum_t \delta y_t h_t^T$<br> - $\frac{\partial L}{\partial W_h} = \sum_t \delta h_t h_{t-1}^T$<br> - $\frac{\partial L}{\partial W_x} = \sum_t \delta h_t x_t^T$ |
| **Key Issues**           | - **Vanishing Gradient:** Gradients shrink exponentially for long sequences.<br>- **Exploding Gradient:** Gradients grow uncontrollably.<br>- **Solutions:** Gradient clipping, LSTM, GRU, residual connections.                                                                                                                                                                   |

---

### **Python Example: BPTT Implementation (Conceptual)**

```python
import numpy as np

# Hyperparameters
T = 5                     # Sequence length
hidden_size = 4           # Hidden layer size
input_size = 3            # Input feature size
output_size = 2           # Output size
lr = 0.01                 # Learning rate

# Initialize weights and biases
W_x = np.random.randn(hidden_size, input_size)
W_h = np.random.randn(hidden_size, hidden_size)
W_y = np.random.randn(output_size, hidden_size)
b_h = np.zeros((hidden_size, 1))
b_y = np.zeros((output_size, 1))

# Dummy input sequence and target outputs
X = [np.random.randn(input_size, 1) for _ in range(T)]
Y = [np.random.randn(output_size, 1) for _ in range(T)]

# Forward Pass
h = [np.zeros((hidden_size, 1))]
y = []
for t in range(T):
    h_t = np.tanh(W_x @ X[t] + W_h @ h[-1] + b_h)   # Hidden state
    y_t = W_y @ h_t + b_y                           # Output
    h.append(h_t)
    y.append(y_t)

# Loss gradients at output (dummy gradient)
dL_dy = [y[t] - Y[t] for t in range(T)]

# Initialize gradients
dW_x = np.zeros_like(W_x)
dW_h = np.zeros_like(W_h)
dW_y = np.zeros_like(W_y)
db_h = np.zeros_like(b_h)
db_y = np.zeros_like(b_y)
dh_next = np.zeros((hidden_size, 1))

# Backward Propagation Through Time
for t in reversed(range(T)):
    # Output layer gradient
    dW_y += dL_dy[t] @ h[t+1].T
    db_y += dL_dy[t]

    # Hidden layer gradient
    dh = W_y.T @ dL_dy[t] + dh_next
    dtanh = (1 - h[t+1]**2) * dh   # Derivative of tanh

    # Weight gradients
    dW_x += dtanh @ X[t].T
    dW_h += dtanh @ h[t].T
    db_h += dtanh

    # Propagate error to previous time step
    dh_next = W_h.T @ dtanh

# Parameter Update (SGD)
for param, dparam in zip([W_x, W_h, W_y, b_h, b_y],
                         [dW_x, dW_h, dW_y, db_h, db_y]):
    param -= lr * dparam
```

---

| **Advantages**                                                                      | **Limitations**                                                                                                         |
| ----------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| - Captures temporal dependencies in sequences.<br>- Allows gradient-based learning. | - Computationally expensive.<br>- Suffers from vanishing/exploding gradients.<br>- Memory-intensive for long sequences. |
