## Backward Propagation (Backpropagation)

---

### 1. Theoretical Intuition
- **Backpropagation** is the process of calculating **gradients of the loss function** with respect to each weight in the network.  
- Used to **update weights** during training to minimize the loss.  
- Works by **propagating the error backward** from output to input layers.  
- Relies on the **chain rule** from calculus.  

---

### 2. Key Pointers
- Step after **forward propagation** and computing **loss**  
- Updates all weights and biases **layer by layer**  
- Fundamental to training **deep neural networks**  
- Uses **learning rate** to scale weight updates  
- Combines with **gradient descent** or its variants  

---

### 3. Use Cases
- Training any **ANN / MLP**  
- Solving regression or classification problems  
- Optimizing network performance on datasets  

---

### 4. Mathematical Intuition
For a simple 2-layer network:  

1. Compute loss: \( L = \frac{1}{2}(y - \hat{y})^2 \)  
2. Output layer gradient:  
\[
\frac{\partial L}{\partial W_2} = a_1^T \cdot (\hat{y} - y)
\]  
3. Hidden layer gradient:  
\[
\delta_1 = ( \hat{y} - y) \cdot W_2^T \cdot \sigma'(z_1)
\]  
\[
\frac{\partial L}{\partial W_1} = X^T \cdot \delta_1
\]  

- Update rule (SGD): \( W := W - \eta \cdot \frac{\partial L}{\partial W} \)  

---

### 5. Interview Q&A

| Question | Answer |
|----------|--------|
| What is backpropagation? | Algorithm to compute gradients of the loss function w.r.t weights for training. |
| Why do we use backpropagation? | To update weights and minimize loss using gradient descent. |
| What is the chain rule in backpropagation? | It calculates the derivative of composite functions layer by layer. |
| Does backpropagation work for deep networks? | Yes, it is essential for training deep neural networks. |
| What are the main steps in backpropagation? | 1. Forward pass 2. Compute loss 3. Compute gradients 4. Update weights |

---

### 6. Code Demo: Step-by-Step Backpropagation

```python
import torch

# ----------------------------
# 1️⃣ Input and output
X = torch.tensor([[0.5, 1.0],
                  [1.5, -0.5]], dtype=torch.float32)
y = torch.tensor([[1.0],
                  [0.0]], dtype=torch.float32)

# ----------------------------
# 2️⃣ Initialize weights and biases
W1 = torch.randn((2, 3), requires_grad=True)   # Input -> Hidden
b1 = torch.randn((3,), requires_grad=True)
W2 = torch.randn((3, 1), requires_grad=True)   # Hidden -> Output
b2 = torch.randn((1,), requires_grad=True)

# Learning rate
lr = 0.1

# ----------------------------
# 3️⃣ Forward Propagation
z1 = torch.matmul(X, W1) + b1
a1 = torch.relu(z1)             # Hidden layer activation

z2 = torch.matmul(a1, W2) + b2
y_pred = z2                     # Output layer (regression)

# Compute loss (MSE)
loss = ((y - y_pred)**2).mean()
print("Loss before backpropagation:", loss.item())

# ----------------------------
# 4️⃣ Backward Propagation
loss.backward()   # Compute gradients automatically

# Print gradients
print("\nGradients:")
print("dL/dW2:\n", W2.grad)
print("dL/db2:\n", b2.grad)
print("dL/dW1:\n", W1.grad)
print("dL/db1:\n", b1.grad)

# ----------------------------
# 5️⃣ Update weights manually (SGD)
with torch.no_grad():
    W1 -= lr * W1.grad
    b1 -= lr * b1.grad
    W2 -= lr * W2.grad
    b2 -= lr * b2.grad

    # Zero gradients
    W1.grad.zero_()
    b1.grad.zero_()
    W2.grad.zero_()
    b2.grad.zero_()

# ----------------------------
# 6️⃣ Forward pass after update
a1_new = torch.relu(torch.matmul(X, W1) + b1)
y_pred_new = torch.matmul(a1_new, W2) + b2
loss_new = ((y - y_pred_new)**2).mean()
print("\nLoss after one update:", loss_new.item())
