In [4]:
import numpy as np

# Activation Functions
def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)  # Gradient of ReLU

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))  # Stability trick
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)

# Loss Function (Cross-Entropy)
def cross_entropy_loss(y_pred, y_true):
    m = y_true.shape[0]
    return -np.sum(y_true * np.log(y_pred + 1e-9)) / m  # Adding small value to avoid log(0)

# One-Hot Encoding for Labels
def one_hot_encode(y, num_classes):
    one_hot = np.zeros((y.size, num_classes))
    one_hot[np.arange(y.size), y] = 1
    return one_hot

# **Sample Input Data (3 samples, 2 features)**
X = np.array([[0.5, 1.2],
              [1.0, -0.7],
              [-0.3, 0.8]])

# **Ground Truth Labels**
y = np.array([0, 1, 0])  # Class labels
y_one_hot = one_hot_encode(y, num_classes=2)  # Convert labels to one-hot

# **Initialize Network Parameters**
np.random.seed(42)
W1 = np.random.randn(2, 3) * 0.01  # Weights for Layer 1 (2 -> 3 neurons)
b1 = np.zeros((1, 3))              # Bias for Layer 1
W2 = np.random.randn(3, 2) * 0.01  # Weights for Layer 2 (3 -> 2 neurons)
b2 = np.zeros((1, 2))              # Bias for Layer 2

learning_rate = 0.01

# **Forward Propagation**
Z1 = np.dot(X, W1) + b1
A1 = relu(Z1)
Z2 = np.dot(A1, W2) + b2
A2 = softmax(Z2)

# **Compute Loss**
loss = cross_entropy_loss(A2, y_one_hot)
print("Loss before backpropagation:", loss)

# **Backward Propagation**
m = X.shape[0]  # Number of samples

# Compute Gradients for Output Layer
dZ2 = A2 - y_one_hot       # Gradient of softmax + cross-entropy
dW2 = np.dot(A1.T, dZ2) / m
db2 = np.sum(dZ2, axis=0, keepdims=True) / m

# Compute Gradients for Hidden Layer
dA1 = np.dot(dZ2, W2.T)    # Backpropagate error to hidden layer
dZ1 = dA1 * relu_derivative(Z1)  # Apply ReLU derivative
dW1 = np.dot(X.T, dZ1) / m
db1 = np.sum(dZ1, axis=0, keepdims=True) / m

# **Update Weights using Gradient Descent**
W1 -= learning_rate * dW1
b1 -= learning_rate * db1
W2 -= learning_rate * dW2
b2 -= learning_rate * db2

# **Print Updated Weights**
print("Updated W1:\n", W1)
print("Updated b1:\n", b1)
print("Updated W2:\n", W2)
print("Updated b2:\n", b2)


Loss before backpropagation: 0.6931042210040035
Updated W1:
 [[ 0.00496985 -0.00136578  0.00647687]
 [ 0.01525736 -0.00235334 -0.0023413 ]]
Updated b1:
 [[ 2.70575422e-05  1.68672202e-05 -3.20079660e-12]]
Updated W2:
 [[ 0.01584455  0.00762193]
 [-0.00469517  0.00542603]
 [-0.00464699 -0.00464449]]
Updated b2:
 [[ 0.00166646 -0.00166646]]


### **🔹 Backpropagation Interview Questions (With Answers) 🚀**  

#### **Basic Questions**
1️⃣ **What is backpropagation in neural networks?**  
   **Answer:** Backpropagation is an algorithm used to update model parameters (i.e. train neural networks) by `computing the gradient of the loss function with respect to the model’s weights`. It propagates the error from the output layer back through the network using the chain rule of calculus.  

2️⃣ **Why do we need backpropagation in deep learning?**  
   **Answer:** To minimize the loss function (i.e distance between labels and predictions). 

3️⃣ **What are the key steps in backpropagation?**  
   **Answer:**  
   - **Forward pass**: Compute activations and output.  
   - **Compute loss**: Compare predicted output to actual target.  
   - **Backward pass**: Compute gradients of the loss with respect to each weight using the chain rule.  
   - **Update weights**: Adjust weights using gradient descent.  

4️⃣ **What is the chain rule and how is it used in backpropagation?**  
   **Answer:** The chain rule is a fundamental rule in calculus used to compute the derivative of a composition of functions. Backpropagation applies the chain rule to calculate how a small change in weights affects the final output error. 
   
4️⃣ **Why do we use chain rule?** 
  - The chain rule allows us to compute the derivative of **non-linear activations** (ReLU, Sigmoid, Tanh).  
   - Example:  
     $
     \frac{dL}{dW} = \frac{dL}{dA} \times \frac{dA}{dZ} \times \frac{dZ}{dW}
     $
     where:  
     - $ L $ is loss  
     - $ A $ is activation  
     - $ Z $ is weighted input  
     - $ W $ is weight 

---

#### **Intermediate Questions**

6️⃣ **Why is ReLU preferred over sigmoid/tanh in deep networks?**  
   **Answer:**  
   - **ReLU avoids vanishing gradients** (its derivative is 0 for negative values but 1 for positive values).  
   - **Sigmoid and tanh suffer from saturation**, where gradients become too small to update weights effectively.  

7️⃣ **What happens if we don’t use backpropagation correctly?**  
   **Answer:**  
   - **Incorrect weight updates** → Model doesn’t learn properly.  
   - **Exploding/vanishing gradients** → Either too large or too small weight updates.  
   - **Slow convergence** → Training takes too long.  

8️⃣ **How does weight initialization impact backpropagation?**  
   **Answer:** Poor initialization can cause exploding or vanishing gradients. Popular initialization techniques include:  
   - **Xavier Initialization** (for sigmoid/tanh)  
   - **He Initialization** (for ReLU)  

---

#### **Advanced Questions**
9️⃣ **What is the vanishing gradient problem? How do we solve it?**  
   **Answer:**  
   - **Vanishing gradients** occur when derivatives are too small (especially in deep networks), preventing effective learning.  
   - **Solutions**:  
     - Use **ReLU** instead of sigmoid/tanh.  
     - Use **Batch Normalization**.  
     - Use **LSTM instead of vanilla RNNs** in sequence models.  

🔟 **What is the exploding gradient problem? How do we solve it?**  
   **Answer:**  
   - **Exploding gradients** occur when weight updates become excessively large.  
   - **Solutions**:  
     - **Gradient Clipping** (limit max gradient value).  
     - **Use proper weight initialization (Xavier, He)**.  

1️⃣1️⃣ **What is the difference between backpropagation and gradient descent?**  
   **Answer:**  
   - **Backpropagation** computes gradients using the chain rule.  
   - **Gradient descent** updates weights using those gradients.  
   - **Backpropagation = Gradient Computation; Gradient Descent = Weight Update**.  

1️⃣2️⃣ **How is backpropagation different in CNNs and RNNs?**  
   **Answer:**  
   - **CNNs**: Gradients are propagated through convolutional layers.  
   - **RNNs**: Uses **Backpropagation Through Time (BPTT)**, which considers sequential dependencies.  

1️⃣3️⃣ **How does batch size affect backpropagation?**  
   **Answer:**  
   - **Small batches** → Noisy updates, better generalization.  
   - **Large batches** → More stable updates but can get stuck in sharp minima.  
   - **Mini-batch Gradient Descent** (mix of both) is widely used.  

---

### **🔹 Coding-Based Questions**
1️⃣4️⃣ **Write a Python function to compute the derivative of ReLU.**  
   ```python
   def relu_derivative(x):
       return (x > 0).astype(float)  # Returns 1 if x > 0, else 0
   ```

1️⃣5️⃣ **Modify the following NumPy forward propagation code to include backpropagation.**  
   (They may give a simple forward propagation code and ask you to implement backpropagation.)  

---

### **🔹 Final Tips for Backpropagation Interviews**
✅ **Understand the math**: Derivatives, Chain Rule, Gradient Computation.  
✅ **Explain gradient flow**: How gradients pass through different layers.  
✅ **Code simple neural networks**: Implement forward and backward propagation in NumPy or PyTorch.  
✅ **Know optimization techniques**: Adam, RMSProp, Learning Rate Scheduling.  