# Session 6: Backpropagation â€“ Learning from Mistakes

## Overview
In this session, we explore **backpropagation**, the core algorithm for training neural networks. Backpropagation enables the model to adjust its weights by learning from its errors, making it crucial for improving neural network performance.

Backpropagation uses the **chain rule of calculus** to compute gradients, allowing the network to update weights during training. By minimizing the error between the predicted and actual values, the network's accuracy improves over time.

## Key Topics
- Understanding Backpropagation
- The Chain Rule and Gradient Calculation
- Backpropagation Algorithm
- How Backpropagation Updates Weights
- Practical Implementation of Backpropagation

---

### 1. Understanding Backpropagation

Backpropagation is a supervised learning technique that optimizes neural network weights. It propagates error backwards from the output layer to the input layer, adjusting weights accordingly.

In a neural network:
- The **forward pass** calculates output using input and current weights.
- The **backward pass** adjusts weights by calculating the gradient of the loss function with respect to each weight.

---

### 2. The Chain Rule and Gradient Calculation

The gradient measures how much the output changes with respect to input changes. In machine learning, gradients help adjust weights during backpropagation.

The **chain rule** allows us to compute the derivative of a composite function. Backpropagation applies the chain rule to compute the derivative of the loss function with respect to each weight.

#### Chain Rule Explanation
If the loss function \( L \) depends on an intermediate variable \( z \), and \( z \) depends on \( w \), the chain rule states:

$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial z} \times \frac{\partial z}{\partial w}$

Backpropagation uses this iteratively across each layer to update weights.

---

### 3. Backpropagation Algorithm

**Backpropagation** follows these steps:

1. **Forward Pass**: Input is fed through the network to calculate output.
2. **Backward Pass**: Starting from the output layer, compute gradients by applying the chain rule.
3. **Weight Update**: Adjust weights using gradient descent.
4. **Repeat**: Iterate to minimize loss.

---

### 4. How Backpropagation Updates Weights

Backpropagation uses **gradient descent** to update weights with this rule:

$w = w - \text{learning rate} \times \frac{\partial L}{\partial w}$

Where:
- $ \frac{\partial L}{\partial w} $: Gradient of the loss with respect to the weight.
- **Learning rate**: Controls step size in adjusting weights.

#### Example of Weight Updates:
Given a neural network with one hidden layer:
1. Calculate hidden layer output $ h = f(W \cdot x + b) $.
2. Output layer output $ y = f(W' \cdot h + b') $.
3. Calculate loss $ L = \frac{1}{2} (y_{\text{pred}} - y_{\text{true}})^2 $.
4. Compute and propagate gradients to adjust weights.

---

### 5. Practical Implementation of Backpropagation

The following Python example demonstrates backpropagation on a simple neural network.

---

### Code Example: Backpropagation for a Simple Neural Network
This code demonstrates backpropagation on a neural network solving the XOR problem.

```python
import numpy as np

# Define the sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Neural Network structure
input_layer_size = 3
hidden_layer_size = 4
output_layer_size = 1

# Initialize weights and biases
np.random.seed(1)
W1 = np.random.rand(input_layer_size, hidden_layer_size)
b1 = np.random.rand(1, hidden_layer_size)
W2 = np.random.rand(hidden_layer_size, output_layer_size)
b2 = np.random.rand(1, output_layer_size)

# Input data and target output
X = np.array([[0, 0, 1], [1, 0, 1], [0, 1, 1], [1, 1, 1]])  # 4 examples, 3 features
y = np.array([[0], [1], [1], [0]])  # XOR problem

# Hyperparameters
learning_rate = 0.1
iterations = 10000

# Training loop
for i in range(iterations):
    # Forward pass
    hidden_layer_input = np.dot(X, W1) + b1
    hidden_layer_output = sigmoid(hidden_layer_input)
    
    output_layer_input = np.dot(hidden_layer_output, W2) + b2
    output_layer_output = sigmoid(output_layer_input)
    
    # Compute loss (Mean Squared Error)
    loss = np.mean((output_layer_output - y) ** 2)
    
    # Backward pass
    output_layer_error = output_layer_output - y
    output_layer_gradient = output_layer_error * sigmoid_derivative(output_layer_output)
    
    hidden_layer_error = output_layer_gradient.dot(W2.T)
    hidden_layer_gradient = hidden_layer_error * sigmoid_derivative(hidden_layer_output)
    
    # Update weights and biases using gradient descent
    W2 -= learning_rate * hidden_layer_output.T.dot(output_layer_gradient)
    b2 -= learning_rate * np.sum(output_layer_gradient, axis=0, keepdims=True)
    
    W1 -= learning_rate * X.T.dot(hidden_layer_gradient)
    b1 -= learning_rate * np.sum(hidden_layer_gradient, axis=0, keepdims=True)

    # Every 1000 iterations, print the loss
    if i % 1000 == 0:
        print(f"Iteration {i}, Loss: {loss}")

# Final output after training
print("Final output after training:")
print(output_layer_output)
```

In [1]:
# Code block to implement backpropagation in Python
import numpy as np

# Define the sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Neural Network structure
input_layer_size = 3
hidden_layer_size = 4
output_layer_size = 1

# Initialize weights and biases
np.random.seed(1)
W1 = np.random.rand(input_layer_size, hidden_layer_size)
b1 = np.random.rand(1, hidden_layer_size)
W2 = np.random.rand(hidden_layer_size, output_layer_size)
b2 = np.random.rand(1, output_layer_size)

# Input data and target output
X = np.array([[0, 0, 1], [1, 0, 1], [0, 1, 1], [1, 1, 1]])  # 4 examples, 3 features
y = np.array([[0], [1], [1], [0]])  # XOR problem

# Hyperparameters
learning_rate = 0.1
iterations = 10000

# Training loop
for i in range(iterations):
    # Forward pass
    hidden_layer_input = np.dot(X, W1) + b1
    hidden_layer_output = sigmoid(hidden_layer_input)
    
    output_layer_input = np.dot(hidden_layer_output, W2) + b2
    output_layer_output = sigmoid(output_layer_input)
    
    # Compute loss (Mean Squared Error)
    loss = np.mean((output_layer_output - y) ** 2)
    
    # Backward pass
    output_layer_error = output_layer_output - y
    output_layer_gradient = output_layer_error * sigmoid_derivative(output_layer_output)
    
    hidden_layer_error = output_layer_gradient.dot(W2.T)
    hidden_layer_gradient = hidden_layer_error * sigmoid_derivative(hidden_layer_output)
    
    # Update weights and biases using gradient descent
    W2 -= learning_rate * hidden_layer_output.T.dot(output_layer_gradient)
    b2 -= learning_rate * np.sum(output_layer_gradient, axis=0, keepdims=True)
    
    W1 -= learning_rate * X.T.dot(hidden_layer_gradient)
    b1 -= learning_rate * np.sum(hidden_layer_gradient, axis=0, keepdims=True)

    # Every 1000 iterations, print the loss
    if i % 1000 == 0:
        print(f"Iteration {i}, Loss: {loss}")

# Final output after training
print("Final output after training:")
print(output_layer_output)


Iteration 0, Loss: 0.38038349310856134
Iteration 1000, Loss: 0.24997670387612728
Iteration 2000, Loss: 0.24972913712906816
Iteration 3000, Loss: 0.24883556540840868
Iteration 4000, Loss: 0.2430179563469606
Iteration 5000, Loss: 0.2023128270301176
Iteration 6000, Loss: 0.1342851241018072
Iteration 7000, Loss: 0.03423820536202099
Iteration 8000, Loss: 0.012399204540152104
Iteration 9000, Loss: 0.006836030803331477
Final output after training:
[[0.07125372]
 [0.93446125]
 [0.93473281]
 [0.06774813]]
