### Day 2: Multilayer Perceptron (MLP) and Backpropagation

Now that the trainees understand the basics of a single perceptron, we'll introduce them to **multilayer neural networks** and dive deeper into backpropagation and optimization. We'll also start working with real-world datasets to see how these concepts come together in practice.

#### 1. **Multilayer Perceptron (MLP)**
A **Multilayer Perceptron (MLP)** consists of an input layer, one or more hidden layers, and an output layer. Each neuron in a layer is connected to every neuron in the next layer. This structure enables the model to learn more complex functions compared to a single-layer perceptron.

##### How it works:
- **Input Layer**: Takes the input features.
- **Hidden Layers**: These layers apply transformations (via neurons and activation functions) to learn intermediate representations.
- **Output Layer**: Produces the final prediction (e.g., for classification, the output could be probabilities for different classes).

Each layer introduces more complexity, allowing the network to learn more about the patterns in the data.

##### Formula for an MLP:
Each hidden neuron computes the following:

$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
$$

Where:
- $ l $ is the layer index.
- $ W^{(l)} $ are the weights of layer $ l $.
- $ a^{(l-1)} $ are the activations from the previous layer.
- $ b^{(l)} $ are the biases of layer $ l $.

The activations $ a^{(l)} $ are obtained by applying an activation function $ f $ to $ z^{(l)} $:

$$
a^{(l)} = f(z^{(l)})
$$

The final output layer uses either **sigmoid**, **softmax**, or **linear** activation depending on the problem (classification or regression).

#### 2. **Backpropagation in MLP**
In an MLP, backpropagation is used to update the weights for every layer in the network, not just the output layer.

##### Steps of backpropagation:
1. **Forward pass**: Calculate the output by passing inputs through all the layers.
2. **Calculate loss**: Compare the network's output to the true output using a loss function.
3. **Backpropagate the error**: Using the chain rule, compute the gradients of the loss function with respect to each weight in the network, working backward from the output layer to the input layer.
4. **Update weights**: Update the weights using gradient descent:

$$
w_i^{(l)} \gets w_i^{(l)} - \eta \frac{\partial L}{\partial w_i^{(l)}}
$$

Where $ \eta $ is the learning rate and $ \frac{\partial L}{\partial w_i^{(l)}} $ is the gradient of the loss with respect to the weight $ w_i^{(l)} $.


#### 3. **Building an MLP in Code (from Scratch)**
Let's now extend the perceptron code to build a simple MLP using NumPy.

Let's break down this Multi-Layer Perceptron (MLP) implementation to make it easier to understand:

### 1. **Imports**
```python
import numpy as np
```
- **`numpy`** is imported to handle array operations and matrix math efficiently, which is crucial for neural networks.

---

### 2. **Activation Functions**
```python
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)
```
- **Sigmoid Activation Function**: Converts any input into a value between `0` and `1`. This helps to introduce non-linearity into the neural network.
  - Formula: $ \sigma(x) = \frac{1}{1 + e^{-x}} $
- **Sigmoid Derivative**: This function is used to calculate how much to adjust the weights during backpropagation. The derivative of the sigmoid function is: 
  - $ \sigma'(x) = \sigma(x) \cdot (1 - \sigma(x)) $

---

### 3. **MLP Class**
#### 3.1. **Initialization (`__init__` Method)**
```python
class MLP:
    def __init__(self, input_size, hidden_size, output_size):
        self.weights_input_hidden = np.random.rand(input_size, hidden_size)
        self.weights_hidden_output = np.random.rand(hidden_size, output_size)
        self.bias_hidden = np.random.rand(hidden_size)
        self.bias_output = np.random.rand(output_size)
```
- **Weights**: Randomly initialized matrices for the connections between layers.
  - **`weights_input_hidden`** connects the input layer to the hidden layer.
  - **`weights_hidden_output`** connects the hidden layer to the output layer.
- **Biases**: Randomly initialized values for each neuron in the hidden and output layers. The bias helps to shift the activation function.
  - **`bias_hidden`** is for the hidden layer.
  - **`bias_output`** is for the output layer.

For example, if the network has:
- **2 input neurons**, 
- **2 hidden neurons**, and 
- **1 output neuron**, 

then:
- **`weights_input_hidden`** will be a 2x2 matrix.
- **`weights_hidden_output`** will be a 2x1 matrix.

---

#### 3.2. **Forward Pass (`forward` Method)**
```python
def forward(self, inputs):
    self.hidden_layer_input = np.dot(inputs, self.weights_input_hidden) + self.bias_hidden
    self.hidden_layer_output = sigmoid(self.hidden_layer_input)

    self.output_layer_input = np.dot(self.hidden_layer_output, self.weights_hidden_output) + self.bias_output
    self.output = sigmoid(self.output_layer_input)
    
    return self.output
```
- **Forward pass** calculates the output of the neural network by passing the input data through the layers.

Steps:
1. **Hidden Layer**:
   - Compute the input to the hidden layer: 
     - $ hidden\_layer\_input = inputs \cdot weights\_input\_hidden + bias\_hidden $
   - Apply the sigmoid activation function to get the output of the hidden layer: 
     - $ hidden\_layer\_output = sigmoid(hidden\_layer\_input) $
2. **Output Layer**:
   - Compute the input to the output layer:
     - $ output\_layer\_input = hidden\_layer\_output \cdot weights\_hidden\_output + bias\_output $
   - Apply the sigmoid activation function to get the network's final output: 
     - $ output = sigmoid(output\_layer\_input) $

---

#### 3.3. **Backward Pass (`backward` Method)**
```python
def backward(self, inputs, actual_output, predicted_output):
    output_error = actual_output - predicted_output
    output_delta = output_error * sigmoid_derivative(predicted_output)

    hidden_error = output_delta.dot(self.weights_hidden_output.T)
    hidden_delta = hidden_error * sigmoid_derivative(self.hidden_layer_output)

    self.weights_hidden_output += self.hidden_layer_output.T.dot(output_delta)
    self.bias_output += np.sum(output_delta, axis=0)

    self.weights_input_hidden += inputs.T.dot(hidden_delta)
    self.bias_hidden += np.sum(hidden_delta, axis=0)
```
- The **backward pass** adjusts the weights and biases using the error between the predicted and actual outputs. This process is called **backpropagation**.

Steps:
1. **Calculate Output Error**:
   - **`output_error`**: Difference between the **actual output** and the **predicted output** (i.e., the error).
   - **`output_delta`**: Error adjusted by the derivative of the sigmoid function. This helps determine how much each weight contributed to the error.
   
2. **Calculate Hidden Layer Error**:
   - **`hidden_error`**: Propagate the output error back to the hidden layer using the weights connecting the hidden layer to the output.
   - **`hidden_delta`**: Adjust the hidden error using the sigmoid derivative to calculate how much to adjust the weights between the input and hidden layer.

3. **Update Weights and Biases**:
   - Adjust the **weights and biases** by multiplying the deltas (errors) with the input or hidden outputs to update the connections:
     - **`weights_hidden_output`**: Adjusted using the output delta and the hidden layer's output.
     - **`weights_input_hidden`**: Adjusted using the hidden delta and the input values.
     - **Biases** are updated by summing the deltas.

---

#### 3.4. **Training (`train` Method)**
```python
def train(self, inputs, labels, epochs=10000):
    for _ in range(epochs):
        predicted_output = self.forward(inputs)
        self.backward(inputs, labels, predicted_output)
```
- **Training** runs the forward pass and backward pass for a given number of **epochs** (iterations).
- During each epoch, the neural network:
  - **Forward pass**: Predicts the output using the current weights.
  - **Backward pass**: Adjusts the weights and biases based on the error.

---

### 4. **Training Data**
```python
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
labels = np.array([[0], [1], [1], [0]])  # XOR gate
```
- **Inputs**: These are the inputs to the neural network. Each input pair represents the possible values of a binary XOR gate.
  - XOR gate: 
    - $ 0 \oplus 0 = 0 $
    - $ 0 \oplus 1 = 1 $
    - $ 1 \oplus 0 = 1 $
    - $ 1 \oplus 1 = 0 $
- **Labels**: These are the expected outputs (or ground truth) for the XOR gate.

---

### 5. **Creating and Training the MLP**
```python
mlp = MLP(input_size=2, hidden_size=2, output_size=1)
mlp.train(inputs, labels)
```
- **Create the MLP** with:
  - **2 input neurons** (because the XOR problem has 2 inputs),
  - **2 hidden neurons**, and
  - **1 output neuron** (since XOR has a binary output).
- **Train the MLP** using the input data and labels.

---

### 6. **Testing the MLP**
```python
print(mlp.forward(np.array([0, 0])))  # Expected output: ~0
print(mlp.forward(np.array([1, 1])))  # Expected output: ~0
```
- After training, test the MLP on the inputs:
  - For input `[0, 0]`, the expected output is approximately `0`.
  - For input `[1, 1]`, the expected output is approximately `0`.

---

### Key Concepts:
- **Multi-Layer Perceptron (MLP)**: A feedforward neural network with one or more hidden layers. It can solve non-linear problems like XOR.
- **Forward Pass**: The process of passing inputs through the network to get the output.
- **Backward Pass**: The process of calculating errors and adjusting weights to minimize them.
- **Training**: Repeatedly running forward and backward passes to make the model learn from data.



The function $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$ is known as the **sigmoid function**, commonly used in machine learning, especially in neural networks.
---


### Derivative Calculation Sigmoid

To find the derivative $$ \frac{d}{dx} \sigma(x) $$, let us break down step by step.

### Step 1: Identify the form of the function
The function $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$ is a composite function of the form:

$$
\sigma(x) = f(g(x)) \quad \text{where} \quad f(u) = \frac{1}{u} \quad \text{and} \quad g(x) = 1 + e^{-x}.
$$

We'll need to use the **chain rule** to differentiate this.

### Step 2: Differentiate $$ f(u) = \frac{1}{u} $$
First, find the derivative of $$ f(u) = \frac{1}{u} $$:

$$
\frac{d}{du} \left( \frac{1}{u} \right) = -\frac{1}{u^2}.
$$

We'll apply this result after differentiating $$ g(x) $$.

### Step 3: Differentiate $$ g(x) = 1 + e^{-x} $$
Now, find the derivative of $$ g(x) = 1 + e^{-x} $$:

$$
\frac{d}{dx} \left( 1 + e^{-x} \right) = 0 + (-e^{-x}) = -e^{-x}.
$$

### Step 4: Apply the chain rule
The derivative of $$ \sigma(x) $$ is:

$$
\frac{d}{dx} \sigma(x) = \frac{d}{du} f(u) \cdot \frac{d}{dx} g(x) = -\frac{1}{(1 + e^{-x})^2} \cdot (-e^{-x}).
$$

### Step 5: Simplify the expression
Now, simplify the result:

$$
\frac{d}{dx} \sigma(x) = \frac{e^{-x}}{(1 + e^{-x})^2}.
$$

### Step 6: Express the result in terms of $$ \sigma(x) $$
Notice that $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$. This allows us to express the derivative in terms of $$ \sigma(x) $$ itself.

- Since $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$, we know that:
  $$
  1 - \sigma(x) = \frac{e^{-x}}{1 + e^{-x}}.
  $$

Thus, the derivative can be rewritten as:

$$
\frac{d}{dx} \sigma(x) = \sigma(x) (1 - \sigma(x)).
$$

### Final Result:
The derivative of the sigmoid function is:

$$
\frac{d}{dx} \sigma(x) = \sigma(x) (1 - \sigma(x)).
$$

This is a key property of the sigmoid function and is commonly used in backpropagation in neural networks.



In [None]:
import numpy as np


# Activation functions and their derivatives
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def sigmoid_derivative(x):
    return x * (1 - x)


# MLP class
class MLP:
    def __init__(self, input_size, hidden_size, output_size):
        self.weights_input_hidden = np.random.rand(input_size, hidden_size)
        self.weights_hidden_output = np.random.rand(hidden_size, output_size)
        self.bias_hidden = np.random.rand(hidden_size)
        self.bias_output = np.random.rand(output_size)

    def forward(self, inputs):
        self.hidden_layer_input = (
            np.dot(inputs, self.weights_input_hidden) + self.bias_hidden
        )
        self.hidden_layer_output = sigmoid(self.hidden_layer_input)

        self.output_layer_input = (
            np.dot(self.hidden_layer_output, self.weights_hidden_output)
            + self.bias_output
        )
        self.output = sigmoid(self.output_layer_input)

        return self.output

    def backward(self, inputs, actual_output, predicted_output):
        # Calculate error
        output_error = actual_output - predicted_output
        output_delta = output_error * sigmoid_derivative(predicted_output)

        hidden_error = output_delta.dot(self.weights_hidden_output.T)
        hidden_delta = hidden_error * sigmoid_derivative(self.hidden_layer_output)

        # Update weights and biases
        self.weights_hidden_output += self.hidden_layer_output.T.dot(output_delta)
        self.bias_output += np.sum(output_delta, axis=0)

        self.weights_input_hidden += inputs.T.dot(hidden_delta)
        self.bias_hidden += np.sum(hidden_delta, axis=0)

    def train(self, inputs, labels, epochs=100):
        for _ in range(epochs):
            predicted_output = self.forward(inputs)
            self.backward(inputs, labels, predicted_output)

    def predict(self, inputs):
        return self.forward(inputs)


# Sample training data
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
labels = np.array([[0], [1], [1], [0]])  # XOR gate

# Create MLP and train
mlp = MLP(input_size=2, hidden_size=2, output_size=1)
mlp.train(inputs, labels)

# Test the MLP
print(mlp.predict(np.array([0, 0])))  # Expected output: ~0
print(mlp.predict(np.array([1, 0])))  # Expected output: ~0


[0.49618505]
[0.50232594]



In this code:
- The network has 2 input neurons, 2 hidden neurons, and 1 output neuron.
- It uses the **sigmoid** activation function in both the hidden and output layers.
- The **backward** function computes the gradients and updates the weights using backpropagation.

#### 4. **Working with Real Data (MNIST Example)**
Now that we’ve coded an MLP from scratch, let’s move to a more complex task using real-world data.

We will use **TensorFlow/Keras** to simplify the implementation and train a network to recognize handwritten digits using the **MNIST** dataset.
