# MLP breakdown

## Introduction

#### What is a Multilayer Perceptron (MLP)?

A **Multilayer Perceptron (MLP)** is a class of **feedforward artificial neural network** composed of multiple layers of interconnected neurons. It maps input vectors $\mathbf{x} \in \mathbb{R}^n$ to output predictions $\hat{\mathbf{y}} \in \mathbb{R}^m$ through a sequence of **learned linear transformations** followed by **nonlinear activations**.

An MLP typically consists of:

1. **Input Layer**:  
    Receives the input features. No computations are performed here—this layer simply passes the data to the first hidden layer.
    
2. **One or More Hidden Layers**:  
    Each hidden layer computes a transformation of the form:
    
    $\mathbf{a}^{(l)}=\sigma\left(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\right)$
    
    where:
    
    - $\mathbf{W}^{(l)}$ and $\mathbf{b}^{(l)}$ are the learnable weight matrix and bias vector
        
    - $\sigma(\cdot)$ is a nonlinear activation function (e.g., sigmoid, ReLU)
        
    - $\mathbf{a}^{(l-1)}$ is the activation from the previous layer
        
3. **Output Layer**:  
    Computes the final prediction. The activation here may depend on the task (e.g., softmax for classification, linear for regression).

#### Learning

The MLP is trained to minimize a **loss function** $\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y})$ over the training set by adjusting weights and biases using **backpropagation** combined with **gradient descent**. Gradients of the loss are propagated backwards through the network using the chain rule.

#### Credits

This Multilayer Perceptron (MLP) implementation is inspired by and based on **Omar Aflak’s** excellent work on building a neural network from scratch in Python. His clear, step-by-step Medium article and accompanying GitHub repository laid the foundation for this simple implementation:

- 📖 [Medium Article: “Neural Network From Scratch in Python”](https://medium.com/data-science/math-neural-network-from-scratch-in-python-d6da9f29ce65)  
- 💻 [GitHub Repo: OmarAflak/Medium-Python-Neural-Network](https://github.com/OmarAflak/Medium-Python-Neural-Network)


## Imports

In [None]:
import numpy as np

## Simple Implementation

### Activation Functions

For the purpose of this notebook, we are going to use the **sigmoid function** as our activation function.

The sigmoid is a smooth, S-shaped function that maps any real-valued number into the range (0, 1).

The formula for the sigmoid function is:

$\sigma(x) = \frac{1}{1 + e^{-x}}$

Its derivative, which is needed during backpropagation, is:

$\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))$

In [None]:
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def sigmoid_prime(x):
    return sigmoid(x)*(1.0 - sigmoid(x))

### Loss functions

To measure how well our neural network is performing, we will use the **Mean Squared Error (MSE)** loss function.

The MSE is a standard loss function for regression problems. It computes the average of the squared differences between the predicted and the actual values:

$\text{MSE}(y, \hat{y}) = \frac{1}{2n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$

We include the factor of 1/2 for mathematical convenience when taking derivatives during backpropagation. The gradient (derivative of the loss with respect to the prediction) is:

$\frac{\partial \text{MSE}}{\partial \hat{y}} = \hat{y} - y$


In [None]:
def mse(y_true, y_pred):
    return (0.5*(y_true - y_pred)**2).mean()

def mse_prime(y_true, y_pred):
    return y_pred - y_true

### Activation Layers

The **ActivationLayer** applies a non-linear activation function element-wise to its input. In our case, we use the **sigmoid** activation function defined above.

This operation introduces non-linearity into the network, enabling it to model complex, non-linear decision boundaries. Without such functions, any composition of layers would reduce to a linear map.

---

#### ⏩ Forward Pass:

For a given input vector $\mathbf{z} \in \mathbb{R}^{1 \times m}$, the activation is applied element-wise:

$\mathbf{a} = \sigma(\mathbf{z})$

where $\mathbf{a} \in \mathbb{R}^{1 \times m}$ is the output activation vector.

---

#### ⏪ Backward Pass:

Given the upstream gradient from the next layer $\frac{\partial \mathcal{L}}{\partial \mathbf{a}}$, we compute the local gradient of the activation:

$\sigma'(z_i) = \sigma(z_i)(1 - \sigma(z_i)) \quad \text{for each component } z_i \in \mathbf{z}$

The resulting gradient with respect to the input $\mathbf{z}$ is:

$\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}} \circ \sigma'(\mathbf{z})$

Here, $\circ$ denotes the Hadamard product. This propagates gradients through the activation nonlinearity in the backward pass.


In [None]:
class ActivationLayer:
    def forward(self, input_data):
        self.input = input_data
        return sigmoid(input_data)

    def backward(self, output_error):
        return sigmoid_prime(self.input) * output_error
    
    def step(self, eta):
        return

### Fully Connected Layer

The **FullyConnectedLayer** (dense layer) implements an affine transformation:

$\mathbf{z} = \mathbf{x} \cdot \mathbf{W} + \mathbf{b}$

where:

- $\mathbf{x} \in \mathbb{R}^{1 \times n}$: input row vector
    
- $\mathbf{W} \in \mathbb{R}^{n \times m}$: weight matrix
    
- $\mathbf{b} \in \mathbb{R}^{1 \times m}$: bias vector
    
- $\mathbf{z} \in \mathbb{R}^{1 \times m}$: output of the layer (before activation)
    

This operation projects the input from an $N$-dimensional space to an $M$-dimensional space.

---

#### ⏩ Forward Pass:

Given input vector $\mathbf{x}$, the layer computes:

$\mathbf{z} = \mathbf{x} \cdot \mathbf{W} + \mathbf{b}$

This linear output vector $\mathbf{z}$ is then passed to the next layer followed by activation layer.

---

#### ⏪ Backward Pass:

Assuming we are given an upstream gradient $\frac{\partial \mathcal{L}}{\partial \mathbf{z}}$ from the loss $\mathcal{L}$, we compute:

- **Input Gradient** (to propagate to previous layer):
    

$\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}} \cdot \mathbf{W}^\top$

- **Weights Gradient** (for update step):
    

$\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \mathbf{x}^\top \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{z}}$

- **Bias Gradient**:
    

$\frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}}$

These partial derivatives are accumulated across all samples in the mini-batch. The `step()` method then applies the parameter update via averaged gradients:

$\mathbf{W} \leftarrow \mathbf{W} - \eta \cdot \frac{1}{B} \sum_{i=1}^B \frac{\partial \mathcal{L}^{(i)}}{\partial \mathbf{W}}, \quad \mathbf{b} \leftarrow \mathbf{b} - \eta \cdot \frac{1}{B} \sum_{i=1}^B \frac{\partial \mathcal{L}^{(i)}}{\partial \mathbf{b}}$

Where:

- $\eta$: learning rate
    
- $B$: batch size

In [1]:
class FullyConnectedLayer:
    def __init__(self, input_size, output_size):
        self.delta_w = np.zeros((input_size, output_size))
        self.delta_b = np.zeros((1,output_size))
        self.passes = 0

        self.weights = np.random.rand(input_size, output_size) - 0.5
        self.bias = np.random.rand(1, output_size) - 0.5

    def forward(self, input_data):
        self.input = input_data
        return np.dot(self.input, self.weights) + self.bias

    def backward(self, output_error):
        input_error = np.dot(output_error, self.weights.T)
        weights_error = np.dot(self.input.T, output_error)

        self.delta_w += weights_error
        self.delta_b += output_error
        self.passes += 1
        return input_error

    def step(self, eta):
        self.weights -= eta * self.delta_w / self.passes
        self.bias -= eta * self.delta_b / self.passes

        self.delta_w = np.zeros(self.weights.shape)
        self.delta_b = np.zeros(self.bias.shape)
        self.passes = 0


### MLP Network

The `Network` class encapsulates the full structure and training loop of a feedforward neural network. It manages the sequence of layers, orchestrates forward and backward propagation, and handles parameter updates via mini-batch gradient descent.

---

#### 🔮 Predictions

Predictions are generated using the `predict` method, which computes the network output by successively applying the forward operation of each layer.
Given an input sample $\mathbf{x}$, the output is computed as:

$\hat{\mathbf{y}} = L_n \circ L_{n-1} \circ \dots \circ L_1(\mathbf{x}) \equiv \mathbf{a}^{(n)}$

where $\mathbf{a}^{(n)}$ denotes the activation of the final layer.

---

#### 🧠 Training

Training of network is handled by `fit` method, which computes network predictions by forwarding mini-batch of $B$ samples randomly selected from empirical dataset.

$\mathcal{B} = \text{RandomBatch}(\mathcal{D}, B) = \{ (\mathbf{x}^{(i)}, \mathbf{y}^{(i)}) \}_{i=1}^B$

For each sample $\mathbf{x}^{(i)}$, a **forward pass** is performed through the network to compute the prediction:

$\hat{\mathbf{y}}^{(i)} = f(\mathbf{x}^{(i)}; \theta)$

where $\theta$ represents the collection of all trainable parameters.

The output is compared with the label $\mathbf{y}^{(i)}$, and the **loss** for the sample is computed using the mean squared error (MSE) function. Once the forward pass is complete, a **backward pass** propagates the error gradients back through the network. For each layer, we compute the partial derivatives: $\frac{\partial \mathcal{L}^{(i)}}{\partial \theta}$, which measure how the loss changes with respect to the layer’s parameters. These gradients are accumulated over the entire mini-batch. 

After all $\mathcal{B}$ samples are processed, the parameters are updated by a gradient descent step using the average of the accumulated gradients:

$\theta \leftarrow \theta - \eta \cdot \frac{1}{B} \sum_{i=1}^B \nabla_\theta \mathcal{L}^{(i)}$

where $\eta$ is the learning rate.

Those steps are later repeted for specified number of epoches $E$.

In [4]:
class Network:
    def __init__(self, verbose=True):
        self.verbose = verbose
        self.layers = []

    def add(self, layer):
        self.layers.append(layer)

    def predict(self, input_data):
        result = []
        for i in range(input_data.shape[0]):
            output = input_data[i]
            for layer in self.layers:
                output = layer.forward(output)
            result.append(output)
        return result

    def fit(self, x_train, y_train, epoches, learning_rate, batch_size=64):
        for i in range(epoches):
            err = 0

            idx = np.argsort(np.random.random(x_train.shape[0]))[:batch_size]
            x_batch = x_train[idx]
            y_batch = y_train[idx]

            for j in range(batch_size):
                output = x_batch[j]
                for layer in self.layers:
                    output = layer.forward(output)

                err += mse(y_batch[j], output)

                error = mse_prime(y_batch[j], output)
                for layer in reversed(self.layers):
                    error = layer.backward(error)
            
            for layer in self.layers:
                layer.step(learning_rate)

            if (self.verbose) and ((i%10) == 0):
                err /= batch_size
                print('epoch: %5d/%d   error=%0.9f' % (i, epoches, err))