# Multi Layer Percepron (MLP)

## Libraries

In [1]:
import torch
import torchvision.datasets as datasets
from torchvision.transforms import ToTensor
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

## MLP

<div class="alert alert-block alert-info">
    
The **multi layer perceptron (MLP)** is feedforward neural network composed of successive layers (cf. Figure below).

<img src="files/figures/MLP.jpg" width="600px"/>
 
The dynamics of an MLP is given by the following equations (sample and batch versions):


$$
\begin{array}{ll}
\textbf{sample $\boldsymbol{x}$} & \textbf{batch $\boldsymbol{X_i}$} \\
\begin{cases}
\boldsymbol{a^{[0]}} ~=~ \boldsymbol{x} & \\
\boldsymbol{z^{[l]}} ~=~ \boldsymbol{W^{[l]}} \boldsymbol{a^{[l-1]}} + \boldsymbol{b^{[l]}}, & l = 1, \dots, L \\
\boldsymbol{a^{[l]}} ~=~ \boldsymbol{\sigma} \left( \boldsymbol{z^{[l]}} \right), & l = 1, \dots, L
\end{cases}
~&~
\begin{cases}
\boldsymbol{A^{[0]}} ~=~ \boldsymbol{X_i}	\\
\boldsymbol{Z^{[l]}} ~=~ \boldsymbol{W^{[l]}} \boldsymbol{A^{[l-1]}} \oplus \boldsymbol{b^{[l]}}, & l = 1, \dots, L \\
\boldsymbol{A^{[l]}} ~=~ \boldsymbol{\sigma} \big( \boldsymbol{Z^{[l]}} \big), & l = 1, \dots, L
\end{cases}
\end{array}
$$

</div>

- Define a class `MLP()` which takes a list `[n1, n2, ..., nL]` as parameter and creates an MLP with $L$ layers of $n_i$ neurons each, for $i= 1, \dots, L$.
- Initializes the weights matrices $\boldsymbol{W^{[l]}}$ and the bias vectors $\boldsymbol{b^{[l]}}$ randomly from a normal distribution $\mathcal{N}(0, 1)$ (`torch.normal()`).
- The first layer is the input layer and thus has no biases.

- Add a method `forward(X)` which takes a batch of vectors `X` as inputs (2D tensor), and computes the forward pass of the network on this batch.
- For the activation function $\sigma$, take the `tanh`.

## Application to the MNIST Dataset

The **MNIST dataset** consists of handwritten digits. The MNIST classification problem consists in predicting the correct digit represented on an image.

<img src="files/figures/mnist.png" width="600px"/>

- Load the train and test MNIST datasets using the following commands:
```
train = datasets.MNIST(root='./data', train=True, download=True, transform=ToTensor())
test = datasets.MNIST(root='./data', train=False, download=True, transform=ToTensor())
```
Each sample consists of a tensor (the image encoded in black and white), and a label (the digit that it represents).
- Examine the train and test sets.
- Visualize some data samples (tensors) using `plt.imshow()`.

Each sample is a $28 \times 28$ 2D-tensor representing a handwritten digit. Note that the sample can be "flattened"  into a $28 \cdot 28 = 784$ 1D-vector using the method `flatten()`.

A **dataloader** creates batches of samples from a dataset so that they can be passed into a model.
- Create a train and test dataloader using the following commands:
```
train_loader = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=64, shuffle=True)
```
- Note that dataloaders are not subscriptable.
- Try to catch one batch of the dataloader and examine it.
- Write a function that reshapes a batch of size $64 \times 1 \times 28 \times 28$ into a tensor of size $784 \times 64$.<br>
(use `torch.squeeze()`, `torch.reshape()`, `torch.flatten()`, `torch.transpose()`, etc.)

- Instantiate a 4-layer MLP with the following characteristics:
    - Layer 1 (or input layer): size 784
    - Layer 2: size 128
    - Layer 3: size 128
    - Layer 4 (or output layer): size 10

- Pass all train samples through your network batch by batch:<br>
Create a function `process_data(dataloader, network)` that performs this.
- Gather all the outputs into 1 tensor.
- Take the argmax of the outputs to obtain the predictions.
- Get the classification report associated to your predictions and real labels:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
- What can you conclude?

**Oviously, the network is untrained, and thus does not preforms better than chance (10%)!**

## Train the MLP via Ridge regression

- Update the **weights of the last layer** only so that they correspond to the solution of a **Ridge regression**.<br>
https://en.wikipedia.org/wiki/Ridge_regression<br>

More precisely:
- Pass the train set through the network and get the predictions of the penultimate layer<br>
(add a method `forward_penultimate()` in the class `MLP`)
- Compute the closed-form solution of the Ridge regression:

$$
{\displaystyle {\widehat {\beta }}_{\text{ridge}}=(X^{T}X+kI_{p})^{-1}X^{T}y}
$$

where
- $X$ is the <span style="color:blue">row-wise concatenation</span> of the penultimate outputs $\boldsymbol{a_i}^{[L-1]}$, for $i = 1, \dots, N$;
- $I_{p}$ is the identity matrix of dim $p$;
- $k > 0$ is a regularization parameter (e.g. $0.1$);
- $y$ is the <span style="color:blue">row-wise concatenation</span> of the 1-hot encoded targets $\boldsymbol{y_i}$, for $i = 1, \dots, N$ (`torch.nn.functional.one_hot()`).
- **Set weights of the last layer $\boldsymbol{W}^{[L]}$ as the solution of the Ridge regression.**
- **Set the bias of the last layer $\boldsymbol{b}^{[L]}$ to $\boldsymbol{0}$.**
- Recompute the predictions associated to the train and test sets.
- Compute the classification reports.
- What can you conclude?

**The results have drastically improved!**
- Note that $\boldsymbol{W}^{[1]}, \boldsymbol{W}^{[2]}$ are kept untrained (randomly initialized).
- Only $\boldsymbol{W}^{[3]}$ is trained via by a **Ridge regression**.
- This suffices to drastically improve the results!