# Backpropagation (BP)

## Libraries

In [1]:
import torch
import torch.nn.functional as F
import torchvision.datasets as datasets
from torchvision.transforms import ToTensor
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

## Backpropagation

<div class="alert alert-block alert-info">
    
The **backpropagation algorithm (BP)** is an efficient implementation of (mini-batch) stochastic gradient descent for neural networks.
    
In short, the algorithm consists of:
1. a **forward pass**: computation of the activations;
2. a **backward pass**: computation of the gradients;
3. an **updating step**: updating of the weights and biases using gradients.

<img src="figures/backprop.png" width="800px"/>
    
We will use the following **sigmoid activation function**
    $$
    \sigma(z) = \frac{1}{1 + \exp(-z)}
    $$
whose derivative is given by
    $$
    \sigma'(z) = \sigma(z) \cdot \left( 1 - \sigma(z) \right).
    $$

Furthermore, we will use the following **quadratic loss function**
    $$
    \mathcal{L_k} \left( \Theta \right) := \mathcal{L_k} \left( \Theta, \boldsymbol{a_k^{[L]}}, \boldsymbol{y_k} \right) = \frac{1}{2} \left\| \boldsymbol{a_k^{[L]}} - \boldsymbol{y_k} \right\|^{2}
    $$
whose gradient with respect to $\boldsymbol{a_k^{[L]}}$ is given by
    $$
    \nabla_{\boldsymbol{a_k^{[L]}}} \mathcal{L_k} \left( \Theta \right) = \boldsymbol{a_k^{[L]}} - \boldsymbol{y_k}.
    $$
</div>

**Step 1**
- Implement the activation functions ``sigma(z)`` and its derivative ``sigma_prime(z)``.
- Implement the loss function ``loss(a_k, y_k)`` and gradient ``loss_gradient(a_k, y_k)``.

**Step 2**
- Create a class `Network()` which takes a list `[n0, n2, ..., nL]` as parameter and creates an MLP with $L+1$ layers of $n_i$ neurons each, for $i= 0, \dots, L$.
- Initializes the weights matrices $\boldsymbol{W^{[l]}}$ and the bias vectors $\boldsymbol{b^{[l]}}$ randomly from a normal distribution $\mathcal{N}(0, 1)$ (`torch.normal()`), for $l = 0, \dots, L-1$.<br>
(Note that according to our notation, for $L+1$ layers, there are $L$ weights and $L$ biases.)
- The first layer is the input layer and thus has no biases.

**Step 3**
- Implement a method ``forward_pass(self, X)`` which:
    - takes as inputs a batch of data $\boldsymbol{X}$ (2D tensor)
    - returns as outputs:
        - the list of pre-activations ``Z_l`` = $\left[ \boldsymbol{Z}^{[0]}, \boldsymbol{Z}^{[1]}, \dots, \boldsymbol{Z}^{[L]} \right]$ associated to $\boldsymbol{X}$ (list of 2D tensors).
        - the list of activations ``A_l`` = $\left[ \boldsymbol{A}^{[0]}, \boldsymbol{A}^{[1]}, \dots, \boldsymbol{A}^{[L]} \right]$ associated to $\boldsymbol{X}$ (list of 2D tensors).

**Step 4**
- Implement a method ``backward_pass(self, Z_l, A_l, Y, eta)`` which:
    - takes as inputs:
        - the list of pre-activations ``Z_l`` = $\left[ \boldsymbol{Z}^{[0]}, \boldsymbol{Z}^{[1]}, \dots, \boldsymbol{Z}^{[L]} \right]$ (list of 2D tensors).
        - the list of activations ``A_l`` = $\left[ \boldsymbol{A}^{[0]}, \boldsymbol{A}^{[1]}, \dots, \boldsymbol{A}^{[L]} \right]$ (list of 2D tensors).
        - the batch of targets ``Y`` (2D tensor);
        - the learning rate ``eta`` (float).
    - updates the attributes ``self.weights`` and ``self.biases`` according to the backward pass.
    
**Remark:** note that ``Z_l`` and ``A_l`` contain one more element that ``self.weights``. Accordingly, indices might get confusing. The following picture clarifies the situation:<br>
```
In algo: z^[0]                z^[1]                z^[2]                 ...
In code: Z_l[0]               Z_l[1]               Z_l[2]                ...
In algo: a^[0]                a^[1]                a^[2]                 ...
In code: A_l[0]               A_l[1]               A_l[2]                ...
In algo:            W^[1]                W^[2]                W^[3]      ...
In code:       self.weights[0]      self.weights[1]      self.weights[2] ...
```

**Step 5**
- Using your class ``Network``, initialize a network as follows:<br>
    ``net = Network([100, 200, 300, 150, 5])``
- Implement a forward pass on a random tensor ``X`` of size $100 \times 32$.
- Implement a backward pass using a random tensor ``Y`` of size $5 \times 32$.

# Application to the MNIST Dataset

The **MNIST dataset** consists of handwritten digits. The MNIST classification problem consists in predicting the correct digit represented on an image.

<img src="files/figures/mnist.png" width="600px"/>

- Load the train and test MNIST datasets using the following commands:
```
train = datasets.MNIST(root='./data', train=True, download=True, transform=ToTensor())
test = datasets.MNIST(root='./data', train=False, download=True, transform=ToTensor())
```
Each sample consists of a tensor (the image encoded in black and white), and a label (the digit that it represents).
- Examine the train and test sets.
- Visualize some data samples (tensors) using `plt.imshow()`.

A **dataloader** creates batches of samples from a dataset so that they can be passed into a model.
- Create a train and test dataloaders using the following commands:
```
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=32, shuffle=True)
```
- Note that dataloaders are not subscriptable.
- Try to catch one batch of the dataloader and examine it.
- Write a function that reshapes a batch of size $32 \times 1 \times 28 \times 28$ into a tensor of size $784 \times 32$.<br>
(use `torch.squeeze()`, `torch.reshape()`, `torch.flatten()`, `torch.transpose()`, etc.)

Instantiate a 4-layer MLP with the following characteristics:
- Layer 1 (or input layer): size 784
- Layer 2: size 128
- Layer 3: size 128
- Layer 4 (or output layer): size 10

- Train your network for 40 epochs with a learning rate of $0.1$.
- Write a ``train()``function that uses your methods ``forward_pass()`` and ``backward_pass()``.
- Compute and store the loss after each batch processing:<br>
    You can one-hot encode the targets and use your function ``loss()``.<br>
    https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html
- Print the loss.

- Write a function ``predict(network, dataloader)`` that returns the targets and predictions of a dataset.
- Compute the classification reports for the train and test sets:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html