## Prerequisites

The MNIST dataset consists of low-resolution grayscale images of digits (0-9) written by humans with a pen.

The task is to predict which digit is written in each of the images.

A pre-processed form of the MNIST dataset is available in Google Drive. Download the `mnist.zip` file and unzip it. Inside you will find:

```
└── mnist
    ├── train_imgs
    ├── test_imgs
    ├── train_labels.npy
    └── test_labels.npy
```

  - `train_imgs`: Contains all train split images. 
  - `test_imgs`: Contains all test split images.
  - `train_labels`: Train split labels.
  - `test_labels`: Test split labels.

The `.npy` format is the standard binary file format for numpy for saving arrays on disk. You can load the array from a `.npy` file using `numpy.load`.

Images are 2D (28x28) byte numpy arrays, and labels are 1D byte numpy arrays.

In [1]:
# Recommended imports

import os
import pathlib

import tqdm
import math
import random

import numpy as np
import torch

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

# %matplotlib inline is a "magic function" which enables inline plot display in notebooks.
# By calling this function, you will see the plots in the notebook rather than in a separate pop-up window.

## Exercice 1 - View the MNIST dataset labels

Plot both the train and test split label distribution histograms using `matplotlib`.

An histogram is a bar plot. For each of the different labels (0-9), count how many occurrences of each label you find in the train and test labels.

You must load the labels from the `train_labels.npy` and `test_labels.npy` files.

Observer the bar plots and ask yourself: Is the data in the train and test splits balanced?

**Useful resources**

  - [numpy.histogram documentation](https://numpy.org/doc/2.1/reference/generated/numpy.histogram.html).
  - [matplotlib bar plot example](https://www.w3schools.com/python/matplotlib_bars.asp).

## Exercice 2 - View the MNIST dataset images

Write a function that takes as argument a digit (0-9), the split (train or test), randomly selects one image from that split with that digit and displays it.

Make sure that you show the image filename on top of the image.

Image numbers have the same order as the labels. For example, if the train labels array has these values:

```
array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)
```

Then:

  - `train_imgs/img_00000.npy` is an image of a handwritten "5" digit.
  - `train_imgs/img_00001.npy` is an image of a handwritten "0" digit.
  - `train_imgs/img_00002.npy` is an image of a handwritten "4" digit.
  - ...

You can go one step further and display more than one image in one single plot.

There exist many different ways to create multiple axes in `matplotlib`, but the simplest one is `matplotlib.pyplot.subplots`.

Try to write a function that takes as argument a digit (0-9), the split (train or test), randomly selects 10 images from that split with that digit and displays them in a 5 x 2 (width x height) grid.

Again, make sure that you show the image filenames on top of each image.

Use these functions and try to find mislabelled or difficult examples. 

**Useful resources**

  - [numpy.argwhere documentation](https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html).
  - [numpy.random.choice documentation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html)
  - [matplotlib image display example](https://matplotlib.org/stable/gallery/images_contours_and_fields/image_demo.html#sphx-glr-gallery-images-contours-and-fields-image-demo-py).

## Exercice 3 - PyTorch datasets

PyTorch provides a standardized interface for datasets: the `torch.utils.data.Dataset` class.

For each dataset that you want to use in PyTorch, you must write a subclass of the `torch.utils.data.Dataset` class implementing the following methods:

  - `__getitem__`.
  - `__len__`.

By implementing the `__getitem__` and `__len__` methods, the dataset becomes an indexable object (like a list) where each item is a data point.

In the case of the MNIST dataset, a data point will be a tuple consisting of:

  - The image (`numpy.ndarray`)
  - The label (`int`)

Create a subclass of `torch.utils.data.Dataset` for your MNIST dataset and test it by indexing it like a list.

Do not create two separate datasets, create only one which contains both the train and test splits.

The train and test splits have 60,000 and 10,000 images, respectively, so your dataset should have length 70,000.

**Useful resources**

  - [torch.utils.data.Dataset documentation](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)

## Exercice 4 - PyTorch dataloaders

In Deep Learning, forward passes are performed by passing batches of data to the model, rather than data points.

To easily load batches of data points, PyTorch provides the `torch.utils.data.DataLoader` class, which wraps around a `torch.utils.data.Dataset`.

Try it out by wrapping an instance of your MNIST dataset into a `torch.utils.data.DataLoader` with `batch_size` 4.

Iterate through a few batches using `for batch in dataloader:` and inspect the `batch` variable to see what it contains.

The `torch.utils.data.DataLoader` class does many things under the hood, such as automatically:

  - Converting numpy arrays and numeric values to torch tensors.
  - Allocating them onto your devices (GPUs).
  - ...

The `torch.utils.data.DataLoader` internally uses a "collate function" in order to merge multiple data points into one batch.

The input to the collate function is a list of data points coming from your dataset, and the output should be a tuple of torch tensors.

Implement your own `collate_fn` to achieve the same behaviour as the default collate function and pass it to the `torch.utils.data.DataLoader`.

Test it out by again wrapping an instance of your MNIST dataset into a `torch.utils.data.DataLoader` with `batch_size` 4.

Although this is not usual, implementing custom collate functions may be nessary when your batches have non-uniform size.

**Useful resources**

  - [torch.utils.data.DataLoader documentation](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)

## Exercice 5 - Splitting your dataset

It is standard practice to split your datasets into a 3-partition:

  - The train split, used to train your model.
  - The validation (often abbreviated as "val") split, used to evaluate your model during training and for hyperparameter search.
  - The test split, used ONLY ONCE at the end of your experiments to evaluate your final model.

Some dataset authors choose to explicitly hide the test split in order to avoid the common malpractice of using it for training and validation.

Datasets often come with only the train and test splits. When this is the case, the val split must be a subset of the train split.

It is common to use split (train:val:test) ratios similar to 80:10:10, although others may suggest suggest to have larger test splits, e.g. 80:5:15 or 75:5:20.

Your dataset class expects integer indices ranging from 0 to 69,999. In our case:

  - Indices from 0 to 59,999 correspond to the train split.
  - Indices from 60,000 to 69,999 correspond to the test split.
  - We will allocate 10% of the train split for the val split.

For each split (train, val, test), compute a torch.tensor (of integer type) containing the indices of the data points.

Using those indices, you can create dataset subsets using `torch.utils.data.Subset`. These will not be copies of your dataset instance, only views.

Make sure to make the val dataset as balanced as possible, without compromising the label distribution of the train split.

Once this has been accomplished, plot and compare the train and val label distribution histograms.

**Userful resources**

  - [torch.utils.data.Subset documentation](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.Subset)

## Exercice 6 - PyTorch modules

The PyTorch abstraction for a model is the `torch.nn.Module` class. Your models must be subclasses of this superclass.

It is imperative that you call the `torch.nn.Module` superclass constructor inside the constructor of your model class.

Additionally, you must implement the `forward` method, which models a forward pass of the model, and inputs / outputs torch tensors.

Afterwards, your model instance can be called as a function (the `__call__` method comes implemented and internally calls the `forward` method).

Begin by creating a simple model consisting of the following sequential architecture:

  - A fully connected layer with input size 864 (number of pixels in each image) and output size 196 (`torch.nn.Linear`).
  - A sigmoid activation layer (`torch.nn.Sigmoid`).
  - A fully connected layer with input size 196 and output size 10 (number of different possible drawn digits) (`torch.nn.Linear`).

The model is supposed to take as input the 28 x 28 image flattenned into a 1D tensor of size 864, and output a tensor of size 10 with the digit activations.

Verify this by passing a float tensor of size 864, and checking that the size of the output tensor is 10.

All layers with PyTorch implementation can process data in batches. Pass a float tensor of size 32 x 864 to the model and check the shape of the output tensor.

**Useful resources**

  - [torch.nn.Module documentation](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html)
  - [torch.nn.Linear documentation](https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html)
  - [torch.nn.Sigmoid documentation](https://docs.pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html)

## Exercice 7 - Data pre-processing

Your dataset + dataloader yield batches, which are tuples consisting of:

  - A (B x 28 x 28) byte tensor (images).
  - A (B x 10) byte tensor (labels).

Where B is the batch size. However, your model expects a (B x 864) float tensor as input. We must perform the following pre-processing to each image:

  - Flatten the (28 x 28) image into an (864) vector.
  - Convert the data type from byte to float (cast to float and division by 255.0).

These operations will convert your byte images into vectors with numbers lying in the [0, 1] interval.

This must happen inside your dataset class, and not inside your model, as these are not learnable operations involving model parameters.

You can make your dataset flexible to any pre-processing operations by modifying your dataset class so that it accepts a pre-processing function in the constructor. You can later use this function in the `__getitem__` to return already pre-processed data points.

Implement these changes to your dataset class. You can decide to use numpy for the pre-processing and still return numpy arrays, or to convert them to torch tensors and use torch functions for the pre-processing.

The most efficient option is to use only numpy and let the dataloader make the array-to-tensor conversion. This way we reduce the number of conversions from 1 per data point to 1 per batch.

Additionally, have in mind that numpy generally is more efficient than torch on the CPU, and that almost always data pre-processing is done on the CPU. However, PyTorch offers image manipulation utils for tensors which are also pretty optimized on CPU.

Once your have implemented the changes, check the data type and range (use `torch.min` and `torch.max`) of the image and label tensors in a few batches.

Afterwards, modify your pre-processing function so that the range of the image tensors is [-1, 1] rather than [0, 1].

## Exercice 8 - The Categorical Cross-Entropy Loss

Your model falls into the category of "classifiers" - models that classify data points into classes. In this case, you are classifying images into the digit written in the image.

The standard loss function for training classifiers is the Categorical Cross-Entropy Loss. Read the [following article](https://www.geeksforgeeks.org/deep-learning/categorical-cross-entropy-in-multi-class-classification/) (or ask ChatGPT) to learn about this loss function.

For each batch:

  - Your model will output a (B x 10) float tensor containing the class activations or logits (NOT PROBABILITIES) of the B images in the batch.
  - The batch will contain a (B) byte tensor with the labels (0-9) of the B images in the batch.

You must write a method that takes as input both tensors and computes the Categorical Cross-Entropy function of the entire batch.

The output of this method must be a (1) float tensor, to which you will call `.backward()` in order to compute the gradients.

You have two ways to go here:

  - Write your own implementation of the Categorical Cross-Entropy Loss.
  - Use [PyTorch's implementation of the Categorical Cross-Entropy Loss](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html).

Use any of the two, but have in mind that by writing your own implementation, you will learn more from this exercice.

<font color=red>**WARNING**</font>: This is one of the most critical parts of this session, as it is difficult to notice if you make mistakes.

There is the same probability of making mistakes here, no matter whether you decide to implement the loss yourself of use the PyTorch built-in implementation.

Therefore, be sure to test this part thoroughly with a few example input tensors from which you know the expected resulting loss.

**Useful resources**
  - [Article on logits vs. probabilities](https://illuri-sandeep5454.medium.com/logits-vs-probabilities-understanding-neural-network-outputs-clearly-0e86a4256a0e)

## Exercice 9 - The training loop

Now is the time to put everything together. The following example showcases the most basic training loop:

---

```python
for epoch_num in num_epochs:

    epoch_train_loss = 0
    epoch_val_loss = 0

    # Training

    model.train()

    for batch in train_dataloader:

        images, labels = batch
        pred_labels = model(images)

        loss = compute_categorical_cross_entropy_loss(labels, pred_labels)
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()

        epoch_train_loss += loss.item()

    # Evaluation

    model.eval()

    for batch in val_dataloader:

        with torch.no_grad():

            images, labels = batch
            pred_labels = model(images)

            loss = compute_categorical_cross_entropy_loss(labels, pred_labels)

        epoch_val_loss += loss.item()

    # LR update

    scheduler.update()
```

---

In this training loop, every epoch:

  - We loop through the entire train dataloader. For every batch we compute the loss, backpropagate it to compute gradients, update the parameters, and reset the gradients. We also accumulate the total train loss.
  - We loop through the entire val dataloader and only accumulate the total val loss.
  - Update the optimizer learning rate using a learning rate scheduler.

Important notes here:

  - Calling `.train()` activates "training mode" on your model, while calling `.eval()` activates "inference mode". Some layers of the model (Dropout, Normalization...) act differently when in training or inference mode. Make sure to call `.train()` whenever you are going to use your model for training and updating parameters, and `.eval()` whenever you are just going to use it for evaluation or inference.
  - The `torch.no_grad()` is a context manager that disables gradient calculation - no computation graph created, and all intermediate tensors in the forward pass will be marked as `requires_grad=False`. Make sure to encapsulate your evaluation code inside of this context manager so that it runs much faster.

A (not recommented at all) alternative to the `torch.no_grad()` context manager would be to:
  1. Before the evaluation code, go through all the parameters of your model(s) and set `.requires_grad = False` for all of them.
  2. After the evaluation code, go through all the parameters of your model(s) again and restore the `.requires_grad` property.

You must write your own training loop, expanding on the provided example, where:

  - You use the `SGD` optimizer with default options (aside from the initial learning rate).
  - You use the `ExponentialLR` scheduler with default options (aside from the `gamma` parameter).
  - You collect the training and evaluation loss from each epoch, so that you can plot them later.
  - You showcase a "progress" bar which shows in real-time the progress of iterating the train and val dataloaders (use `tqdm` for this).
  - You plot (every few epochs) the epoch mean train and val loss per data point in one single chart (divide the epoch loss by the number of data points). We recommend logarithmic scale on the Y-axis.

At the end of your training loop, you must also:

  - Compute the final mean train and val loss per data point (go through the train and val dataloaders in "inference mode" one last time).
  - Compute the overall train and val accuracy of your model (% of correct classifications).

Within this script, for now, fix the following hyperparameters:

  - `num_epochs = 100` - Number of epochs to train for.
  - `batch_size = 32` - DataLoader batch size. If the training is too slow, you can try 64, 128 or 256.
  - `gamma = 0.9` - ExponentialLR scheduler gamma hyperparameter.

Try a few values of the learning rate until you find some value that seems to produce a stable training. Do not try to optimize the learning rate, just find one that works to verify that your training loop is correctly written.