In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from fastai.vision.all import *
import torch
from torchvision import datasets, transforms
import torch.nn.functional as Fnn
import gzip
import numpy as np

## Learning how to build a CNN for MNIST dataset with Fastai and PyTorch 

Here's what I've learned while building a model that can recognize the value of handwritten digits from 0 - 9 using the MNIST dataset. This was an exercise I assigned to myself while learning the basics of computer vision and the associated engineering concepts.

Below, you'll find some of my notes on the subject. I've learned alot about the subject recently, and some parts of my notes are extracts of a chat with an LLM.

First, I'll summarize some peculiarities of the code and the challenges required to understand the problem. Then, we'll delve deeper into the code.

I've used the Fastai API for some parts and pure PyTorch for others. 

## Understanding how to create a multi-class classification model
During a Fastai lesson we've created an image classifer with the MNIST dataset. The base exercise was to classify handwritten digits `3` and `7` . An easy enough problem to understand the basis of computer vision.

The problem type consisted of a binary classification(if it's not a `3` then it's a `7`).
We've resolved it by creating a simple linear Neural net that had a loss function that looked like this:

````

def mnist_loss(predictions, targets):
    preds = predictions.sigmoid() 
    return torch.where(targets==1, 1-preds, preds).mean()
    
# use sigmoid on prediction logits so that preds will be normalized between 0.0 - 1.0  

````

Here we're applying a sigmoid function to the predictions and then comparing them to the targets. 
As mentionned before, this approach is typically used for binary classification problems.

The function is calculating the `mean loss` for a binary classification problem, where the loss is `1-preds` when the `target` is `1` (indicating that we want the prediction to be closer to 1), and `preds` when the `target` is `0` (indicating that we want the prediction to be closer to 0). This is a form of binary cross-entropy loss, but without the log function applied, and is suitable for binary classification problems where the labels are either 0 or 1.

However, the new model I have created is a multi-class classification model (whole MNIST dataset, which has 10 classes wich are the digits (0 to 9)). 
Therefore, I found out that when calculating the `loss` I should be using a `softmax` function instead of a `sigmoid` function. 
For the batch accuracy calculation I should be comparing the `argmax` of my predictions to the targets instead of a function that uses a `sigmoid` function and a threshold of `0.5` to make predictions, which the latter is appropriate for binary classification.

The `softmax` function is specifically designed to handle multiple classes. It outputs a vector that represents the probability distributions of a list of potential outcomes. It's a way of normalizing the output of a network to a probability distribution over predicted output classes.
On the other hand, the `sigmoid` function is used for binary classification problems. It squashes its input into a range between 0 and 1, which can be used to represent a probability. However, it treats each class independently and doesn't have a notion of a probability distribution over multiple classes. 

You'll notice `softmax` isn't visible in my loss function because the `F.cross_entropy` function in PyTorch applies the softmax function internally
when using `F.cross_entropy`, the predictions tensor should have the `shape` `(batch_size, num_classes`) and the targets tensor should have the `shape` `(batch_size,)`. The values in targets should be the class indices, not one-hot encoded vectors.

----

## About the architecture of the neural net

The model created is a convolutional neural network (CNN) and it use `nn.Conv2d` , a 2D convolutional layer in PyTorch. It's an appropriate choice in several scenarios, particularly when dealing with image data or any kind of grid-like 2D data. 

You'll observe that the model has 3 layers and 32 hidden layers, after the convolutional layers, there's an AdaptiveAvgPool2d layer which applies average pooling to convert the feature maps to a size of 1x1, a Flatten layer to flatten the tensor into a vector, and finally a Linear layer to map the features to the 10 output classes.

Let's break down the architecture of one layer in your CNN which consists of `Conv2d` + `ReLU` + `Dropout`.

**Conv2d**

The first part of the layer is 2D convolutional layer, it takes an input with 1 channel (e.g., a grayscale image), and applies 32 filters (or kernels) of size 5x5. The stride of 2 means that the filters move 2 pixels at a time, reducing the size of the output feature maps compared to the input. The padding of 1 adds extra pixels around the input feature map to control the spatial output size.

**ReLU**

`nn.ReLU()` This is the activation function. ReLU stands for Rectified Linear Unit. It introduces non-linearity into the model, allowing the network to learn more complex patterns. The function returns 0 if it receives any negative input, but for any positive value x, it returns that value back. 

**Dropout**

Dropout is a regularization technique used in neural networks to prevent overfitting. 
During training, dropout randomly sets a fraction of the input units to 0 at each update, which helps to prevent overfitting. The fraction of zeroed units is determined by a hyperparameter, usually denoted as p, which is the dropout rate. For example, if p is set to 0.5, approximately half of the input units will be dropped out, or set to zero.


Dropout is effective because it forces the network to learn redundant representations, and ensures that the output does not rely too heavily on any single neuron. This makes the network more robust and improves generalization.

Remember, dropout is only used during training, not during testing or evaluation of the model.


----
## Let's start 
Pytorch as a built-in function to download MNIST datasets. Here's the link of all the others one available from the vision package https://pytorch.org/vision/stable/datasets.html

Good to know: I have kept, as commented code, some usefull parts when working on different enviromnent.

In [None]:
# Download MNIST dataset 
# When not on Kaggle use below
# mnist = datasets.MNIST('./data', download=True)

In [None]:
ROOT_DIR = os.path.dirname(os.path.abspath('')) 

#VSCODE/IDE 
#DATA_DIR = os.getcwd() + ROOT_DIR + 'mnist-basics/data/MNIST/raw/'
#BROWSER
#DATA_DIR = ROOT_DIR + '/mnist-basics/data/MNIST/raw/'

#Kaggle 
DATA_DIR = '/kaggle/input/mnist-pytorch/'
print(DATA_DIR)

In [None]:
# The kaggle dataset I use have already the ubyte file ungzipped, 
# but I kept as a reference the function. The code for kaggle current env. use the other one below.

# this function is useful when using the dataset from PyTorch as gzip
def load_mnist_as_gzip(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,'%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,'%s-images-idx3-ubyte.gz'
                               % kind)
    print(labels_path)
    print(images_path)
    with gzip.open(labels_path, 'rb') as lbpath:
        labels = torch.from_numpy(np.frombuffer(lbpath.read(), dtype=np.uint8, offset=8).copy())
    print(f'Number of labels: {labels.size(0)}')

    with gzip.open(images_path, 'rb') as imgpath:
        images = torch.from_numpy(np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16).copy().reshape(-1, 784))
    print(f'Number of images: {images.size(0)}')

    assert len(images) == len(labels)

    return images, labels

In [None]:
def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,'%s-labels-idx1-ubyte'
                               % kind)
    images_path = os.path.join(path,'%s-images-idx3-ubyte'
                               % kind)
    print(labels_path)
    print(images_path)
    with open(labels_path, 'rb') as lbpath:
        labels = torch.from_numpy(np.frombuffer(lbpath.read(), dtype=np.uint8, offset=8).copy())
    print(f'Number of labels: {labels.size(0)}')

    with open(images_path, 'rb') as imgpath:
        images = torch.from_numpy(np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16).copy().reshape(-1, 784))
    print(f'Number of images: {images.size(0)}')

    assert len(images) == len(labels)

    return images, labels

## Some details about gzip, buffers & tensors

The dataset provided is packed in a gzip file. So, I had to do some manipulation to do to return data in tensors to use in our model `train` & `valid` datasets. 

***Details/Notes***

The numpy module is then used to convert the binary data into an array of unsigned 8-bit integers (np.uint8). The offset parameter is used to skip the first 8 bytes of the file, which contain metadata about the file format.

Therefore, the labels variable is an array of unsigned 8-bit integers that represent the labels associated with a dataset. 

We apply `.reshape(-1, 784)`  to ensure that the number of images matches the number of labels.
Also, we `assert` so that it'll throw an AssertionError if the numbers doesn't match.

About the warning `The given buffer is not writable, and PyTorch does not support non-writable tensors.`
This warning is raised because PyTorch is trying to create a tensor from a buffer that is not writable. This is not necessarily a problem, but it could potentially lead to unexpected behavior if you try to modify the tensor later on.

To avoid this warning, we create a copy of the buffer before converting it to a tensor. 
 
`.copy()` creates a copy of the numpy array. This is necessary because the original data is in a read-only buffer, but we need a writable copy to convert it to a PyTorch tensor.

`torch.from_numpy()` converts the numpy array to a PyTorch tensor.

### Creating the datasets

In [None]:
# usage with the dataset from PyTorch as ubyte
trainImages, trainLabels = load_mnist(path=DATA_DIR, kind='train')
validImages, validLabels = load_mnist(path=DATA_DIR, kind='t10k')

In [None]:
trainImages.shape

In [None]:
trainLabels.shape

Images are uint8 (byte) while the neural network needs inputs as floating point in order to calculate gradients,
simply divide by 255 to get values into [0, 1]
Images are usually represented as arrays of integers, where each integer ranges from 0 to 255 to represent pixel intensity (0 being black and 255 being white, for grayscale images). However, when training a neural network, it's typically beneficial to feed in relatively small input values.

By dividing by 255, you are performing a process called normalization, mapping pixel intensities from the range [0,255] to the range [0,1].

In [None]:
train_x = trainImages.view(-1, 1, 28, 28) # to reshape the tensor - as a rank from rank-3 to rank-2 
train_y = trainLabels.unsqueeze(1) # remove dimension 1
train_x = train_x.float()/255 
train_y = train_y.type(torch.LongTensor) # LongTensor is a signed 64-bit integer (long integer)

In [None]:
valid_x = validImages.view(-1, 1, 28, 28)
valid_y = validLabels.unsqueeze(1)
valid_x = valid_x.float()/255
valid_y = valid_y.type(torch.LongTensor)

In [None]:
train_y.shape
valid_y[120].item()

**Notes**: About the reduction argument: in PyTorch loss functions is used to specify the reduction to apply to the output. It determines how the individual losses calculated across the mini-batch are combined into a single loss value.

If you're implementing a custom loss function and you want it to be compatible with fastai's Learner and ClassificationInterpretation, you should make sure your function accepts a reduction argument and applies the specified reduction to the output.

In most cases, you'll want to use 'mean', because it makes the training process more stable and the loss value easier to interpret. However, there might be cases where 'sum' is more appropriate, depending on your specific use case.

In [None]:
def ce_loss(predictions, targets, reduction='mean'):
    targets = targets.squeeze()
    loss = F.cross_entropy(predictions, targets, reduction=reduction)
    return loss

def batch_accuracy(xb, yb):
    preds = torch.argmax(xb, dim=1)  # Get the predicted classes , argmax, return indexes of .max() value for each tensor at dimension 1
    correct = (preds == yb.squeeze())  # Compare with targets , correct become a tensor of boolean value
    return correct.float().mean()  # Compute the mean accuracy, while correct.float()  boolean become float of value between 0 - 1. then mean()

In [None]:
## CNN of 3 layers et 2D CONV of 32 feature detection layer + Dropouts & AdaptiveAvgPool2d 

learning_rate = 0.05754399299621582
train_dataset = list(zip(train_x,train_y))
valid_dataset = list(zip(valid_x,valid_y))
list_of_classes = validLabels.unique().tolist()

conv_net = nn.Sequential(
    nn.Conv2d(1, 64, kernel_size=5, stride=2, padding=1), # Convolutional layer
    nn.ReLU(), # Activation function
    nn.Dropout(0.5), # Dropout for regularization
    nn.Conv2d(64, 64, kernel_size=5, stride=2, padding=1), # Convolutional layer
    nn.ReLU(), # Activation function
    nn.Dropout(0.5), # Dropout for regularization
    nn.Conv2d(64, 10, kernel_size=5, stride=2, padding=1), # Convolutional layer
    nn.ReLU(), # Activation function
    nn.Dropout(0.5), # Dropout for regularization
    nn.AdaptiveAvgPool2d(1), # Adaptive average pooling
    Flatten(), # Flatten the tensor for the fully connected layer
    nn.Linear(10,10) # Fully connected layer
)

train_dtl = DataLoader(train_dataset, batch_size=256)
valid_dtl = DataLoader(valid_dataset, batch_size=256)
dls = DataLoaders(train_dtl, valid_dtl)
dls.vocab = list_of_classes
learn = Learner(dls,conv_net,opt_func=Adam,lr=learning_rate, loss_func=ce_loss,metrics=batch_accuracy)

I have improved the accuracy by changing the optimizer for Adam, and also changed the learn.fit() methid by using learn.fit_one_cycle() using momentums strategy.

Notes: 
- Learning rate slice param in fit_one_cycle() -> Rule of thumb for range: A common practice is to choose a range where the lower value is about 10 times smaller than the higher value. 
- `moms` param stands for Default momentum for schedulers,  it's a tuple (mom1, mom2, mom3) that defines the two momentums used in the One Cycle Policy.
The One Cycle Policy is a learning rate schedule that involves training with increasing and then decreasing learning rates, and similarly, decreasing and then increasing momentums. 
    - mom1: The initial momentum at the start of the cycle.
    - mom2: The momentum in the middle of the cycle.
    - mom3: The momentum at the end of the cycle.
    
    Typically, mom1 and mom3 are the same, representing a higher momentum value, and mom2 is a lower momentum value. This is based on the 1cycle policy which suggests using higher momentum at the start and end of the cycle, and lower momentum in the middle.This policy was proposed by Leslie Smith in his paper "A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay".

In [None]:
# below is the fastai function to help find an appropriate learning rate
learn.lr_find()
# we could use use learn.fit_one_cycle to ...

In [None]:
learn.fit_one_cycle(20, slice(0.002511886414140463/10, 0.002511886414140463), moms=(0.95,0.85, 0.95))

## Model Interpretation

Here we'll visualize few of our predictions.

In [None]:
# Get predictions and targets
preds, targs = learn.get_preds()
# Convert predictions to labels
pred_labels = [learn.dls.vocab[p.argmax()] for p in preds]
# Convert targets to labels
true_labels = [learn.dls.vocab[t] for t in targs]
# Now you can compare pred_labels and true_labels
for i in range(5):
    print(f'True label: {true_labels[i]} \nPredicted label: {pred_labels[i]}\n---')

Below we use per-class metrics with scikit-learn's classification report

In [None]:
from sklearn.metrics import classification_report

# Convert target names to strings
target_names = [str(v) for v in learn.dls.vocab]
# Calculate per-class metrics
report = classification_report(true_labels, pred_labels, target_names=target_names)

print(report)

### Manual(single) Inference use case on model directly
INPUT is our test file

In [None]:
INPUT = valid_x[101]
img = transforms.transforms.ToPILImage()(INPUT) 
img

In [None]:
output = learn.model(INPUT.unsqueeze(0))

# Get the class with the highest probability
_, predicted_class = torch.max(output, 1)

print(predicted_class) 

In [None]:
# we can create a mapping of the tensor receive as output
class_names = ['zero', 'un', 'deux', 'trois', 'quatre', 'cinq', 'six', 'sept', 'huit', 'neuf']
predicted_class_name = class_names[predicted_class.item()]

print(f"The name of this number in french is:{predicted_class_name}")

In [None]:
learn.save('/kaggle/working/this-is-mnist-model-f1c')

### Export/save & inference
Due an issue I had with fastai export method and predict, I chose to export and do the inference with PyTorch (without the save() & predict() API methods from Fastai).

In [None]:
# Save model state dictionary
torch.save(learn.model.state_dict(), Path()/'models/this-is-mnist-model-f1c.pth')

# Load model state dictionary
model_state_dict = torch.load(Path()/'models/this-is-mnist-model-f1c.pth')

In [None]:
# Apply the state dictionary to your model
conv_net.load_state_dict(model_state_dict)

In [None]:
# Ensure the model is in evaluation mode
conv_net.eval()

# Preprocess your input
input_data = INPUT
# Add an extra dimension for batch size
input_data = input_data.unsqueeze(0)

# Make sure the input data is on the right device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input_data = input_data.to(device)

# Pass the input data to the model
with torch.no_grad():
    output = conv_net(input_data)

# Postprocess the output
probabilities = torch.nn.functional.softmax(output[0], dim=0)
n_predicted_class = probabilities.argmax().item()

print(n_predicted_class)

### Mapping the result
We can create a mapping of the tensor receive as output and return the prediction as a different label. Below, we've mapped it to french word as value.

In [None]:
class_names = ['zero', 'un', 'deux', 'trois', 'quatre', 'cinq', 'six', 'sept', 'huit', 'neuf']
predicted_class_name = class_names[predicted_class.item()]

print(predicted_class_name)

## Thanks! That's it! (for the basics)
In summary, in addition to the basics of machine learning engineering, I've learned how to use the Fastai API and integrate a custom model with it - a CNN in this case. I've also learned how to create a multi-class classification model, the basics of computer vision, and how to use PyTorch for saving a model and performing inference on it. Lastly, I've learned how to implement the model in a live project and create a demo as a Hugging Face space. You can view the demo on huggingface [here](https://huggingface.co/spaces/mgspl/mnist_basic).
 
Productive comments are welcome.

Thanks for reading!

Oh. and for the demo app as a standalone python projet see below. 
### More about `app.py`

There was some conversion to opperate to create the same code for inference in a python file.

***Manually map the state names from conv_net base model***

In conv_net, which is a nn.Sequential model, the layers are automatically named with increasing integers starting from 0. However, in the ConvNet class, the layers are named according to the variable names you gave them (conv1, conv2, conv3, fc).
To resolve this issue, you can manually map the names from conv_net to the names in ConvNet.

In [None]:
class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)

class NConvNet(nn.Module):
    def __init__(self):
        super(NConvNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 64, kernel_size=5, stride=2, padding=1)
        self.conv2 = nn.Conv2d(64, 64, kernel_size=5, stride=2, padding=1)
        self.conv3 = nn.Conv2d(64, 10, kernel_size=5, stride=2, padding=1)
        self.fc = nn.Linear(10,10)
        self.dropout = nn.Dropout(0.5)
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.flatten = Flatten()

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.dropout(x)
        x = F.relu(self.conv2(x))
        x = self.dropout(x)
        x = F.relu(self.conv3(x))
        x = self.dropout(x)
        x = self.avgpool(x)
        x = self.flatten(x)
        x = self.fc(x)
        return x

def n_predict(img, withGradio=False):
    if withGradio:
        img = Image.fromarray(img)
    
    transform = transforms.Compose([
        transforms.Resize((28, 28)),
        transforms.Grayscale(),
        transforms.ToTensor(),
    ])

    img_tensor = transform(img)
    input_data = img_tensor
    input_data = input_data.unsqueeze(0)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Get the state_dict of conv_net
    state_dict = torch.load('/kaggle/working/models/this-is-mnist-model-f1c.pth') # adjust the path if needed
    
    # Define a new state_dict for ConvNet
    new_state_dict = OrderedDict()

    # Manually map the state names from conv_net base model
    new_state_dict['conv1.weight'] = state_dict['0.weight']
    new_state_dict['conv1.bias'] = state_dict['0.bias']
    new_state_dict['conv2.weight'] = state_dict['3.weight']
    new_state_dict['conv2.bias'] = state_dict['3.bias']
    new_state_dict['conv3.weight'] = state_dict['6.weight']
    new_state_dict['conv3.bias'] = state_dict['6.bias']
    new_state_dict['fc.weight'] = state_dict['11.weight']
    new_state_dict['fc.bias'] = state_dict['11.bias']

    # Load the new_state_dict into ConvNet
    model = NConvNet()
    model.load_state_dict(new_state_dict)
    
    model.to(device)
    model.eval()  # Set the model to evaluation mode
    
    # Pass the input data to the model
    with torch.no_grad():
        output = model(input_data)

    # Postprocess the output
    probabilities = torch.nn.functional.softmax(output[0], dim=0)
    n_predicted_class = probabilities.argmax().item()
    return n_predicted_class

In [None]:
n_predict(img)

Notebook on Kaggle: https://www.kaggle.com/code/mindgspl/exercise-mnist
Demo on Huggingface: https://huggingface.co/spaces/mgspl/mnist_basic