# Implementing CNNs for image recognition.

In the last lab, we implemented an MLP to recognise handwritten digits. MLPs are very useful approximators but they don't have the ability to capture spatial information, because every input is considered the same way against every other input. Convolutional Neural Network (CNN), on the other hand, capture spatial information through convolution. It makes them more suitable to handle structured inputs, such as images.

As usual, we will approach this problem in three steps : defining the dataset, defining the model, and performing the optimization.

The dataset will still be the MNIST dataset, used to recognize handwritten digits.

## Setup the environement

We first setup the environnement and the necessary inputs as usual.

In [None]:
!pip install numpy matplotlib scikit-learn

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits

## Dataset preparation

As in the last lab, we will prepare the data, preprocess it, and split it into train-test

In [None]:
# We load the dataset using the function load_digits
digits = load_digits()

## check the minimum and maximum value of a pixel in the dataset
print(f"Pixel values are between {np.min(digits.data)} and {np.max(digits.data)}")

processed_data = digits.data / 16

print(f"After normalization, pixel values are between {np.min(processed_data)} and {np.max(processed_data)}")

In [None]:
processed_data.shape

The cell above shows that our digits are actually flattened into a vector of dimension 64. We need 2D images to be able to process them through a CNN. In fact, 2D CNN layers expects input of the shape (C,H,W) where C is the number of channels, H is the height of the image, and W is the width of the image.
In our case, the MNIST data is grayscale, so it has C=1, and it's a collection of 8x8 images. So the shape the images should be (1,8,8).

In [None]:
## TODO : Reshape the images so that they can be input to a cnn layer.
## The data before reshaping is of shape N,64
## It should be of shape N, 1, 8, 8


print(processed_data.shape) # should be 1797, 1, 8, 8 

We also need a train-test split. With 20% of the data set aside for testing.

In [None]:
## TODO : implement the train test split
def train_test_split(data, targets):
    train_set = None
    test_set = None
    return train_set, test_set

train_set, test_set = train_test_split(processed_data, digits.target)

### Validation split

The data is already split into a train and a test set. We will now introduce a new set that is also important for training : the validation set.

The validation set is an important set for training a model. It's a portion of the train set that we reserve aside to monitor the model's performance during training. It helps identify overfitting (when the model performs well on training data but poorly on new data) and provides a way to choose the best version of the model before final testing. This ensures that the model performs well on real-world data.

Typically, we reserve 20% of the train set for validation.

In [None]:
## TODO : split the train set into train-validation.
## Hint : it's very similar to a train-test split
def train_val_split(data, targets):
    pass

train_set, val_set = train_val_split()

## Implementation of the convolutional neural network.

### Forward pass

A 2D Conv Layer takes as input a 2D image and outputs a feature map by running filters on the image. More details and visualization [here](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)

A conv layer has parameters that govern its behaviour : 
- The kernel size or size of the filter, a standard size is 3x3.
- The number of filters, this will determine the number of output feature maps.
- The stride, it controls how much pixels should the filter jump when computing the next value.

You can also add padding around the image if you want the output feature map to have the same dimension as the input.

For example suppose we have ConvLayer with 10 filters of size $5\times5$, that receives an input with 3 channels : 
- The number of parameter per filter is : $5 \times 5 \times 3$ : filter_size $\times$ input_channels.
- The number of output feature maps is $10$ : the number of filters.

Each of the 10 filters slides over the input image, computing a dot product between the filter's weights and the corresponding region of the input image.

We can write the computation of a convolution layer for an image patch $x$ as : 
$$ y = F \circledast x + b$$

Where $x$ is a patch of the same size as the filter $F$, $y$ is a pixel value and $b$ is the bias of the filter. $\circledast $ is the convolution operation.

### Backward pass

We now need to compute the gradient of the convolution operation to update the filter weight and the bias. For this layer we need to compute three gradients :
$$ \frac{d\mathcal{L}}{dx} $$
Where $y$  is the output feature map and $x$ is the input.
$$ \frac{d\mathcal{L}}{dF} $$
Where $F$ is the filter tensor (size filter_size_x $\times$ filter_size_y $\times$ input_channels)
$$ \frac{d\mathcal{L}}{db} $$
Where $b$ is the bias of the filter.



We give the expressions of the gradients here :
$$ \frac{d\mathcal{L}}{db_i} =  \sum_{j}\frac{d\mathcal{L}}{dy_{ij}} $$
Where $b_i$ is the bias of filter $i$ and y_{i} is the $i'th$ channel of the output $y$. We sum over all the pixels.
$$ \frac{d\mathcal{L}}{dF} = x \circledast \frac{d\mathcal{L}}{dy}$$
$$ \frac{d\mathcal{L}}{dx} = F^* \circledast pad(\frac{d\mathcal{L}}{dy}, x)$$

Where $F^*$ is the filter $F$ rotated by $180°$ and $pad(a,b)$ is a function that pads a to match the size of b. 

Similarly to last time, we will implement the convolutional layer. We provide a structure consistent with the previous lab's implementation, but you can come up with your own.

Convolution is more difficult to implement. You can start by implementing a convolution function that computes one pixel given a window of the same size of the filter, and then apply it to each window.
The stride and padding are optional.

You can also start with one filter, and then apply the same function to every filter to get all the feature maps.

In [None]:
## TODO : implement the pixel convolution operation.
def pixel_convolution(region, filter, bias):
    """This function implements the convolution operation between the filter and the 
    region of the image given as input.
    The region and the filter should have the exact same shape. The bias is a scalar value.
    The output should be a scalar value that represents the pixel being computed."""
    return 0

## TODO : implements the convolution filter. 
class ConvolutionFilter : 
    """This class implements a single convolution filter. 
    It takes as input a feature map of
    size (C, H, W) and run its filter of size (C, kernel_size, kernel_size) over it.
    The output is of size (1, H-kernel_size+1, W-kernel_size+1)"""

    def __init__(self, input_dim, kernel_size) -> None:
        self.weights = 0 ## filter, what shape should it have ?
        self.bias = 0 ## bias, it should be a scalar
    
    def forward(self, x) : 
        """x is a feature map of size (C, H, W), the convolution filter should slide over the
        image and compute the pixel convolution for the output map.
        The output map should have size 1, H-kernel_size+1, W-kernel_size+1"""
        
        return x

    def backward(self, dLdy, x): 
        """This function takes as input the derivative with respect to the output of this filter,
        and the input of this filter, and performs the backward propagation for this filter.
        It should compute 3 values : 
                - the derivative with respect to the bias.
                - the derivative with respect to the filter.
                - the derivative with respect to the input x.
        You are given indications in comments to help you through it."""

        ## Derivative with respect to bias, its a scalar (see formula)
        dLdb = 0

        ## Derivative with respect to weights
        ## It should have the same shape as the filter
        dLdw = np.zeros_like(self.weights, dtype=np.float32)
        
        ## TODO : compute the derivative with respect to the weight. 
        # it is essentially a convolution between the input  
        # and the derivative of the output
        


        ## Derivative with respect to input
        # it should have the same shape as the input.
        dLdx = np.zeros_like(x, dtype=np.float32)

        ## TODO : compute the derivative with respect to the input.
        # it is essentially a convolution between the filter rotated by 180 degrees (use np.flip)
        # and the derivative of the output.
        # don't forget to PAD the derivative to get the exact shape. (np.pad)


        ## TODO : save your gradient for step return the derivative wrt the input
        return dLdx
    
    def step(self, lr) :
        pass
    
    def __call__(self, x):
        return self.forward(x)

         

We give below the implementation of a 2D convolution layer using the ConvolutionFilter implemented above.

In [None]:
class Convolution2DLayer :
    """This implements the 2D convolution layer, it takes as input a matrix and runs its filters."""
    def __init__(self, input_dim, kernel_size, num_filters, stride=None, padding=None): # stride and padding are optional
        self.input_dim = input_dim
        self.num_filter = num_filters

        self.filters = []
        for f in range(num_filters): 
            self.filters.append(ConvolutionFilter(input_dim, kernel_size))
        
    def forward(self, x):
        """This is the forward pass, to compute the output y_pred given the input.
        It computes the output of each filter and stacks them together"""
        self.x = x
        y = []
        for f in self.filters : 
            y.append(f(x))
        
        y = np.stack(y)
        return y
    
    def backward(self, gradient):
        """The backward pass allows you to compute the gradient of this layer"""
        dldx = np.zeros_like(self.x, dtype=np.float32)
        for i,f in enumerate(self.filters) :
            dldx += f.backward(gradient[i], self.x) 
        return dldx
    
    def step(self, alpha):
        """Take a gradient descent step"""
        for f in self.filters : 
            f.step(alpha)

    def __call__(self, x) : 
        """To ensure we can call this module."""
        return self.forward(x)

We need an activation function for our convolution layer. Sigmoid is a possible choice but we will use ReLU (rectified Linear Unit). The reLU function is defined as follows : 

$$ ReLU(x) = max(0, x) $$

It's a simple non-linear function that outputs 0 if $x$ is negative, and $x$ otherwise. Its derivative with respect to x is :
$$ ReLU'(x) = \mathbf{1}(x > 0) $$

In [None]:
## TODO : implement relu function and layer forward/backward
def relu(x):
    return 0

class ReLULayer : 
    """This implements the ReLU layer"""
    
    def __init__(self):
        pass

    def forward(self, x):
        """This is the forward pass, computes the reLU"""
        return 0
    
    def backward(self, gradient):
        """Computes the gradient of relu for backpropagation.
        Hint : there is no parameter, so only the gradient w.r.t the input is necessary"""
        return gradient

    def step(self, alpha):
        pass
    
    def __call__(self, x):
        return self.forward(x)

#### Pooling

The last layer we will implement is a pooling layer. A pooling layer takes as input a feature map and returns a new, downsized feature map, where each pixel is the max (for maxpooling) or the average (for average pooling) of the corresponding window in the input.

For example, if the image is of size $10\times 10$, a pooling layer of window size $2\times2$ will produce an output feature map of size $5\times5$.

Pooling layers are useful for downsampling the feature maps and working with smaller size input, which improves the computational efficiency of the model.

In [None]:
## TODO : implement a max pooling layer
class MaxPoolingLayer : 
    """This implements the pooling layer.
    You can focus on pooling layers with a window size of 2, as they
    are the typical values we use. 
    A pooling layer works by sliding its window over the input and taking the 
    max value of the pixel in that window. """
    
    def __init__(self, window_size):
        self.x = None

    def forward(self, x):
        """This is the forward pass, computes the pooling.
        For an input of size (C, H, W), it should (in case of window_size = 2)
        return an output of size (C, H//2, W//2) where each value is the max of a window of 
        2 pixels ran every 2 pixels."""
        self.x = x
        return x
    
    def backward(self, dLdy):
        """Computes the gradient of pooling for backpropagation.
        Hint : there is no parameter, so only the gradient w.r.t the input is necessary."""
        
        ## TODO : Compute the gradient with respect to the input, the size is supposed to be the 
        # size of the input of the forward propagation
        dLdx = np.zeros_like(self.x, dtype=np.float32)

        ## No operation has been done apart from taking the max. This means that for every window,
        # the values that have not been taken should have a gradient of 0
        # while the values that has been taken by the pooling should have a gradient of 1
        
        return dLdx

    def step(self, alpha):
        pass
    
    def __call__(self, x):
        return self.forward(x)

Since we are in a classification problem, we need to be able to output a vector of probabilities for each class. This is a 1D output, while our input is 2D.
We thus need to use a linear layer (or an MLP) at the end of our model to be able to transform the 2D input from the convolution layers to the desired output.

You can reuse your previous lab's implementation here for the linear layer. 

In [None]:
class LinearLayer :
    """Use your previous lab implementation"""

In [None]:
## We give you a flatten layer to be able to pass from a 2D representation to a 1D representation 
# so you can use MLPs after CNN/pooling.
class FlattenLayer :
    def __init__(self):
        self.shape = None
    def forward(self, x): 
        self.shape = x.shape
        x = x.flatten()
        return x
    
    def backward(self, grad) : 
        grad = grad.reshape(*self.shape)
        return grad

    def step(self, lr):
        pass
    def __call__(self, x) :
        return self.forward(x)

Now that we have all the ingredients of our model, we can implement it.
Taking inspiration from the previous lab, implement the CNN class. It should take as input the description of your CNN layers, and the description of your MLP layers. Then it should compute the Convolution part, flatten the input, and compute the MLP part.

The output should be a vector of size 10 (we have 10 classes) exactly like the MLP lab.

In [None]:
## TODO : implement the full CNN

class CNN :
    """This implements the CNN, it's a combination of convolution layers 
    and linear layers to output the prediction for each class.
    It should implement CNN blocks, where each block is :
            - A Convolution layer
            - A Pooling Layer
            - An activation layer
    Then after all the blocks are implemented (1 is enough), it should use a flatten layer.
    Then once the data is flattened, you can use linear layers to produce the desired output.
    """
    def __init__(self, input_channel, CNN_layer, FC_layers):
        """CNN_layer is a list like [2, 4, 5] and describe the number of filter in the Convolution layer per block.
        This particular example [2,4,5] means 3 blocks, the fist one with 2 filters, second one
        with 4 filters, third one with 5.
        FC layer describes the linear part similar to the MLP class in previous lab."""
    
        

    def forward(self, x):
        """This is the forward pass, to compute the output y_pred given the input. 
        It should pass through the layers of the model"""
        return x
    
    def backward(self, gradient):
        """Computes the gradient of each layers"""
        return gradient

    def step(self, lr) :
        pass
    
    def __call__(self, x):
        return self.forward(x)

### Gradient clipping.

Because you are using ReLU instead of Sigmoid, there is a chance that you get the issue of exploding gradients. This happens because accumulating positive values will lead to the value of the gradient being higher and higher. This can hinder learning as it will make us take huge steps and miss our minimum. 

To alleviate this issue, we introduce gradient clipping. Where we clip the norm of the gradient to a set value if it becomes to big. A typical threshold for the gradient is 10.

In [None]:
## TODO : Implement the gradient clipping
def clip(value, clip_threshold=10): 
    
    return value

## TODO : to make it effective, you need to modify the CNN class' backward method.
# to clip the gradient after every layer.

## Training procedure.

Since we are in a classification problem, the training procedure is the same as the previous lab. We will use cross entropy as our loss function.

Implement the training loop and train your model. Your code from the previous lab should work.

There is one modification, you should implement a `validation` loop that makes use of the validation dataset.

In [None]:
## TODO : implement the softmax function 
def softmax(x) :
    return 0

## TODO : implement the cross entropy
def cross_entropy(y_pred, y_true):
    return 0

## TODO : implement the one hot function
def one_hot(n_classes, y):
    return y

In [None]:
# TODO : implement the training loop with the validation loop
def validation(model, val_data, val_labels):
    """Validation loop : should go over all the validation data and compute
    the loss. The validation loss is the average of the loss over the data."""
    return 0

def train(model, train_data, train_labels, lr, num_epochs, val_set):
    losses = []
    best_model = None
    best_model_loss = float("inf")
    ## TODO : implement the train loop (similar to previous lab)
        ## TODO : every few epoch, call the validation loop and update the best model.
        ## The best model is the one that minimizes the validation loss.
    return losses, best_model

In [None]:
# TODO : train your CNN
model = CNN(1, [32], [30, 10])
## Optimization parameters
lr = 0.01
num_epochs = 50
losses, best_model = train()

## Evaluation 

It's time to evaluate your model on your test data. You should compute the accuracy and recall per class. The code from the previous lab should work.
You can also compute and show the confusion matrix.

In [None]:
## TODO : evaluate the model's accuracy and recall per class on the TEST set. 
## (optional) : compute and plot the confusion matrix
def test(model, test_set) :
    accuracy = 0
    return accuracy
