<a href="https://colab.research.google.com/github/Cognition-And-Vision-Amsterdam-CAVA/UvA2024NeuroAI/blob/main/TutorialDay0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Tutorial day 0 - PyTorch and Convolutions - UvA NeuroAI Summerschool 2024
Author: Niklas Müller

---

This Tutorial introduces you to one of the most commonly used computer vision approaches: Convolutional Neural Networks.

We will be working with a python library called [PyTorch](https://pytorch.org/) that is designed to facilitate the work with neural networks and allows us to effectively use a GPU to perform both normal numerical computations but also more complex algorithms e.g., those involved in training a neural network (think of e.g., backpropagation).

---

This assignment is structured as follows:

1. Build a simple Neural Network model and let it solve a simple task.
2. Showcase the shortcomings of non-convolutional models
3. Build a convolutional neural network and compare it to the simple one.
4. Visualize convolutional kernels to get insights into how CNN solve object classification tasks.
5. Use a pre-trained, powerful convolutional neural network to improve performance.
6. BONUS: Vision Transformers

**Good luck with this assignment!**

---

Deep learning is a data science method that is often applied to problems where a task needs to be solved that would require a lot of repetitive human effort but that is still too complex in order to solve it with straightforward code. Examples of these problems are object classification or detection in images (also including medical images in e.g., tumor detection), language processing (including translation, sentiment analysis, and question answering), but also tasks like stock trading, image generation or fraud detection.

The basis in all these problem settings is consituted of three basic parts. First, the problem needs to be defined as a specific task to solve, the so called objective function. Second, there needs to be a dataset on which this task needs to be solved. Third, a model is needed that will attempt to solve the task on the given dataset.

In this assingment, we will focus mostly on the dataset part and will also touch upon the importance of choosing the right model for the right task. We will generally ignore the wide varieties of objective functions that can be applied when using the same model or the same dataset, however we will still clearly define the task and objective function.

## PyTorch

We will be using the [PyTorch](https://pytorch.org/) library, which is "an optimized tensor library for deep learning using GPUs and CPUs". Tensors are a data format that allow us to use the graphics processing unit (GPU). The GPU is the video chip any computer needs in order to display non-textual content. Initially, strong GPUs have been developed by the gaming industry forming the basis for what has become near-realistic video games. However, in recent years, researchers in AI and other fields have been using GPUs in order to parallelize computations. This has been found to be specifically fruitful for training Deep Neural Networks as the computation of gradients, which are needed in order to "teach" the network using its "mistakes", can be done in parallel.

Even though this is a basic introduction, it outlines the requirements for training a deep learning model for a given problem. We know that we need a dataset on which we can solve a problem, a model that will do this for us and a GPU. Using this Google Colab notebook, we have direct access to a GPU server. We have also learned that in order to use the GPU, our data needs to be in the format of tensors. Let's now start with taking a look at our dataset.

In [None]:
# Imports (Load pacakges that are needed)
import torch                              # <- PyTorch library for Tensors
from torch.utils.data import DataLoader
from torch import nn
import torchvision                        # Specialization of Torch for Computer vision
from torchvision import transforms

import os
import matplotlib.pyplot as plt           # Plotting library
import numpy as np                        # Numerical computations on CPU

from tqdm.notebook import tqdm            # Library to track progress of processes

# Dataset

Here, we specify the kind of transformations that should be applied to each individual image in the dataset. This could be data augmentation, resizing of images and many other types of transformations. For this tutorial, as stated above, we limit this to transforming the images to the format of tensors. If you want to know more about potential transformations that could be applied, feel free to ask one of the instructors.

In [None]:
transform_list = []
transform_list.append(transforms.ToTensor())
transform = transforms.Compose(transform_list)

We will be using the MNIST dataset, which is an image dataset containing images of handwritten digits. This is a historically important dataset as it perfectly illustrates the kind of problems that people wanted to outsource to AI systems. Imagine you would have to go through thousands of images of handwritten digits (think of post codes, marks in education, etc.) and document/digitize them into a database. As this is highly repetitive and on top cognitively not very challenging, people started to implement systems that would do this for them. The MNIST dataset is publicly available using the torchvision package, which is the Computer Vision extension of PyTorch. We can download the dataset using the below command.

In [None]:
mnist = torchvision.datasets.MNIST(root='.', download=True, transform=transform)

Another feature of pytorch is the dataloader. Given any kind of dataset (i.e. a python class that implements a function returning an indexed item) the dataloader allows us to use multiple cores/CPUs in order to parallelize the loading of items (here, images) from the disk and also applies the above transformations (here, formatting as a tensor) such that we can directly use it on our GPU. The below command defines a dataloader using the above MNIST dataset that we downloaded and loaded into this notebook. We want to use 8 cores in order to load 8 images from the disk at the same time. We also tell it the batch size, which is the number of images that it loads from the dataset in one iteration. All of these images will then be processes AT THE SAME TIME by the model when using a GPU. We can see why this would speed up the whole process a lot as compared to only being able to process a single image at the same time if we were to do this on a CPU instead.

In [None]:
dataloader = DataLoader(dataset=mnist, batch_size=4096, num_workers=8)

We can then iterate over this dataloader (either by using the iter function together with the next function to get the next iteration, or using any loop).

As you can see, the dataloader does not only return an image but it also return a corresponding label to the image. This label tells us what digit is visible in the image. Using this pair of image and label we can directly define the task that we would want the model to solve: given the image as an input, output the digit that is visible in the image. Before we will go into how to teach a model to do that, let's have a look at the actual images.

In [None]:
img, label = next(iter(dataloader)) # Create an iterator of the dataloader and get the "next" iteration

In [None]:
img.shape

Even though we only loaded one iteration of the dataloader, we got 4096 pairs of image and label. This is the batch of items that we defined above that will be processed in parallel. We can also see that the images have a size of 28x28 pixels and that they have 1 channel, meaning that the images do not have color but are grayscale only. Let's visualize some of the images. If you execute the below cell multiple times, new random images will be plotted from the above batch.

Do you disagree with any of the labels? Are some images ambiguous?

In [None]:
fig, ax = plt.subplots(1,8, figsize=(10, 5))

for ax_index, index in enumerate(np.random.randint(0, 4096, size=8)):
  ax[ax_index].imshow(transforms.ToPILImage()(img[index]), cmap='gray')
  ax[ax_index].set_title(label[index].numpy())
  ax[ax_index].axis('off')

plt.show()

---
**Exercise 1**

**Take the code from above that specifies the transformations that should be applied for each image and modify it in the cell below. Insert transformations that:**
1. **Flips the image vertically.**
2. **Crops the image to 10x10 pixels**

**Make edits in the cell below and plot the resulting images.**

In [None]:
transform_list = []

raise NotImplementedError()
transform_list.append(transforms.ToTensor())
transform_augmented = transforms.Compose(transform_list)


mnist_augmented = torchvision.datasets.MNIST(root='.', download=True, transform=transform_augmented)
dataloader_augmented = DataLoader(dataset=mnist_augmented, batch_size=4096, num_workers=8)

In [None]:
img, label = next(iter(dataloader_augmented)) # Create an iterator of the dataloader and get the "next" iteration

In [None]:
fig, ax = plt.subplots(1,8, figsize=(10, 5))

for ax_index, index in enumerate(np.random.randint(0, 4096, size=8)):
  ax[ax_index].imshow(transforms.ToPILImage()(img[index]), cmap='gray')
  ax[ax_index].set_title(label[index].numpy())
  ax[ax_index].axis('off')

plt.show()

---

Now, that we have an idea of what the dataset looks like, and we have defined the problem that we would want to be solved, we can create a model that can help us solve this for us.

# Problem and Models

One of the simplest Neural Networks (that is in fact not a "Deep" network; more to this later) is the Multi-Layer Perceptron or MLP.



The MLP is characterized by 4 core concepts.

The first is **nodes**. A node is the smallest unit of the network and computes a single function on some input. Each of the green dots in the below image is one such unit or node. Nodes either get the original data as input or more importantly, can get the *output* of another node as *input*. A function, often non-linear, is then computed on this input and the resulting output is either past on to the next node or it constitutes the output of the entire model, e.g., the digit corresponding to the input image.

The second concept is a **layer**. In the image, we see that this particular MLP has three layers: input layer, hidden layer, and output layer. As maybe intuitively clear, the input layer gets the image as input. Each node in the input layer receives the value of exactly one pixel of the image. So in our case we would have 28*28=784 nodes in the input layer. The number of nodes in the hidden layer can be arbitrarily specified by the user but crucially determines (to some extent) whether the model can solve the task at hand or not.

The last layer usually corresponds to the number of output options. In our case, there are 10 digits, therefore the output layer would have 10 nodes. Each of these nodes represents one digit and outputs a value between 0 and 1 which will in turn be interpreted as the probability that that digit is visible in the input image. So contrary to a common belief, the model does not output a single number directly telling us the digit, but instead it outputs a probability for each digit with which it "thinks" that that digit is visible in the image.


![link text](https://media.geeksforgeeks.org/wp-content/uploads/nodeNeural.jpg)

<sub>https://media.geeksforgeeks.org/wp-content/uploads/nodeNeural.jpg<sub>

The third concept is **scalability**. We already heard that the particular MLP that we would use in our case would have 784 input nodes, some hidden nodes and 10 output nodes and therefore differs from the MLP instance that is shown in the image below. MLPs can come in any size. Interestingly, this is not only limited to the number of nodes per layer but also applies to the number of hidden layers. Every MLP also has one input layer and one output layer but everything inbetween is "arbitrily" defined. This is commonly referred to as the "Black Box" when talking about Neural Networks, as it becomes harder and harder (up to impossible) to disentagle with what kind of behaviour the MLP model is solving the given task.

Below is a MLP that illustrates multiple hidden layers and also shows that each layer can have a different number of nodes.

![](https://www.researchgate.net/publication/354817375/figure/fig2/AS:1071622807097344@1632506195651/Multi-layer-perceptron-MLP-NN-basic-Architecture.jpg)

<sub>https://www.researchgate.net/publication/354817375/figure/fig2/AS:1071622807097344@1632506195651/Multi-layer-perceptron-MLP-NN-basic-Architecture.jpg<sub>


The last and likely most important concept is **learnability**. So far, we have talked about the input and output, the nodes and layers, and that all of that can be scaled up or down. But there are also connections between all the nodes. These symbolize the weights that are assigned between pairs of nodes. Weights are essentially real-valued numbers that are multiplied with the input in order to transform the input of an individual node to the output of that node.

We can see that for all the hidden layers, and the output layer in the two images above, the nodes have an arrow, i.e. a weight, coming from EVERY single node in the previous layer. So every node gets input from all nodes in the previous layer. If we image that the input is an image, then this means that each node in the (first) hidden layer has access to information from all nodes in the input layer and therefore to information from the WHOLE image. All of that information is then weighted by the values of the connections between the nodes after which a non-linear function, called activation function, is being applied. So a given node is computing the weighted sum of all its inputs (which is a linear computation) followed by a non-linear computation.

![](https://www.researchgate.net/publication/354817375/figure/fig1/AS:1071622807117824@1632506195623/A-single-neuron-receiving-weighted-inputs.jpg)

<sub>https://www.researchgate.net/publication/354817375/figure/fig1/AS:1071622807117824@1632506195623/A-single-neuron-receiving-weighted-inputs.jpg<sub>


Often the Rectified Linear Unit (ReLU) or the Sigmoid function are used as activation functions.


![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*XxxiA0jJvPrHEJHD4z893g.png)


<sub>https://miro.medium.com/v2/resize:fit:720/format:webp/1*XxxiA0jJvPrHEJHD4z893g.png<sub>



# Model Implementation

This has been a long but needed explanation for how a typical MLP model looks like. We now have all the ingredients to build one using PyTorch.

Counterintuitively, a layer in PyTorch (or any other deep learning library) is rather defined as the set of weights instead of the set of nodes per layer. So a PyTorch layer will be defined in terms of the number of input nodes and the number of output nodes and the type of computation that is applied to the input. In our case we want that computation to be the weighted sum of inputs. This is called a Linear Layer. We want that to be follow by a ReLU operation as the activation function.

Our model needs 784 input nodes, one hidden layer with a high enough number of nodes and a output layer with 10 nodes.

In [None]:
class MLP(nn.Module):       # <- nn.Module builds the basis for any model
    def __init__(self, in_channels, out_channels, hidden_channels):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(in_channels, hidden_channels),  # Set of weights connecting the number of input nodes (defined by in_channels)
                                                      #       to the number of hidden nodes (defined by hidden_channels)
                                                      #       and the type of computation applied to them (weighted sum = Linear)
            nn.ReLU(),                                # Rectified Linear Unit - Activation function applied to every output (so the weighted sum of inputs) of the previous layer.
            nn.Linear(hidden_channels, out_channels)  # Same as above but now connecting the hidden layer to the output layer
        )


    # The forward function is called when the model receives an input X and defines how the model processes this input.
    # Here, the input is first flattened, going from 3-Dimensional to 1-Dimensional such that the number of pixels correspond to the number of input nodes.
    # Then, the input is fed through the network, meaning that all the weighted sums and activations are being applied and
    #     yielding an output telling us the probability for each of the possible digits on how likely that digit is visible in the image.
    def forward(self, x):
        # convert tensor (128, 1, 28, 28) --> (128, 1*28*28)
        x = x.view(x.size(0), -1)   # flatten
        x = self.layers(x)          # feed through network
        return x                    # return output

The above cell only defines the class that we can now create an instance of, specifying the exact number of nodes per layer. As said above, we need 784 input nodes and 10 output nodes. Additionally, we want the hidden layer to have 100 nodes. Feel free to play around with the number of hidden nodes, e.g. make it 10 or 1000 and see how the model's behaviour changes. Please do this after the end of the workshop in the interest of time.

In [None]:
model = MLP(in_channels=28*28, hidden_channels=100, out_channels=10)

Note that this model could already be used to execute the task. However, given that its parameters (i.e., the weights of each pair of nodes) are initialized randomly the model output would also be random.

# Training

Now, we want to change/tune the weights such that the model output corresponds to the label for the given input image. To do this, we need of course the image itself and the corresponding label.

Next, we need a measure of how well the model is doing. This is referred to as the loss function. This will tell us how close the output of the model is to the desired output. Remember, the model does not just output the digit itself but a probability distribution over all possible digits. If we conceptualize that a single correct answer (namely that there is only one digit visible per image) corresponds to a probability distribution that is 0 everywhere except for the number/node that corresponds to the digit then we can imagine that we can compute how similar to output distribution of the model is to the desired distribution that assigns a probability of 1 to the actual visible digit.

The similarity between these two distribution is given by the CrossEntropyLoss. For a given pair of distributions or rather predicted digits and target/desired digits, it calculates a single number which is close to 0 if the predicted digit matches the target digit and can be infinetly large for a predicted digit probability that is far away from the desired one.

Given this loss function, we can then calculate the gradient for each of the parameters of the model, i.e. the weights, and change or update these weights such that the loss would decrease if the model were to use these weights instead.

This yields the follow training paradigm:

1. Start with a pair of **input** (image) and corresponding **label** (digit)
2. Give the input to the model which feeds it through its layers and nodes and **outputs** a probably distribution over the possible outcomes.
3. Calculate the distance (or **loss**) between the model **output** and the **label** in terms of the CrossEntropyLoss function.
4. Calucalte the **gradient** for each parameter/weight given the **loss**.
5. Update the model parameters/weights according to the **gradient**.
6. Repeat for a new pair of **input** and **label**.

In [None]:
def train(model, loader, n_epochs, criterion, optimizer, device):
  model.train(True)

  for epoch in tqdm(range(n_epochs)):

    for item in tqdm(loader, leave=False):
      inputs, labels = item[0], item[1]       # Pair of input and corresponding label

      inputs = inputs.to(device)              # Move to desired device (always recommended to use GPU as calculation times drastically decreases for CPU use)
      labels = labels.to(device)              # Move to desired device (always recommended to use GPU as calculation times drastically decreases for CPU use)

      optimizer.zero_grad()                   # Set all gradient to 0, this is a technical step that is needed in order to only consider the gradients of this "round" of input-output-pairs and not the previous ones

      outputs = model(inputs)                 # Feed the input (image) through the model and get the probability distribution over targets (digits)

      loss = criterion(outputs, labels)       # Calculate the loss (distance or error) between outputs and labels using CrossEntropyLoss
      loss.backward()                         # Calculate the gradient for each parameter for the current loss

      optimizer.step()                        # Update each model parameter (weight) according to the calculated loss

---

**Exercise 2**

**Given the above code cell and the description of the training process, what is the role of *epochs* in the training process? Describe what an *epoch* constitutes and how it contributes to the concept of scalability.**


**Answer to Exercise 2**

Describe here ...

---


Let's now define the remaining parts using PyTorch and then we can finally train our first model.

We need a loss function for which we will use the CrossEntropyLoss. We need an algorithm that updates each model parameter in a smart way given the loss. For this we will choose the commonly used Adam optimizer. Other common optimizers are SGD (stochstic gradient decent), LBFGS or RMSProp. A list of available optimizers in PyTorch can be found [here](https://pytorch.org/docs/stable/optim.html#algorithms).

In [None]:
criterion = torch.nn.CrossEntropyLoss() # Define loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # define optimizer, note that we have to tell the optimizer what parameters should be update. Here, we want all parameters to be updated
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # We define the GPU as the device that should be used during training, if a GPU is available. Note that everything said so far would also work on a CPU but would take much much longer

model = model.to(device) # Move the model to the desired device (GPU)

We have learned earlier that the model (using a GPU) can handle multiple pairs of input and output at the same time (this is called a batch). So, we cut the whole dataset up into batches. We can now iterate over those batches such that the model sees each pair of input and label once. Unfortunately, showing each pair of input and label once will not enable the model to adjust its parameters such that it can perfectly classify all images correctly. Therefore, we need to show the whole dataset multiple times to the model in order for it to achieve a sufficient classification performance.

The below cell executes the train function that we define above for a given model, loss function (criterion), optimizer and dataloader. We can further specify the device and the number of times the whole dataset is shown to the model (n_epochs). The cell below should finish after around 70 seconds for 10 epochs and the default model that we defined above. If you make changes to any if these parts, expect these times to differ.

Go ahead and train the model by executing the following cell:

In [None]:
train(model=model, loader=dataloader, n_epochs=10, criterion=criterion, optimizer=optimizer, device=device)

# Testing

To evaluate model performance we define a simple classication accuracy as the ratio of times the model assigned the highest probability to the actual target digit over the number of item in the dataset. For a given dataloader and model, the function below compute the just defined accuracy.

In [None]:
def evaluate(loader, model, device):
    num_correct = 0
    num_samples = 0
    model.eval()


    with torch.no_grad():
        with tqdm(enumerate(loader), total=len(loader), desc='0/0') as t:
            for i, (item) in t:
                x, y = item[0], item[1]

                x = x.to(device=device)
                y = y.to(device=device)

                scores = model(x)

                ############################
                _, predictions = scores.max(1)
                ############################
                num_correct += (predictions == y).sum()
                num_samples += predictions.size(0)

                t.set_description(f'{num_correct}/{num_samples} correct ')

        accuracy = float(num_correct)/float(num_samples)
        print(f'Got {num_correct} \t/ {num_samples} correct -> accuracy {accuracy*100:.2f} %')

    return accuracy


---
**Exercise 3**

**Look at the line of code that is highlighted in the cell above:**

```_, predictions = scores.max(1)```

**Interpret the code. Explain what the variables ```scores``` and ```predictions``` represent, respectively. What does the function ```scores.max(1)``` do, both programatically (keep it simple) and conceptually with respect to the ```scores``` variable content.**

---

**Answer to Exercise 3**

Interpret here ...

---

Let's see how well our model is doing after it has been trained for 10 epochs:

In [None]:
accuracy = evaluate(model=model, loader=dataloader, device=device)

We see that the model performs nearly perfectly after it has been trained for 10 epochs.

An important additional check that is commonly done in Machine Learning is to test the model on data that it has not seen during training. While this might seem unfair or task irrelevant, we need to remember that the initial solution that we were looking for is a model that can tell us for any image of a handwritten digits, what digit is visible. And for such a model to be able to work reliably and autonomously, we would want it to also be able to classify digits that have been written by e.g., a new, unknown person.

The technical term for this is cross-validation. The easiest way to perform cross-validation for any given dataset is to simply split it into two new dataset. One is called the training dataset and would consist of around 90% of the original data and will be used for training the model. The other is called the test dataset and consists of the remaining 10% of data. The test dataset is then used for so called inference, i.e. the model has to make its prediction but weights won't be change if the output is wrong.

Many datasets, including MNIST, have a pre-defined test dataset for the outlined purpose. The test dataset can be accessed and downloaded by setting the 'train' argument to false.

In [None]:
mnist_test = torchvision.datasets.MNIST(root='.', download=True, transform=transform, train=False)

We can now assess/test the trained model's accuracy on this test dataset. Remember, this is data that the model has not seen before and could therefore not adjust its parameters to be able to predict the output class (digit) for these images.

We define a new dataloader for this test set and use the above 'evaluate' function to get the model's test performance in terms of classification accuracy.

In [None]:
testloader = DataLoader(dataset=mnist_test, batch_size=4096, num_workers=8)

In [None]:
accuracy = evaluate(model=model, loader=testloader, device=device)

Impressively, the model is still able to predict almost all images correctly. Note that the size of the MNIST test set is 10000 pairs of input and output.

But why the big hype around deep learning if a (non-deep) simple model that we can train in under 2 minutes can do this task so easily???

That is, because this might have been one of the challenging tasks to solve back in the days but has now indeed become more of a toy example for education and historic reasons. In order to illustrate a more timely problem, we will move on to a new dataset called CIFAR-10.

# More advanced datasets and problems

CIFAR-10 is a dataset that contains natural images, i.e. color images of actual object, more comparable to photographs. Each image has again assigned one of 10 labels. This time labels correspond to the object that is visible in the image. Let's first download the training and test datasets seperately, then inspect the classes that the labels correspond to and then have a look at some of the images contained in this dataset.

(The below code is basically copied from above, while only changing the name of the dataset. Isn't that amazing how easy it is to download a new dataset using PyTorch?)

In [None]:
cifar10 = torchvision.datasets.CIFAR10(root='.', transform=transform, download=True, train=True)
cifar10_test = torchvision.datasets.CIFAR10(root='.', transform=transform, download=True, train=False)

In [None]:
cifar10.classes

In [None]:
cifar_loader = DataLoader(dataset=cifar10, batch_size=4096, num_workers=8)
cifar_test_loader = DataLoader(dataset=cifar10_test, batch_size=4096, num_workers=8)

In [None]:
img, label = next(iter(cifar_loader))

In [None]:
img.shape

In [None]:
fig, ax = plt.subplots(1,10, figsize=(20, 8))

for index in range(10):
  ax[index].imshow(transforms.ToPILImage()(img[index]), cmap='gray')
  ax[index].set_title(cifar10.classes[label[index].numpy()])
  if index > 0:
    ax[index].axis('off')

plt.show()

As you can see, the images are far more varied in their content. However, the resolution of images has increased only slightly, up to 32x32 pixels. Additionally, we are dealing with color images, so instead of one "color" channel (as is the case of grayscale images) we now have 3 color channels corresponding to RGB (red, green, blue).

Let's now apply the same model architecture that we use before in order to solve the new task: assign one of the 10 object classes to the given RGB images of objects.

We have to make some small alterations to the model because of the new dimensions of the dataset. The input layer now needs 3\*32\*32=3072 nodes as we will now flatten all 3 color channels and all pixels into one long list of values. We will keep the number of hidden nodes (channels) and output channels.

Further, we define again the CrossEntropyLoss as out criterion and we use the Adam optimizer, with the new model parameters (!).

In [None]:
cifar_model = MLP(in_channels=3*32*32, hidden_channels=100, out_channels=10)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cifar_model.parameters(), lr=0.001)      # Note that we need to tell the optimizer which parameters it should use

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # We define the GPU as the device that should be used during training, if a GPU is available. Note that everything said so far would also work on a CPU but would take much much longer

cifar_model = cifar_model.to(device)

Let's see how well this model performs after 10 epochs of training. The execution of the below cell with the instructed parameters should take around 2 minutes (more parameters, longer training).

In [None]:
train(model=cifar_model, loader=cifar_loader, criterion=criterion, n_epochs=20, optimizer=optimizer, device=device)

We again first evaluate the model on the training dataset to see whether it was able to learn the corresponding labels in the first place. If we are happy with the training classification accuracy, then we move on to test the model on the test dataset, to check whether it can also classify images that it hasn't seen during training.

In [None]:
accuracy = evaluate(model=cifar_model, loader=cifar_loader, device=device)

After 10 epochs, the above model should have a training accuracy of around 40 %. Go ahead and run the train function for another 20 epochs. Be aware that if we execute the cell in which the model is defined, this will reset the parameter to random again, so you would need to adjust the number of epochs to 30. Do not train the model for more than 30 epochs, instead continue with the tutorial :)

In [None]:
accuracy = evaluate(model=cifar_model, loader=cifar_test_loader, device=device)

As you can see, it seems very hard to increase the models performance by purely training for longer. The model might not be "powerful" enough in order to solve this task. At this point, one could try to add more layers or more nodes in the hidden layer in order to allow the model more flexibility. However, in this case, the model would probably indeed achieve a higher classification accuracy on the training set but it would likely not be able to perform better on the test set. This is called overfitting and tells us that the model is learning the noise in the data (e.g. common background of objects on the same class: e.g., airplanes are more commonly photographed on a blue sky-like background) but it cannot transfer this to the test data where an airplane might occur standing at the airport (which is of course a totally realistic setting by which humans would not be confused). So it must be possible to change the network such that it learns what parts of the objects are indicative of its class label. This leads us to convolutions.

# Convolutions to the rescue

You may or may not know convolutions from other fields or mathematics. Generally speaking, a convolution expresses how the shape of one function is influenced by another. It can also be thought of as a kind of filtering using one function that corresponds to the filter and another function that corresponds to the signal that is to be filtered.

In computer vision a convolution is usually referred to as a feature or a kernel. This is, because the function that represents the filter/kernel/feature is a sort prototype for a characteristic in the data as e.g., a straight line or a circle in an image. When a convolution that represent a straight line is being applied to a part of an image where a straight line is visible, this will yield the strongest output (defined by the mathematical computation behind the convolution which we will not inspect any closer here). When the same convolution is then applied to some other part where e.g. a circle is visible, this will yield a very weak output. In that sense, the output of a convolution reveals where and to what extent a certain feature or characteristic is present in an image.

To illustrate this, we can think back to how this would help in identifying the digit in a grayscale handwritten image. The handwritten 1,4 or 7 all have a lot of straight or diagonal lines while 3,6,8 or 0 all almost have no straight lines but more half-circles or circles. If we would apply the respective convolutions/kernel/features/filter to the handwritten digit images, we could easily learn something about the content of the images by just inspecting the distribution of occurences of these features (lines and circles).

The kernel itself is defined as a two-dimensional matrix commonly with size of 3x3, 5x5 or 7x7 where each entry in the matrix is a learnable real number that can be adjusted. The below image illustrates how such a matrix can form kernels corresponding to different types of straight lines or edges.

![](https://media5.datahacker.rs/2018/10/multiplication_slicice.png)
<sub>https://media5.datahacker.rs/2018/10/multiplication_slicice.png<sub>

During training the weights of the kernels (i.e. the values in the matrix) will be adjusted such that the resulting activations allow the model to distinguish parts of the images from other parts of the image.

Another large advantage of using convolutions is that a single convolutional kernel, e.g., corresponding to a straight line, can be applied to every location in the image without needing more than one set of weights. In case of the MLP, remember that the input layer needed to have as many nodes as there were pixels in the image and nodes in the hidden layer received input from all previous nodes. Here, one convolutional kernel, of which there would still be many in a neural network model, can be applied to all locations of the image, no matter how many pixels there are without changing the number of weights.

---

**Exercise 4**

**Implement the convolutional kernels from the above image and apply them to the data `content` below. How does the resulting output change when changing the kernels. Can you interpret these changes in terms of feature detection?**

In [None]:
from scipy.signal import fftconvolve

raise NotImplementedError('First kernel not implemented ...')
kernel_1 = np.array([ ... ])

kernel_2 = np.array([[1, 0, -1],
          [1, 0, -1],
          [1, 0, -1]])

raise NotImplementedError('Third kernel not implemented ...')
kernel_3 = np.array([ ... ])

content = np.array([
    [0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0],
    [0, 0, 0, 1, 0, -1],
    [0, 0, 0, 1, 0, -1],
    [0, 0, 0, 1, 0, -1],
    [0, 0, 0, 0, 0, 0],
])

raise NotImplementedError('First convolution not implemented ...')
output_1 = ...

output_2 = fftconvolve(content, kernel_2)

raise NotImplementedError('First convolution not implemented ...')
output_3 = ...


# Let's plot the content, kernels and output
fig, ax = plt.subplots(4,2, figsize=(10, 8))

ax[0,0].imshow(content)
plt.delaxes(ax[0,1])

ax[0,0].set_title('Content')

ax[1,0].imshow(kernel_1)
ax[1,0].set_title('Kernel 1')
bar = ax[1, 1].imshow(output_1)
ax[1, 1].set_title('Output 1')

ax[2,0].imshow(kernel_2)
ax[2,0].set_title('Kernel 2')
bar = ax[2, 1].imshow(output_2)
ax[2, 1].set_title('Output 2')

ax[3,0].imshow(kernel_3)
ax[3,0].set_title('Kernel 3')
bar = ax[3, 1].imshow(output_3)
ax[3, 1].set_title('Output 3')


plt.tight_layout()
plt.show()

---

Let's implement a convolutional neural network given what we learned so far.

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class ConvNet(nn.Module):
    def __init__(self):
        super().__init__()

        ####################### THIS BLOCK DEFINES EACH INDIVIDUAL COMPONENT THAT WE WILL BELOW BRING INTO THE ORDER NEEDED

        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)               # Convolutional Layer with 3 channels as inputs, 32 channels as outputs (so that means it has 32 kernels/filters) a kernel size of 3 and a padding of 1

        self.relu = nn.ReLU()                                                 # ReLU activation layer. Can be "reused" multiple times as there are no weights attached to this directly

        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)    # Conv layer with 32 in and 64 out channels

        self.max_pool = nn.MaxPool2d(2, 2)                                    # Max Pooling layer, we can ignore how this works for now

        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)

        self.conv4 = nn.Conv2d(128, 128, kernel_size=3, stride=1, padding=1)

        self.conv5 = nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1)

        self.conv6 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)

        self.fc1 = nn.Linear(256*4*4, 1024)                                   # Fully connected layer, needs flattening of the output of the previous layer which we will perform below
        self.fc2 = nn.Linear(1024, 512)                                       # Gradually decrease the number of nodes, down to the number of output classes
        self.fc3 = nn.Linear(512, 10)
        #######################


        ####################### Define the order of individual components with which the model should process the input
        self.layers = nn.Sequential(
            self.conv1,
            self.relu,
            self.conv2,
            self.relu,
            self.max_pool,  # output: 64 x 16 x 16

            self.conv3,
            self.relu,
            self.conv4,
            self.relu,
            self.max_pool,  # output: 128 x 8 x 8

            self.conv5,
            self.relu,
            self.conv6,
            self.relu,
            self.max_pool,  # output: 256 x 4 x 4

            nn.Flatten(),   # output: 256*4*4
            self.fc1,
            self.relu,
            self.fc2,
            self.relu,
            self.fc3
        )
        #######################


    ####################### The forward function defines what happens when the model is called on an input x. In this case we just call the layers block, with the predefined order of individual components
    # Note that we don't need to define the self.layers variable, but we can also call each individual component on the input, which eventuall just requires a bit more writing. See a short example below
    def forward(self, x):
        return self.layers(x)

        ####################### actually equivalent to:
        # x = self.conv1(x)
        # x = self.relu(x)
        # x = self.conv2(x)
        # etc. ...
        # ...
        # x = self.fc3(x)
        # return x

Just like above, we need to create a new instance of the ConvNet model class, choose our loss function (criterion), the optimization algorithm (Adam), and the device to run the training on and train the model.

---

**Exercise 5**

**Take the previous code for instantiating a model, now of the ConvNet class, create a new optimizer with the ConvNet parameters, and move the model instance to the GPU. Then, take the ```train``` and ```evaluate``` function and train the convolutional model for 20 epochs on the GPU on the train dataset. Afterwards, evaluate the trained model performance on the train dataset and also on the test dataset.**

In [None]:
raise NotImplementedError()

...

---

After evaluating the trained model on the test set, we see that using convolutions in a simple model increase test accuracy significantly. However, the model still does not perform perfectly. This is again due to the size of the model. A larger model (i.e., a model with more convolutional layer and more channels per layer) would perform even better. However, here we could probably also squeeze out some performance improvements by just training the model for a bit longer. Due to the lack of time we will not look further into this but rather reside to another class of models.

Before we move on to the state-of-the-art models, we want to have a closer look at what the above model actually learned. The cool thing about convolutions is that we can visualize the leared features/channels/kernels. Since every kernel is a two-dimensional matrix (as we have seen above), this matrix can be visualized. The below function takes a list of images and plots them in a grid of axes. No need to go through this code.

In [None]:
def plot_images(images, fig=None, max_images: int = None, figsize=(15, 10), titles:list=None, cmap=None, axis_off:bool=True, orientation='landscape', custom_func_per_image=None):
    if max_images is None:
        max_images = len(images)

    if fig is None:
        fig = plt.figure(figsize=figsize)

    if orientation == 'landscape':
        n_rows = int(np.floor(np.sqrt(max_images)))
        n_cols = int(np.ceil(max_images / n_rows))
    else:
        n_cols = int(np.floor(np.sqrt(max_images)))
        n_rows = int(np.ceil(max_images / n_cols))

    index = 0
    for _ in range(n_rows):
        for _ in range(n_cols):
            if index >= max_images:
                break
            ax = fig.add_subplot(n_rows, n_cols, index+1)
            ax.imshow(images[index], cmap=cmap)
            if axis_off:
                ax.axis('off')
            if titles is not None and len(titles) > index:
                ax.set_title(titles[index])

            if custom_func_per_image is not None:
                custom_func_per_image(ax, index)
            index += 1

    fig.tight_layout()

    return fig

Let's pick one of the convolutional layer of the above model and visualize the trained kernels.

In [None]:
kernels = conv_model.conv6.parameters() # <- we can change conv1 to any of the other convolutional layers of the model
param_iter = next(iter(kernels))

In [None]:
fig = plot_images([x[0,:,:].cpu().detach() for x in param_iter], cmap='gray')       # The '.cpu()' and '.detach()' move the kernels from the GPU to the CPU and thereby allow us to plot them

---

**Exercise 6**

**Execute the above cell for different layers (conv1, conv3, conv6) and compare the kernel visualization. Can you find differences between layers? Can you identify kernel that have a certain structure (lines, circle, etc.)? Describe your findings.**



**Answer to Exercise 6**

Describe here ...

---

# Transfer learning

*The following part is based on a notebook by Max van Spengler.*

Now, we have seen how to build our own fairly basic models, but what if we want to use the modern, state-of-the-art models? How can we use such models with only limited amounts of resources? Luckily, the company [Hugging Face](https://huggingface.co/), with the help of many others, is putting lots of effort into democratizing AI by providing implementations for many state-of-the-art models together with pretrained model weights. These are basically parameters for the model that have been obtained by training on (usually massive) datasets.

While the classes for our dataset may be different from the classes in the training dataset, the visual features that appear in the images can often be quite similar.

Models often have an "encoding" part that condenses the images into a single feature vector that represents the most important properties of an image. For example in our ConvNet model, the output after the nn.Flatten operation is a single vector of size 2048 = 256\*4\*4 for each image, so all the operations up to this point can be considered the "encoder". These feature vectors are then used by a classifier to predict a class. In our ConvNet this classifier consists of the three Linear layers together with the ReLU activation functions.

A common approach is to take the "encoding" part of a pretrained model and place a new classifier on top. Then, we can leave the encoding part as is and only train the (relatively small) classifier. Compared to training such a model from scratch with this approach we
- need less data;
- need less compute power;
- need less time;
- reduce our carbon footprint;

As a result, we can build very advanced models at relatively little cost!

For this part of the tutorial we will use Hugging Face's [transformer](https://huggingface.co/docs/transformers/index) package.

In [None]:
!pip install transformers

In [None]:
from transformers import AutoImageProcessor, ResNetModel

## ResNet

The first large model that we will use is the [ResNet](https://arxiv.org/abs/1512.03385) architecture that was developed in 2015. This is basically a very deep convolutional network, that uses some additional tricks to make learning possible. Luckily, because of the transformers package we don't have to implement this model ourselves and can simply load the model straight into our notebook. The only thing that we have to do is make sure that we "freeze" the parameters of our encoder and we have to add a classifier on top. In PyTorch, we can simply disable the computation of gradients for layers, which will ensure that the model does not try to change the parameters of these layers.

We use the pretrained ResNet-50 model from Microsoft. This model has 50 layers with approximately 23,5 million (!) parameters and it is pretrained on the ImageNet dataset, which contains 1,281,167 training images, 50,000 validation images and 100,000 test images.

In [None]:
class ResNetClassifier(nn.Module):
    def __init__(self, classes):
        super(ResNetClassifier, self).__init__()
        self.encoder = ResNetModel.from_pretrained(
            "microsoft/resnet-50",
        )
        self.encoder.requires_grad_(False)
        self.classifier = nn.Linear(2048, classes)

    def forward(self, x):
        x = self.encoder(x).pooler_output.squeeze()
        return self.classifier(x)

resnet_model = ResNetClassifier(classes=10)
resnet_model.to(device)

In [None]:
resnet_model.encoder.embedder.embedder.convolution

Another thing that we have to change is how the data is transformed. This is because these models have been pretrained on images that have been transformed in specific ways. So, if we do not perform these same transformations, then we will confuse the model. Luckily, we can automatically load the transformation that was used for pretraining the model.

In [None]:
resnet_processor = AutoImageProcessor.from_pretrained("microsoft/resnet-50")

def resnet_transform(example):
    return resnet_processor(example, return_tensors="pt").pixel_values.squeeze()

Now we can construct the datasets as we did before, but now we supply the model specific transformation.

In [None]:
resnet_cifar10 = torchvision.datasets.CIFAR10(root='.', transform=resnet_transform, download=True, train=True)
resnet_cifar10_test = torchvision.datasets.CIFAR10(root='.', transform=resnet_transform, download=True, train=False)

resnet_cifar_loader = DataLoader(dataset=resnet_cifar10, batch_size=512, num_workers=8)
resnet_cifar_test_loader = DataLoader(dataset=resnet_cifar10_test, batch_size=512, num_workers=8)

Now we can fine-tune this model, by training only the classifier on our own dataset. Remember that we turned the gradient computations off for the encoder part of our model. Therefore, we will only train the classifier.

In [None]:
resnet_optimizer = torch.optim.Adam(resnet_model.parameters(), lr=0.001)
train(model=resnet_model, loader=resnet_cifar_loader, n_epochs=1, criterion=criterion, optimizer=resnet_optimizer, device=device)

Fine-tuning for 1 epoch should take around 6 minutes.

And lastly, we can check the accuracy of our fine-tuned model.

In [None]:
accuracy = evaluate(model=resnet_model, loader=resnet_cifar_test_loader, device=device)

Now, we are getting to performances that become very much human-like. Would you trust an algorithm that is correct in almost 70% of cases if it saves you a lot of time?

---

**Exercise 7**

**Look at how we visualized the kernel of the ConvNet model earlier and use that code to visualize the kernels of one of the ResNet50 layers. ResNet50 has kernels that are larger than those of the ConvNet model.**

**Implement the visualization of any layer of the ResNet50 model where the kernels are larger than 3x3. Describe what kind of features these kernels represent. Can you identify anything that we know the brain computes when processing visual information?**

---



In [None]:
raise NotImplementedError('')

# Copy the visualization code here and modify it accordingly.

However, we are still not close to perfect performance ...


---

**This is the end of the mandatory and graded part of the assignment. Below you find a small part on how to use the popular Vision Transformer architecture to achieve state-of-the-art performance on the CIFAR-10 object classification task.**

**The following part is voluntary!**

---

# Bonus: Vision transformers

In [None]:
from transformers import ViTModel

One of the most popular architectures of today is the Vision Transformer (ViT) architecture. This model uses a new type of mechanism, called attention, that was initially proposed for natural language processing. Using some smart tricks, researchers from Google Brain managed to adapt this model to images, which has led to massive successes. Understanding all the details of this model usually takes quite some time, but luckily we don't have to implement this model ourselves (although doing so can be a valuable lesson!). Instead we load the architecture and pretrained weights straight from the transformers package.

The model that we load is a relatively small vision transformer by the user WinKawaks on Hugging Face. It has ~21,8 million parameters and is also pretrained on the ImageNet dataset.

Note that there is an additional line in our initialization method which enables the computation of gradients for the pooler layer of the encoder. This is because the original vision transformer doesn't condense an image into a single vector. Therefore, the transformers package adds an additional layer called the pooler, which does this for us. This pooler, however, has not been pretrained, so we have to train this one ourselves. Luckily, the size of this pooler is neglible compared to the rest of the encoder, so training it should not be a problem.

In [None]:
class ViTClassifier(nn.Module):
    def __init__(self, classes):
        super(ViTClassifier, self).__init__()
        self.encoder = ViTModel.from_pretrained(
            "WinKawaks/vit-small-patch16-224",
        )
        self.encoder.requires_grad_(False)
        self.encoder.pooler.requires_grad_(True)
        self.classifier = nn.Linear(384, classes)

    def forward(self, x):
        x = self.encoder(x).pooler_output
        return self.classifier(x)

vit_model = ViTClassifier(classes=10)
vit_model.cuda()

Next, we load the model specific transformations again.

In [None]:
vit_processor = AutoImageProcessor.from_pretrained("WinKawaks/vit-small-patch16-224")

def vit_transform(example):
    return vit_processor(example, return_tensors="pt").pixel_values.squeeze()

And we build the datasets using this transformation.

In [None]:
vit_cifar10 = torchvision.datasets.CIFAR10(root='.', transform=vit_transform, download=True, train=True)
vit_cifar10_test = torchvision.datasets.CIFAR10(root='.', transform=vit_transform, download=True, train=False)

vit_cifar_loader = DataLoader(dataset=vit_cifar10, batch_size=512, num_workers=8)
vit_cifar_test_loader = DataLoader(dataset=vit_cifar10_test, batch_size=512, num_workers=8)

Now we are set to fine-tune our vision transformer model.

In [None]:
vit_optimizer = torch.optim.Adam(vit_model.parameters(), lr=0.001)
train(model=vit_model, loader=vit_cifar_loader, n_epochs=1, criterion=criterion, optimizer=vit_optimizer, device=device)

Training the Vision Transformer for 1 epoch should again take about 6 minutes.

And, lastly, we check the fine-tuned model's accuracy.

In [None]:
accuracy = evaluate(model=vit_model, loader=vit_cifar_test_loader, device=device)

What a beast! The transformer is outperforming all other models. Note however, that the transformer has maaaaaaaany parameters and has been trained on a huge dataset to achieve this. But as we can see, with very little fine-tuning we can use this pre-trained model to perform a task that is different from the task that it was trained on. And that even on new data.