# MY475 Seminar 3: Convolutional neural networks and autoencoders

In this seminar, we are going to study working with image data using both *convolutional neural networks* (CNNs) and *(variational) autoencoders*.

As normal, we will build our networks in three stages:

1. Format the input data
2. Design the neural network architecture
3. Train the model

Our first dataset is the cannonical [MNIST](https://en.wikipedia.org/wiki/MNIST_database) dataset, a set of handwritten digits with the corresponding numerical label from 0-9. The first task is to recognise which digit each handwritten example represents, making this a *multi-class classification problem* as there are 10 outcome classes (0, 1,...,9).

## Exercise 0. Reading in images

Fortunately, the MNIST data is so common it comes bundled with `torchvision` (usually this module gets installed at the same time as the base pytorch), so we can use inbuilt functions to download the data. Before we get to the full data, however, it's worth just understanding how these images get converted into numerical vectors (also useful for exercise 2 below).

Complete the below code to read in a single image, and display it using the `matplotlib` library.

In [None]:
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

# Load the image
example = Image.open("mnist_example.jpg")

# Display the image
plt.imshow(example, cmap=____)
plt.show()

# Convert the image to a numpy array
pixel_array = np.array(example)
pixel_array[16]

## Exercise 1. Building a convolutional neural network

### Data formatting

We'll download our data using a specific function in the `torchvision.datasets` submodule. This function will download the data into your working directory. We can also specify whether we want the cannonical `train` or `test` splits of the data, and since we want to represent these as tensors, we can use the `torchvision.transforms` submodule function `ToTensor()` to convert the data into the correct format.

Complete the code below to generate both a train and a test dataset, as well as corresponding dataloaders for training.

In [None]:
# Get train and test sets for the MNIST data using the torch package
import torch
import torchvision
from torchvision.transforms import ToTensor

torch.random.manual_seed(42)

# Download and load the training data
mnist_train = torchvision.datasets.MNIST(
    "./mnist_data", download=True, train=True, transform=ToTensor()
)
trainloader = torch.utils.data.DataLoader(mnist_train, batch_size=64, shuffle=True)

# Get the corresponding test data
mnist_test = torchvision.datasets.MNIST(
    "./mnist_data", download=True, train=False, transform=ToTensor()
)
testloader = torch.utils.data.DataLoader(mnist_test, batch_size=64, shuffle=True)

### Defining the CNN architecture

Now we can design our CNN to classify these images.

The key, new bits of code we will need to use are:

`torch.nn.Conv2d` - a 2D convolutional layer to create our feature maps

 * This function takes in a minimum of three arguments: the number of input channels (`in_channels`), the number of output channels (`out_channels`), and the kernel size (you guessed it...`kernel_size`!)
 * You can also modify the `stride` (default = 1) and amount of `padding` (default = 0). Notice, in the default implementation, as there is no padding there will be some reduction in the image size after each cross-correlation operation. If you want to downsample the image only through pooling, you can set `padding = 'same'` to keep the image size the same after the cross-correlation.
 * You can review the full documentation for this layer [here](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)

`torch.nn.MaxPool2d` - a 2D pooling layer we can use to downsample our feature maps

 * For a given `kernel_size` and `stride`, this layer will take the maximum value in each `kernel_size` x `kernel_size` window and output it to the next layer
 * This operation has the effect of reducing the resolution of our feature maps
 * Note: as with dropout, this is a functional operation and so we only need to define this attribute once (and then can apply it multiple times in the forward pass)
 * The full documentation can be found [here](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html)

We will then use fully connected layers (as per last week) at the end of our convolutions before making our final predictions.

### Building the neural network

Your task is to complete the below code to build a simple CNN with the following architecture:

1. Two convolutional layers. The first layer should yield 2 feature maps, and the second layer should yield 4 feature maps. You should use a kernel size of 3 for both layers, but prevent these cross-correlations from reducing the dimensionality of the inputs (using the `padding` argument)
2. Max pooling layers with a kernel size of 2 and stride of 2 to downsample the resolution of the image
3. Two fully connected layers with 64 and 32 nodes respectively
4. An output layer with 10 features (one for each digit)

You will need to *flatten* your convolutions prior to passing the data to the fully connected layers. Rather than using 'reshape' here, which would duplicate the objects in memory, we can use `torch.view`, which will also change the shape of the tensor but using the existing data (and thus preserving memory).

To calculate the number of input features to the first fully connected layer, you should consult the documentation for the `torch.nn.Conv2d` and `torch.nn.MaxPool2d` for the required formulas (or do it from first principles in your head!)

In [None]:
class CNN(torch.nn.Module):

    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = ____
        self.conv2 = torch.nn.Conv2d(2, 4, kernel_size=3, padding=1)
        self.fc1 = ____
        self.fc2 = ____
        self.out = torch.nn.Linear(32, 10)

        self.hidden_act = torch.nn.ReLU()
        self.pool = torch.nn.____(kernel_size=2, stride=____)

    def forward(self, x):
        # Your code here

    def predict_proba(self, x):
        return torch.nn.functional.____(self.forward(x), dim=1)

To test your model, we can take the very first training image example and pass it through our untrained model to check the output size:

In [None]:
# Convert the img to a tensor and test the output of the CNN
img_tensor = mnist_train[0][0]
cnn = ____()
output = cnn(img_tensor)
output

### Training the neural network

Finally, we will train the neural network using the MNIST data. We won't worry about a validation dataset today, because the runtimes will be so long we will only loop through our data twice!

Complete the code below to train the neural network. You will need to define a loss function and an optimizer, and then loop over the data to train the model.

You should use cross entropy loss and the `torch.optim.Adam` optimizer.

**Warning!** Always read the PyTorch documentation for the function you are using. In our discussions of multi-class classification, we defined the output activation as a softmax function (converting logits to probabilities for each class). But the cross entropy loss function in PyTorch already includes a softmax operation, so we should not apply the softmax function to the output of our neural network. If you added a softmax activation before, remove this line and rerun your model definition before proceeding.

  * If you want to get class probabilities, you could define a new method `predict_proba` that applies the softmax function to the output of your neural network (i.e. by calling `forward()` and then activating the results). This method would not be used during training, but could be useful for making predictions after training is complete. *You do not need to implement this now in order to complete these exercises*

Even though the number of feature maps is very small, the sheer quantity of training data means the code will take a while to run (approx. 3 minutes on a M1 Max MacBook Pro).

In [None]:
cnn = CNN()

criterion = (
    torch.nn.____()
)  # notice the loss function requires the output to be unnormalized logits (i.e. not softmax)
optimizer = torch.optim.Adam(cnn.parameters(), lr=0.001)

# Train the CNN
n_epochs = 2
cnn.train()
for epoch in range(n_epochs):
    for i, (imgs, labels) in enumerate(trainloader):
        optimizer.____()
        outputs = cnn(____)
        loss = criterion(outputs, labels)
        loss.backward()
        ____.step()

        if i % 100 == 0:
            print(f"Epoch {epoch+1}/{n_epochs}, Batch #{i}, Loss: {loss.item()}")

## Test time

Finally, we can assess how well our model performs on out-of-sample data using the test set and classification accuracy metric. Rather than worry too much about converting logits to probabilities, since the maximum logit will have the highest probability, we can just use `torch.max` to find the most likely class. This function returns both the maximum value and the index of the maximum value (we want the latter), and we apply this over the 1st dimension to get the most likely class for each image.

A little caveat: PyTorch is fantastic, but for some functions the documentation can be a bit lacking. So while you should always read it, you may also need to test it. Here, the `torch.max` [documentation](https://pytorch.org/docs/2.6/generated/torch.max.html#torch-max) simply states it 'returns the maximum value' but, as you'll see below, it returns *both* that value **and** the corresponding index of that value. So just be mindful of this as you implement novel features from this library.

Complete and run the code below:

In [None]:
with torch.no_grad():
    cnn.____()
    correct_count = 0
    for imgs, labels in testloader:
        # get logits for batch
        outputs = cnn(____)
        # get the predicted class
        _, predicted = torch.max(____, 1)
        correct_count += (predicted == labels).sum().item()

    print(f"Accuracy: {100*correct_count/len(____)}%")

## Exercise 2. Implement a CNN for the CIFAR-10 dataset

A somewhat more challenging dataset is [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html). Can you modify your previous code to work with this dataset where images are 32x32 pixes and have three channels (RGB)? The dataset can be loaded through Pytorch as well with `datasets.CIFAR10` instead of `datasets.MNIST`.


In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import random


random.seed(42)
torch.manual_seed(42)

# Define a transform to normalise the data with mean and sd
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)

# Download and load the CIFAR10 training dataset
trainset = torchvision.datasets.CIFAR10(
    root="./cifar10_data", train=True, download=True, transform=transform
)

# Define the classes in CIFAR10
classes = (
    "plane",
    "car",
    "bird",
    "cat",
    "deer",
    "dog",
    "frog",
    "horse",
    "ship",
    "truck",
)


# Function to show an image
def imshow(img):
    # Convert tensor to numpy array and transpose from (C,H,W) to (H,W,C)
    img = img.numpy().transpose((1, 2, 0))

    # Unnormalize if the data was normalized with mean and std
    img = img * np.array([0.5, 0.5, 0.5]) + np.array([0.5, 0.5, 0.5])

    # Clip values to be between 0 and 1
    img = np.clip(img, 0, 1)
    return img


# Randomly select 16 images
random_indices = random.sample(range(len(trainset)), 16)
random_images = [trainset[i][0] for i in random_indices]
random_labels = [trainset[i][1] for i in random_indices]

# Create a figure with a 4x4 grid
fig, axes = plt.subplots(4, 4, figsize=(10, 10))
fig.subplots_adjust(hspace=0.4)

# Plot each image
for i, ax in enumerate(axes.flat):
    ax.imshow(imshow(random_images[i]))
    ax.set_title(f"{classes[random_labels[i]]}")
    ax.axis("off")

plt.suptitle("Sample of CIFAR10 Images", fontsize=16)
plt.tight_layout()
plt.show()

# Your code here

## Exercise 3. MNIST (variational) autoeconder

Revisiting the autoencoders from the lecture, can you improve the reconstruction and also the sampling of new MNIST images? Options could e.g. be to increase the number of units in the bottleneck representation, adding convolutional layers, or further fully connected layers. You could also add structure by leveraging the knowledge that MNIST digits are grey-scale values between 0 and 1, and incorporate a sigmoid activation function in the output layer and a binary cross-entropy loss function.

In [None]:
# Your code here