# Exercise 6

## Group ID: 
## Exercise day: 

### Imports

In [4]:
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from torchvision import datasets

### Data loading and splitting

In [5]:
fashion_mnist_train = datasets.FashionMNIST(root='data', train=True, download=True)
fashion_mnist_test = datasets.FashionMNIST(root='data', train=False, download=True)

X_train = fashion_mnist_train.data.reshape(-1, 1, 28, 28).float() / 255
y_train = fashion_mnist_train.targets

X_test = fashion_mnist_test.data.reshape(-1, 1, 28, 28).float() / 255
y_test = fashion_mnist_test.targets

### PyTorch Dataset and DataLoader

To make the training process easier, we will use the PyTorch `Dataset` and `DataLoader` classes. The Dataset class is an abstract class representing a dataset, while the DataLoader class provides an iterable over a dataset. In this case, we use the `TensorDataset` to wrap our data and the DataLoader class to iterate over the training and validation datasets.

In [7]:
train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

## Description
With all the introduction steps done, we can now start implementing the `GoogLeNet` architecture.
In this  exercise, you will implement a small version of the GoogLeNet architecture. The GoogLeNet architecture is a deep convolutional neural network that was proposed by Szegedy et al. in 2014. The architecture is known for its use of `inception modules`, which are modules that perform multiple convolutions with different filter sizes and then concatenate the results. The architecture also uses `auxiliary classifiers` to help with training.
This notebook will guide you through the implementation and there will be the following subtasks you need to complete:

1. Implement the Inception module (1 points)
2. Implement the auxiliary classifier (1 point)
3. Implement the BabyGoogLeNet architecture (2 point)
4. Train the model on FashionMNIST (1 point)

For a more detailed explanation of the GoogLeNet architecture, you can refer to the [original paper](https://arxiv.org/abs/1409.4842).


### Inception Module

The following figure shows the structure of the inception module, where we here only refer to the subfigure (b) Inception module with dimesion reduction:

![Inception Module](inception_module.png)

The inception module is a module that performs multiple convolutions with different filter sizes and then concatenates the results. The module consists of four branches, each of which performs a different operation. The first branch performs a 1x1 convolution, the second branch performs a 1x1 convolution followed by a 3x3 convolution, the third branch performs a 1x1 convolution followed by a 5x5 convolution, and the fourth branch performs a 3x3 max pooling followed by a 1x1 convolution. The outputs of the four branches are then concatenated along the channel dimension.
After each convolution operation, a ReLU activation function is applied.

### 1. Implement the Inception module (1 points)

The following PyTorch functions will be useful for implementing the different modules in this exercise:

- `nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)`: Creates a 2D convolutional layer with the specified number of input and output channels, kernel size, stride, and padding.

- `nn.ReLU()`: Creates a ReLU activation function.

- `nn.LeakyReLU()`: Creates a Leaky ReLU activation function.

- `torch.cat(tensors, dim)`: Concatenates the given sequence of tensors along the given dimension.

- `torch.flatten(input, start_dim, end_dim)`: Flattens a contiguous range of dims into a tensor.

- `nn.MaxPool2d(kernel_size, stride, padding)`: Creates a 2D max pooling layer with the specified kernel size, stride, and padding.

- `nn.AvgPool2d(kernel_size, stride, padding)`: Creates a 2D average pooling layer with the specified kernel size, stride, and padding.

- `nn.AdaptiveAvgPool2d(output_size)`: Creates an adaptive average pooling layer that outputs a tensor with the specified output size.

- `nn.Sequential(*args)`: A sequential container that stores a sequence of layers. The layers will be executed in order. e.g `nn.Sequential(nn.Conv2d(1, 1, 3), nn.ReLU(), nn.Conv2d(1, 1, 3))` will create a sequential model with two convolutional layers and a ReLU activation function inbetween.

You can also use any other PyTorch functions that you find useful. You can refer to the [PyTorch documentation](https://pytorch.org/docs/stable/index.html) for more information.


One thing you need to keep in mind using PyTorch is that each building block you implement should be a subclass of `nn.Module`. This is because PyTorch uses a dynamic computation graph, and subclassing `nn.Module` allows PyTorch to keep track of the parameters of the model. Specify the layers you want to use in the `__init__` method and define the forward pass in the `forward` method.

In [22]:
class InceptionModule(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(InceptionModule, self).__init__()
        ### Your code here ###

        # allocate 1/4 of the output channels to each branch
        branch_out_channels = out_channels // 4
        # first branch
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, branch_out_channels, kernel_size=1),
            nn.ReLU()
        )

        # second branch
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, in_channels, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(in_channels, branch_out_channels, kernel_size=3, padding=1),
            nn.ReLU()
        )

        # third branch
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, in_channels, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(in_channels, branch_out_channels, kernel_size=5, padding=2),
            nn.ReLU()
        )

        # fourth branch
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, branch_out_channels, kernel_size=1),
            nn.ReLU()
        )
        #####################ä



    def forward(self, x):
        ### Your code here ###
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)

        # B x C x H x W
        out = torch.cat((branch1, branch2, branch3, branch4), -3)
        return out

        ######################

### Test

In [23]:
# Test the InceptionModule output shape
module = InceptionModule(1, 64)
x = torch.randn(1, 1, 28, 28)

assert module(x).shape == torch.Size([1, 64, 28, 28]), f"Shape was {module(x).shape} instead of torch.Size([1, 64, 28, 28])"

### 2. Implement the auxiliary classifier (1 point)

The auxiliary classifier is a small classifier that is added to the network to help with training. The auxiliary classifier is added after the first and second inception module and consists of a 3x3 average pooling layer followed by a 1x1 convolutional layer, LeakyReLU activation, Adaptive average pooling with output size (1,1) and a fully connected layer. The auxiliary classifier is used to provide additional supervision to the network during training.

Implement the auxiliary classifier with the following structure:

- 3x3 average pooling layer with stride 2
- 1x1 convolutional layer
- LeakyReLU activation function
- Adaptive average pooling layer with output size (1, 1)
- Fully connected layer

The in_channels and out_channels of the convolutional layer are specified as arguments to the constructor.

In [33]:
class AuxillaryClassifier(nn.Module):
    def __init__(self, in_channels, out_channels, num_classes):
        super(AuxillaryClassifier, self).__init__()
        ### Your code here ###

        self.layers = nn.Sequential(
            # 3x3 average pooling
            nn.AvgPool2d(kernel_size=3, stride=2),
            # 1x1 conv
            nn.Conv2d(in_channels, out_channels, kernel_size=1),
            nn.ReLU(),
            # adaptive average pooling with output size 1x1
            nn.AdaptiveAvgPool2d((1, 1)),
            # fully connected layer, has to be flattened first
            nn.Flatten(),
            nn.Linear(out_channels, num_classes)

        )

        #####################

    def forward(self, x):
        ### Your code here ###
        return self.layers(x)
        ######################


### Test

In [34]:
# Test the AuxillaryClassifier output shape
module = AuxillaryClassifier(64, 64, 10)
x = torch.randn(4, 64, 28, 28)

assert module(x).shape == torch.Size([4, 10]), f"Shape was {module(x).shape} instead of torch.Size([4, 10])"

### 3. Implement the BabyGoogLeNet architecture (2 point)

The GoogLeNet architecture consists of multiple inception modules, followed by a global average pooling layer and a fully connected layer. The architecture also uses auxillary classifiers to help with training.

Compared to the original GoogLeNet architecture, we will use a smaller version of the architecture in this exercise. The architecture consists of three inception modules, followed by a global average pooling layer and a fully connected layer. The architecture also uses an auxillary classifier after the first and second inception module.

Implement the BabyGoogLeNet architecture with the following structure:

- CNN block:
    - 3x3 convolutional layer with 16 filters and stride 1 and padding 1
    - LeakyReLU activation function
    - 3x3 max pooling layer with stride 2 and padding 1
- Inception module 1 (16 input_channels and 32 output_channels)
- Auxillary classifier 1 (32 input_channels, 64 output_channels and 10 output_channels)
- max pooling layer with kernel size 3 and stride 2 and padding 1
- Inception module 2 (32 input_channels and 64 output_channels)
- Auxillary classifier 2 (64 input_channels, 128 output_channels and 10 output_channels)
- max pooling layer with kernel size 3 and stride 2 and padding 1
- Inception module 3 (64 input_channels and 128 output_channels)
- Global average pooling layer with output size 1 using `nn.AdaptiveAvgPool2d`
- Fully connected layer

As the auxillary classifiers are only used during training, you have to distinguish between the training and evaluation mode. You can do this by checking the model's mode using `model.training`.

In [35]:
class BabyGoogLeNet(nn.Module):
    def __init__(self):
        super(BabyGoogLeNet, self).__init__()
        ### Your code here ###

        self.cnn = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3,
                      padding=1, stride=1),
            nn.LeakyReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )

        self.inception1 = InceptionModule(16, 32)
        self.aux1 = AuxillaryClassifier(32, 64, 10)
        self.maxpool1 = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.inception2 = InceptionModule(32, 64)
        self.aux2 = AuxillaryClassifier(64, 128, 10)
        self.maxpool2 = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.inception3 = InceptionModule(64, 128)
        self.global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128, 10)
        )

        ######################

    def forward(self, x):
        ### Your code here ###
        # if self.training, return the auxillary classifiers as well
        x = self.cnn(x)
        x = self.inception1(x)

        if self.training:
            aux1 = self.aux1(x)

        x = self.maxpool1(x)
        x = self.inception2(x)

        if self.training:
            aux2 = self.aux2(x)

        x = self.maxpool2(x)
        x = self.inception3(x)
        x = self.global_avg_pool(x)
        x = self.fc(x)

        if self.training:
            return x, aux1, aux2
        else:
            return x
        ######################


### Test

In [36]:
# Test the BabyGoogLeNet output shape
model = BabyGoogLeNet()
x = torch.randn(4, 1, 28, 28)

# Training mode
output, aux1, aux2 = model(x)
assert output.shape == torch.Size([4, 10]), f"Shape was {output.shape} instead of torch.Size([4, 10])"
assert aux1.shape == torch.Size([4, 10]), f"Shape was {aux1.shape} instead of torch.Size([4, 10])"
assert aux2.shape == torch.Size([4, 10]), f"Shape was {aux2.shape} instead of torch.Size([4, 10])"

# Evaluation mode
model.eval()
output = model(x)
assert output.shape == torch.Size([4, 10]), f"Shape was {output.shape} instead of torch.Size([4, 10])"

In [37]:
def validate(model, test_loader, device):
    model.to(device)
    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        for X_batch, y_batch in test_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            _, predicted = torch.max(y_pred, 1)
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()
    model.train()
    return correct / total

### 4. Implement the training loop (1 point)

You will also implement parts of the training loop for the BabyGoogLeNet architecture. The following steps have to be implemented:

- Forward pass: Pass the input through the network to get the output.
- Calculate the loss: Calculate the loss using the outputs (main output and the two auxiliary outputs) and the target.
- Backward pass: Perform backpropagation to calculate the gradients.
- Update the parameters: Update the weights of the model.
- Zero the gradients: Zero the gradients of the model after updating the weights. For this you can use the `nn.Module.zero_grad()` function.

The loss function should be of the following form:

$$ loss(target, output, aux\_output_1, aux\_output_2, w_{aux}) = 
\\ \text{criterion}(output, target) + w_{aux} \times \text{criterion}(aux\_output_1, target) + w_{aux} \times \text{criterion}(aux\_output_2, target) $$

where `criterion` is the cross-entropy loss function and $w_{aux}$ is a hyperparameter that controls the weight of the auxillary classifiers. We set it to 0.9 for the training.

In [41]:
def train(model, criterion, train_loader, test_loader, learning_rate, n_epochs, aux_loss_weight, device):
    print(f'Training on {device} with learning rate of {learning_rate} for {n_epochs} epochs')
    # Transfer model to device, this is necessary if we want to use GPU acceleration
    model.to(device)
    for i in range(n_epochs):
        # Set model to training mode, this is necessary because some layers like the auxillary classifiers in our model behave differently in training mode and evaluation mode
        model.train()
        epoch_loss = 0
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device) # Model and data have to be on the same device
            # Your code here #
            # Forward pass
            y_pred, aux1, aux2 = model(X_batch)
            # Calculate loss
            loss = criterion(y_pred, y_batch) + \
                             aux_loss_weight * (criterion(aux1, y_batch) + \
                                                criterion(aux2, y_batch))
            epoch_loss += loss.item()
            # Backward pass
            loss.backward()
            # update weights
            for param in model.parameters():
                param.data -= learning_rate * param.grad
            # Zero gradients
            model.zero_grad()

            ##################
        with torch.no_grad():
            # Set model to evaluation mode
            model.eval()
            val_loss = 0
            for X_batch, y_batch in test_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                y_pred = model(X_batch)
                loss = criterion(y_pred, y_batch)
                val_loss += loss.item()
        val_acc = validate(model, test_loader, device)
        train_acc = validate(model, train_loader, device)
        print(f'Epoch {i+1}, train loss {epoch_loss/len(train_loader)}, test loss: {val_loss/len(test_loader)}, train acc: {train_acc}, test acc: {val_acc}')

In [42]:
model = BabyGoogLeNet()
criterion = nn.CrossEntropyLoss()
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
aux_loss_weight = 0.9 # Scales the loss of the auxillary classifiers
lr = 0.1  # 0.1 and 0.2 should work well enough
n_epochs = 25
# Set the device to 'cuda' if you have a GPU available, otherwise set it to 'mps' if you have Metal Performance Shaders (Apple's API for programming metal GPU) available, otherwise set it to 'cpu'
device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')

Train the model for at least 25 epochs.
Your model should achieve a validation accuracy of roughly 90% now.

In [43]:
train(model, criterion, train_loader, test_loader, lr, n_epochs, aux_loss_weight, device)

Training on cuda with learning rate of 0.1 for 25 epochs
Epoch 1, train loss 5.114478130330409, test loss: 1.05544536850255, train acc: 0.5681166666666667, test acc: 0.5636
Epoch 2, train loss 2.4984487278629213, test loss: 0.864342112639907, train acc: 0.7153833333333334, test acc: 0.7098
Epoch 3, train loss 1.7412569794192243, test loss: 0.4867699741368081, train acc: 0.8323333333333334, test acc: 0.821
Epoch 4, train loss 1.4577088203511512, test loss: 0.3892122449199106, train acc: 0.8656833333333334, test acc: 0.8499
Epoch 5, train loss 1.3113781955323494, test loss: 0.4390952677293948, train acc: 0.8534, test acc: 0.8409
Epoch 6, train loss 1.2245876280737837, test loss: 0.41525159709772486, train acc: 0.8578833333333333, test acc: 0.8442
Epoch 7, train loss 1.1591968031198994, test loss: 0.36937206053430105, train acc: 0.8734666666666666, test acc: 0.8582
Epoch 8, train loss 1.1009990716539722, test loss: 0.40119828549539965, train acc: 0.8629333333333333, test acc: 0.8455
Epoch

In [44]:
print(f'validation accuracy: {validate(model, test_loader, device)}')

validation accuracy: 0.8883
