<!-- Assignment 2 - SS 2023 -->

# Vision Networks and Fast Training (15 points)

This notebook contains one of the assignments for the exercises in Deep Learning and Neural Nets 2.
It provides a skeleton, i.e. code with gaps, that will be filled out by you in different exercises.
All exercise descriptions are visually annotated by a vertical bar on the left and some extra indentation,
unless you already messed with your jupyter notebook configuration.
Any questions that are not part of the exercise statement do not need to be answered,
but should rather be interpreted as triggers to guide your thought process.

**Note**: The cells in the introductory part (before the first subtitle)
perform all necessary imports and provide utility functions that should work without (too much) problems.
Please, do not alter this code or add extra import statements in your submission, unless explicitly allowed!

<span style="color:#d95c4c">**IMPORTANT:**</span> Please, change the name of your submission file so that it contains your student ID!

In this assignment, we will take a closer look at some famous vision architectures.
Since most of these architectures are very large, it requires high-end hardware to train from scratch.
To leverage the limited availability of hardware, also *Transfer Learning* can be used. 
By using stored weights of a large network a new network can be trained cheaply on new datasets.

In [1]:
import inspect
import torch
import torchvision
from torch import nn
from torch import optim
from torch.utils.data import DataLoader
from tqdm.notebook import trange

torch.manual_seed(1806)
torch.cuda.manual_seed(1806)

In [2]:
# google colab data management
import os.path

try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    _home = 'gdrive/MyDrive/'
except ImportError:
    _home = '~'
finally:
    data_root = os.path.join(_home, '.pytorch')

print(data_root)

~/.pytorch


## LeNet-5 and its Offspring

![LeNet-5 architecture](https://miro.medium.com/max/2154/1*1TI1aGBZ4dybR6__DI9dzA.png)

The LeNet-5 architecture (depicted above) is one of the first convolutional networks.
Since convolutions are extremely well suited for many computer vision tasks,
a wide variety of network architectures using convolutional layers has become available.
Although the differences in performance are sometimes large,
the architectures can generally be considered variations on the same theme.

### Alexnet

![alex-net architecture](https://cdn-images-1.medium.com/max/1000/1*wzflNwJw9QkjWWvTosXhNw.png)

In 2012 Alex Krizhevsky et al. won the [Imagenet Large Scale Visual Recognition Challenge](http://www.image-net.org/challenges/LSVRC/) (ILSVRC).
The network they used, which is known as *Alex-net*, is depicted below and follows the same basic principles as LeNet-5.
Alex-net has quite a bit more parameters than LeNet-5, therefore it requires a large amount of computational resources to train.

To speed up training time, Alex-net was trained on GPU.
Since GPUs have access to little memory compared to CPUs (especially back in the days),
alex-net did not fit on a single GPU and required 2 GPUs to train the model,
hence the distinction between two paths in the illustration of the network.

On modern GPUs, it is no longer a problem to fit alex-net on a single GPU.
Due to the fact that deep learning frameworks mostly support hardware acceleration, 
it has even become extremely easy and almost common to train (large) networks on GPUs.
A more detailed description on how to achieve this in pytorch, is given below.

Another important add-on, is the use of the dropout regularisation technique in the fully connected layers.
From DL & NN 1 you should remember that dropout behaves differently during testing and training.
When using Dropout or other modules with different behaviour, e.g. BatchNorm, in pytorch, 
it is important to make sure that your network operates in the right mode.
To do this, the `nn.Module` class provides the `train` and `eval` methods
and invokes it on all submodules to assure that the desired behaviour is triggered.

### Pytorch GPU acceleration

In pytorch, training a model on GPU is relatively easy.
To copy a tensor `x` from main memory (or wherever it may be) to GPU memory,
all we need to do is call `x.to('cuda')` or equivalently `x.cuda()`.
When multiple GPUs are available, `x.to('cuda:0')` copies a tensor to the first GPU,
`x.to('cuda:1')` to the second, etc.
Similarly, to copy a tensor from a GPU (or again wherever it may be) to main memory,
`x.to('cpu')` or equivalently `x.cpu()` can be used.

Whenever a computation is done on tensors that reside on a specific device,
the result will also be on that device.
It is not possible, however, to make computations with tensors from different devices.
This means that the training of an entire network automatically takes place on e.g. a GPU,
as soon as all the variables reside on the same device.
When working with neural networks, 
this is the case if both the network parameters and the data are moved to the same device.

To move all parameters of a network to the correct device,
The `nn.Module` class provides a convenience `to` method 
that moves all registered parameters, buffers and submodules to the correct device.

As for the data, it is often possible to fit the entire dataset in GPU memory.
However, often it does not provide any advantages or it even comes with disadvantages.
E.g. the `MNIST` dataset from `torchvision` provides PIL images that can not reside on GPU.
If the dataset would be stored on the GPU, the data would have to move to CPU first,
where the pre-processing is done on the PIL images, and then move back to the GPU.
Therefore, it is considered good practice to keep the dataset in main memory
and move the samples to the GPU only when they are needed for computation.

### Exercise 1: Hardware Acceleration (3 points)

In order to allow our computations to be accelerated,
the utility functions `evaluate` and `update` require some minor adjustments.

 > Alter the `evaluate` and `update` functions from assignment 1
 > so that it is assured that the inputs are on the same device as the network parameters.
 > Also put the networks in the right modes so that dropout etc. work correctly.

In [3]:
@torch.no_grad()
def evaluate(network: nn.Module, data: DataLoader, metric: callable) -> list:
    # YOUR CODE HERE

    cuda_bool = torch.cuda.is_available() # check for cuda
    if cuda_bool == True:
      device = 'cuda'

    else:
      device = 'cpu'
      print('use cpu as no gpu available')

    network.to(device) # new # recursively convert their parameters and buffers to device specific tensors


    network.eval() # tells if the network shall do a dropout or not --> .eval()/.train(False) gives dropout=False 
    errors = torch.tensor([]) 
    for mini_batch_x, mini_batch_y in data: 

        mini_batch_x = mini_batch_x.to(device) # new
        mini_batch_y = mini_batch_y.to(device) # new
        mini_batch_x.requires_grad_ = False

        logits = network.forward(mini_batch_x)
        logits.detach()
        error_one_batch = metric(logits, mini_batch_y)

        errors = torch.cat((errors,torch.tensor([error_one_batch])))

    return errors

    
@torch.enable_grad()
def update(network: nn.Module, data: DataLoader, loss: nn.Module, 
           opt: optim.Optimizer) -> list:
    # YOUR CODE HERE
    # using ideas from: https://pytorch.org/tutorials/beginner/ptcheat.html?ref=blog.hackajob.com
    
    cuda_bool = torch.cuda.is_available() # check for cuda
    if cuda_bool == True:
      device = 'cuda'

    else:
      device = 'cpu'
      print('use cpu as no gpu available')

    network.to(device) # new # recursively convert their parameters and buffers to device specific tensors

    network.train() # tells if the network shall do a dropout or not --> .train() gives dropout=True

    errors = torch.tensor([]) 
    
    for mini_batch_x, mini_batch_y in data: 

        mini_batch_x = mini_batch_x.to(device).requires_grad_(True) # new
        mini_batch_y = mini_batch_y.to(device) # new

        logits = network.forward(mini_batch_x)
        logits.detach()
        opt.zero_grad()
        error_one_batch = loss(logits, mini_batch_y)   
        errors = torch.cat((errors,torch.tensor([error_one_batch])))

        error_one_batch.backward()
        opt.step() # no return value

    return errors


In [4]:
# Test Cell: do not edit or delete!

In [5]:
# Test Cell: do not edit or delete!

In [6]:
# Test Cell: do not edit or delete!

In [7]:
# Test Cell: do not edit or delete!

In [8]:
# Test Cell: do not edit or delete!

In [9]:
# Test Cell: do not edit or delete!

### VGG

<img src="https://miro.medium.com/max/2628/1*lZTWFT36PXsZZK3HjZ3jFQ.png" 
     alt="VGG architecture" style="width: 70%; margin: auto" />

The Visual Geometry Group at Oxford University introduced 
different versions of architectures that are now known as VGG net.
11-layer, 16-layer and 19-layer variants exist,
all of which use only 3x3 convolutions in the feature extraction part.

After winning the Imagenet Large Scale Visual Recognition Challenge (ILSVRC) in 2014,
the weights of the winning models were made [available](https://www.robots.ox.ac.uk/~vgg/research/very_deep/).
This made it possible for researchers with lower computational budgets
to make use of the features the network has extracted for natural images.
Since 2021, large pre-trained models often end up serving as [foundation models](https://en.wikipedia.org/wiki/Foundation_models).
Note that VGG is a very small model compared to modern "large" models today.

### Exercise 2: VGG for CIFAR-10 (2 points)

Most vision architectures have been trained on the ImageNet dataset, which is hard to come by:
it is very large (a few 100GB) and requires registration to get access to the images.
[CIFAR-10 and CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html)
are similar datasets that are much easier to obtain
and they are one of the standard datasets in `torchvision.datasets`.
In this exercise the goal is to modify a vision network that was trained on ImageNet
to make predictions on CIFAR-10 so that we can reuse large parts of the weights.

 > Create a network with the same feature extraction architecture as
 > `torchvision.models.VGG` so that it can be used for CIFAR images.
 > Concretely, the goal is to replace the classifier to predict CIFAR labels
 > instead of the Imagenet labels.
 > Use global average pooling to make the classifier independent of the exact image size.
 > Keep the classifier architecture rectangular, i.e. same width for all layers (except for the classes).

**Hint:** Take a look at [`torchvision.models.VGG`](https://pytorch.org/vision/0.13/_modules/torchvision/models/vgg.html) if you need some inspiration.

In [10]:
class CifarVGG(nn.Module):
    """ Variant of the VGG network for classifying CIFAR images. """
    
    def __init__(self, features: nn.Module, num_classes: int = 10):
        """
        Parameters
        ----------
        features : nn.Module
            The convolutional part of the VGG network.
        num_classes : int
            The number of output classes in the data.
        """
        # YOUR CODE HERE
        super().__init__()
        
        self.num_classes = num_classes

        self.size_out = (7,7,512) # that is the output size of the convolutional part according to torchvision.models.vgg (https://pytorch.org/vision/stable/_modules/torchvision/models/vgg.html)
        #(1,1) # default values which shouldn't be used and are overwritten in def forward()
        

        self.features = features

        self.GlobalAvgPooling = nn.AvgPool2d(kernel_size=(self.size_out[0],self.size_out[1]))

        #dropout = 0.2
        self.classifier = nn.Sequential(
          # in trochvision.models.VGG we use nn.Linear(512 * 7 * 7, 4096) where (7,7) is the shape of the single 'image' and 512 its number of channels.
          # Here through Global average pooling we have 1*1*512:
          nn.Linear(512, 1003),
		      nn.ReLU(True),
          #nn.Dropout(p=dropout),

          nn.Linear(1003,1003),
          nn.ReLU(True),
          #nn.Dropout(p=dropout),
          
          nn.Linear(1003, self.num_classes),
          nn.Softmax()

        ) # aim of the classifier is a rectengular architecture, meaning that the number of nodes are the same for each hidden layer.
        # this classifier has several hidden layers. At least one hidden layer is in general needed to solve more complex task, e. g. XOR. Also dropout is here implemented
        # on a basis of p=0.2. In Exercise 1 the ability for dropout is implemented, therefore, it wasn't let out here.


    def forward(self, x):

        conv_part_out = self.features(x)
        self.size_out = (conv_part_out.shape[2], conv_part_out.shape[3])
        remaining_pixels = self.GlobalAvgPooling(conv_part_out) # number of channels/kernels should be kept the same --> we don't get a scalar as output.
        # the classifier shall be independent of image size, thus we trim its input values to the same size with global average pooling.

        remaining_pixels = torch.flatten(remaining_pixels, 1) # tensor = self.Squeeze_Dims(tensor)  # getting vector of length of number of channels

        pred = self.classifier(remaining_pixels) 

        return pred


In [11]:
# Test Cell: do not edit or delete!

In [12]:
# Test Cell: do not edit or delete!

In [13]:
# Test Cell: do not edit or delete!

### Exercise 3: Existing Features (2 points)

Training a network like VGG (or any of the other networks in this assignment)
can take a few hours when training on a GPU.
Therefore it is often useful to be able to load pre-trained weights into the network.
Also, saving a model that has been trained for hours can often save a lot of time.
In pytorch this is possible through what is called 
[`state_dict`s](https://pytorch.org/tutorials/beginner/saving_loading_models.html).
Saving the parameters of a pytorch module can be done with `torch.save(module.state_dict(), path)`,
whereas loading saved parameters is done with `module.load_state_dict(torch.load(path))`.

 > Write a function `vgg_init_` to initialise a `CifarVGG` network.
 > It should load the pre-trained weights for the **11-layer variant of VGG** from `torchvision.models.vgg`
 > to initialise the feature extractor of the model
 > and reasonably initialise the classifier using initialisation functions from `torch.nn.init`.

**Hint:** you can use all of the functions available in `torchvision.models.vgg`.

In [14]:
def vgg_init_(network: CifarVGG):
    """
    Initialise a CifarVGG network with a pre-trained VGG feature extractor.
    
    Parameters
    ----------
    network : CifarVGG
        The model to initialise.
    """
    from torchvision.models import vgg
    # YOUR CODE HERE

    # how to get weights:
    pretrained_weights = vgg.VGG11_Weights.IMAGENET1K_V1 # or weights with batch normalization: VGG11_BN_Weights
    
    pretrained_weights = vgg.VGG11_Weights.verify(pretrained_weights)
    pretrained_weights = pretrained_weights.get_state_dict(progress=True) # gives ordered weight dictionary, loading weights

    # remove classifier keys, keep the feature keys/weights (convolutional part):
   # remove_list = [ i for i in pretrained_weights.keys() if 'classifier' in i]
    #[pretrained_weights.pop(key) for key in remove_list]


    # Integrate weights into the given network:
    for ind,m in enumerate(network.modules()):
        if isinstance(m, nn.Conv2d): # for convolutional part, pretrained

              searched_key = list(pretrained_weights.keys())
              m.weight = nn.Parameter(pretrained_weights[searched_key[0]])
              m.requires_grad_ = False
              pretrained_weights.pop(searched_key[0])

              if m.bias is not None:

                m.bias = nn.Parameter(pretrained_weights[searched_key[1]]) # nn.Parameter should move them automatically to gpu (as I read)
                m.requires_grad_ = False
                pretrained_weights.pop(searched_key[1])
          
        elif isinstance(m, nn.Linear): # for classifier, initialize weights
              nn.init.normal_(m.weight, 0, 0.01)
              nn.init.constant_(m.bias, 0)
              m.requires_grad_ = True

    return


In [None]:
# sanity check
vgg11 = torchvision.models.vgg11() # gives the architecture of the convolutional part embedded in Sequential()
network = CifarVGG(vgg11.features, num_classes=10)
vgg_init_(network)

Downloading: "https://download.pytorch.org/models/vgg11-8a719046.pth" to /home/c/.cache/torch/hub/checkpoints/vgg11-8a719046.pth


  0%|          | 0.00/507M [00:00<?, ?B/s]

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

### Exercise 4: Training (part of) the Network (4 points)

Obviously, a classifier for CIFAR 10 will be different from a classifier for Imagenet.
With the initialisation above, the `CifarVGG` has a ready-to-go feature extractor,
but the classifier part still has to be trained.
To do this training efficiently, there are a few things left to do.

 > The code below should train the entire network on the entire dataset for a few epochs.
 > Modify the code so that 
 > 1. it only trains the classifier part of the network using SGD
 > and leaves the convolutional feature extractor untouched, i.e., *frozen*.
 > 2. the 32x32 CIFAR images are upscaled to 224x224 pixels.
 > 3. training is done on the GPU, which is generally faster. 
 > 4. it uses a parallel dataloader to make sure that the GPU does not have to wait for data.
 > 5. only a random subset of 500 images from the CIFAR data is used for training.
 > 6. a random subset of 500 images from the CIFAR data is used as validation data.
 > You will also need to include a round of validation in the training loop.
 > This should give you some confidence that the classifier is learning something useful.
 
**Hint:** you might find useful tools under
[`torch.utils.data`](https://pytorch.org/docs/stable/data.html).

In [None]:
def get_cifar10(root: str, batch_size: int = 32, resize: tuple[int, int] = (224, 224),
                num_train: int = 500, num_valid: int = 500, num_workers: int = 4):
    """
    Get dataloader(s) for CIFAR-10.
    
    Parameters
    ----------
    root : str
        Path to directory where CIFAR-10 dataset is stored.
    batch_size : int, optional
        The number of samples per mini-batch.
    resize : tuple of int, optional
        Desired width and height of loaded images.
    num_train : int, optional
        Number of (random) samples to use for training.
    num_valid : int, optional
        Number of (random) samples to use for validation.
    num_workers : int, optional
        Number of parallel processes to use for loading data.
        
    Returns
    -------
    train_batches : DataLoader
        A dataloader that loads mini-batches of CIFAR-10 for training.
    valid_batches : DataLoader
        A dataloader that loads (mini-)batches of CIFAR-10 for validation.
    """
    normalise=torchvision.transforms.Compose([
        torchvision.transforms.Resize(resize), # 2. resize/upscale images
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261))
    ])
    cifar10_train = torchvision.datasets.CIFAR10(root, transform=normalise, download=True, train=True) 


    train_sampler = torch.utils.data.RandomSampler(data_source=cifar10_train, replacement=False, num_samples=num_train, generator=None) # 5. only use 500 shuffled pics for train/validation

    train_batches = DataLoader(cifar10_train, batch_size=batch_size, num_workers=num_workers,sampler=train_sampler) # 4. num_workers make data loading parallel
    # 5. shuffle=True to randomize the order.


    cifar10_val = torchvision.datasets.CIFAR10(root, transform=normalise, download=True, train=False) 
    val_sampler = torch.utils.data.RandomSampler(data_source=cifar10_val, replacement=False, num_samples=num_valid, generator=None) # 6. only use 500 shuffled pics for train/validation
    valid_batches = DataLoader(cifar10_val, batch_size=batch_size, num_workers=num_workers,sampler=val_sampler) # 4. num_workers make data loading parallel

    return train_batches, valid_batches

In [None]:
# Test Cell: do not edit or delete!
train_batches, valid_batches = get_cifar10(data_root)

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:
def get_vgg(num_classes: int = 10, device: str = "cpu"):
    """
    Create and initialise VGG network on given device.
    
    Parameters
    ----------
    num_classes : int, optional
        The number of output units for the network.
    device : str, optional
        A string representing the device to work on
    """
    vgg11 = torchvision.models.vgg11()
    net = CifarVGG(vgg11.features, num_classes=10)
    vgg_init_(net)
    # YOUR CODE HERE

    net.to(device) # was already ensured in def update()/evaluate() but also made here as demanded

    return net

In [None]:
# Test Cell: do not edit or delete!
network = get_vgg(device='cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
from pickle import TRUE
class TransferTrainer:
    
    def __init__(self, model: nn.Module):
        """
        Create a trainer for transfer learning.
        
        Parameters
        ----------
        model : nn.Module
            The model to train.
        """
        self.model = model
        self.objective = nn.CrossEntropyLoss(reduction="sum")
        self.optimiser = None
        # YOUR CODE HERE

        # 1.: make gradient computation impossible for network parts for which we already have pretrained weights:
        # was already ensured with initializing the network in exercise 3 but also demanded here
        for ind,m in enumerate(self.model.modules()):
          if isinstance(m, nn.Conv2d): # for convolutional part, pretrained
              
              m.weight.requires_grad = False

              if m.bias is not None:

                m.bias.requires_grad = False

            
          elif isinstance(m, nn.Linear): # 20233003 new
              m.weight.requires_grad = True
              if m.bias is not None:

                m.bias.requires_grad = True

        self.optimiser = torch.optim.SGD(self.model.parameters(), lr=1e-3, momentum=0.9)

    
    def train(self, train_batches: DataLoader, valid_batches: DataLoader, 
              num_epochs: int = 1):
        """
        Train (part of) the network for a number of epochs.
        
        Parameters
        ----------
        train_batches : DataLoader
            The training data for updating the network.
        valid_batches : DataLoader
            The validation data for evaluating the network.
        num_epochs : int, optional
            The number of iterations over the training data.
        """
        train_errs, valid_errs = [], []
        for _ in trange(num_epochs):
            local_errs = update(self.model, train_batches, self.objective, self.optimiser)
            train_errs.append(sum(local_errs) / len(local_errs) / train_batches.batch_size)
            # YOUR CODE HERE

            # 6. also make use of validation set every loop:
            val_errs = evaluate(self.model, valid_batches, self.objective)
            valid_errs.append(sum(val_errs) / len(val_errs) / valid_batches.batch_size)

            # 3. training on gpu guaranteed by exercise 1, see update() and evaluate()
        
        return train_errs, valid_errs
        

In [None]:
trainer = TransferTrainer(network)
train_errs, valid_errs = trainer.train(train_batches, valid_batches, 20)


In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:

# plot learning curves
from matplotlib import pyplot as plt
plt.plot(range(1, len(train_errs) + 1), train_errs, label="train")
plt.plot(range(1, len(valid_errs) + 1), valid_errs, label="valid")
plt.legend()
print(f"ran on {next(network.parameters()).device}")

## Skip-connections

One of the most popular modern network architectures for vision is the residual network.
The main feature of this architecture is the so-called skip-connection,
which allows to combine the activations with the original inputs in each layer.
Since these skip-connections open up a gradient highway,
they make it possible to train much deeper networks than is possible without the skip-connections.

Mathematically, the simplest form of a skip connection can be written as
$$\boldsymbol{s} = \boldsymbol{x} + f(\boldsymbol{x}).$$
In order for this to work, the dimensions of $\boldsymbol{x}$ and $f(\boldsymbol{x})$ must line up.
This means that in this formulation, only square layers,
i.e. layers with the same number of inputs and outputs, are possible.

In order to use a skip-connection on layers that reduce the dimensionality,
a linear transform on $\boldsymbol{x}$ can be inserted in the equation.
Since also other operations are possible, on both inputs and (pre-)activations, 
we can generalise the skip-connection formula to
$$\boldsymbol{s} = \boldsymbol{C} \cdot \boldsymbol{x} + \boldsymbol{T} \cdot f(\boldsymbol{x}),$$
where $\boldsymbol{C}$ and $\boldsymbol{T}$ are linear transformations (a.k.a. matrices).

### Exercise 5: Pre-Residual Networks (4 points)

The original and most commonly used residual networks actually do not implement skip connections as in the formula above.
Upon closer inspection (e.g. `torchvision.models.resnet`), it becomes clear that the most famous skip-connection looks more like

$$\boldsymbol{a} = \phi(\boldsymbol{x} + f(\boldsymbol{x})),$$

where $\phi$ is some non-linear activation function.
This non-linearity typically interferes with the signal propagation of the network.
As a result, gradients might still vanish despite the skip-connection.

Pre-Residual Networks aim to counter this problem by moving skip-connections to the level of pre-activations, such that

$$\boldsymbol{a} = \boldsymbol{x} + f(\phi(\boldsymbol{x})).$$

This way, clean signal propagation can be guaranteed and learning should become easier.

 > Implement the `PreResBlock` class so that it can be used as a layer in a pre-residual network.
 > The residual part of the network, $f$, should be a small network with two convolutional layers.
 > Both layers should use the given `kernel_size` and preserve the image size if `stride` is one.
 > If `stride` is greater than one, the network should reduce the spatial dimensions by this factor.
 > Make sure that the network also works if `in_channels != out_channels` and `stride > 1`.
 > Try to avoid unnecessary parameters, especially for the skip-connection.

In [None]:
class PreResBlock(nn.Module):
    """ Residual block using skip-connections on pre-activation level. """
    
    def __init__(self, in_channels: int, out_channels: int, kernel_size: int = 3,
                 stride: int = 1, phi: nn.Module = nn.ReLU(), extra_pars: bool = True):
        """
        Parameters
        ----------
        in_channels : int
            Number of input channels.
        out_channels : int
            Number of output channels.
        kernel_size : int
            Size of the kernel in all dimensions.
        stride : int
            Factor by which to reduce the spatial dimensions.
        phi : nn.Module
            The activation function to use in the residual branch.
        """
        super().__init__()
        # YOUR CODE HERE

        # Used to solve the issue, the formula to compute the output size, solve it by Z the zeropadding. 
        # init: z=zeropad, a=width or height dimension of the input or outcome, k=kernel size, p=stride.
        # For the formula I also considered pooling with the arguments kernel_size=our given kernel_size and stride=kernel_size such that the input gets in size divided by k.
        # We end up with two conv layers with two times pooling.
        # formula: a_out_of_layer2 = (((a_out_of_layer1)/k + 2*z_layer2 - k)/p + 1)/k
        # where a_out_of_layer1 = (a_in - k + 2*z_layer1)/p + 1 
        # where a_out_of_layer2 = a_in/stride    .
        # Solving this by z_layer2 gives us the formula used in variable 'zlPlusOne_height_formula' below. Search for a z_layer1 and z_layer2 combination which solves this
        # formula and for which every z % 0.5 == 0.

        self.p = stride
        self.k = kernel_size
        self.pad_factor_width = 0
        self.pad_factor_height = 0
        self.pad_factor_width_PlusOne = 0
        self.pad_factor_height_PlusOne = 0

        self.zl_width = 0
        self.zl_height = 0
        self.zlPlusOne_width = 0
        self.zlPlusOne_height =0

        self.padderl1 =  nn.ZeroPad2d((self.zl_width+self.pad_factor_width,self.zl_width,self.zl_height+self.pad_factor_height,self.zl_height))

        self.l1 = nn.Sequential(          
          nn.Conv2d(in_channels, 1, kernel_size=(kernel_size, kernel_size), stride=(stride, stride), padding=0), # (in_channels, out_channels, ...)
          # we can't use the padding argument in Conv2d because it only add rows/columns on all sides EQUALLY! However, it can happen that we want to add
          # for example on the left one column more than on the right side of the picture.
          nn.ReLU(inplace=True),
          nn.MaxPool2d(kernel_size=kernel_size, stride=kernel_size, padding=0, dilation=1, ceil_mode=False)) 

        self.padderl2 =  nn.ZeroPad2d((self.zlPlusOne_width+self.pad_factor_width_PlusOne,self.zlPlusOne_width,self.zlPlusOne_height+self.pad_factor_height_PlusOne,self.zlPlusOne_height))

        self.l2 = nn.Sequential(   
          nn.Conv2d(1, out_channels, kernel_size=(kernel_size, kernel_size), stride=(stride, stride), padding=0),
          nn.ReLU(inplace=True),
          nn.MaxPool2d(kernel_size=kernel_size, stride=kernel_size, padding=0, dilation=1, ceil_mode=False))

    
    def forward(self, x):

      # find out with the help of the shape of x how the padding values z look like for width of the input:
      a_width = x.shape[2]
      a_height = x.shape[3]

      a_out_width = a_width/self.p
      a_out_height = a_height/self.p

      zlPlusOne_width_formula = lambda zl: (-a_width + self.k - 2*zl - self.p + self.k**2*self.p + self.p**2*a_out_width*self.k**2 - self.p**2*self.k) / (2*self.p*self.k)
      fixed_width_term = (-a_width + self.k - self.p + self.k**2*self.p + self.p**2*a_out_width*self.k - self.p**2*self.k)

      zl_width = 0
      saver = []

      while True:
        zlPlusOne_width = zlPlusOne_width_formula(zl_width)
        
        if zlPlusOne_width % 0.5 == 0:

           if self.p == 1:
            if saver == []:
              saver.append(zl_width)
              saver.append(zlPlusOne_width)

            if zl_width == zlPlusOne_width: # when they are the same the image shape stays after each individual layer the same
              break

           else:
             break

           zl_width += 0.5

        else:
            zl_width += 0.5

        if 2*zl_width > fixed_width_term:
            print('problem in finding padding strategy!')
            break

      if self.p==1:
        if zl_width != zlPlusOne_width:
          zl_width = saver[0]
          zlPlusOne_width = saver[1]


      # same for height:
      if a_height != a_width:
        zlPlusOne_height_formula = lambda zl: (-a_height + self.k - 2*zl - self.p + self.k**2*self.p + self.p**2*a_out_height*self.k**2 - self.p**2*self.k) / (2*self.p*self.k)
        fixed_height_term = (-a_height + self.k - self.p + self.k**2*self.p + self.p**2*a_out_height*self.k - self.p**2*self.k)
        
        zl_height = 0
        saver2 = []

        while True:
          zlPlusOne_height = zlPlusOne_height_formula(zl_height)
          
          if zlPlusOne_height % 0.5 == 0:

           if self.p == 1:
            if saver2 == []:
              saver2.append(zl_height)
              saver2.append(zlPlusOne_height)

            if zl_height == zlPlusOne_height:
              break

           else:
             break
          
           zl_height += 0.5

          else:
              zl_height += 0.5

          if 2*zl_height > fixed_height_term:
              print('problem in finding padding strategy!')
              break      
              
        if self.p==1:
          if zl_height != zlPlusOne_height:
            zl_height = saver2[0]
            zlPlusOne_height = saver2[1]

      else: # input image is rectengular. So we can reuse found z-s in width for height.
        zl_height = zl_width
        zlPlusOne_height = zlPlusOne_width

      # --> we end up finding zl and zlplusOne_result

      # When found values end on .5, then we have to pad only on one side of the image instead of two sides.
      # Explanation: zl gives us the number of rows/columns we want to add on ONE side of the picture. E. g. we have zl=0.5 such that on EACH side there shall
      # be 0.5 rows/columns added but that's not possible. Instead add only ONE row/column on only ONE side:
      
      if int(zl_width) != zl_width: # zl ends on .5
        self.pad_factor_width = 1 # the pad_factor will be used to add a row/column only to one side of the 'image' during padding
      
      self.zl_width = int(zl_width)

      if int(zl_height) != zl_height: 
        self.pad_factor_height = 1
      
      self.zl_height = int(zl_height)


      if int(zlPlusOne_width) != zlPlusOne_width: 
        self.pad_factor_width_PlusOne = 1 
      
      self.zlPlusOne_width = int(zlPlusOne_width)


      if int(zlPlusOne_height) != zlPlusOne_height: 
        self.pad_factor_height_PlusOne = 1
      
      self.zlPlusOne_height = int(zlPlusOne_height)

        
      self.padderl1.padding =  (self.zl_width+self.pad_factor_width,self.zl_width,self.zl_height+self.pad_factor_height,self.zl_height)
      self.padderl2.padding = (self.zlPlusOne_width+self.pad_factor_width_PlusOne,self.zlPlusOne_width,self.zlPlusOne_height+self.pad_factor_height_PlusOne,self.zlPlusOne_height)
      
      
      # Finally, we start processing x through net:
      x_pad = self.padderl1(x)
      out1 = self.l1(x_pad)
      x_pad2 = self.padderl2(out1)
      conv_x = self.l2(x_pad2)
      
      return conv_x

In [None]:
# Test Cell: do not edit or delete!
block = PreResBlock(8, 8)
out = block(torch.zeros(1, 8, 32, 32))
sum(par.numel() for par in block.parameters()), out.shape

In [None]:
# Test Cell: do not edit or delete!
strided_block = PreResBlock(8, 8, stride=2)
out = strided_block(torch.zeros(1, 8, 32, 32))
sum(par.numel() for par in strided_block.parameters()), out.shape

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!