<!-- Assignment 3 - SS 2024 -->

# Monitoring, Hyperparameters and efficient CNNs  (15 points)

This notebook contains one of the assignments for the exercises in Deep Learning and Neural Nets 2.
It provides a skeleton, i.e. code with gaps, that will be filled out by you in different exercises.
All exercise descriptions are visually annotated by a vertical bar on the left and some extra indentation,
unless you already messed with your jupyter notebook configuration.
Any questions that are not part of the exercise statement do not need to be answered,
but should rather be interpreted as triggers to guide your thought process.

**Note**: The cells in the introductory part (before the first subtitle)
perform all necessary imports and provide utility functions that should work without (too much) problems.
Please, do not alter this code or add extra import statements in your submission, unless explicitly allowed!

<span style="color:#d95c4c">**IMPORTANT:**</span> Please, change the name of your submission file so that it contains your student ID!

In this assignment, the main goal is to get familiar with neural network hyperparameter search.
More specifically, you will perform hyperparameter search on some real-world data.
To prepare you for the search, we will first look at how you can monitor the training progress.

In [46]:
import random
from pathlib import Path

import torch
import torchvision
from torch import nn, optim
from tqdm.notebook import tqdm
from torch.utils.data import DataLoader, random_split
from torch.utils.tensorboard import SummaryWriter

torch.manual_seed(1806)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

%load_ext tensorboard

cpu
The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [47]:
# google colab data management
import os.path

try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    _home = 'gdrive/MyDrive/'
except ImportError:
    _home = '~'
finally:
    data_root = os.path.join(_home, '.pytorch')

print(data_root)

~\.pytorch


## Tracking Progress

Training a deep neural network with millions of parameters can cost quite some time.
E.g. Alexnet already requires roughly [225 hours][alexnet] (>1 week) of compute on a single GPU.
In order to make sure that the network is training as expected,
it is crucial to get some insights into how training progresses.
After all, you do not want to waste hundreds of hours of compute to find out 
that training had already diverged in the first few minutes.
Therefore it is important to be able to monitor the training process.

[alexnet]: https://arxiv.org/abs/1404.5997

As a matter of fact, the `update` and `evaluate` functions 
already implement some sort of ad hoc monitoring by providing the list of errors in a batch.
This list can be used to print the mean loss after every epoch
and can therefore be used to get an idea of how learning is progressing.
This specific implementation of monitoring the loss is not very flexible, however,
since it is not possible to access the information before the epoch has finished.

Before we start, we will tackle the flexibility of monitoring the loss
by creating a separate `Tracker` class to keep track
of important steps and results during training.

In [48]:
class Tracker:
    """ Tracks useful information as learning progresses. """

    def __init__(self, *loggers: "Logger"):
        """
        Parameters
        ----------
        logger0, logger1, ... loggerN : Logger
            One or more loggers for logging training information.
        """
        self.epoch = 0
        self.update = 0
        self._tag = None
        self._losses = []
        self._summary = {}

        self.loggers = list(loggers)

    def start_epoch(self, count: bool = True):
        """ Start one iteration of updates over the training data. """
        if count:
            self.epoch += 1
        
        self._summary.clear()
        for logger in self.loggers:
            logger.on_epoch_start(self.epoch)

    def end_epoch(self):
        """ Wrap up one iteration of updates over the training data. """
        for logger in self.loggers:
            logger.on_epoch_end(self.epoch, **self._summary)

        return dict(self._summary)

    def start(self, tag: str, num_batches: int = None):
        """ Start a loop over mini-batches. """
        self._tag = tag
        self._losses.clear()
        for logger in self.loggers:
            logger.on_iter_start(self.epoch, self.update, self._tag, num_steps_expected=num_batches)
    
    def step(self, loss: float):
        """ Register the loss of a single mini-batch. """
        self._losses.append(loss)
        for logger in self.loggers:
            logger.on_iter_update(self.epoch, self.update, self._tag, loss=loss)  

    def summary(self):
        """ Wrap up and summarise a loop over mini-batches. """
        losses = self._losses
        avg_loss = float("nan") if len(losses) == 0 else sum(losses) / len(losses)
        self._summary[self._tag] = avg_loss
        for logger in self.loggers:
            logger.on_iter_end(self.epoch, self.update, self._tag, avg_loss=avg_loss)

        return avg_loss

    def count_update(self):
        """ Increase the update counter. """
        self.update += 1
        for logger in self.loggers:
            logger.on_update(self.epoch, self.update)

This class provides the same functionality as the list that
you might have used in the current `update` and `evaluate` functions.
However, it also makes it possible to extend the functionality
of both functions without the need to interfere with existing code.

Note that there are libraries and frameworks out there that provide
(parts of) the functionality we will implement in what follows.
Two example frameworks that directly build on pytorch are
[pytorch-lightning](https://www.pytorchlightning.ai/)
and [pytorch ignite](https://pytorch.org/ignite/).

### Exercise 1: Combining Classes for Tracking (3 points)

You might not have noticed yet, but in assignment 2, a `Trainer` class was introduced.
The goal of this exercise is to extend this `Trainer` class to make use of the `Tracker`.

 > Update the `Trainer` class to make use of the `tracker` attribute (see `__init__`).
 > The functionality and outputs of the current implementation should be preserved.
 > Also, make sure to offload as much as possible to the `tracker`.
 > You will want to use every method of the `Tracker` class.

In [49]:
class Trainer:
    """ Class to organise learning and monitoring. """

    def __init__(self,
         model: nn.Module,
         criterion: nn.Module,
         optimiser: optim.Optimizer,
         tracker: Tracker = None,
    ):
        """
        Parameters
        ----------
        model : torch.nn.Module
            Neural Network that will be trained.
        criterion : torch.nn.Module
            Loss function to use for training.
        optimiser : torch.optim.Optimizer
            Optimisation strategy for training.
        tracker : Tracker, optional
            Tracker to keep track of training progress.
        """
        if tracker is None:
            tracker = Tracker()

        self.model = model
        self.criterion = criterion
        self.optimiser = optimiser

        self.tracker = tracker

    def state_dict(self):
        """ Current state of learning. """
        return {
            "model": self.model.state_dict(),
            "objective": self.criterion.state_dict(),
            "optimiser": self.optimiser.state_dict(),
            "num_epochs": self.tracker.epoch,
            "num_updates": self.tracker.update,
        }

    @property
    def device(self):
        """ Device of the (first) model parameters. """
        return next(self.model.parameters()).device

    @torch.no_grad()
    def evaluate(self, batches: DataLoader, tag: str = 'default_eval_tag'):
        """
        One epoch of evaluating the network.

        Parameters
        ----------
        batches : DataLoader
            An iterator over mini-batches of data to use for updating.
        tag : str, optional
            Identification tag for tracking loss values.

        Returns
        -------
        avg_loss : float
            The average loss over all mini-batches.
        """
        self.model.eval()
        device = self.device
        
        # YOUR CODE HERE
        # raise NotImplementedError()
        
        self.tracker.start(tag, num_batches=len(batches))
        losses = []
        
        for x, y in batches:
            x, y = x.to(device), y.to(device)
            logits = self.model(x)
            loss = self.criterion(logits, y)
            losses.append(loss.item())
            self.tracker.step(loss.item())

            for logger in self.tracker.loggers:
                if hasattr(logger, 'on_iter_update'):
                    logger.on_iter_update(self.tracker.epoch, self.tracker.update, tag, loss=loss.item())

        avg_loss = sum(losses) / len(losses)
        self.tracker.summary()
        for logger in self.tracker.loggers:
            if hasattr(logger, 'on_iter_end'):
                logger.on_iter_end(self.tracker.epoch, self.tracker.update, tag, avg_loss=avg_loss)
        return avg_loss

    @torch.enable_grad()
    def update(self, batches: DataLoader, tag: str = 'default_update_tag'):
        """
        One epoch of updating the network.

        Parameters
        ----------
        batches : DataLoader
            An iterator over mini-batches of data to use for updating.
        tag : str, optional
            Identification tag for tracking loss values.

        Returns
        -------
        avg_loss : float
            The average loss over all mini-batches.
        """
        self.model.train()
        device = self.device
        
        # YOUR CODE HERE
        # raise NotImplementedError()
        self.tracker.start(tag, num_batches=len(batches))
        losses = []
    
        for i, (x, y) in enumerate(batches):
            x, y = x.to(device), y.to(device)
            logits = self.model(x)
            loss = self.criterion(logits, y)
            losses.append(loss.item())

            self.optimiser.zero_grad()
            loss.backward()
            self.optimiser.step()
            self.tracker.step(loss.item())

            for logger in self.tracker.loggers:
                if hasattr(logger, 'on_iter_update'):
                    logger.on_iter_update(self.tracker.epoch, self.tracker.update, tag, loss=loss.item())

        avg_loss = sum(losses) / len(losses)
        self.tracker.summary()
        for logger in self.tracker.loggers:
            if hasattr(logger, 'on_iter_end'):
                logger.on_iter_end(self.tracker.epoch, self.tracker.update, tag, avg_loss=avg_loss)
        return avg_loss

    def train(self, train_batches, valid_batches=None, num_epochs: int = 1):
        """
        Train the network for multiple epochs.

        Parameters
        ----------
        train_batches : DataLoader
            The training data for updating the network.
        valid_batches : DataLoader, optional
            The validation data for estimating the generalisation performance.
        num_epochs : int, optional
            The number of epochs to train.

        Returns
        -------
        results : dict
            The average loss estimates after `num_epochs` epochs.
            
        """
        import math 
        if valid_batches is None:
            valid_batches = ()

        # YOUR CODE HERE
        # raise NotImplementedError()
        
        results = {'train': [], 'valid': []}
        for epoch in range(num_epochs):
            self.tracker.start_epoch()

            training_tag = 'train'
            self.tracker.start(training_tag, len(train_batches))
            train_loss = self.update(train_batches, tag=training_tag)
            self.tracker.summary()  # Update summary after training
            results['train'].append(train_loss)

            if valid_batches:
                validation_tag = 'valid'
                self.tracker.start(validation_tag, len(valid_batches))
                valid_loss = self.evaluate(valid_batches, tag=validation_tag)
                self.tracker.summary()  
                results['valid'].append(valid_loss)
            else:
                results['valid'].append(float('nan'))  

            self.tracker.end_epoch()

        return {
            "train": sum(results['train']) / len(results['train']),
            "valid": sum(v for v in results['valid'] if not math.isnan(v)) / len([v for v in results['valid'] if not math.isnan(v)]) if results['valid'] else float('nan')
        }

In [50]:
# sanity check (and test setup)
from torchvision import transforms
mean, std = .1307, .3081
normalise = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((mean, ), (std, ))
])

dataset = torchvision.datasets.FashionMNIST(data_root, train=False, transform=normalise, download=True)
loader = DataLoader(dataset, batch_size=1024, shuffle=True, num_workers=2)
    
conv_net = nn.Sequential(
    nn.Conv2d(1, 8, 5), nn.MaxPool2d(3), nn.ELU(),
    nn.Conv2d(8, 16, 7), nn.ELU(),
    nn.Flatten(),
    nn.Linear(64, 10),
)

trainer = Trainer(
    model=conv_net.to(device),
    criterion=nn.CrossEntropyLoss(reduction="sum"),
    optimiser=optim.Adam(conv_net.parameters(), lr=1e-2),
)

In [51]:
# Test Cell: do not edit or delete!

In [52]:
# Test Cell: do not edit or delete!
results = trainer.train(loader, loader)
assert trainer.tracker.epoch == 1, (
    f"ex1: expected tracker to have counted 1 epoch, but found {trainer.tracker.epoch} (-0.5 points)"
)

In [53]:
# Test Cell: do not edit or delete!
assert "train" in results, "ex1: could not find training loss in results"
assert "valid" in results, "ex1: could not find validation loss in results"

In [54]:
# Test Cell: do not edit or delete!
results = trainer.evaluate(loader, tag="extra")

## Logging Tracked Information

In its simplest form, a `Tracker` only keeps track of what happens in an epoch.
It knows about the loss values for each mini-batch,
but also how many epochs and updates already happened.
However, as mentioned earlier, a lot of features can be added to the `Tracker`.

Most notably, we can use the `Tracker` to store certain information during training.
Thus far, loss information has been collected to compute the average and is then discarded.
In order to revisit this information later, it can be written to a file, or _logged_.

For this purpose, we will use the interface provided by the `Logger` class (below).
This way, different types of information can be logged in a flexible way.
Luckily the `Tracker` class already provides everything that is necessary
to work with loggers to monitor whatever we need during learning.

In [55]:
class Logger:
    """ Extracts and/or persists tracker information. """

    def __init__(self, path: str = None):
        """
        Parameters
        ----------
        path : str or Path, optional
            Path to where data will be stored.
        """
        path = Path("run") if path is None else Path(path)
        self.path = path.expanduser().resolve()

    def on_epoch_start(self, epoch: int, **kwargs):
        """Actions to take on the start of an epoch."""
        pass

    def on_epoch_end(self, epoch: int, **kwargs):
        """Actions to take on the end of an epoch."""
        pass

    def on_iter_start(self, epoch: int, update: int, tag: str, **kwargs):
        """Actions to take on the start of an iteration."""
        pass

    def on_iter_update(self, epoch: int, update: int, tag: str, **kwargs):
        """Actions to take when an update has occurred."""
        pass

    def on_iter_end(self, epoch: int, update: int, tag: str, **kwargs):
        """Actions to take on the end of an iteration."""
        pass
    
    def on_update(self, epoch: int, update: int):
        """Actions to take when the model is updated."""
        pass

### Exercise 2: Progress bar (1 point)

Monitoring the loss early on during training can be useful
to check whether things are working as expected.
In combination with an indication of progress in training,
expectations can be properly managed early on.

 > Create a logger that produces some sort of progress bar for each epoch.
 > The progress bar should show the current epoch, the current trainnig stage (tag) and the current loss value.
 > Moreover, it should print a short summary after each epoch, including the average loss for each tag.
 > Note that most of this information is passed through the `kwargs` in the `Logger` methods.

**Hint:** You probably want to make use of the [`tqdm` library](https://tqdm.github.io/docs/tqdm/) to manage the progress bar.

In [56]:
class ProgressBar(Logger):
    """Log progress of epoch using a progress bar."""

    def __init__(self):
        super().__init__()
        # YOUR CODE HERE
        # raise NotImplementedError()
        self.progress_bar = None
    
    # TODO: implement any logger method you like/need

    def on_epoch_start(self, epoch: int, **kwargs):
        print(f"Starting epoch {epoch + 1}")

    def on_iter_start(self, epoch: int, update: int, tag: str, num_steps_expected: int = None, **kwargs):
        if self.progress_bar is not None:
            self.progress_bar.close()
        self.progress_bar = tqdm(total=num_steps_expected, desc=f"{tag} (Epoch {epoch + 1})", leave=False)

    def on_iter_update(self, epoch: int, update: int, tag: str, loss: float, **kwargs):
        self.progress_bar.set_description(f"{tag} (Epoch {epoch + 1}): Loss {loss:.4f}")
        self.progress_bar.update(1)

    def on_iter_end(self, epoch: int, update: int, tag: str, avg_loss: float, **kwargs):
        if self.progress_bar is not None:
            self.progress_bar.close()

    def on_epoch_end(self, epoch: int, **kwargs):
        print(f"Completed epoch {epoch + 1}. Summary:")
        string_kwargs = {str(k): v for k, v in kwargs.items() if isinstance(k, str)}
        if not string_kwargs:  
            print(f"No valid tags to display. Check tag configurations.")
        else:
            for tag, avg_loss in string_kwargs.items():
                print(f"{tag}: Average Loss {avg_loss:.4f}")

        if self.progress_bar is not None:
            self.progress_bar.close()
            self.progress_bar = None

In [57]:
# sanity check (and test setup)
progress = ProgressBar()
trainer.tracker.loggers = [progress]
trainer.train(loader, loader, num_epochs=5)

Starting epoch 3


train (Epoch 3):   0%|          | 0/10 [00:00<?, ?it/s]

train (Epoch 3):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 3):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 3):   0%|          | 0/10 [00:00<?, ?it/s]

Completed epoch 3. Summary:
train: Average Loss 694.7586
valid: Average Loss 614.4181
Starting epoch 4


train (Epoch 4):   0%|          | 0/10 [00:00<?, ?it/s]

train (Epoch 4):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 4):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 4):   0%|          | 0/10 [00:00<?, ?it/s]

Completed epoch 4. Summary:
train: Average Loss 569.4003
valid: Average Loss 525.7129
Starting epoch 5


train (Epoch 5):   0%|          | 0/10 [00:00<?, ?it/s]

train (Epoch 5):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 5):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 5):   0%|          | 0/10 [00:00<?, ?it/s]

Completed epoch 5. Summary:
train: Average Loss 504.6120
valid: Average Loss 470.0963
Starting epoch 6


train (Epoch 6):   0%|          | 0/10 [00:00<?, ?it/s]

train (Epoch 6):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 6):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 6):   0%|          | 0/10 [00:00<?, ?it/s]

Completed epoch 6. Summary:
train: Average Loss 462.5868
valid: Average Loss 445.2923
Starting epoch 7


train (Epoch 7):   0%|          | 0/10 [00:00<?, ?it/s]

train (Epoch 7):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 7):   0%|          | 0/10 [00:00<?, ?it/s]

valid (Epoch 7):   0%|          | 0/10 [00:00<?, ?it/s]

Completed epoch 7. Summary:
train: Average Loss 427.5823
valid: Average Loss 410.6964


{'train': 531.7879791259766, 'valid': 493.2431848144532}

In [58]:
# Test Cell: do not edit or delete!

In [59]:
# Test Cell: do not edit or delete!

### Exercise 3: Tensorboard (2 points)

[Tensorboard](https://www.tensorflow.org/tensorboard) 
is a library that allows to track and visualise data during and after training.
Apart from scalar metrics, tensorboard can process distributions, images and much more.
It started as a part of tensorflow, but was then made available as a standalone library.
This makes it possible to use tensorboard for visualising pytorch data.
As a matter of fact, tensorboard is readily available in pytorch.
From [`torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html),
the `SummaryWriter` class can be used to track various types of data.

 > Create a Logger that makes use of the `Summarywriter` to monitor the loss with tensorboard.
 > On one hand, it should monitor the loss for every batch and both modes using `<tag>/loss` as tag.
 > On the other hand, it should monitor the average losses after every stage, using `'<tag>/avg_loss'`.

In [60]:
class TensorBoard(Logger):
    """Log loss values to tensorboard."""

    def __init__(self, path: Path = None, every: int = 1):
        super().__init__(path)
        self.every = every
        # YOUR CODE HERE
        # raise NotImplementedError()
        # Ensure the path is a Path object and resolve it.
        if path is None:
            path = Path("run")
        else:
            path = Path(path).resolve()
        self.writer = SummaryWriter(log_dir=path)
        self.global_step = 0 
        
    def on_epoch_start(self, epoch: int, **kwargs):
        pass

    def on_epoch_end(self, epoch: int, **kwargs):
        pass

    def on_iter_start(self, epoch: int, update: int, tag: str, **kwargs):
        pass

    def on_iter_update(self, epoch: int, update: int, tag: str, loss: float, **kwargs):
        """
        Log the loss for every batch update.
        """
        if update % self.every == 0:
            self.global_step += 1
            self.writer.add_scalar(f"{tag}/loss", loss, global_step=epoch * 10000 + update)
            self.writer.flush()

    def on_iter_end(self, epoch: int, update: int, tag: str, avg_loss: float, **kwargs):
        """
        Log the average loss at the end of each stage (epoch).
        """
        self.writer.add_scalar(f"{tag}/avg_loss", avg_loss, global_step=epoch)
        self.writer.flush()

    def __del__(self):
        """
        Ensure that the writer is properly closed when the object is deleted.
        """
        self.writer.close()
        print('TensorBoard writer closed.')

In [61]:
%tensorboard --logdir run

Reusing TensorBoard on port 6006 (pid 1604), started 2 days, 1:08:26 ago. (Use '!kill 1604' to kill it.)

In [62]:
# sanity check (and test setup)
tb = TensorBoard()
trainer.tracker.loggers = [tb]
trainer.train(loader, loader, num_epochs=5)

TensorBoard writer closed.


{'train': 363.4068991088867, 'valid': 348.5600860595703}

In [63]:
# Test Cell: do not edit or delete!
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
path = next(tb.path.glob("events.out.tfevents.*"))
tb_data = EventAccumulator(str(path)).Reload()
tags = tb_data.Tags()["scalars"]

assert "train/loss" in tags, "ex3: could not find training loss (-1 points)"
assert "valid/loss" in tags, "ex3: could not find validation loss (-1 points)"

In [64]:
# Test Cell: do not edit or delete!
assert "train/avg_loss" in tags, "ex3: could not find avg training loss (-0.5 points)"
assert "valid/avg_loss" in tags, "ex3: could not find avg validation loss (-0.5 points)"

### Exercise 4: Always have a Backup-plan (1 point)

Apart from logging metrics like e.g. loss and accuracy,
it can often be useful to create a backup (or checkpoint) of training progress.
After all, you do not want hours of training to get lost
due to a programming error in a print statement at the end of your code.
This idea can also be useful to implement some form of early-stopping.
However, we will ignore that for now.

 > Implement a logger that saves the state of the trainer every few epochs.
 > For the sake of convention, use the `.pth` extension for storing these backups.

**Hint:** you may want to raise a [`warning`](https://docs.python.org/3/library/warnings.html#available-functions) if no trainer has been attached.

In [65]:
class Backup(Logger):
    
    DEFAULT_FILE = "backup.pth"
    
    def __init__(self, path: Path = None, every: int = 1):
        super().__init__(path)
        self.trainer = None
        self.every = every
        
        if self.path.is_dir() or not self.path.suffix:
            self.path = self.path / self.DEFAULT_FILE
        
        self.path.parent.mkdir(exist_ok=True, parents=True)
    
    def attach_trainer(self, trainer: Trainer):
        self.trainer = trainer
    
    # YOUR CODE HERE
    # raise NotImplementedError()
    def on_epoch_end(self, epoch: int, **kwargs):
        import warnings

        if self.trainer is None:
            warnings.warn("Backup logger requires a trainer to be attached.")
            return
        
        if (epoch + 1) % self.every == 0:
            torch.save(self.trainer.state_dict(), self.path)
            print(f"Backup created at epoch {epoch + 1}: {self.path}")

In [66]:
# sanity check (and test setup)
checkpoints = Backup(every=2)
trainer.tracker.loggers = [checkpoints]
checkpoints.attach_trainer(trainer)
trainer.train(loader, loader, num_epochs=4)
trainer.tracker.epoch

Backup created at epoch 14: C:\Users\Q540900\Desktop\A.I. Master\Second Semester\Deep Learning II\Assignment 3\run\backup.pth
Backup created at epoch 16: C:\Users\Q540900\Desktop\A.I. Master\Second Semester\Deep Learning II\Assignment 3\run\backup.pth


15

In [67]:
# Test Cell: do not edit or delete!
print(torch.load(checkpoints.path)["num_epochs"])

15


In [68]:
# clean up checkpoints and tensorboard logs
! rm -r run

Der Befehl "rm" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


## Hyperparameter Search

Finding good hyperparameters for a model is a general problem in machine learning (or even statistics).
However, neural networks are (in)famous for their large number of hyperparameters.
To list a few: learning rate, batch size, epochs, pre-processing, layer count, neurons for each layer, 
activation function, initialisation, normalisation, layer type, skip connections, regularisation, ...
Moreover, it is often not possible to theoretically justify a particular choice for a hyperparameter.
E.g. there is no way to tell whether $N$ or $N + 1$ neurons in a layer would be better, without trying it out.
Therefore, hyperparameter search for neural networks is an especially tricky problem to solve.

###### Manual Search

The most straightforward approach to finding good hyperparameters is to just 
try out *reasonable* combinations of hyperparameters and pick the best model (using e.g. the validation set).
The first problem with this approach is that it requires a gut feeling as to what *reasonable* combinations are.
Moreover, it is often unclear how different hyperparameters interact with each other,
which can make an irrelevant hyperparameter look more important than it actually is or vice versa.
Finally, manual hyperparameter search is time consuming, since it is generally not possible to automate.

###### Grid Search

Getting a feeling for combinations of hyperparameters is often much harder than for individual hyperparameters.
The idea of grid search is to get a set of *reasonable* values for each hyperparameter individually
and organise these sets in a grid that represents all possible combinations of these values.
Each combinations of hyperparameters in the grid can then be run simultaneously,
assuming that so much hardware is available, which can speed up the search significantly.

###### Random Search

Since there are plenty of hyperparameters and each hyperparameters can have multiple *reasonable* values,
it is often not feasible to try out every possible combination in the grid.
On top of that, most of the models will be thrown away anyway because only the best model is of interest,
even though they might achieve similar performance.
The idea of random search is to randomly sample configurations, rather than choosing from pre-defined choices.
This can be interpreted as setting up an infinite grid and trying only a few --- rather than all --- possibilities.
Under the assumption that there are a lot of configurations with similarly good performance as the best model,
this should provide a model that performs very good with high probability for a fraction of the compute.

###### Bayesian Optimisation 

Rather than picking configurations completely at random, 
it is also possible to guide the random search.
This is essentially the premise of Bayesian optimisation:
sample inputs and evaluate the objective to find which parameters are likely to give good performance.

Bayesian optimisation uses a function approximator for the objective 
and what is known as an *acquisition* function.
The function approximator, or *surrogate*, 
has to be able to model a distribution over function values, e.g. a Gaussian Process.
The acquisition function then uses these distributions
to find where the largest improvements can be made, e.g. using the cdf.
For a more elaborate explanation of Bayesian optimisation, 
see e.g. [this tutorial](https://arxiv.org/abs/1807.02811)

This approach is less parallellisable than grid or random search,
since it uses the information from previous runs to find good sampling regions.
However, often there are more configurations to be tried out than there are computing devices
and it is still possible to sample multiple configurations at each step with Bayesian Optimisation.
Also consider [this paper](https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms) in this regard.

###### Neural Architecture Search

Instead of using Bayesian optimisation, 
the problem of hyperparameter search can also be tackled by other optimisation algorithms.
This approach is also known as *Neural Architecture Search* (NAS).
There are different optimisation strategies that can be used for NAS,
but the most common are evolutionary algorithms and (deep) reinforcement learning.
Consider reading [this survey](http://jmlr.org/papers/v20/18-598.html) 
to get an overview of how NAS can be used to construct neural networks.

## Efficient CNNs

In recent times CNNs have become more computationally efficient. Traditional convolutional layers apply filters across the entire depth of the input volume, mixing all the input channels to produce a single output channel. Depthwise separable convolutions, introduced as a key innovation in architectures like Xception, are a more efficient variant of the standard convolution operation. This process is divided into two layers: the depthwise convolution and the pointwise convolution. In the depthwise convolution, a single filter is applied per input channel, which significantly reduces the computational cost. Following this, a 1x1 convolution (pointwise convolution) is applied to combine the outputs of the depthwise layer, creating a new set of feature maps. This approach drastically reduces the number of parameters and computations, making the network more efficient and faster, which is especially beneficial for mobile and embedded devices.

<img src="https://www.researchgate.net/publication/358585116/figure/fig1/AS:1127546112487425@1645839350616/Depthwise-separable-convolutions.png" />

Squeeze-and-Excitation layers introduce an additional level of adaptivity in CNNs, enabling the network to perform dynamic channel-wise feature recalibration. Squeeze-and-Exitation blocks are usually executed after a convolutional layer or block
and before the residual connection by a series of relatively inexpensive computations

1. A three dimensional input consisting of different channels and the two spati l
dimensions is compressed into one dimension by global aver ge pooling. As a res lt
the spatial information is squeezed into one descriptor per channel.
2. The squeezed data is transformed by a two layer feed-forward neural network.  fter
the first linear layer ReLU is used as activation functi n and after the se ond a
sigmoid function is applied. This normalizes the output between 0 and 1 and can be
interpreted as the significance per channel.
3. The result is used to scale the input of the Squeeze-and-Exitation block by an element-
wise multiplication.

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*bmObF5Tibc58iE9iOu327w.png" />



### Exercise 5: Create an efficient CNN (4 points)

Today, neural networks frequently have millions or billions of parameters. However, CNNs have become more computationally efficient over the years. How far can you get with a limited amount of compute?

> Create an efficient CNN with less than 30.000 parameters.
> Use at least one depthwise separable or groupwise convolution or apply at least one squeeze-and-exitation layer after a convolution.

Hint: Skip-connections and Normalization layers are frequently used to stabilize the training behavoir of deep CNNs.

In [69]:
from torch.nn import functional as F

class SqueezeExcitation(nn.Module):
    def __init__(self, channel, reduction=16):
        super(SqueezeExcitation, self).__init__()
        self.se = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(channel, channel // reduction, kernel_size=1, stride=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(channel // reduction, channel, kernel_size=1, stride=1),
            nn.Sigmoid()
        )

    def forward(self, x):
        y = self.se(x)
        return x * y

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, nin, nout):
        super(DepthwiseSeparableConv, self).__init__()
        self.depthwise = nn.Conv2d(nin, nin, kernel_size=3, padding=1, groups=nin)
        self.pointwise = nn.Conv2d(nin, nout, kernel_size=1)

    def forward(self, x):
        out = self.depthwise(x)
        out = self.pointwise(out)
        return out
    
class EfficientCNN(nn.Module):
    def __init__(self, in_channels, num_classes):
        super(EfficientCNN, self).__init__()
        self.layer1 = DepthwiseSeparableConv(in_channels, 32)
        self.bn1 = nn.BatchNorm2d(32)
        self.se1 = SqueezeExcitation(32)
        self.res1 = nn.Conv2d(in_channels, 32, kernel_size=1) 
        
        self.layer2 = DepthwiseSeparableConv(32, 64)
        self.bn2 = nn.BatchNorm2d(64)
        self.se2 = SqueezeExcitation(64)
        self.res2 = nn.Conv2d(32, 64, kernel_size=1, stride=1)  
        
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(64, num_classes)

    def forward(self, x):
       
        identity = self.res1(x)  
        out = F.relu(self.bn1(self.layer1(x)))
        out = self.se1(out)
        out += identity 
        out = F.relu(out)
        
       
        identity = self.res2(out)  
        out = F.relu(self.bn2(self.layer2(out)))
        out = self.se2(out)
        out += identity 
        out = F.relu(out)
        
        out = self.pool(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        
        return out


        

In [70]:
# sanity-check
model = EfficientCNN(in_channels=3, num_classes=10)
model(torch.zeros((1, 3, 32, 32)))
print("number of parameters: ", sum([p.numel() for p in model.parameters()]))

number of parameters:  6414


In [71]:
# Test Cell: do not edit or delete!

In [72]:
# Test Cell: do not edit or delete!

### Exercise 6: Training (4 points)

In order to get a feeling for hyperparameter search, you have to try it out on some example. You can use the monitoring tools from previous exercises to log performance and get a feeling for which hyperparameters work well. 

> Train your EfficientCNN on CIFAR10 using the Trainer class. Use hyperparameter search for the learning rate, optimizer and maybe even the model architecture to get an optimal performance (CrossEntropyLoss < 0.9) within 10 epochs of training. 

In [73]:
# TODO: Cell for Hyperparameter search, you can freely edit or delete this code
# Transformations and dataset setup
normalise = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
dataset = torchvision.datasets.CIFAR10(data_root, train=True, transform=normalise, download=True)
train_loader = DataLoader(dataset, batch_size=1024, shuffle=True, num_workers=2)

# Define a list of hyperparameters to search through
learning_rates = [0.1, 0.01, 0.001, 0.0001]
optimizers = [optim.SGD, optim.Adam, optim.RMSprop]

# Search through the hyperparameters
best_loss = float('inf')
best_params = {}

for lr in learning_rates:
    for opt in optimizers:
        print(f"Training with learning rate: {lr} and optimizer: {opt}")
        model = EfficientCNN(in_channels=3, num_classes=10).to(device)
        optimizer = opt(model.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss()
        trainer = Trainer(model, criterion, optimizer)
        
        # Train for 10 epochs
        for epoch in tqdm(range(10)):
            trainer.train(train_loader, train_loader, num_epochs=1)
        
        # Get the final loss and update best parameters if improved
        final_loss = trainer.evaluate(train_loader)
        if final_loss < best_loss:
            best_loss = final_loss
            best_params['lr'] = lr
            best_params['opt'] = opt
            best_params['model_state'] = model.state_dict()

print(f"Best params: {best_params}")
print(f"Best loss: {best_loss}")

Files already downloaded and verified
Training with learning rate: 0.1 and optimizer: <class 'torch.optim.sgd.SGD'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.1 and optimizer: <class 'torch.optim.adam.Adam'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.1 and optimizer: <class 'torch.optim.rmsprop.RMSprop'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.01 and optimizer: <class 'torch.optim.sgd.SGD'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.01 and optimizer: <class 'torch.optim.adam.Adam'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.01 and optimizer: <class 'torch.optim.rmsprop.RMSprop'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.001 and optimizer: <class 'torch.optim.sgd.SGD'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.001 and optimizer: <class 'torch.optim.adam.Adam'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.001 and optimizer: <class 'torch.optim.rmsprop.RMSprop'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.0001 and optimizer: <class 'torch.optim.sgd.SGD'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.0001 and optimizer: <class 'torch.optim.adam.Adam'>


  0%|          | 0/10 [00:00<?, ?it/s]

Training with learning rate: 0.0001 and optimizer: <class 'torch.optim.rmsprop.RMSprop'>


  0%|          | 0/10 [00:00<?, ?it/s]

Best params: {'lr': 0.01, 'opt': <class 'torch.optim.rmsprop.RMSprop'>, 'model_state': OrderedDict([('layer1.depthwise.weight', tensor([[[[-0.5923, -0.1178,  0.0849],
          [-0.4974,  0.1164,  0.5155],
          [-0.2952,  0.1844,  0.3079]]],


        [[[ 0.2536, -0.2050, -0.4168],
          [ 0.1924, -0.2837, -0.7004],
          [ 0.3066,  0.7886,  0.0520]]],


        [[[-0.6019,  0.2198,  0.1235],
          [-0.6671, -0.2728,  0.4583],
          [-0.0386,  0.3796,  0.4633]]]])), ('layer1.depthwise.bias', tensor([-0.2195,  0.3067,  0.2371])), ('layer1.pointwise.weight', tensor([[[[ 0.2704]],

         [[ 0.6393]],

         [[ 0.3131]]],


        [[[ 0.0999]],

         [[ 0.9701]],

         [[-0.0516]]],


        [[[-0.1672]],

         [[ 0.0536]],

         [[ 1.2620]]],


        [[[-0.8079]],

         [[-0.1569]],

         [[ 0.3953]]],


        [[[ 0.5133]],

         [[ 0.2008]],

         [[ 0.0316]]],


        [[[ 0.1716]],

         [[ 0.7107]],

         [[ 0.1

In [74]:
# Based on the search results, set best hyperparameters
best_lr = best_params['lr']
best_opt = best_params['opt']
model = EfficientCNN(in_channels=3, num_classes=10)
model.load_state_dict(best_params['model_state']) 

# Define the optimizer with the best learning rate
optimizer = best_opt(model.parameters(), lr=best_lr)
trainer = Trainer(model, criterion, optimizer)
trainer.train(train_loader, train_loader, num_epochs=10)

{'train': 1.399788881807911, 'valid': 1.5038463799320922}

In [75]:
# Test Cell: do not edit or delete!

In [76]:
# Test Cell: do not edit or delete!