# **KogSys-ML-B: Einführung in Maschinelles Lernen**
## **Deep Learning 2: Training Procedure**
---

To set up a new conda environment suitable for this notebook, you can use the following console commands:

```bash
conda create -y -n pytorch python=3.13
conda activate pytorch
python -m pip install -r requirements.txt
```

**Note**: Conda can become very hard-drive hungry when you use many environments. Consider regularly deleting environments you no longer need and running the ``conda clean --all`` command to remove no longer needed packages and cached files.

You can also install the requirements for this notebook into an existing environment by running the cell below:

In [None]:
# !python -m pip install -q -U -r requirements.txt

In [None]:
import time

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from sklearn.metrics import accuracy_score
from torch.optim.lr_scheduler import ReduceLROnPlateau
from tqdm import tqdm

torch.manual_seed(2025)

### **1 Recap**

In last week's tutorial, we have seen the basic functionality of PyTorch and implemented our first own CNN! A possible solution for this is implemented here, so if you still want to catch up on last weeks exercises, don't look too closely at the cells from this section!

In [None]:
### choosing the device dynamically ###


def get_device() -> str:
    """
    Automatically checks if PyTorch has been installed for the use with CUDA (on NVIDIA GPUs) or MPS (Metal Performance Shaders, on Apple M chips). If neither is available, the CPU is used.
    """
    return (
        "cuda"
        if torch.cuda.is_available()
        else "mps"
        if torch.backends.mps.is_available()
        else "cpu"
    )


get_device()

In [None]:
### model ###


class Model(nn.Module):
    """
    The Convolutional Neural Network implemented last tutorial session.
    """

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.network = nn.Sequential(
            nn.Conv2d(
                in_channels=3,
                out_channels=8,
                kernel_size=3,
                stride=1,
                padding=1,
                bias=False,
            ),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(
                in_channels=8,
                out_channels=16,
                kernel_size=3,
                stride=1,
                padding=1,
                bias=False,
            ),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(16 * 8 * 8, 10, bias=False),
            nn.Softmax(dim=1),
        )

    def forward(self, x):
        return self.network(x)

In [None]:
### data loading ###

train_data = torchvision.datasets.CIFAR10(
    root="./data",
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor(),
)
test_data = torchvision.datasets.CIFAR10(
    root="./data",
    train=False,
    download=True,
    transform=torchvision.transforms.ToTensor(),
)

train_loader = torch.utils.data.DataLoader(train_data, batch_size=8, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=8, shuffle=False)

classes = (
    "plane",
    "car",
    "bird",
    "cat",
    "deer",
    "dog",
    "frog",
    "horse",
    "ship",
    "truck",
)

### **2 Deconstructing the Basic Training Loop**

At the beggining of training, we initialize the Model to random weights (which is done by just calling the constructor) and moving it to the optimal device (which is set in one of the recap-cells).

In [None]:
### initialize model, move to device ###

model = Model()
model = model.to(get_device())

We then choose our loss function and optimizer. The loss function needs to be callable (in this case it is a callable object), and the optimizer is a `torch.optim.Optimizer` object, in this case Stochastic Gradient Descent. The optimizer must always be passed the parameters which it should optimize. We get those by calling `model.parameters()`.

In [None]:
### loss function and optimizer ###

loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), momentum=0.9)

In the training loop, we first iterate over epochs, and secondly over a dataloader – the training loader for training the model, obviously. We unpack the batch provided by the dataloader to labels and images (inputs). `optimizer.zero_grad()` sets the previously calculated gradients to zero. We then send the inputs to the models and calculate the loss function using the labels and model outputs. We then backpropagate the loss by calling `loss.backward()`, and optimize the parameters by calling `optimizer.step()`.

We record the batch `idx` using `enumerate` to be able to calculate average losses for recording training information.

At the end of training, we save the model to a file.

In [None]:
### training loop ###

print("Starting Training")
model.train()

for epoch in range(10):  # Limit to 10 epochs to keep the runtime short
    sum_loss = 0.0

    for idx, data in enumerate(train_loader, 0):
        inputs, labels = data[0].to(get_device()), data[1].to(get_device())

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = loss_func(outputs, labels)
        loss.backward()
        optimizer.step()

        sum_loss += loss.item()
        if idx % 1000 == 999:
            print(f"Epoch {epoch + 1}, batch {idx + 1}: loss {sum_loss / 1000:.3f}")
            sum_loss = 0.0

print("Finished Training")


### save the model ###

model.eval()
torch.save(model.state_dict(), "cifar10_cnn.pth")

The following code loads the weights from the save-file, so that you can repeat evaluation without re-training the model when coming back to this notebook.

In [None]:
### load the model ###

model = Model()
model.load_state_dict(torch.load("cifar10_cnn.pth", weights_only=True))
model = model.to(get_device())

In [None]:
### evaluate ###

def eval(data: torch.utils.data.DataLoader, model: nn.Module) -> float:
    """
    Calculates accuracy of `model` on the DataLoader `data`
    """
    model.eval()
    device = next(model.parameters()).device

    with torch.no_grad():
        true, pred = [], []
        for batch in data:
            images, labels = batch[0].to(device), batch[1].to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            true.extend(labels.to("cpu"))
            pred.extend(predicted.to("cpu"))

    return accuracy_score(true, pred)


print(f"Test Accuracy: {eval(test_loader, model):.2f}")

### **3 Building a Better Trianing Loop**

#### **3.1 Optimizers: `Adam`**

**What is Adam?** [Adam](https://paperswithcode.com/method/adam) is an optimization algorithm (Kingma & Ba, 2015). Since then, it has found wide application for optimizing neural network parameters. It extends Stochastic Gradient Descent with both Momentum (regulated by $\beta_1$) and Root Mean Square Propagation (RMSprop), which essentially adapts the learning rate for each to-be-optimized parameter individually (regulated by $\beta_2$).

In [None]:
del model
model = Model()  # Create new untrained model, so that we can use this optimizer in training later (otherwise we would have to copy-paste this code)

optimizer = torch.optim.Adam(
    params=model.parameters(),  # parameters to optimize
    lr=0.001,  # learning rate
    betas=(
        0.9,
        0.999,
    ),  # beta 1 momentum factor, beta 2 is RMSprop factor for per-parameter learning rate adjustment
    eps=1e-8,  # parameter avoiding RMSprop denominator collapse
    weight_decay=0,  # Factor to way in the L2 Norm (Euclidean distance) of all weights, i.e. not only minimize loss, but also weight values
)

#### **3.2 Learning Rate Scheduling**

Training performance can be increased by adapting the learning rate, with techniques like learning rate warmup (not starting with the full learning rate but increasing it over the first episodes) or decay (reducing the learning rate as training goes on). A very effective technique demonstrated for example in the [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) paper (He et al., 2016). See how the classification error decreases visibly after reducing the learning rate:

<image src='images/lr_schedule.png' style='width:800px'>

In [None]:
### learning rate scheduling ###

lr_scheduler = ReduceLROnPlateau(
    optimizer=optimizer,  # the optimizer for which learning rate should be adapted
    mode="max",  # whether the metric should increase or decrease. Default is 'min' to be used with loss, when used with acc should be 'max'
    patience=3,  # how long a metric must stop improving by at least best * (1 +/- threshold) per default, can also be set to absolute threshold mode
    threshold=1e-4,  # Threshold for patience calculation
    factor=0.1,  # The value to multiply the learning rate with
)

Learning rate scheduling should be applied after the optimizer's update by calling `lr_scheduler.step()`. This should usually happen after iterating through the training DataLoader. Note that the `ReduceLROnPlateau` scheduler's `step()` method must be passed a value to track, i.e. which must plateau for learning rate to be reduced.

#### **3.3 Validation**

To track training progress, looking only at the loss is not the best option. While we describe a performance criterion using the loss, it does not tell us as much about model _performance_ as say a calculated accuracy on a validation dataset would. Let's do exactly that!

We already have a function for calculating accuracy given a model and a DataLoader. Now we just need to split our dataset, and torch has a function exactly for that: `torch.utils.data.random_split()`. Note that this task (splitting an existing dataset) is different from the one presented in the assignment, where we want to build the split into the Dataset class itself.

**Note: Validation Dataset.** For datasets that are intended to be used as benchmarks, the test dataset is often either provided without labels or not provided at all. This is done to prevent models from being trained on the test data to cheat on the leaderboards. The validation dataset is thus all we have to get an estimate of how well our model performs.  
In such cases, we should not use the validation dataset for the classic validation tasks (e.g., to calculate metrics for early stopping), but rather treat it as our test dataset. Instead, it is advisable to create our own validation dataset from the training dataset.

In [None]:
### create validation set from train set ###

train, val = torch.utils.data.random_split(
    dataset=train_data,  # Dataset object to split
    lengths=(0.85, 0.15),  # Fractions of the returned datasets, must sum to 1
    generator=torch.Generator().manual_seed(
        2025
    ),  # Ensures reproducibility, optional parameter
)

train_loader = torch.utils.data.DataLoader(train, batch_size=8, shuffle=True)
val_loader = torch.utils.data.DataLoader(val, batch_size=8, shuffle=False)

#### **3.4 Checkpointing**

Using `torch.save()` you can save components of your model and training process by passing a `dictionary`. Such checkpoint files are by convention ending in either `.pth` or `.pt`.

In [None]:
### checkpointing ###

def save(save_dict: dict, path: str) -> None:
    """
    Wrapper around torch.save, to demonstrate the syntax. This function may be used in the loop, or its content itself.
    """

    torch.save(save_dict, path)

A common dictionary to save could be the following:

In [None]:
{
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "scheduler_state_dict": lr_scheduler.state_dict(),
    "loss": loss,
}

...which can then be loaded using torch.load, and be accessed like the originally saved dictionary. Models, optimizers and schedulers have `.load_state_dict()` methods to load the stored `state_dict`s.

#### **3.5 Progress Bars!!!**

The `tqdm` Library offers progress bars for the console which display progress (duh) and can also show some information. We usually use progress bars around the DataLoaders rather than the epoch counter.

In [None]:
### default progress bar ###

for i in tqdm(range(1000000)):
    time.sleep(3e-6)

In [None]:
### customizable progress bar ###

pbar = tqdm(range(30000))

for i in pbar:
    pbar.set_description(f"i: {i}")

In [None]:
### syntactic sugar ###

for i in (pbar := tqdm(range(30000), ncols=100)):  # ncols sets the width of the bar
    pbar.set_description(f"i = {i}")

#### **3.6 Putting It All Together**

**Exercise:** Build a training loop using the improvements shown in this notebook. Use validation accuracy for learning rate scheduling. Train for 10 epochs, and save a detailed checkpoint every five epochs.

**Exercise:** Load the checkpoint from epoch 10 for the model, optimizer and learning rate scheduler, and continue the training for 10 more epochs. You may use the same training loop as in the cell above.

**Exercise:** Evaluate the final model on the test split.

Note in the case above the model produced after epoch 20 may not be the best model we have seen during training, judging by the validation accuracy. For all we know, a model from a previous epoch should perform better. For this purpose, we may also introduce a second set of checkpoints, which always store (and overwrite) a model whenever a new maximum validation accuracy is reached.

### **What Else?**

Of course, this isn't everything to learn about PyTorch! Here is an (incomplete) list of resources for you to look at if you want to dive deeper into this framework! Note that some of these are really advanced – so don't worry if you don't understand them, you won't need them for this course.

- [Tensorboard visualization of training metrics](https://docs.pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html)
- [Distributed Data Parallel (DDP), i.e. Multi-GPU Training](https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html)
- Technically not PyTorch, but an important tool for academic experiments: Running experiments from configs, e.g. via [YACS](https://github.com/rbgirshick/yacs) or [JSON argparse](https://jsonargparse.readthedocs.io/en/v4.44.0/)

### **Bibliography**

Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In Y. Bengio & Y. LeCun (Hrsg.), 3rd international conference on learning representations, ICLR 2015, san diego, CA, USA, may 7-9, 2015, conference track proceedings. http://arxiv.org/abs/1412.6980

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 770–778. https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html