# Deep Learning Lab #0 - Introduction to PyTorch
Welcome to the first laboratory session of the Deep Learning course. Today, we will examine PyTorch, a Python library for **Deep Learning**. PyTorch allows us to build and train deep models in an efficient and intuitive way, leaving most of the mechanics to be carried out **automatically**, such as:
* GPU support
* Automatic gradient computation for back-propagation

### On today's menu:
We are going to take a look at the main functionalities of PyTorch, such as:
* tensor operations;
* computational graph and backpropagation;
* modules;
* loss functions;
* optimizers;
* datasets and dataloaders.

To install PyTorch, follow the instructions at https://pytorch.org/get-started/locally/

In [None]:
!pip3 install torch torchvision torchaudio matplotlib wandb tensorboard -q

In [None]:
# Let's start by importing the library
import torch

## Tensor operations
PyTorch provides a specific class called **Tensor**, that encodes scalar values as well as multidimensional vectors.
PyTorch offers a wide variety of methods for creating and manipulating tensors, with most NumPy functions being directly supported. Let us start with an overview of some methods for tensor creation.

In [None]:
# Create a (2, 3) tensor from python data
data = [
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0]
]
a = torch.tensor(data)
print("> Tensor from python data")
print(a)

# Creates a (2, 3) tensor with all 0s
b = torch.zeros((2, 3))
print("> Tensor with all 0s")
print(b)

# Creates a (2, 3) tensor with all 1s
c = torch.ones((2, 3))
print("> Tensor with all 1s")
print(c)

# Creates a (1, 4, 3) tensor with values from a normal distribution
d = torch.randn((1, 4, 3))
print("> Tensor with random values")
print(d)

# Creates a tensor with values from 1 to 10
e = torch.arange(1, 10)
print("> Tensor with values from 1 to 10")
print(e)

In [None]:
# Unless specified, the default type for tensors is float32
# For a list fo all tensor types => https://pytorch.org/docs/stable/tensor_attributes.html
print(f"torch.get_default_dtype() = {torch.get_default_dtype()}")

# We can change the tensor type in two ways:
# (a) directly when creating the tensor
f = torch.zeros((2, 3), dtype=torch.int32)
print("> Tensor with all 0s as int32")
print(f)

# (b) after tensor creation
f = torch.zeros((2, 3)).int()
print("> Tensor with all 0s cast to int32")
print(f)

In [None]:
# NOTE: some operators are not implemented for some types, e.g.
# this_will_crash = torch.randn((1, 4, 3), dtype=torch.int32)

# In these cases, we can only cast later
this_wont_crash = torch.randn((1, 4, 3)).long()
print(this_wont_crash)

PyTorch supports the basic python operators, which are applied elementwise to the tensors.

In [None]:
# Create a (2, 3) tensor from python data
a = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print("> a")
print(a)

# Create a (2, 3) tensor with all 1s
b = torch.ones((2, 3))
print("> b")
print(b)

a = a * 2
print("> a * 2")
print(a)

print("> (a * 2) + b")
a = a + b
print(a)

print("> ((a * 2) + b)^2")
a = a ** 2
print(a)

Other operations, instead, operate on entire dimensions of the tensors and can change their size.

In [None]:
# Create a (2, 3) tensor from python data
a = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print("> a")
print(a)

# To find the shape of the tensor, we can use the .shape of tensors
print("> a.shape")
print(a.shape)

# Sum the values along the dimensions of the rows
b = torch.sum(a, dim=0)
print("> torch.sum(a, dim=0)")
print(b)
print(b.shape)

# Sum the values in a along the dimensions of the columns
c = torch.sum(a, dim=1)
print("> torch.sum(a, dim=1)")
print(c)
print(c.shape)

# Sum all values in a
d = torch.sum(a)
print("> torch.sum(a)")
print(d)
print(d.shape)

Tensor indexing in PyTorch is quite similar to NumPy

In [None]:
# Create a (2, 3) tensor from python data
a = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
print("> a")
print(a)

# Index a specific scalar
print("> a[0, 0]")
print(a[0, 0])

# Index row 0
print("> a[0]")
print(a[0])

# Index column 0
print("> a[:, 0]")
print(a[:, 0])

# Index columns 0 and 1
print("> a[:, 0:2]")
print(a[:, 0:2])

# Index the elements greater or equal to 3.0
print("> a[a >= 3.0]")
print(a[a >= 3.0])

# The returned tensors share the memory with the original tensor
a[a >= 3.0] += 10
print("> a[a >= 3.0] += 10")
print(a)

# To clone a tensor, you can use the .clone function
b = a.clone()
b[b >= 3.0] -= 10
print("> b[b >= 3.0] -= 10")
print(b)

PyTorch supports a concept called **Tensor Broadcasting**, which is designed to automatically deal with operations involving tensors of **different sizes**. This often happens in practice, as in the case where an entire tensor is multiplied by a single scalar.

Given two tensors, we say that they are "broadcastable" if, when iterating over their dimensions starting from the last one and proceeding towards the initial ones, one of these conditions hold for the size of the dimensions:

1. **they match**: no special treatment is needed in this case
2. **one of them is 1**: the dimension of size 1 is replicated to make it reach the size of the corresponding dimension in the other tensor
3. **one of them does not exist**: the dimension is created with size 1 and the former rule applies

Let's see some examples

In [None]:
# Create a (2, 3) tensor
a = torch.arange(1, 7).reshape(2, 3)
print("> a")
print(a)
print(a.shape)

# Create a scalar value with no dimensions
b = torch.ones(())
print("> b")
print(b)
print(b.shape)

# We want to sum a and b. Which Tensor Broadcasting rules are we using?
c = a + b

print("> c = a + b")
print(c)
print(c.shape)

In [None]:
# Create a (2, 1, 3) tensor
a = torch.ones((2, 1, 3))
print("> a")
print(a)
print(a.shape)

# Create a (4, 1) tensor
b = torch.randn((4, 1))
print(b)
print(b.shape)

# Will this work?
c = a + b

print("> c = a + b")
print(c)
print(c.shape)

In [None]:
# Create a (2, 1, 3) tensor
a = torch.ones((2, 1, 3))
print("> a")
print(a)
print(a.shape)

# Create a (1, 4) tensor
b = torch.randn((1, 4))
print("> b")
print(b)
print(b.shape)

# Will this work?
c = a + b

print("> c = a + b")
print(c)
print(c.shape)

The `squeeze()` and `unsqueeze()` methods are often used in conjunction with broadcasting. These methods respectively remove or add a dimension with size 1 in the tensor on which they are called.

In [None]:
# Create a (2) tensor
a = torch.arange(0, 2)
print("> a")
print(a)
print(a.shape)

# Create a (3) tensor
b = torch.arange(0, 3)
print("> b")
print(b)
print(b.shape)

# Say we want to multiply each element in a with each element in b, getting a (2, 3) tensor.
# With the current tensor shapes, broadcasting won't work
# c = a * b

# To leverage broadcast, we can add a trailing 1 dimension to a and multiply
print("> a.unsqueeze(dim=-1)")
a = a.unsqueeze(dim=-1)
print(a)
print(a.shape)

c = a * b
print("> c = a * b")
print(c)
print(c.shape)

PyTorch adopts some conventions on the shape of the tensors expected by its modules. 1D data is typically represented in the `(batch_size, features_count)` format. 2D data instead is represented in the `(batch_size, channels, height, width)` format.

## Tensors and Computational Graphs
To compute the gradients of a given function for the optimization process, we need to track its input and the operations applied to it. This tracking results in an object called a Computational Graph.

<img src="https://colah.github.io/posts/2015-08-Backprop/img/tree-eval-derivs.png" width="500"></br></br>

For each operation executed, a new node is appended to the graph, allowing us to exploit the **chain rule** to compute all the derivatives in a single back-propagation pass. To efficiently support this functionality, PyTorch provides gradient support for each Tensor.

Let us now create two tensors containing scalar value and define some operations on them

In [None]:
# Create the tensors. By default tensors do not require gradients, so we enable them
a = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([3.0], requires_grad=True)

print("> a")
print(a)
print("> b")
print(b)

# The result c is also a tensor
c = a * b
print("> c = a * b")
print(c)

All the intermediate values of the computation performed in the background by the machine are tracked in order to enable back-propagation. Our computational graph will have a node for `a`, one for `b` and one for the `*` operation. Let us now compute the gradients of `c = a * b` with respect to one of its inputs.

In [None]:
print("Gradient before computation")
print(a.grad)
print(b.grad)
c.backward()
print("Gradient after computation")
print(a.grad)
print(b.grad)

Calling the `backward()` method deallocates the computational graph, releasing the memory used to store it, and updates the `grad` attribute of each tensor, summing to it the newly computed gradient. It's easy to see how this automatic differentiation save lots of coding. Take as an example the code that would be needed to manually implement gradient computation in Numpy for a linear layer:


```python
# define derivative of the activation
def derivative_sigmoid(z):
  return sigmoid(z) * (1 - sigmoid(z))

# =========== BACKWARD =========== #

# compute gradient from L to z
dL_dy = -2 * (t - y)
dL_dz = dL_dy * derivative_sigmoid(z)

# compute gradient w.r.t. input
dL_dx = np.dot(dL_dz, W)

# compute gradient w.r.t. parameters
dL_dW = np.dot(x, dL_dz)
dL_db = dl_dZ
```

## Working with different devices
Until now, our operations were executed on **CPU**. PyTorch, however, supports a wide range of processors for execution, including **GPUs** and **TPUs**. These are called **Devices**. Using the correct device can have a huge impact on **performance**. In order to execute an operation on a specific device, we first need to move all the involved tensors to the memory of such a device. The operation will then be automatically executed on that device.

In [None]:
# Check if we have CUDA support
print("> torch.cuda.is_available()")
print(torch.cuda.is_available())

# Fetch the first GPU. In the case of multiple GPUs, the index specifies which GPU to use
device = "cuda:0"

# Create some tensors in GPU memory
a = torch.tensor([2.0], device=device)
# Create tensor in CPU and move it to GPU
b = torch.tensor([3.0]).to(device)

# The result c is also on the GPU
c = a * b

print("> c.device")
print(c.device)

Different operations can also be executed on different devices. PyTorch will automatically keep track of it in the Computational Graph. We now have all the ingredients we need to create and train a deep model. However, using operations at such a low level is unconvenient and prone to errors. We will now look at additional functionalities offered by the library to speed up development.

## Modules
Let use what we learned so far to implement a simple linear layer:
`y = Wx + b`

In [None]:
def apply_linear_layer(x, W, b):
    """Apply a linear layer on an input.

    Args:
    x (batch_size, in_features)
    W (out_features, in_features)
    b (out_features)
    """

    # (1, out_features, in_features)
    W = W.unsqueeze(0)
    # (batch_size, in_features, 1)
    x = x.unsqueeze(-1)

    # (batch_size, out_features, 1)
    product = torch.matmul(W, x).squeeze(-1)

    # (batch_size, out_features)
    result = product + b
    return result

batch_size = 4
in_features = 8
out_features = 16

# Create input
x = torch.randn((batch_size, in_features))

# Weights of the linear layer
W = torch.randn((out_features, in_features), requires_grad=True)
b = torch.zeros((out_features), requires_grad=True)

# Get output
output = apply_linear_layer(x, W, b)

print("> output")
print(output)
print(output.shape)

While the layer is functional, instantiating multiple such layers quickly becomes **unmanageable**. The main problem lies in the fact that the weights of the layer **are not tied** with the computational logic: creating a **class** for the layer would solve this problem. PyTorch provides a base class (`torch.nn.Module`) for such purpose, providing various functionalities.

Let's implement our linear layer in PyTorch style.

In [None]:
import torch
from torch import nn

class MyLinear(nn.Module):
    def __init__(self, in_features, out_features):
        """Linear layer.

        Args:
          in_features: number of input features
          out_features: number of output features
        """
        super(MyLinear, self).__init__()

        # Creates tensors for the weights
        W = torch.randn((out_features, in_features))
        b = torch.zeros((out_features))

        # Uses the Parameter class (subclass of Tensor) to create parameters for the module
        # When assigned to a member of self, Parameter tensors are automatically registered
        # Require gradient by default
        # Other Module objects are also automatically registered if assigned to self
        self.W = nn.Parameter(W)
        self.b = nn.Parameter(b)

    def forward(self, x):
        """Method executed when the object is called.

        Args:
          x (batch_size, in_features)

        Return:
          tensors (batch_size, out_features)
        """

        # (batch_size, in_features, 1)
        x = x.unsqueeze(-1)
        # Note that Parameters
        # (1, out_features, in_features)
        W = self.W.unsqueeze(0)

        # (batch_size, out_features, 1)
        product = torch.matmul(W, x).squeeze(-1)

        # (batch_size, out_features)
        result = product + self.b
        return result

batch_size = 4
in_features = 8
out_features = 16

x = torch.randn((batch_size, in_features))

# Creates an instance of our linear layer
linear_layer = MyLinear(in_features, out_features)
# Computes the results. the forward method is internally called
output = linear_layer(x)

print("> output")
print(output)
print(output.shape)

Note how now both the parameters, their initialization and the processing logic are contained in the class. Multiple instances can be handled more conveniently. The Module class provides a range of additional functionalities.

In [None]:
# Obtain all the parameters in the model. Useful for model optimization
for name, values in linear_layer.named_parameters():
    print(name, values)

# Move all the tensors associated to the model to the specified device
# recursively applies to other Module objects contained in the instance
linear_layer.to("cuda:0")

# Saving the model weights
print("> Saving model")
saved_model = linear_layer.state_dict()
torch.save(saved_model, "save.pth")

# Load the saved model
print("> Loading model")
loaded_state_dict = torch.load("save.pth")
linear_layer.load_state_dict(loaded_state_dict)

PyTorch provides a wide range of `Module` implementations representing the most common computational blocks. These include

*   Linear layers
*   1D, 2D and 3D Convolutions
*   Transposed Convolutions
*   Batch/Layer Normalization layers
*   RNN, LSTM, GRU cells
*   Multi-head Attention Layers
* ...

Moreover, many common networks are implemented as `Module`s
*   Alexnet
*   VGG
*   ResNet
*   DenseNet
*   Transformers
*   Vision Transformers
*   ...

## Loss functions
PyTorch provides a wide range of already implemented loss functions as part of `torch.nn`, such as:
*   L1
*   MSE
*   Cross Entropy
*   Binary Cross Entropy


In [None]:
import torch
from torch import nn

# Create some tensors for the loss
a = torch.zeros((2, 4))
b = torch.ones((2, 4)) * 2

# Instantiate the loss
l1_loss = nn.L1Loss()
mse_loss = nn.MSELoss()

# Compute the loss functions
loss = l1_loss(a, b)
print("> l1_loss(a, b)")
print(loss)

loss = mse_loss(a, b)
print("> mse_loss(a, b)")
print(loss)

In [None]:
# Loss functions are also able to perform back-propagation of gradients
a = torch.rand((2, 4), requires_grad=True)
b = torch.rand((2, 4), requires_grad=True)

loss = l1_loss(a, b)
print("> l1_loss(a, b)")
print(loss)

print("> Before backward")
print(a.grad)
print(b.grad)

loss.backward()

print("> After backward")
print(a.grad)
print(b.grad)

## Optimizers

PyTorch implements a wide range of optimizers as part of the `torch.optim` package. When an optimizer is created, a sequence of tensors to optimize is required. The optimizer then uses each tensor's `grad` attribute to update its value.

A typical optimization cycle is made of the following steps:

1.   perform the computations that build the Computational Graph;
2.   compute the loss term;
3.   use `backward()` to compute gradients for each tensor in the Computational Graph;
4.   perform an optimization step using the optimizer;
5.   zero the gradient in all tensors for the next optimization cycle using `zero_grad()`;

In [None]:
import torch
from torch import nn

in_features = 8
out_features = 4
batch_size = 4
learning_rate = 1e-4

# Create tensors for the loss
x = torch.zeros((batch_size, in_features))
y = torch.ones((batch_size, out_features))

# Create the model to optimize
model = nn.Linear(in_features, out_features)

# Instantiate an optimizer on the parameters of the Linear model.
optimizer = torch.optim.SGD(model.parameters(), learning_rate)

# Instantiate the loss
l1_loss = nn.L1Loss()

# 1. Perform computations
y_pred = model(x)

# 2. Compute the loss term
loss = l1_loss(y_pred, y)

# 3. Compute the gradients on the loss term all tensors involved in the computation now have a .grad value
loss.backward()

# 4. Perform the optimization step with the gradient values in .grad
optimizer.step()

# 5. Set all .grad attributes to 0 for the next optimization cycle
optimizer.zero_grad()

print("> loss")
print(loss)

## Datasets and Dataloaders
While we could manually load training data into input tensors, doing so would be a major performance bottleneck in training a deep model. For this reason, PyTorch provides a range of utilities in the `torch.utils.data` package that helps us efficiently deal with data. The most relevant ones are the `Dataset` and `DataLoader` classes.

* the `Dataset` class represents our training data and contains the logic to load a single element. We typically subclass it when creating a new dataset.
* The `DataLoader` class is a utility class that efficiently loads a batch of data from a dataset. Multiprocessing speeds up data processing and overlaps the processing of the next batch with the current model computations.

Subclassing the Dataset class requires overriding the `__len__` and the `__getitem__` methods to return respectively the number of elements in the dataset and the item at a specified position. Any object type can be returned by the `__getitem__` method.


In [None]:
import torch
from torch.utils.data import Dataset

class SimpleDataset(Dataset):
    """A simple dataset representing the numbers from 0 to size-1"""

    def __init__(self, size):
        super(SimpleDataset, self).__init__()

        self.size = size

    def __getitem__(self, idx):
        """Get an item given its id.

        Args:
          idx: the integral index of the element to retrieve

        Returns:
          element at index idx
        """
        return torch.tensor([idx], dtype=torch.float32)

    def __len__(self):
        """Get the length of the dataset.

        Returns:
          number of elements that compose the dataset
        """
        return self.size

size = 10

# instantiate the dataset
dataset = SimpleDataset(size)

# fetch the length of the dataset (__len__ method)
length = len(dataset)
print(f"> Dataset length: {length}")

# get each element of the dataset through indexing
# (__getitem__ method)
for idx in range(len(dataset)):
  print(f"- {idx}: {dataset[idx]}")

A range of methods are provided to conveniently work with datasets

In [None]:
train_size = 6
val_size = 2
test_size = 2

# split the dataset into training, validation and test sets
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, val_size, test_size])

# print all the splits
for current_dataset in [train_dataset, val_dataset, test_dataset]:
    current_length = len(current_dataset)
    print(f"> Current length: {current_length}")
    for idx in range(current_length):
        print(f"- {idx}: {current_dataset[idx]}")

Once a `Dataset` instance is available, a `DataLoader` object can be used to efficiently gather batches of data from the dataset. We just need to specify the size of the batch we would like to retrieve, the number of parallel workers to use for data processing, and whether or not we would like batch elements to be sampled randomly from the dataset.

The DataLoader object is iterable and yields a batch of data at each iteration. Internally, when a batch is requested, the dataloader uses the `Dataset` `__getitem__` method to retrieve each item in the batch. If the object type returned by this function is known to PyTorch (eg. it is a Tensor), then they are automatically combined into a single object representing the batch. For example, if the returned type is Tensor, PyTorch fuses all Tensors composing the batch into a single Tensor with an additional initial dimension of size equal to the batch size. If the dataset returns custom data types instead, a `collate_fn` function can be manually specified that takes as input a list of objects returned by the dataset and returns a single object representing the entire batch.

Due to `DataLoader parallelism, PyTorch recommends that objects returned by datasets be placed in CPU memory due to the subtleties in handling objects placed in GPU memory from multiple processes.

In [None]:
from torch.utils.data import DataLoader

# Creates a dataloader for our dataset instance.
# Does not randomize the order of elements and returns the last batch even if
# it is not of size batch_size
dataloader = DataLoader(dataset, num_workers=2, batch_size=4, shuffle=False, drop_last=False)

print(f"> Length of dataset: {len(dataset)}")

print("> Unshuffled DataLoader")
for idx, batch in enumerate(dataloader):
  print(f"Batch {idx}:")
  print(batch)

## Transformations

The `Dataset` class gives us the freedom to insert data augmentation strategies directly inside the `__getitem__` method implementation. Doing so, however, is inconvenient since for the same dataset we may want to apply different augmentation, for example during training and during evaluation.

For this reason, a typical pattern in PyTorch is providing to the Dataset constructor a `transform` function. The Dataset class, will apply the desired transformations **before** returning the `__getitem__` value, thus actually returning `transform(__getitem__(idx))`.

In [None]:
import torch
from torch.utils.data import Dataset


class TransformableDataset(Dataset):
    """A simple dataset class representing the numbers from 0 to size - 1"""
    def __init__(self, size, transform=None):
        super(TransformableDataset, self).__init__()

        self.transform = transform
        self.size = size

    def __getitem__(self, idx):
        """Get an item given its id.

        Args:
          idx: the integral index of the element to retrieve

        Returns:
          element at index idx
        """
        result = torch.tensor([idx], dtype=torch.float32)

        # If a transformation is available, we apply it
        if self.transform is not None:
          result = self.transform(result)

        return result

    def __len__(self):
        """Get the length of the dataset.

        Returns:
          number of elements that compose the dataset
        """
        return self.size

# A simple transformation
def square(input):
  return input ** 2

size = 10

# Instantiates the dataset
dataset = TransformableDataset(size, transform=square)

# Gets the length of the dataset (__len__ method)
length = len(dataset)
print(f"> Dataset length: {length}")

# Gets each element of the dataset through indexing
# (__getitem__ method)
for idx in range(len(dataset)):
  print(f"- {idx}: {dataset[idx]}")

Many transformations are available in PyTorch. In particular, the `torchvision.transforms` package contains a range of transformations designed for images and utilities to compose a complex chain of transformations into a single pipeline.

When designing a transformation, it is important to consider that PyTorch datasets typically return images represented by the PIL Image class, which is the format expected by many of the transformations in the `torchvision.transforms` package. The `ToTensor` transformation can be used to convert PIL Images to Tensors.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torchvision
from torchvision import transforms

# Obtains the CIFAR100 dataset, downloading it if necessary
# Each returned element is a tuple (PIL Image, int) with the int representing the label of the image
# transform is applied only to the first element in the tuple
# target_transform can be specified for the second element
dataset = torchvision.datasets.CIFAR100(root="cifar100", transform=None, download=True)

# Plots the first 10 images without transformations
fig, axs = plt.subplots(3, 3, figsize=(5, 5))
for ax_idx, ax in enumerate(axs.flatten()):
    sample_image, sample_label = dataset[ax_idx]

    ax.axis("off")
    ax.set_title(f"Class: {sample_label}")
    ax.imshow(np.asarray(sample_image))

In [None]:
from torchvision import transforms

# Build a transformation that will apply a random affine transformation
affine_transformation = transforms.RandomAffine(degrees=20, translate=(0.1, 0.1))

# Pass transformation to dataset
dataset = torchvision.datasets.CIFAR100(root="cifar100", transform=affine_transformation, download=True)

# Plots the first 10 images
fig, axs = plt.subplots(3, 3, figsize=(5, 5))
for ax_idx, ax in enumerate(axs.flatten()):
    sample_image, sample_label = dataset[ax_idx]

    ax.axis("off")
    ax.set_title(f"Class: {sample_label}")
    ax.imshow(np.asarray(sample_image))

In [None]:
from torchvision import transforms

# We can also build a chain of transformations
transformations_sequence = [
  transforms.RandomAffine(degrees=20, translate=(0.1, 0.1)),
  # Random changes in pixel colors
  transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1),
  # The former transformations accept and return PIL Image objects, now convert to Tensor
  transforms.ToTensor(),
  # Apply normalization
  transforms.Normalize(mean=[0.4913, 0.4821, 0.4465], std=[0.2470, 0.2434, 0.2615])
]
composed_transformation = transforms.Compose(transformations_sequence)

# Pass transformation to dataset
dataset = torchvision.datasets.CIFAR100(root="cifar100", transform=composed_transformation, download=True)

# Plots the first 10 images
fig, axs = plt.subplots(3, 3, figsize=(5, 5))
for ax_idx, ax in enumerate(axs.flatten()):
    sample_image, sample_label = dataset[ax_idx]

    ax.axis("off")
    ax.set_title(f"Class: {sample_label}")
    # Because the sample_image is a Tensor of shape [channels, height, width], we need to reshape it so that matplotlib can show it
    # [channels, height, width] => [height, width, channels]
    ax.imshow(sample_image.permute(1, 2, 0))

## Logging

Understanding the training behavior of deep models is often challenging. However, a wide variety of metrics can give us clues on why a certain behavior is shown. Moreover, when working on a deep learning project, multiple configurations and architecture variations are typically tested, generating a large quantity of data. Thus, being able to explore these data and compare them among different configurations is of primary importance.

Multiple tools are available to achieve this goal. We will quickly cover two main logging utilities: Tensorboard and WandB. The idea behind these tools is simple: when training or evaluating a model, we log some metrics at each step, and the tool provides us with a web interface where plots showing the dynamics of our model are automatically populated.

In [None]:
# Let's clear previous runs (if any)
!rm -r runs

In [None]:
import torch
from torch.utils.tensorboard import SummaryWriter

# ====== Write fake data representing a first experiment ====== #
# Creates a logger for the experiment
writer = SummaryWriter(log_dir="runs/exp1")

# Simulate 100 training steps
for training_step in range(100):
    # Log training metrics
    writer.add_scalar("train/quantity_a", training_step * 0.5, training_step)
    writer.add_scalar("train/quantity_b", training_step ** 1.5, training_step)
    writer.add_scalar("train/quantity_c", 1 / (1 + training_step), training_step)

# Close the logger
writer.close()

# ====== Write fake data representing a second experiment ====== #
writer = SummaryWriter(log_dir="runs/exp2")

for training_step in range(100):
    writer.add_scalar("train/quantity_a", training_step * 0.4, training_step)
    writer.add_scalar("train/quantity_b", training_step ** 1.4, training_step)
    writer.add_scalar("train/quantity_c", 1 / (1 + 2 * training_step), training_step)

writer.close()
# ============================================================== #

In [None]:
# If you are getting an error, enable third-party cookies!
%load_ext tensorboard
%tensorboard --logdir=runs

Let's now try out WandB

In [None]:
# Import the library
import wandb

wandb.login()

# ====== Write fake data representing a first experiment ====== #
wandb.init(project="lab_01_intro", name="exp1")

# Simulate 100 training steps
for training_step in range(100):
    # Log training metrics
    wandb.log({
        "train/quantity_a": training_step * 0.5,
        "train/quantity_b": training_step ** 1.5,
        "train/quantity_c": 1 / (1 + training_step),
    })

# ====== Write fake data representing a second experiment ====== #
wandb.init(project="lab_01_intro", name="exp2")

for training_step in range(100):
    # Log training metrics
    wandb.log({
        "train/quantity_a": training_step * 0.4,
        "train/quantity_b": training_step ** 1.3,
        "train/quantity_c": 1 / (1 + 2 *training_step),
    })