# COMPSCI 714 - Lectutorial 2 - Training and evaluating a DNN with PyTorch

## Coding time 1 - PyTorch basics

In [None]:
import numpy as np
import torch

### Pytorch Tensors

The core data structure used in PyTorch is the **tensor**. They are very similars to arrays and matrices (e.g., NumPy array) and can be used to store data as a multidimentional array with a data type.

The main difference with Numpy arrays is that Pytorch tensors supports two main additional features:
- They can run on GPUs, while NumPy arrays are designed for CPU-based computations only and do not have built-in GPU support.
- They support auto-differentiation (i.e., PyTorch captures information about operations applied to the tensor and can use it to calculate gradients automatically with *Autograd*).

In [None]:
X = torch.tensor([[1.0, 4.0, 7.0, 9.0], [2.0, 3.0, 6.0, 8.0]])
X

Display the shape and data type of the tensor `X` (similar than with a NumPy array with the `shape` and `dtype` fields).

In [None]:
# TODO

In [None]:
# TODO

Try to index the tensor, e.g., display
- the third value of the first row, and
- the last values of the both rows.

In [None]:
# TODO

In [None]:
# TODO

You can perform operations on tensors very similarly as on NumPy arrays. \
Try to run the few following operations and comment on what they do.

In [None]:
8 * (X + 4)

In [None]:
X.exp()

In [None]:
X.mean()

In [None]:
X.mean(axis = 1)

In [None]:
X.max(axis=0)

In [None]:
X @ X.T

You can convert a tensor to a NumPy array, and vice versa.

In [None]:
X.numpy()

In [None]:
torch.tensor(np.array([[1., 4., 7.], [2., 3., 6.]]))

Or if you want the data precision of the tensor to be converted to 32-bits:

In [None]:
torch.FloatTensor(np.array([[1., 4., 7.], [2., 3., 6]]))

### Autograd

PyTorch comes with an implementation of auto-differentiation called *Autograd* (Automated gradients). It can be used to compute the derivative of a function, i.e., its gradient. For a enable *Autograd* to be performed on a tensor, you have to set `requires_grad=True` when creating it.

In [None]:
x = torch.tensor(5.0, requires_grad=True)
x

Let's then create a function performing a computation on tensor `x`.

In [None]:
f = x ** 2
f

Notice that `f` is also a tensor, carrying the fucntion `grad_fn=<PowBackward0>`. It is the function that would be used if you ecide to backpropagate the gradients through the operation performed by `f` (** is the power operator, hence the name `PowBackward0`).

Let's now backpropagate the gradient.

In [None]:
f.backward()

This backpropagates the gradient from `f` to `x`. This is quite straighforward here, but imagine this applied to a full DNN. The gradient would be backpropagated from the outputs to the inputs, through all the `grad_fn` registered during the forward pass.

Let's now have a look at the gradient value associated with `x`, i.e., the value of the derivative of `f` with respect to `x` Does this value makes sense to you?

In [None]:
x.grad

### Hardware acceleration

CUDA-enabled NVIDIA GPU and Apple's MPS are directly supported by PyTorch. You can check for them and else fall back on the CPU:

In [None]:
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

In [None]:
device

To perform computations with tensors on a GPU, you need to move your tensors to the GPU device:

In [None]:
M = torch.tensor([[1., 2., 3.], [4., 5., 6.]])
M = M.to(device)

In [None]:
M.device

In [None]:
M = torch.tensor([[2., 4., 6.], [8., 6., 4.]], device="cuda")

In [None]:
M.device

In [None]:
M = torch.rand((1000, 1000))

In [None]:
%timeit M @ M.T

In [None]:
M = torch.rand((1000, 1000), device="cuda")

In [None]:
%timeit M @ M.T

How much faster is this matrix multiplication being computed on GPU?

In [None]:
# TODO

## Coding time 2: Training a simple DNN for regression

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import matplotlib.pyplot as plt

In [None]:
%pip install torchmetrics
import torchmetrics

### Regression

#### Loading the dataset

Let's first load the california housing dataset from Scikit-Learn.

In [None]:
housing_dataset = fetch_california_housing()

Run the next cell to see what data format is the dataset loaded as.

In [None]:
type(housing_dataset)

What is the shape of the data?

In [None]:
# TODO

First, let's divide it into train/validation/test sets with a 60%/20%/20% ratio.

The next line of code splits the data into train/test sets with a 80%/20% ratio. Extend the code to create the validation set as well.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(housing_dataset.data, housing_dataset.target, test_size=0.2)
# TODO

How many samples are there in each set?

What is the type of data structure used to store the sets?

In [None]:
# TODO

As we saw before, PyTorch works with tensors. We need to convert the sets and associated targets to tensors.

In [None]:
X_train = torch.FloatTensor(X_train)
X_valid = torch.FloatTensor(X_valid)
X_test = torch.FloatTensor(X_test)

y_train = torch.FloatTensor(y_train)
y_valid = torch.FloatTensor(y_valid)
y_test = torch.FloatTensor(y_test)

Next, we will do a quick touch of pre-processing by standardising the values of the attributes. We can do it manually this time, by computing the mean and standard deviation for each attribute.

In [None]:
means = X_train.mean(axis=0, keepdims=True)
stds = X_train.std(axis=0, keepdims=True)
X_train = (X_train - means) / stds
X_valid = (X_valid - means) / stds
X_test = (X_test - means) / stds

What are the shapes of `y_train`, `y_valid` and `y_test`?

In [None]:
# TODO

These are 1D tensors, however, PyTorch models generally expects 2D tensors. 1D tensors might be treated differently from a 2D tensors when performing some operations like matrix multiplication.

Therefore, we need to reshape our target tensors to 2D tensors.

In [None]:
y_train = y_train.reshape(-1, 1)
y_valid = y_valid.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

In [None]:
print(y_train.shape)
print(y_valid.shape)
print(y_test.shape)

#### Declaring the model

Next, let's create our first multilayer neural network! The easiest way is to use the PyTorch `nn.Sequential` module. It allows to create a stack of layers.

The following cell defines a neural network with:
- 2 hidden layers, the first one with 50 neurons and the second one with 40 neurons. Both use the ReLU activation function (we will cover it next week).
- 1 output layer.

Note that we declared all the layers as full-connected layers, also called dense layers, with the `nn.Linear` module. To create a `Linear` layer, we need to pass the shape of the parameter matrix as argument. This shape corresponds to the layers's $number\ of\ inputs \times number\ of\ outputs$.

What should be the values of assigned to the variables `n_attributes` and `n_outputs`?

In [None]:
n_attributes = # TODO
n_outputs = # TODO
my_model = nn.Sequential(
nn.Linear(n_attributes, 50),
nn.ReLU(),
nn.Linear(50, 40),
nn.ReLU(),
nn.Linear(40, n_outputs)
)

We can wrap this in a function as we might need to reset our model later on.

In [None]:
def set_model():
  # TODO
  return model

#### Training the model

Next, we need to set:
- the optimiser we want to use to train the model (let's use SGD),
- the loss function (let's use MSE),
- the learning rate (let's start at 0.1),
- the number of epochs (let's set it to 20).

In [None]:
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mse = nn.MSELoss()
n_epochs = 20

Finally, let's define a simple training loop.

If you have a look at hos to do this with TensorFlow, you will see that the module you can use to create a model has already a pre-defined method to train (*fit*) the model. With PyTorch, you have to create the training loop yourself. It can be seen as more tedious, but on the positive side, it gives you more control as well. And it is great to break down and understand each step of the training!

For better reusability, let's create a function `train` in which we will build our training loop.

Complete the training loop in the function below, by including the following intructions in the correct order:
- `loss.backward()`: calculates the gradient of the loss with respect to the model's parameters
- `optimizer.step()`: take a step of optimisation
- `y_pred = model(X_train)`: performs a forward pass
- `optimizer.zero_grad()`: resets the gradients of all tensors
- `loss = loss_fn(y_pred, y_train)`: calculates the loss


In [None]:
def train(model, optimizer, loss_fn, X_train, y_train, n_epochs):
  for epoch in range(n_epochs):
    # Intruction 1
    # Intruction 2
    # Intruction 3
    # Intruction 4
    # Intruction 5
    print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item()}")

Now, we are ready to train our model!
Run the following function call to do so.

In [None]:
train(model, optimizer, mse, X_train, y_train, n_epochs)

Modify your training function to return a list of the loss values.

In [None]:
def train_v2(model, optimizer, loss_fn, X_train, y_train, n_epochs):
  losses = []
  for epoch in range(n_epochs):
    y_pred = model(X_train)
    loss = loss_fn(y_pred, y_train)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item()}")
    # TODO

Train the model again and plot the loss after training.

**Warning**: The training of your model will resume where it stopped. If you want to start training from stratch again, you need to reset the model parameters by re-run the cells where you declared the model and optimiser first, or using the `set_model()` function we declared to that end.

In [None]:
# TODO

In [None]:
# TODO

#### Implementing Mini-batch GD

What type of gradient descent have you used so far?

Let's now try to implement Mini-batch gradient descent.

To do so, we need to use the `DataLoader` class which facilitate the loading of batches of data. To be able to use a `DataLoader`, we first need to wrap our dataset as a `TensorDataset` object (this provide the correct API to the `DataLoader` class).

In [None]:
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) # Creates a DataLoader for loading batches of 32 random samples

Let's also use the GPU to train the model this time. To do so, we need to move the model tensors to GPU with the following instruction.

**Warning**: Do not forget to reinitialise your model parameter first by re-running the cell where you defined it.

In [None]:
model = set_model().to(device)

Let's now update our training loop to:
- Calculate the gradient update over a batch of data and not the full dataset.
- Use the GPU to perform the training.

Look at the lines with the #NEW tag and try to understand wht changed compared to the previous training loop.

In [None]:
def train_v3(model, optimizer, loss_fn, train_loader, n_epochs):
    losses = []
    model.train() # Puts the model in training mode, will be useful later on when we use other types of layers
    for epoch in range(n_epochs):
        epoch_loss = 0. # NEW
        for X_batch, y_batch in train_loader: # NEW
            X_batch, y_batch = X_batch.to(device), y_batch.to(device) # NEW
            y_pred = model(X_batch) # NEW
            loss = loss_fn(y_pred, y_batch)
            epoch_loss += loss.item() # NEW
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        mean_epoch_loss = epoch_loss / len(train_loader) # NEW
        print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {mean_epoch_loss:.4f}")
        losses.append(mean_epoch_loss)
    return losses

You can now train the model with the new training loop.

You can try to lower the learning rate by a factor 10 if the training does not converge.

In [None]:
# TODO

We reached a lower loss than before, but each epoch took more time.

The lower loss can be explained by the fact that mini-batch GD introduces some stochatiscity in the optimisation process (i.e., it can help avoid local minima).

The higher time per update is explained by the fact that mini-batch GD makes several gradient update per epoch, while batch GD does only one. However, we reached a lower loss in with mini-batch GD in much less epochs than with batch GD.

## Coding time 3: Model evaluation

### Validation loss

It is usually good to also evaluate the model's loss on the validation set after each epoch, e.g., to monitor for overfitting.

Let's update our train function to include this.

In [None]:
def train_v4(model, optimizer, loss_fn, train_loader, valid_loader, n_epochs):
    train_losses = []
    valid_losses = []

    for epoch in range(n_epochs):
        #Training
        model.train()
        epoch_train_loss = 0.
        for X_train_batch, y_train_batch in train_loader:
            X_train_batch, y_train_batch = X_train_batch.to(device), y_train_batch.to(device)
            y_train_pred = model(X_train_batch)
            train_loss = loss_fn(y_train_pred, y_train_batch)
            epoch_train_loss += train_loss.item()
            train_loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        mean_epoch_train_loss = epoch_train_loss / len(train_loader)
        train_losses.append(mean_epoch_train_loss)

        # Validation
        model.eval()
        epoch_valid_loss = 0.
        with torch.no_grad():
            for X_valid_batch, y_valid_batch in valid_loader:
                X_valid_batch, y_valid_batch = X_valid_batch.to(device), y_valid_batch.to(device)
                y_valid_pred = model(X_valid_batch)
                valid_loss = loss_fn(y_valid_pred, y_valid_batch)
                epoch_valid_loss += valid_loss.item()
        mean_epoch_valid_loss = epoch_valid_loss / len(valid_loader)
        valid_losses.append(mean_epoch_valid_loss)

        print(f"Epoch {epoch + 1}/{n_epochs}, Training Loss: {mean_epoch_train_loss:.4f}, Valid Loss: {mean_epoch_valid_loss:.4f}")

    return (train_losses, valid_losses)

In [None]:
valid_dataset = TensorDataset(X_valid, y_valid)
valid_loader = DataLoader(valid_dataset, batch_size=32, shuffle=True)

In [None]:
model = set_model().to(device)
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mse = nn.MSELoss()
n_epochs = 20
train_losses, valid_losses = train_v4(model, optimizer, mse, train_loader, valid_loader, n_epochs)

Do you notice anything strange with the learning curves?

What could we do to fix this?

### Evaluation metrics and classification

Let's now train an image classifier and use evaluation metrics.  

`torchvision` is the PyTorch module containing popular datasets, model architectures, and common image transformations for computer vision. We will just use it to load the Fashion MNIST dataset and do a few quick pre-processing today.

In [None]:
import torchvision
import torchvision.transforms.v2 as T

The following instructions are used to:
1. Define a pre-processing function to convert images to PyTorch `Image` datatype (subclass of `Tensor`), with float32 type and scaling of the pixel's values between 0 and 1 (from 0 to 255 in original images).
2. Load the Fashion MNIST dataset (train and test sets) and apply the pre-processing.
3. Split the trainig data in training and validation sets.

In [None]:
toTensor = T.Compose([T.ToImage(), T.ToDtype(torch.float32, scale=True)]) # Define a pre-processing function to convert loaded images to adequate format
train_and_valid_data = torchvision.datasets.FashionMNIST(root="datasets", train=True, download=True, transform=toTensor)
test_data = torchvision.datasets.FashionMNIST(root="datasets", train=False, download=True, transform=toTensor)
train_data, valid_data = torch.utils.data.random_split(train_and_valid_data, [55_000, 5_000])

Create the data loaders.

In [None]:
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=32)
test_loader = DataLoader(test_data, batch_size=32)

Look at the shape of the first image, its data type and the target class name.

In [None]:
X_sample, y_sample = train_data[0]
X_sample.shape

In [None]:
X_sample.dtype

In [None]:
train_and_valid_data.classes[y_sample]

You can find below a more structured way of declaring a model. This can give you more freedom in terms of architecture design.

The code defines a class, inheriting from the `nn.Module` module, to instance models with 2 hidden layers and one output layer. The user can pass the number of inputs, neurons in hidden layer 1 and 2 and number of classes to instance models.

The `forward` method has to be present if you use this approach, as this is automatically called when you use perform a forward pass through the model (i.e., `model(X)`).

In [None]:
class ImageClassifier(nn.Module):
    def __init__(self, n_inputs, n_hidden1, n_hidden2, n_classes):
      super().__init__()
      self.mlp = nn.Sequential(
          nn.Flatten(),
          nn.Linear(n_inputs, n_hidden1),
          nn.ReLU(),
          nn.Linear(n_hidden1, n_hidden2),
          nn.ReLU(),
          nn.Linear(n_hidden2, n_classes)
      )
    def forward(self, X):
        return self.mlp(X)


Let's update our previous traninig loop to include the evaluation metric calculation and return.

In [None]:
def train_v5(model, optimizer, loss_fn, eval_metric, train_loader, valid_loader, n_epochs):
    train_losses = []
    train_eval_metrics = []
    valid_losses = []
    valid_eval_metrics= []

    for epoch in range(n_epochs):

        # Model evaluation
        model.eval()
        eval_metric.reset() # Reset the eval metric
        epoch_valid_loss = 0.
        with torch.no_grad():
            for X_valid_batch, y_valid_batch in valid_loader:
                X_valid_batch, y_valid_batch = X_valid_batch.to(device), y_valid_batch.to(device)
                y_valid_pred = model(X_valid_batch)
                valid_loss = loss_fn(y_valid_pred, y_valid_batch)
                epoch_valid_loss += valid_loss.item()  # Update eval metric for validation
                eval_metric.update(y_valid_pred, y_valid_batch)
        mean_epoch_valid_loss = epoch_valid_loss / len(valid_loader)
        valid_losses.append(mean_epoch_valid_loss)
        # Calculte and store validation eval metric for this epoch
        epoch_valid_eval_metric = eval_metric.compute().item()
        valid_eval_metrics.append(epoch_valid_eval_metric)

        #Training
        eval_metric.reset() # Reset the eval metric
        model.train()
        epoch_train_loss = 0.
        for X_train_batch, y_train_batch in train_loader:
            X_train_batch, y_train_batch = X_train_batch.to(device), y_train_batch.to(device)
            y_train_pred = model(X_train_batch)
            train_loss = loss_fn(y_train_pred, y_train_batch)
            epoch_train_loss += train_loss.item()
            train_loss.backward()
            eval_metric.update(y_train_pred, y_train_batch) # Update eval metric for training
            optimizer.step()
            optimizer.zero_grad()
        mean_epoch_train_loss = epoch_train_loss / len(train_loader)
        train_losses.append(mean_epoch_train_loss)
        # Calculte and store training eval metric for this epoch
        epoch_training_eval_metric = eval_metric.compute().item()
        train_eval_metrics.append(epoch_training_eval_metric)

        print(f"Epoch {epoch + 1}/{n_epochs}, Training Loss: {mean_epoch_train_loss:.4f}, Valid Loss: {mean_epoch_valid_loss:.4f}")
        print(f"Epoch {epoch + 1}/{n_epochs}, Training Eval Metric: {epoch_training_eval_metric:.4f}, Valid Eval Metric: {epoch_valid_eval_metric:.4f}")

    return (train_losses, valid_losses, train_eval_metrics, valid_eval_metrics)

Create an instance of the model and define the loss as Cross Entropy (loss for classification), the evaluation metric as accuracy, the optimiser as SGD (which works as mini-batch GD in our setup) and the number of epochs to 10.

Move the model to the GPU and start the training (will take a few minutes on the Colab T4 GPU).

In [None]:
model = ImageClassifier(n_inputs=28 * 28, n_hidden1=300, n_hidden2=100, n_classes=10)
xentropy = nn.CrossEntropyLoss()
accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
n_epochs = 10

model = model.to(device)
train_losses, valid_losses, train_accuracy, valid_accuracy = train_v5(model, optimizer, xentropy, accuracy, train_loader, valid_loader, n_epochs)

In [None]:
plt.plot(train_losses[1:], label='Training loss')
plt.plot(valid_losses[1:], label='Validation loss')
plt.plot(train_accuracy[1:], label='Training accuracy')
plt.plot(valid_accuracy[1:], label='Validation accuracy')
plt.grid()
plt.legend()

You can now use the model to make predictions on "new" images.

In [None]:
model.eval()
X_new, y_new = next(iter(valid_loader))
X_new = X_new[:3].to(device)
with torch.no_grad():
  y_pred_logits = model(X_new)
y_pred = y_pred_logits.argmax(axis=1) # index of the largest logit
y_pred

In [None]:
[train_and_valid_data.classes[index] for index in y_pred]