# Intermediate architectures and advanced PyTorch tools
## TD 4

We are essentially going to use the same `Food101` ([credit where it's due](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/)) data, the same object `ImageDataset`, the same `DataLoader`.

The code below is mainly a copy of the code from the previous TD, except that global variables are now defined separately and everything is wrapped in different functions. This is to make it easier to train the same model with different hyperparameters and architectures, etc ...

For those that can use their GPUs, all the necessary `.to(device)` are already in the code.

If, for some reason, you encounter this error: `OutOfMemoryError: CUDA out of memory.`. It means that your GPU does not have enough memory to run the model. You can try to reduce the batch size, or the number of neurons in the network, or the number of layers in the network, or the number of filters in the convolutional layers, etc ...

In [2]:
# Imports

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import pathlib
import time
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Set the random seed for reproducibility
_ = torch.manual_seed(25)

You can set the `flush` parameter to `True` for all `print()` statements in `Python` by overriding the built-in `print()` function using the `functools.partial()` method. An example of this is:

```py
from functools import partial
print = partial(print, flush=True)
```

We will use this to make sure that the outputs are printed in the correct order and at the correct time (for more info, check [this link](https://www.includehelp.com/python/flush-parameter-in-python-with-print-function.aspx)).

In [3]:
from functools import partial
print = partial(print, flush=True)

In [4]:
# Global variables

# Setup device-agnostic code
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {DEVICE} device")

# Batch size
BATCH_SIZE = 8

# Learning rate
LEARNING_RATE = 2e-2

# Number of epochs
NUM_EPOCHS = 15

# Number of classes
NUM_CLASSES = 3

Using cuda device


In [5]:
def get_datasets_and_dataloaders(
    batch_size: int = 4
) -> tuple[
    datasets.ImageFolder, 
    datasets.ImageFolder, 
    DataLoader, 
    DataLoader
]:
    """
    Load the training and test datasets into data loaders.
    """
    data_dir = pathlib.Path("data")
    train_dir = data_dir / "Food-3" / "train"
    test_dir = data_dir / "Food-3" / "test"

    data_transform = transforms.Compose(
        [
            transforms.Resize(size=(64, 64)),  # Resize the images to 64x64*
            transforms.ToTensor()  # Convert the images to tensors
        ]
    )

    train_data = datasets.ImageFolder(
        root=train_dir,  # target folder of images
        transform=data_transform,  # transforms to perform on data (images)
        target_transform=None  # transforms to perform on labels (if necessary)
    ) 

    test_data = datasets.ImageFolder(
        root=test_dir,
        transform=data_transform
    )

    train_dataloader = DataLoader(
        dataset=train_data,
        batch_size=batch_size,  # how many samples per batch?
        shuffle=True  # shuffle the data?
    )

    test_dataloader = DataLoader(
        dataset=test_data,
        batch_size=batch_size,
        shuffle=False
    ) # don't usually need to shuffle testing data


    return train_data, test_data, train_dataloader, test_dataloader

In [6]:
# Load dataloaders in global variables
TRAIN_DATASET, TEST_DATASET, TRAIN_DATALOADER, TEST_DATALOADER = get_datasets_and_dataloaders(BATCH_SIZE)

# We actually don't really need to return the datasets, but it's nice to have them for reference. If you don't,
# you can just return the dataloaders and find the datasets by calling TRAIN_DATALOADER.dataset or TEST_DATALOADER.dataset:
print(TRAIN_DATALOADER.dataset == TRAIN_DATASET)
print(TEST_DATALOADER.dataset == TEST_DATASET)

True
True


In [7]:
class Net(nn.Module):
    def __init__(self, hidden_units=200):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(64*64*3, hidden_units)
        self.fc2 = nn.Linear(hidden_units, NUM_CLASSES)

    def forward(self, x):
        x = nn.ReLU()(self.fc1(x))
        x = self.fc2(x)
        return x

In [8]:
# Create model
MODEL: Net = Net().to(DEVICE)

In [9]:
def test_our_model() -> float:
    # 0. Put model in eval mode
    MODEL.eval()  # to remove stuff like dropout that's only going to be in the training part

    # 1. Setup test accuracy value
    test_acc: float = 0

    # 2. Turn on inference context manager
    with torch.no_grad():
        # Loop through DataLoader batches
        for X_test, y_test in TEST_DATALOADER:  # majuscule à X car c'est une "matrice", et y un entier
            # a. Move data to device
            X_test_flattened = X_test.view(-1, 64*64*3).to(DEVICE) 
            y_test = y_test.to(DEVICE)

            # b. Forward pass
            model_output = MODEL(X_test_flattened)

            # c. Calculate and accumulate accuracy
            test_pred_label = model_output.argmax(dim=1)
            test_acc += (test_pred_label == y_test).sum()

    # Adjust metrics to get average loss and accuracy per batch
    test_acc = test_acc / (len(TEST_DATASET))
    return test_acc.item()

In [10]:
# Test our untrained model
print((f"{100*test_our_model():.2f}%"))

36.00%


You should get 36.00% accuracy on the testing set without training and with the default hyperparameters if you used the same seed.

---

Why does it not work with ` X_test_flattened = X_test.view(BATCH_SIZE, 64*64*3).to(DEVICE)`?

---

In [11]:
def main_train(loss_fn, optimizer) -> None:
    """
    Train the model and modified the trained model inplace.
    """
    start_time_global = time.time()

    # Put model in train mode
    MODEL.train()

    # Loop through data loader data batches
    for epoch in range(NUM_EPOCHS):
        start_time_epoch = time.time()

        # Setup train loss and train accuracy values
        train_loss, train_acc = 0, 0

        for X, y in TRAIN_DATALOADER:
            # 0. Move data to device
            X = X.view(-1, 64*64*3).to(DEVICE)
            y = y.to(DEVICE)

            # 1. Forward pass
            y_pred = MODEL(X)

            # 2. Calculate and accumulate loss
            loss = loss_fn(y_pred, y)
            train_loss += loss.item()

            # 3. Optimizer zero grad
            optimizer.zero_grad()

            # 4. Loss backward
            loss.backward()

            # 5. Optimizer step
            optimizer.step()

            # Calculate and accumulate accuracy metric across all batches
            y_pred_class = y_pred.argmax(dim=1)
            train_acc += (y_pred_class == y).sum()

        # Adjust metrics to get average loss and accuracy per batch
        train_loss = train_loss / (len(TRAIN_DATASET))
        train_acc = train_acc / (len(TRAIN_DATASET))
        print(
            f"epoch {epoch+1}/{NUM_EPOCHS},"
            f" train_loss = {train_loss:.2e},"
            f" train_acc = {100*train_acc.item():.2f}%,"
            f" time spent during this epoch = {time.time() - start_time_epoch:.2f}s,"
            f" total time spent = {time.time() - start_time_global:.2f}s"
        )

In [12]:
main_train(nn.CrossEntropyLoss(), torch.optim.SGD(MODEL.parameters(), lr=LEARNING_RATE))

epoch 1/15, train_loss = 1.25e-01, train_acc = 49.81%, time spent during this epoch = 6.92s, total time spent = 6.92s
epoch 2/15, train_loss = 1.16e-01, train_acc = 54.41%, time spent during this epoch = 6.06s, total time spent = 12.99s
epoch 3/15, train_loss = 1.12e-01, train_acc = 57.63%, time spent during this epoch = 6.46s, total time spent = 19.45s
epoch 4/15, train_loss = 1.09e-01, train_acc = 59.33%, time spent during this epoch = 6.33s, total time spent = 25.78s
epoch 5/15, train_loss = 1.08e-01, train_acc = 60.41%, time spent during this epoch = 6.32s, total time spent = 32.10s
epoch 6/15, train_loss = 1.05e-01, train_acc = 60.67%, time spent during this epoch = 5.98s, total time spent = 38.08s
epoch 7/15, train_loss = 1.03e-01, train_acc = 61.30%, time spent during this epoch = 5.98s, total time spent = 44.06s
epoch 8/15, train_loss = 1.01e-01, train_acc = 62.48%, time spent during this epoch = 5.98s, total time spent = 50.04s
epoch 9/15, train_loss = 9.84e-02, train_acc = 65

In [13]:
print((f"{100*test_our_model():.2f}%"))

55.67%


You should get 55.67% accuracy on the testing set without training and with the default hyperparameters if you used the same seed. And we almost reached convergence (the loss is not decreasing that much anymore, and if you try to train for more epochs, you will see that the testing set accuracy will decrease). Note that we kind of cheated by using the testing set to set the number of epochs, we should instead use validation sets and cross validation techniques ... and we will (today)! No worries.

-----

Is it possible for `train_loss` to decrease whilst `train_acc` decreases at the same time? Look at what happens between epochs 10 and 11 here:

```
epoch 10/15, train_loss = 9.64e-02, train_acc = 65.11%, [...], total time spent = 121.83s
epoch 11/15, train_loss = 9.54e-02, train_acc = 64.78%, [...], total time spent = 134.67s
```

Why is that?

-----

## Let's try to improve this accuracy!

You will need to install the Optuna package (`pip install optuna`) and import it at the beginning of your script. We should also import KFold from sklearn.model_selection. This is because we will use cross-validation to find the best hyperparameters.

In [14]:
import optuna
from sklearn.model_selection import KFold

 First easy task is to decide whether one should use a convolutional network or a dense network.
 
 We will do this together (choice between a convolutional and dense network), and then you'll have to implement optimization of the learning rate* and optimizer's choice on your own.

 \* *Careful! Small learning rates are not always better, especially if you do not change the number of epochs. You should try to find the best learning rate for the number of epochs you chose, one that is not too big for your computer to handle.*

In [15]:
class AdvancedNet(nn.Module):
    def __init__(self, use_conv: bool, hidden_units: int = 200):
        super(AdvancedNet, self).__init__()
        self.use_conv: bool = use_conv
        if use_conv:
            self.conv = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1)
            # output of this layer will be ((64+2*1-3)/1)+1 = 64. 
            # -> 64 channels of 64x64 images
            self.fc1 = nn.Linear(64*64*64, hidden_units)  # flattening will be necessary to enter fc1
            self.fc2 = nn.Linear(hidden_units, NUM_CLASSES)
        else:
            self.fc1 = nn.Linear(3*64*64, hidden_units)
            self.fc2 = nn.Linear(hidden_units, NUM_CLASSES)

    def forward(self, x):
        if self.use_conv:
            x = nn.ReLU()(self.conv(x))
            x = x.view(-1, 64*64*64)  # flattening is necessary, and, same as above,
            # we need to use -1 and not BATCH_SIZE because the last batch might be smaller
        x = nn.ReLU()(self.fc1(x))
        x = self.fc2(x)
        return x

 Then, you will need to define a new function that will be used as the objective function for Optuna's optimization. This function should take in the `trial` object from Optuna as an argument and use the `trial` object to define and sample the hyperparameters that you want to optimize. For example, you can use the `trial` object to sample a choice between a convolutional and dense network, and to sample the number of neurons for the chosen network. After training the model, we will need to return the final validation accuracy calculated with cross-validation* as the objective function value for Optuna to maximise.

 \* We use cross-validation here (3-fold) because we want to use the testing set as little as possible. We will use the testing set only once, at the end, to get the final accuracy of the best model. But, cross-validation greatly increases the time required to run the algorithms, so we won't always use cross-validation to optimize hyperparameters.

In [16]:
def objective(trial: optuna.trial.Trial) -> float:
    print("New trial")

    # Set up cross validation
    n_splits: int = 3
    fold = KFold(n_splits=n_splits, shuffle=True, random_state=0)
    scores = [0]*n_splits

    use_conv: bool = trial.suggest_categorical('use_conv', [True, False])

    # Loop through data loader data batches
    for fold_idx, (train_idx, valid_idx) in enumerate(fold.split(range(len(TRAIN_DATASET)))):
        # train_idx and valid_idx are numpy arrays of indices of the training and validation sets for this fold respectively.
        # They do not contain the actual data, but the indices of the data in the dataset.
        # We can use these indices to create a subset of the dataset for this fold with torch.utils.data.Subset.
        # Obviously, if an index is in the validation set, it will not be in the training set. You can
        # check this by printing train_idx and valid_idx and check by yourself.
        
        print(f"Fold {fold_idx+1}/{n_splits}")

        # Create subsets of the dataset for this fold
        sub_train_data = torch.utils.data.Subset(TRAIN_DATASET, train_idx)
        sub_valid_data = torch.utils.data.Subset(TRAIN_DATASET, valid_idx)

        # Create data loaders for this fold
        sub_train_loader = torch.utils.data.DataLoader(sub_train_data, batch_size=BATCH_SIZE, shuffle=True)
        sub_valid_loader = torch.utils.data.DataLoader(sub_valid_data, batch_size=BATCH_SIZE, shuffle=False)
        
        # Generate the model.
        my_model: AdvancedNet = AdvancedNet(use_conv).to(DEVICE)
        
        for epoch in range(NUM_EPOCHS):
            # Training of the model.
            # Put model in train mode
            my_model.train()

            # Set up optimizer
            optimizer = torch.optim.SGD(my_model.parameters(), lr=LEARNING_RATE)

            # Set up loss function
            loss_fn = nn.CrossEntropyLoss()
            for X, y in sub_train_loader:
                # 0. Reshape data to input to the network
                if use_conv:
                    pass
                else:
                    X = X.view(-1, 64*64*3)

                # 1. Move data to device
                X = X.to(DEVICE)
                y = y.to(DEVICE)

                # 2. Forward pass
                y_pred = my_model(X)

                # 3. Calculate and accumulate loss
                loss = loss_fn(y_pred, y)

                # 4. Optimizer zero grad
                optimizer.zero_grad()

                # 5. Loss backward
                loss.backward()

                # 6. Optimizer step
                optimizer.step()

        # Validation of the model.
        # Put model in eval mode
        my_model.eval()
        
        val_acc = 0
        with torch.no_grad():
            for X, y in sub_valid_loader:
                # 0. Reshape data to input to the network
                if use_conv:
                    pass
                else:
                    X = X.view(-1, 64*64*3)
                
                # 1. Move data to device
                X = X.to(DEVICE)
                y = y.to(DEVICE)

                # 2. Forward pass
                y_pred = my_model(X)
                
                # 3. Compute accuracy
                pred = y_pred.argmax(dim=1, keepdim=True)
                y_pred_class = y_pred.argmax(dim=1)

                val_acc += (y_pred_class == y).sum()

        scores[fold_idx] = (val_acc / len(sub_valid_data)).cpu()
        # bring it back otherwise, np.mean will not work
        print(f"Fold {fold_idx+1}/{n_splits} accuracy: {scores[fold_idx]}")
    
    return np.mean(scores)

Finally, we will need to call the `optuna.create_study()` function to create a new study, and use the `study.optimize()` function to run the optimization, passing the objective function that we defined earlier.

You can find more information about how to use Optuna in the [Optuna documentation](https://optuna.readthedocs.io/en/stable/index.html).

In [17]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, timeout=1200, n_trials = 2) 
# - timeout=1200 -> stops after 20 minutes; 
# - n_trials = 2 -> here we only try two models, a dense or a convolutional model so
#   we need to make it stop after having trained the two models otherwise it will continue to 
#   loop on those two models unless it reaches the 20 minutes mark*. In practice, you will give
#   a lot of hyperparameters to optimize and you will want to run the optimization for a lot
#   longer than 20 minutes. The timeout parameter is useful in those cases because you won't 
#   know how long it'll take.
#   * e.g., https://i.imgur.com/bCzH1pm.png

pruned_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED]
complete_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE]

print("\n")
print("--------------------")
print("--------------------")
print("--------------------")
print("\n")
print("Study statistics: ")
print("  Number of finished trials: ", len(study.trials))
print("  Number of pruned trials: ", len(pruned_trials))
print("  Number of complete trials: ", len(complete_trials))

print("Best trial:")
trial = study.best_trial

print("  Value: ", trial.value)

print("  Params: ")
for key, value in trial.params.items():
    print(f"\t{key}: {value}")

[32m[I 2023-01-24 13:44:11,556][0m A new study created in memory with name: no-name-7b673fbf-1ae0-4fcd-859e-ea34fb593c6b[0m


New trial
Fold 1/3
Fold 1/3 accuracy: 0.6166666746139526
Fold 2/3
Fold 2/3 accuracy: 0.6344444751739502
Fold 3/3
Fold 3/3 accuracy: 0.6422222256660461


[32m[I 2023-01-24 13:47:24,304][0m Trial 0 finished with value: 0.6311111450195312 and parameters: {'use_conv': True}. Best is trial 0 with value: 0.6311111450195312.[0m


New trial
Fold 1/3
Fold 1/3 accuracy: 0.6366666555404663
Fold 2/3
Fold 2/3 accuracy: 0.6277778148651123
Fold 3/3
Fold 3/3 accuracy: 0.6333333253860474


[32m[I 2023-01-24 13:50:43,063][0m Trial 1 finished with value: 0.6325926184654236 and parameters: {'use_conv': True}. Best is trial 1 with value: 0.6325926184654236.[0m




--------------------
--------------------
--------------------


Study statistics: 
  Number of finished trials:  2
  Number of pruned trials:  0
  Number of complete trials:  2
Best trial:
  Value:  0.6325926184654236
  Params: 
	use_conv: True


A lot of you lot might have a problem: we've only allowed two trials but `Optuna` tried `False` then `False` or `True` then `True`. This is because `Optuna` doesn't check if it already has used the previous set of hyperparameters. To fix this, we can add the following code:

```py
from optuna.trial import TrialState

...

for previous_trial in trial.study.trials:
    if previous_trial.state == TrialState.COMPLETE and trial.params == previous_trial.params:
        print(f"Duplicated trial: {trial.params}, return {previous_trial.value}")
        return previous_trial.value
```

And set n_trials to 5 for example, that way it'll be very unlikely to have the same hyperparameters twice.

In [18]:
from optuna.trial import TrialState

def objective(trial: optuna.trial.Trial) -> float:
    print("New trial")

    # Set up cross validation
    n_splits: int = 3
    fold = KFold(n_splits=n_splits, shuffle=True, random_state=0)
    scores = [0]*n_splits

    use_conv: bool = trial.suggest_categorical('use_conv', [True, False])

    # Check if this trial has already been run before
    for previous_trial in trial.study.trials:
        if previous_trial.state == TrialState.COMPLETE and trial.params == previous_trial.params:
            print(f"Duplicated trial: {trial.params}, return {previous_trial.value}")
            return previous_trial.value

    # Loop through data loader data batches
    for fold_idx, (train_idx, valid_idx) in enumerate(fold.split(range(len(TRAIN_DATASET)))):
        # train_idx and valid_idx are numpy arrays of indices of the training and validation sets for this fold respectively.
        # They do not contain the actual data, but the indices of the data in the dataset.
        # We can use these indices to create a subset of the dataset for this fold with torch.utils.data.Subset.
        # Obviously, if an index is in the validation set, it will not be in the training set. You can
        # check this by printing train_idx and valid_idx and check by yourself.
        
        print(f"Fold {fold_idx+1}/{n_splits}")

        # Create subsets of the dataset for this fold
        sub_train_data = torch.utils.data.Subset(TRAIN_DATASET, train_idx)
        sub_valid_data = torch.utils.data.Subset(TRAIN_DATASET, valid_idx)

        # Create data loaders for this fold
        sub_train_loader = torch.utils.data.DataLoader(sub_train_data, batch_size=BATCH_SIZE, shuffle=True)
        sub_valid_loader = torch.utils.data.DataLoader(sub_valid_data, batch_size=BATCH_SIZE, shuffle=False)
        
        # Generate the model.
        my_model: AdvancedNet = AdvancedNet(use_conv).to(DEVICE)
        
        for epoch in range(NUM_EPOCHS):
            # Training of the model.
            # Put model in train mode
            my_model.train()

            # Set up optimizer
            optimizer = torch.optim.SGD(my_model.parameters(), lr=LEARNING_RATE)

            # Set up loss function
            loss_fn = nn.CrossEntropyLoss()
            for X, y in sub_train_loader:
                # 0. Reshape data to input to the network
                if use_conv:
                    pass
                else:
                    X = X.view(-1, 64*64*3)

                # 1. Move data to device
                X = X.to(DEVICE)
                y = y.to(DEVICE)

                # 2. Forward pass
                y_pred = my_model(X)

                # 3. Calculate and accumulate loss
                loss = loss_fn(y_pred, y)

                # 4. Optimizer zero grad
                optimizer.zero_grad()

                # 5. Loss backward
                loss.backward()

                # 6. Optimizer step
                optimizer.step()

        # Validation of the model.
        # Put model in eval mode
        my_model.eval()
        
        val_acc = 0
        with torch.no_grad():
            for X, y in sub_valid_loader:
                # 0. Reshape data to input to the network
                if use_conv:
                    pass
                else:
                    X = X.view(-1, 64*64*3)
                
                # 1. Move data to device
                X = X.to(DEVICE)
                y = y.to(DEVICE)

                # 2. Forward pass
                y_pred = my_model(X)
                
                # 3. Compute accuracy
                pred = y_pred.argmax(dim=1, keepdim=True)
                y_pred_class = y_pred.argmax(dim=1)

                val_acc += (y_pred_class == y).sum()

        scores[fold_idx] = (val_acc / len(sub_valid_data)).cpu()
        # bring it back otherwise, np.mean will not work
        print(f"Fold {fold_idx+1}/{n_splits} accuracy: {scores[fold_idx]}")
    
    return np.mean(scores)


study = optuna.create_study(direction="maximize")
study.optimize(objective, timeout=1200, n_trials = 5) 
# - timeout=1200 -> stops after 20 minutes; 

pruned_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED]
complete_trials = [t for t in study.trials if t.state == optuna.trial.TrialState.COMPLETE]

print("\n")
print("--------------------")
print("--------------------")
print("--------------------")
print("\n")
print("Study statistics: ")
print("  Number of finished trials: ", len(study.trials))
print("  Number of pruned trials: ", len(pruned_trials))
print("  Number of complete trials: ", len(complete_trials))

print("Best trial:")
trial = study.best_trial

print("  Value: ", trial.value)

print("  Params: ")
for key, value in trial.params.items():
    print(f"\t{key}: {value}")

[32m[I 2023-01-24 13:50:43,112][0m A new study created in memory with name: no-name-9ad8c7f1-3000-42a1-a92e-ae1faac6f53c[0m


New trial
Fold 1/3
Fold 1/3 accuracy: 0.6155555844306946
Fold 2/3
Fold 2/3 accuracy: 0.6255555748939514
Fold 3/3
Fold 3/3 accuracy: 0.6322222352027893


[32m[I 2023-01-24 13:54:01,482][0m Trial 0 finished with value: 0.6244444847106934 and parameters: {'use_conv': True}. Best is trial 0 with value: 0.6244444847106934.[0m


New trial
Duplicated trial: {'use_conv': True}, return 0.6244444847106934


[32m[I 2023-01-24 13:54:01,484][0m Trial 1 finished with value: 0.6244444847106934 and parameters: {'use_conv': True}. Best is trial 0 with value: 0.6244444847106934.[0m


New trial
Fold 1/3
Fold 1/3 accuracy: 0.5877777934074402
Fold 2/3
Fold 2/3 accuracy: 0.5944444537162781
Fold 3/3
Fold 3/3 accuracy: 0.6122222542762756


[32m[I 2023-01-24 14:06:58,015][0m Trial 2 finished with value: 0.5981481671333313 and parameters: {'use_conv': False}. Best is trial 0 with value: 0.6244444847106934.[0m


New trial
Duplicated trial: {'use_conv': True}, return 0.6244444847106934


[32m[I 2023-01-24 14:06:58,017][0m Trial 3 finished with value: 0.6244444847106934 and parameters: {'use_conv': True}. Best is trial 0 with value: 0.6244444847106934.[0m


New trial
Duplicated trial: {'use_conv': True}, return 0.6244444847106934


[32m[I 2023-01-24 14:06:58,020][0m Trial 4 finished with value: 0.6244444847106934 and parameters: {'use_conv': True}. Best is trial 0 with value: 0.6244444847106934.[0m




--------------------
--------------------
--------------------


Study statistics: 
  Number of finished trials:  5
  Number of pruned trials:  0
  Number of complete trials:  5
Best trial:
  Value:  0.6244444847106934
  Params: 
	use_conv: True


Let's now train with the hyperparameters that we found with Optuna. We will use the `study.best_params` attribute to get the best hyperparameters. You need to re-train on the whole training dataset!!! Otherwise, you will not get the best accuracy as you're leaving out some data.

In [19]:
# Create model
MODEL: AdvancedNet = AdvancedNet(**study.best_params).to(DEVICE)

In [20]:
print(MODEL)

AdvancedNet(
  (conv): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (fc1): Linear(in_features=262144, out_features=200, bias=True)
  (fc2): Linear(in_features=200, out_features=3, bias=True)
)


In [21]:
def main_train_conv(loss_fn, optimizer) -> None:
    """
    Train the model and modified the trained model inplace.
    """
    start_time_global = time.time()

    # Put model in train mode
    MODEL.train()

    # Loop through data loader data batches
    for epoch in range(NUM_EPOCHS):
        start_time_epoch = time.time()

        # Setup train loss and train accuracy values
        train_loss, train_acc = 0, 0

        for X, y in TRAIN_DATALOADER:
            # 0. Reshape data to input to the network
            pass  # we are happy with the shape BATCH_SIZE, 3, 64, 64

            # 1. Move data to device
            X = X.to(DEVICE)
            y = y.to(DEVICE)

            # 2. Forward pass
            y_pred = MODEL(X)

            # 3. Calculate and accumulate loss
            loss = loss_fn(y_pred, y)
            train_loss += loss.item()

            # 4. Optimizer zero grad
            optimizer.zero_grad()

            # 5. Loss backward
            loss.backward()

            # 6. Optimizer step
            optimizer.step()

            # Calculate and accumulate accuracy metric across all batches
            y_pred_class = y_pred.argmax(dim=1)
            train_acc += (y_pred_class == y).sum()

        # Adjust metrics to get average loss and accuracy per batch
        train_loss = train_loss / (BATCH_SIZE * len(TRAIN_DATALOADER))
        train_acc = train_acc / (BATCH_SIZE * len(TRAIN_DATALOADER))
        print(
            f"epoch {epoch+1}/{NUM_EPOCHS},"
            f" train_loss = {train_loss:.2e},"
            f" train_acc = {100*train_acc.item():.2f}%,"
            f" time spent during this epoch = {time.time() - start_time_epoch:.2f}s,"
            f" total time spent = {time.time() - start_time_global:.2f}s"
        )

In [22]:
main_train_conv(nn.CrossEntropyLoss(), torch.optim.SGD(MODEL.parameters(), lr=LEARNING_RATE))

epoch 1/15, train_loss = 1.20e-01, train_acc = 53.11%, time spent during this epoch = 10.15s, total time spent = 10.15s
epoch 2/15, train_loss = 1.05e-01, train_acc = 61.61%, time spent during this epoch = 9.07s, total time spent = 19.22s
epoch 3/15, train_loss = 9.71e-02, train_acc = 65.87%, time spent during this epoch = 7.84s, total time spent = 27.06s
epoch 4/15, train_loss = 8.97e-02, train_acc = 69.01%, time spent during this epoch = 9.61s, total time spent = 36.68s
epoch 5/15, train_loss = 8.03e-02, train_acc = 73.41%, time spent during this epoch = 7.40s, total time spent = 44.08s
epoch 6/15, train_loss = 6.97e-02, train_acc = 77.00%, time spent during this epoch = 7.09s, total time spent = 51.16s
epoch 7/15, train_loss = 5.87e-02, train_acc = 81.14%, time spent during this epoch = 7.10s, total time spent = 58.26s
epoch 8/15, train_loss = 4.51e-02, train_acc = 85.32%, time spent during this epoch = 8.12s, total time spent = 66.38s
epoch 9/15, train_loss = 2.96e-02, train_acc = 

In [23]:
def test_our_model_conv() -> float:
    # 0. Put model in eval mode
    MODEL.eval()  # to remove stuff like dropout that's only going to be in the training part

    # 1. Setup test accuracy value
    test_acc: float = 0

    # 2. Turn on inference context manager
    with torch.no_grad():
        # Loop through DataLoader batches
        for X_test, y_test in TEST_DATALOADER:  # majuscule à X car c'est une "matrice", et y un entier
            # a. Move data to device
            X_test_flattened = X_test.to(DEVICE)  # no need to flatten here
            y_test = y_test.to(DEVICE)

            # b. Forward pass
            model_output = MODEL(X_test_flattened)

            # c. Calculate and accumulate accuracy
            test_pred_label = model_output.argmax(dim=1)
            test_acc += (test_pred_label == y_test).sum()

    # Adjust metrics to get average loss and accuracy per batch
    test_acc = test_acc / (len(TEST_DATASET))
    return test_acc.item()

In [24]:
print((f"{100*test_our_model_conv():.2f}%"))

63.00%


Most likely some sort of overfitting has happened here (look at the training accuracy!), but we did improve our accuracy (63.0% now against 55.67% earlier, and not far off what there was in the validation set (62.44%) on average (which makes sense))! This is not amazing though, that's why we should also optimise the learning rate (or the number of epochs), etc ... not just the architecture.

Your turn now!

Optimizing learning rate and the number of channels after the first convolution layer.

In [1]:
## TODO: Add your code here

And then we train the model with the best hyperparameters on the whole training set and test it on the testing set: ...