<a href="https://colab.research.google.com/github/NicholasBaraghini/Machine-Learning-for-Computer-Vision-LAB-Sessions/blob/main/ML4CV_4_neural_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We start with our usual imports and figure adjustments.

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import random_split, DataLoader, TensorDataset
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
import math
from timeit import default_timer as timer
from functools import partial

plt.rcParams['figure.figsize'] = (12.0, 8.0)
plt.rcParams['font.size'] = 16

Then we load CIFAR10, and we create the usual `Dataset`s and `DataLoader`s.

In [None]:
tsfms = transforms.Compose([transforms.ToTensor(), transforms.Lambda(lambda z: z.reshape(-1))]) 
train_ds = torchvision.datasets.CIFAR10(root="/data/", train=True, transform=tsfms, download=True)
test_ds = torchvision.datasets.CIFAR10(root="/data/", train=False, transform=tsfms)

classes = train_ds.classes
n_classes = len(classes)
n_features = len(train_ds[0][0])

Files already downloaded and verified


In [None]:
splitted_datasets = torch.utils.data.random_split(train_ds, [45000, 5000])
actual_train_subds = splitted_datasets[0]
valid_subds = splitted_datasets[1]

In [None]:
small_actual_train_subds = torch.utils.data.Subset(actual_train_subds, range(500))
small_valid_subds = torch.utils.data.Subset(valid_subds, range(100))
small_test_subds = torch.utils.data.Subset(test_ds, range(100))

In [None]:
batch_size = 256
small_actual_train_dl = torch.utils.data.DataLoader(small_actual_train_subds, batch_size=batch_size, shuffle=True)
small_valid_dl = torch.utils.data.DataLoader(small_valid_subds, batch_size=batch_size)
small_test_dl = torch.utils.data.DataLoader(small_test_subds, batch_size=batch_size)
actual_train_dl = torch.utils.data.DataLoader(actual_train_subds, batch_size=batch_size, shuffle=True)
valid_dl = torch.utils.data.DataLoader(valid_subds, batch_size=batch_size)
test_dl = torch.utils.data.DataLoader(test_ds, batch_size=batch_size)

We will then create our first Neural Network. One way to create Neural Networks in PyTorch is by subclassing `torch.nn.Module`. In this way, our model will inherit a lot of ready-to-use convinience functions (access to parameters for optimization, get/set parameters, ...). 

We only need to create the layers we will use in the `__init__` function and define the `forward` function that specifies how to apply them. 

Layers are in turn subclasses of `torch.nn.Module`. In our example, we will use only linear layers, i.e. Fully Connected (FC) layers. Our network will have at least two FC layers: `self.first`, mapping the flattened input image into the (first) hidden representation, and `self.last`, mapping the (last) hidden representation into the scores for the classes.

To play with varying depths and activation functions, we will have two additional parameters:


*   `n_additional_hidden_layers`, specifies how many hidden layers our network has, beside `self.first`
*   `use_relu`, if `False`, activations will be sigmoid functions, ReLUs otherwise

Note that to store a variable number of layers in our network, we do not use plain PyTorch lists, but `torch.nn.ModuleList`. This is important to make PyTorch aware of the layers in the list, e.g. to set/get their parameters when calling the methods of the base `Module` class.

In [None]:
class TwoPlusLayersNetwork(torch.nn.Module):
  def __init__(self, n_features, hidden_width, n_classes, n_additional_hidden_layers=0, use_relu=True):
    super(TwoPlusLayersNetwork, self).__init__()
    self.first = torch.nn.Linear(n_features, hidden_width) 
    self.activation = torch.relu if use_relu else torch.sigmoid
    self.last = torch.nn.Linear(hidden_width, n_classes)

    self.additional_hidden_layers = torch.nn.ModuleList(
        [torch.nn.Linear(hidden_width, hidden_width) for i in range(n_additional_hidden_layers)])
  
  def forward(self, x):
    x = self.first.forward(x)
    x = self.activation(x)
    for layer in self.additional_hidden_layers:
      x = layer.forward(x)
      x = self.activation(x)
    x = self.last.forward(x)
    return x

We then define the usual function to train a model.

Note that we use 
*   `nn.parameters()` to get a list of trainable parameters
*   `nn.state_dict()` to get the model parameters when we achieve better validation accuracy and save them in the `best_params` variable. 

These are two of the convinience functions our network inherits from `torch.nn.Module`.

In [None]:
def ncorrect(scores, y):
  y_hat = torch.argmax(scores, 1)
  return (y_hat==y).sum()

def accuracy(scores, y):
  correct = ncorrect(scores, y)
  return correct.true_divide(y.shape[0])
  
def train_loop(n_features, hidden_width, n_classes, n_additional_hidden_layers, use_relu,
               train_dl, epochs, partial_opt, 
               valid_dl=None, verbose=False):
  best_valid_acc = 0
  best_params = []
  best_epoch = -1

  nn = TwoPlusLayersNetwork(n_features, hidden_width, n_classes, n_additional_hidden_layers, use_relu)

  # We "complete" the partial function by calling it and specifying the missing parameters
  opt = partial_opt(nn.parameters())

  for e in range(epochs):
    #train
    train_loss = 0
    train_samples = 0
    train_acc = 0
    for train_data in train_dl:
      scores = nn.forward(train_data[0])
      loss = F.cross_entropy(scores, train_data[1])
      train_loss += loss.item() * train_data[0].shape[0]
      train_samples += train_data[0].shape[0]
      train_acc += ncorrect(scores, train_data[1]).item()
      loss.backward()

      opt.step()
      opt.zero_grad()

    train_acc /= train_samples
    train_loss /= train_samples
    
    # validation
    with torch.no_grad():
      valid_loss = 0
      valid_samples = 0
      valid_acc = 0
      if valid_dl is not None:
        for valid_data in valid_dl:
          valid_scores = nn.forward(valid_data[0])
          valid_loss += F.cross_entropy(valid_scores, valid_data[1]).item() * valid_data[0].shape[0]
          valid_samples += valid_data[0].shape[0]
          valid_acc += ncorrect(valid_scores, valid_data[1]).item()
        valid_acc /= valid_samples
        valid_loss /= valid_samples
      
      if valid_dl is None or valid_acc > best_valid_acc:
        best_valid_acc = valid_acc if valid_dl is not None else 0
        best_params = nn.state_dict()
        best_epoch = e

      
    if verbose and e % 10 == 0:
      print(f"Epoch {e}: train loss {train_loss:.3f} - train acc {train_acc:.3f}" + ("" if valid_dl is None else f" - valid loss {valid_loss:.3f} - valid acc {valid_acc:.3f}"))
  
  if verbose and valid_dl is not None:
    print(f"Best epoch {best_epoch}, best acc {best_valid_acc}")

  return best_valid_acc, best_params, best_epoch

The two functions are similar, but serve different purposes. 

`parameters()` returns a list (actually, a generator) of trainable tensors, which is what `Optimizer`s require: they do not need to know which tensor correspond to which layer, since they are all treated the same when performing SGD.

In [None]:
nn = TwoPlusLayersNetwork(n_features, hidden_width, n_classes, n_additional_hidden_layers=1, use_relu=True)
for p in nn.parameters():
  print(type(p), p.shape)

<class 'torch.nn.parameter.Parameter'> None torch.Size([50, 3072])
<class 'torch.nn.parameter.Parameter'> None torch.Size([50])
<class 'torch.nn.parameter.Parameter'> None torch.Size([10, 50])
<class 'torch.nn.parameter.Parameter'> None torch.Size([10])
<class 'torch.nn.parameter.Parameter'> None torch.Size([50, 50])
<class 'torch.nn.parameter.Parameter'> None torch.Size([50])


`state_dict()` instead is an (Ordered) Dictionary, which associates each variable storing a layer in our classes with its parameters. It is therefore useful to obtain a snapshot of the parameters of our model that can later be restored by calling `load_state_dict()`.

In [None]:
type(nn.state_dict())

collections.OrderedDict

In [None]:
nn.state_dict().keys()

odict_keys(['first.weight', 'first.bias', 'last.weight', 'last.bias', 'additional_hidden_layers.0.weight', 'additional_hidden_layers.0.bias'])

In [None]:
nn.state_dict()["first.bias"]

tensor([ 0.0141,  0.0104,  0.0093,  0.0019,  0.0071,  0.0129,  0.0131, -0.0018,
        -0.0164,  0.0076,  0.0072, -0.0075, -0.0068,  0.0118,  0.0068,  0.0172,
         0.0024, -0.0086, -0.0102,  0.0095, -0.0133,  0.0001,  0.0098,  0.0133,
         0.0033, -0.0034, -0.0120, -0.0093, -0.0103,  0.0037, -0.0024,  0.0179,
         0.0114, -0.0150, -0.0080, -0.0055, -0.0069, -0.0176, -0.0028, -0.0049,
         0.0139,  0.0176,  0.0077,  0.0115,  0.0174,  0.0032, -0.0127, -0.0115,
        -0.0057,  0.0029])

Let's verify that the use of the sigmoid as activation function makes it more difficult to train "deep" networks, i.e. with 10 hidden layers. 

In [None]:
start = timer()
lr=1e-3
hidden_width = 50
n_additional_hidden_layers = 10
use_relu = False
p_opt = partial(torch.optim.Adam, lr=lr)

train_loop(n_features, hidden_width, n_classes, n_additional_hidden_layers, use_relu,
           train_dl=small_actual_train_dl, epochs=200, partial_opt=p_opt, 
           valid_dl=small_valid_dl, verbose=True)
end = timer()
print(f"Elapsed time (s): {end-start}")

Epoch 0: train loss 2.333 - train acc 0.102 - valid loss 2.345 - valid acc 0.060
Epoch 10: train loss 2.296 - train acc 0.124 - valid loss 2.294 - valid acc 0.160
Epoch 20: train loss 2.294 - train acc 0.124 - valid loss 2.298 - valid acc 0.160
Epoch 30: train loss 2.294 - train acc 0.124 - valid loss 2.297 - valid acc 0.160
Epoch 40: train loss 2.294 - train acc 0.124 - valid loss 2.296 - valid acc 0.160
Epoch 50: train loss 2.294 - train acc 0.124 - valid loss 2.298 - valid acc 0.160
Epoch 60: train loss 2.294 - train acc 0.124 - valid loss 2.298 - valid acc 0.160
Epoch 70: train loss 2.295 - train acc 0.124 - valid loss 2.297 - valid acc 0.160
Epoch 80: train loss 2.294 - train acc 0.124 - valid loss 2.298 - valid acc 0.160
Epoch 90: train loss 2.294 - train acc 0.124 - valid loss 2.297 - valid acc 0.160
Epoch 100: train loss 2.294 - train acc 0.124 - valid loss 2.298 - valid acc 0.160
Epoch 110: train loss 2.294 - train acc 0.124 - valid loss 2.297 - valid acc 0.160
Epoch 120: trai

Let's compare this with ReLU.

In [None]:
start = timer()
lr=1e-3
hidden_width = 50
n_additional_hidden_layers = 10
use_relu = True
p_opt = partial(torch.optim.Adam, lr=lr)

train_loop(n_features, hidden_width, n_classes, n_additional_hidden_layers, use_relu,
           train_dl=small_actual_train_dl, epochs=200, partial_opt=p_opt, 
           valid_dl=small_valid_dl, verbose=True)
end = timer()
print(f"Elapsed time (s): {end-start}")

Epoch 0: train loss 2.312 - train acc 0.092 - valid loss 2.302 - valid acc 0.060
Epoch 10: train loss 2.298 - train acc 0.124 - valid loss 2.293 - valid acc 0.160
Epoch 20: train loss 2.236 - train acc 0.160 - valid loss 2.224 - valid acc 0.210
Epoch 30: train loss 2.163 - train acc 0.190 - valid loss 2.153 - valid acc 0.230
Epoch 40: train loss 2.090 - train acc 0.198 - valid loss 2.095 - valid acc 0.260
Epoch 50: train loss 2.030 - train acc 0.230 - valid loss 2.070 - valid acc 0.230
Epoch 60: train loss 1.973 - train acc 0.250 - valid loss 2.053 - valid acc 0.240
Epoch 70: train loss 1.875 - train acc 0.282 - valid loss 2.050 - valid acc 0.220
Epoch 80: train loss 1.836 - train acc 0.294 - valid loss 2.027 - valid acc 0.290
Epoch 90: train loss 1.718 - train acc 0.340 - valid loss 2.023 - valid acc 0.320
Epoch 100: train loss 1.687 - train acc 0.340 - valid loss 2.075 - valid acc 0.280
Epoch 110: train loss 1.620 - train acc 0.356 - valid loss 2.089 - valid acc 0.320
Epoch 120: trai

You can see that training of modestly deep networks (for today standards) with the sigmoid function is stuck, while the network using ReLUs increases its performance while training.

Let's then define a function to perform hyper-parameter tuning. Since this is a small network we can afford to validate also hyper-parameters defining the architecture, like the `hidden_width` of the layers, or the number of hidden layers. We will also run a loop over optimizers (wrapping learning rates), as usual.

In [None]:
def hyperparameter_tuning(n_features, n_classes, train_dl,  
                          valid_dl, partial_opts, hidden_widths,
                          n_additional_hidden_layers_list, epochs=5):
  
  best_valid_acc = 0
  best_params = []
  best_hyper_params = []

  for hidden_width in hidden_widths:
    for n_additional_hidden_layers in n_additional_hidden_layers_list:
      for partial_opt in partial_opts:
        run_valid_acc, params, epoch = train_loop(n_features, hidden_width, n_classes, n_additional_hidden_layers, use_relu=True, 
                  train_dl=train_dl, epochs=epochs, partial_opt=partial_opt, valid_dl=valid_dl, verbose=False)

        if run_valid_acc > best_valid_acc:
          best_valid_acc = run_valid_acc
          best_params = params
          best_hyper_params = [partial_opt, epoch, hidden_width, n_additional_hidden_layers]
          print(f"Improved result: acc {best_valid_acc:.3f}, best_hyper_params {best_hyper_params}")
  return best_hyper_params, best_params

Then, the usual function to define which combination of optimizers and learning rates we want to validate.

In [None]:
def build_optlist():
  lrs = [1e-4, 1e-3]
  betas = [0.9]
  opts = []
  #opts += [partial(torch.optim.SGD, lr=lr) for lr in lrs]
  #opts += [partial(torch.optim.SGD, lr=lr, momentum=beta, nesterov=True) for lr in lrs for beta in betas]
  opts += [partial(torch.optim.Adam, lr=lr) for lr in lrs]
  #opts += [partial(torch.optim.RMSprop, lr=lr) for lr in lrs]
  return opts

build_optlist()

[functools.partial(<class 'torch.optim.adam.Adam'>, lr=0.0001),
 functools.partial(<class 'torch.optim.adam.Adam'>, lr=0.001)]

Let' check everything works on the small `Dataset`s.

In [None]:
start=timer()
opts = build_optlist()
hidden_widths = [128]
n_hidden_layers_list = [0]
best_hyper_params, best_params = hyperparameter_tuning(n_features, n_classes, small_actual_train_dl, 
                  small_valid_dl, opts, hidden_widths, n_hidden_layers_list, epochs=200)
end=timer()
print(f"Elapsed time (s): {end-start:.3f}")
print(f"best optimizer {best_hyper_params[0]}, best epoch {best_hyper_params[1]}," 
      f"best hidden_width {best_hyper_params[2]}, best n_additional_hidden {best_hyper_params[3]}")

Improved result: acc 0.340, best_hyper_params [functools.partial(<class 'torch.optim.adam.Adam'>, lr=0.01), 62, 128, 0]
Improved result: acc 0.470, best_hyper_params [functools.partial(<class 'torch.optim.adam.Adam'>, lr=0.001), 25, 128, 0]
Elapsed time (s): 34.131
best optimizer functools.partial(<class 'torch.optim.adam.Adam'>, lr=0.001), best epoch 25,best hidden_width 128, best n_additional_hidden 0


And then, let's validate on the real `Dataset`s.

In [None]:
start=timer()
opts = build_optlist()
hidden_widths = [128]
n_hidden_layers_list = [0,1]
best_hyper_params, best_params = hyperparameter_tuning(n_features, n_classes, 
      actual_train_dl, valid_dl, opts, hidden_widths, n_hidden_layers_list, epochs=30)
end=timer()
print(f"Elapsed time (s): {end-start:.3f}")
print(f"best optimizer {best_hyper_params[0]}, best epoch {best_hyper_params[1]}, "
      f"best hidden_width {best_hyper_params[2]}, best n_additional_hidden {best_hyper_params[3]}")

Improved result: acc 0.253, best_hyper_params [functools.partial(<class 'torch.optim.adam.Adam'>, lr=0.01), 20, 128, 0]
Improved result: acc 0.494, best_hyper_params [functools.partial(<class 'torch.optim.adam.Adam'>, lr=0.001), 21, 128, 0]
Improved result: acc 0.505, best_hyper_params [functools.partial(<class 'torch.optim.adam.Adam'>, lr=0.001), 29, 128, 1]
Elapsed time (s): 932.106
best optimizer functools.partial(<class 'torch.optim.adam.Adam'>, lr=0.001), best epoch 29, best hidden_width 128, best n_additional_hidden 1


Let's train on the full training set.

In [None]:
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=batch_size, shuffle=True)

In [None]:
start = timer()
best_opt = best_hyper_params[0]
best_epochs = best_hyper_params[1]
_, best_params, best_epoch = train_loop(n_features=n_features, 
        hidden_width=best_hyper_params[2], n_classes=n_classes, 
        n_additional_hidden_layers=best_hyper_params[3], use_relu=True,
        train_dl=train_dl, epochs=best_epochs, partial_opt=best_opt, verbose=True)
end = timer()
print(f"Elapsed time (s): {end-start}")

Epoch 0: train loss 1.906 - train acc 0.314
Epoch 10: train loss 1.389 - train acc 0.506
Epoch 20: train loss 1.249 - train acc 0.556
Elapsed time (s): 225.143524608


And test on the full test set. To restore the parameters computed in training, we use the `load_state_dict` function.

In [None]:
nn = TwoPlusLayersNetwork(n_features, best_hyper_params[2], n_classes, best_hyper_params[3])
nn.load_state_dict(best_params)

start = timer()
test_samples = 0
test_acc = 0
for test_data in test_dl:
  test_scores = nn.forward(test_data[0])
  test_samples += test_data[0].shape[0]
  test_acc += ncorrect(test_scores, test_data[1]).item()
test_acc /= test_samples
end = timer()
print(f"Accuracy on full test set {test_acc:.3f}, elapsed time (s): {end-start:.3f}")


Accuracy on full test set 0.512, elapsed time (s): 1.179
