# Lab 2: Machine Learning Tutorial

In this tutorial, you will learn the basics of *Machine Learning* (ML) through hands-on practice. We cover the following topics:

1. Frameworks: PyTorch
3. Model Archetypes: Support Vector Machines, Linear Regression, Logistic Regression, Multi-Layer Perceptrons, Convolutional Neural Networks.
4. Performance metrics: Accuracy, Confusion Matrix.
5. Model Validation Techniques: Cross-Validation.

## 1. Introduction to PyTorch

Let's start by importing all the necessary libraries.

*__Note:__ Don't forget to install PyTorch using the instructions on their website! This is done to ensure you have proper GPU acceleration. More info in the Installation instructions on Studium.*

In [None]:
import torch
import numpy as np
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

### 1.1 Tensors and Their Allocation

First, what is a *tensor*? Tensors are *multi-dimensional arrays*, together with some rules on how to do math with them. 

NumPy's `ndarray` is one example of a way to represent tensors in Python, and PyTorch `tensor`s are another. Both approaches look quite similar:

In [None]:
a = torch.tensor([3.0])  # 1D array with a single entry: 3
b = torch.zeros([3, 2])  # 3x2 matrix filled with zeros
c = torch.ones_like(b)  # b-sized matrix filled with ones
d = torch.ones_like(a)  # a-sized matrix filled with zeros

a, b, c, d

There are many 'shortcut' functions to create a tensor. See the docs for more info: [PyTorch - Creation Ops](https://pytorch.org/docs/stable/torch.html#tensor-creation-ops).

A key difference between NumPy's `ndarray` and PyTorch `tensor` is that the latter can run both on a CPU and on a GPU while ndarray can only run on CPUs. This gives a big advantage in case of heavy computations, as doing them on a GPU is usually much faster.

However, PyTorch does not allocate tensors 'smartly' itself. Instead, you need to manually specify at the time of creation (or later on) on which device (CPU or GPU) you want to allocate the tensor. By default, it gets allocated on the CPU.

In [None]:
# First, discover if the current computer can use GPU acceleration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Also works with ROCm (PyTorch > 1.8, Linux)
print(f"Using {device}")

# Let's create a tensor on the available device
torch.tensor([
    [0, 1],
    [2, 3]
], device=device)

# We can also move that tensor manually afer creation
A = torch.tensor([
    [0, 1],
    [2, 3]
])  # By default, it is allocated on the CPU
print(f"Before moving the tensor: {A}")
A = A.to(device)  # Will return a 'new' tensor, does not change A unless we re-assign it
print(f"After moving the tensor: {A}")

### 1.2 Tensor Operations

Now that we learned how to allocate tensors on GPUs for super-fast computation, let's see how can actually handle the tensors (note how for our first, simpler exercises we will not bother moving tensors to GPU for the sake of readability).

Here is a non-exhaustive list of examples. Some of them have a really similar syntax to NumPy (the full list can be found in the [official docs](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html#operations-on-tensors)):

In [None]:
a = torch.tensor([[3.0, 5.0]])
b = torch.tensor([[-4.0, 4.0]])
W = torch.tensor([
    [1.0, 2.0],
    [3.0, 4.0]
])

print(f"a: {a}")
print(f"b: {b}")
print(f"W: {W}")

print("----")

# Slicing is very similar to NumPy:
print(f"Second column in W: {W[:, 1]}")

# torch.cat() joins tensors along an axis:
print(f"Stacking a and b:\n {torch.cat([a, b], dim=0)}")

print("----")

# We can use artihmetic operators:

print(f"a + b: {a + b}")  # element-wise a + b
print(f"a - b: {a - b}")  # element-wise a - b
print(f"a * b: {a * b}")  # element-wise a * b (!! different from matrix multiplication !!)
print(f"a / b: {a / b}")  # element-wise a/b
print(f"a @ W: {a @ W}")  # matrix multiplication

print("----")

# For each operator above, there is a function that does the same thing:
print(f"torch.matmul(a, W): {torch.matmul(a, W)}")  # same as 'a @ W'

With this, you can already do a lot of operations. Here is an example of a linear function:

In [None]:
X = torch.tensor([1.0, 1.0, 1.0])
W = torch.tensor([
    [1.0, 0.0, 0.0],
    [0.0, 1.0, 0.0],
    [0.0, 0.0, 1.0]
])  # Identity matrix
b = torch.tensor([0.0, 1.0, 2.0])

Y_pred = X @ W + b
Y_pred.numpy()

The example above is more than just a demonstration of PyTorch syntax: It is also a machine learning model called a *linear model*. We conceptualize it as a function $f_{W,b}(X) = Y_\mathrm{pred}$: the vector $X$ is the model's input, $Y_\mathrm{pred}$ is the model's output, and all other variables (here, $W$ and $b$) are the *model parameters*. The idea is that we can use some kind of automatic procedure to choose values for $W$ and $b$, so that $f_{W, b}()$ gives good approximate solutions for a problem. We often group all model paramters together into a variable called $\theta$. 

For this linear model, the output $Y_\mathrm{pred}$ is a continous number (making it a *regression model*), and the input $X$ is a *feature vector*. The process we performed above (taking a set of model parameters, taking one or more feature vectors, and computing the model's output) is called *prediction*.

A single *feature vector* (like $X$ above) is usually interpreted as a row-vector (notice that we multiplied it *to the left* of `W`). This is not standard mathematical notation (we would usually represent $X$ as a column-vector that goes *to the right* of `W`), but is very common in Machine Learning. The reason is that we can pile up samples like in a spreadsheet, so each row is a different sample and each column is a different feature. We don't have to change any math to calculate several samples at once: the rules of matrix multiplication guarantee that each row is calculated separately.

#### Exercise 1

Calculate
$$ l = (Y_\textrm{pred}-Y)^2 $$
for the input

In [None]:
Y_pred = torch.tensor([1.0])
Y = torch.tensor([1.5])

*__Tip:__ Python's power syntax works out of the box with PyTorch.*

In [None]:
(Y_pred - Y) ** 2

*__Note:__ The expected output is* `tensor([0.2500])`

The function you just implemented is called the *squared error* and measures how well the model's output $Y_\mathrm{pred}$ fits the desired output $Y$. The closer the prediction gets to the real value, the smaller $l$ gets. Averaging over a collection of samples (a dataset), we obtain the *Mean Square Error* (MSE)

$$ L = \textrm{mean}_i \left( l^{(i)} \right) = \frac{1}{N} \sum_{i=1}^N \left( Y^{(i)}_\textrm{pred}-Y^{(i)} \right)^2, $$

where $N$ is the number of samples, and the superindex $(i)$ indicates "pertaining to the $i$-th sample". MSE is one way of measuring a model's *goodness of fit* (how well a model fits the data) and is used for regession models. We call it (and other functions like it) a *loss function*, and to highlight this we sometimes call MSE the *MSE loss*.

#### Exercise 2

Building upon Exercise 1, compute the MSE loss for these inputs:

In [None]:
Y_pred = torch.tensor([[1.0], [2.0], [3.0], [4.0], [5.0]])
Y = torch.tensor([[1.5], [3.0], [2.5], [4.0], [6.0]])

*__Tip:__ PyTorch tensors have a `mean()` method.*

In [None]:
((Y_pred - Y) ** 2).mean()

*__Note:__ The expected output of this function is* `tensor(0.5000)`

Computing the loss of regression models is nice, but what about classification models? What if we want to predict if a certain value is *true* or *false*, e.g., face vs. no face? As it turns out, we can turn the classification problem into a regression problem with a simple trick: Assume that the current sample has some probability to be in each category, and build a regression model that predicts the probability for each category. Then, predict the sample to be in the category with the highest probability.

In a binary scenario (face, no face) we can create an even simpler regression model: We have the model return large positive values when it thinks the value is *true* (face), and large negative values when it thinks the value is *false* (no face). For ease of interpretability, we often convert the resulting score into a 0 to 1 range, because this allows us to interpret the result as a probability. For this, we need an S-shaped function that is close to 0 for large negative values, and close to 1 for large positive values. Anything with this shape is called a *sigmoid function*. We usually choose a sigmoid function called the *logistic function*:

$$ \sigma(z) = \frac{1}{1+e^{-z}}, $$

operating element-by-element if $z$ is an array.

#### Exercise 3

Calculate the logistic function $\sigma(z)$ for the input

In [None]:
Z = torch.tensor([1.0, 2.0, 3.0, 4.0])

*__Tip:__ `torch.exp()` might be useful in this case.*

In [None]:
1 / (1 + torch.exp(-Z))

*__Note:__ The expected output is* `tensor([0.7311, 0.8808, 0.9526, 0.9820])`

#### Exercise 4

Calculate
$$ l = \max(0, 1- Y \cdot Y_\textrm{pred}) $$
for the input 

In [None]:
Y_pred = torch.tensor([0.5])
Y = torch.tensor([-1.0])

*__Tip:__ there probably is a built in function for max, isn't there?*

In [None]:
torch.maximum(torch.zeros(1), 1 - Y * Y_pred)

*__Note:__ The expected output is:* `tensor([1.5000])`

The function above is called [hinge loss](https://en.wikipedia.org/wiki/Hinge_loss). It is sometimes used as a loss when training binary classification models, and it follows the same idea of assigning positive $Y$ values to the category *true* (face) and negative $Y$ values to the category *false* (no face). It is most known in the context of *Support Vector Machines* (SVMs). Similar to the MSE loss earlier, we can use the hinge loss to compute a model's goodness of fit:

$$ L = \textrm{mean} \left( l^{(i)} \right) = \frac{1}{N} \sum_{i=1}^N \max \left( 0, 1 - Y^{(i)} \cdot Y^{(i)}_\textrm{pred} \right). $$

*__Note__: When using the hinge loss, the ground truth label $Y$ is chosen as $-1$ or $1$. Many other categorical losses choose $0$ or $1$ instead.*

#### Exercise 5

Create an expression that computes the hinge loss for the input

In [None]:
Y_pred = torch.tensor([[1.0], [0.5], [0.0], [-0.5], [-1.0]])
Y = torch.tensor([[1.0], [-1.0], [1.0], [-1.0], [1.0]])

In [None]:
torch.maximum(torch.zeros(1), 1 - Y * Y_pred).mean()

*__Note:__ The expected output is* `tensor(1.)`

#### Exercise 6

Create the expression for the following two functions using PyTorch.

Function 1 is called `sigmoid`, takes $X$ as input, and has the form
$$\textrm{sigmoid}(X) = \frac{1}{1+e^{-X}}$$

Function 2 is called `Y_pred`, takes as input $X$, $W$, $b$, and applies the linear transformation from the example above followed by the sigmoid from function 1. It has the form
$$Y_{pred}(X) = \textrm{sigmoid}(X \cdot W + b)$$

Then, compute `Y_pred(X, W, b)`.

This time, the inputs are a fake dataset of 5 examples with 2 features each, a weight matrix, and a constant (bias).

In [None]:
X = torch.tensor([
    [0.5, 1],
    [0.5, 2],
    [0.5, 3],
    [0.5, 4],
    [0.001, 0.1]
])
W = torch.tensor([
    [0.5],
    [-0.5]
])
b = torch.tensor([0.1])

In [None]:
def sigmoid(X):
    return 1 / (1 + torch.exp(-X))

def Y_pred(X, W, b):
    return sigmoid(X @ W + b)

Y_pred(X, W, b)

*__Note:__ The expected output is* 
```
tensor([[0.4626],
        [0.3430],
        [0.2405],
        [0.1611],
        [0.5126]])
```

The function $Y_\textrm{pred}$ is a machine learning model. It is called *perceptron*, it is a binary *classification model*, and it takes the same approach to classification that we already encountered above.

### 1.3 Computing Gradients

Now we have all the pieces in place to begin looking into training models. For this, we rely on *automatic differentiation*, which is arguably the most important feature of PyTorch. Differentiation with respect to a parameter allows us to compute how the model's output will change if we change this parameter. We call the result a gradient and write $\nabla_\theta Y_\textrm{pred}$ (the gradient of $Y_\textrm{pred}$ with respect to $\theta$). As the model's output affects the goodness of fit score, we can get a sense of how this score will change if we change a parameter of the model. Automatic differentiation then allows us to delegate this task to the computer.

Armed with *automatic differentiation*, we can work out how to change the model's parameters $\theta$ such that the goodness of fit improves for a given set of input features and target labels, i.e., our training set. A caveat of this process is that the gradient is only accurate in a small area around the current value of $\theta$. Hence, instead of updating to the perfect set of model parameters in one shot, we need to do this iteratively and in small steps. We call this algorithm *gradient descent* as we use the gradient to *descend* in small steps to the optimal values of $\theta$ to minimize a certain function (hence the name descent) and maximize the goodness of fit measure as a result. 

In PyTorch, computing the gradient via automatic differentiation looks like this:

In [None]:
Y = torch.tensor([
    [1.0],
    [1.0],
    [1.0],
    [-1.0],
    [1.0]
])  # expected output

X = torch.tensor([
    [0.5, 1],
    [0.5, 2],
    [0.5, 3],
    [0.5, 4],
    [0.001, 0.1]
])  # input

W = torch.tensor([
    [0.5],
    [-0.5]
], requires_grad=True)  # we want to optimize the weights
b = torch.tensor([0.1])  # bias

# prediction
Y_pred = X @ W + b  # linear model

# loss (i.e. difference between expected output and actual output)
loss = ((Y_pred - Y) ** 2).mean()  # MSE loss

# let's now compute the gradients
loss.backward()

W.grad

PyTorch records all the operations done on a vector and, from that, works out how to compute the gradients and stores them in the `grad` attribute. For this, it needs to know the variables with respect to which we want to compute gradients later. We set `requires_grad=True` to tell PyTorch to watch the operations done on that particular tensor. You can disable the automatic tracking for some operations with `torch.no_grad` option:

In [None]:
Y = torch.tensor([
    [1.0],
    [1.0],
    [1.0],
    [-1.0],
    [1.0]
])  # expected output
X = torch.tensor([
    [0.5, 1],
    [0.5, 2],
    [0.5, 3],
    [0.5, 4],
    [0.001, 0.1]
])  # input
W = torch.tensor([
    [0.5],
    [-0.5]
], requires_grad=True)  # we want to optimize the weights
b = torch.tensor([0.1])  # bias

# prediction
Y_pred = X @ W  # this operation is tracked

with torch.no_grad():
    a = Y_pred + b  # this operation is not tracked

@torch.no_grad()
def no_grad_fun(A, B):  # using this function will not affect the gradient
    return A @ B

print(Y_pred.detach())  # use detach to get a tensor that is detached from the gradients' graph

# loss (i.e. difference between expected output and actual output)
loss = ((Y_pred - Y) ** 2).mean()  # MSE loss

# let's now compute the gradients
loss.backward()

W.grad

### 1.4 Gradient Descent

Using gradient descent we can find the minimum of a function:

In [None]:
# find the minimum of y = x^2
def function_to_minimize(X):
    return X * X

X = torch.tensor(5.0, requires_grad=True)
step_size = 0.1  # also called learning rate
n_steps = 25  # change this value to get more or less close to the minimum

for step in range(n_steps):
    # run the function
    y = function_to_minimize(X)
    
    X.grad = None  # reset last step gradients (if any)
    y.backward()  # compute gradients
    
    with torch.no_grad():  # don't record the update to the gradient
        X -= step_size * X.grad  # update X using the gradient 

    print(f"[{step+1:>2}] Minimized value: {function_to_minimize(X).detach().numpy():9.6f}")

As you can see, the algorithm updates `X` in such a way that `function_to_minimize(X)` gets smaller and smaller. With this, we can train our first model in PyTorch!

PyTorch also offers an automatic way of updating our models' parameters: Optimizers. An optimizer takes as input the parameters we want to optimize (e.g. `X` in the previous example) and does the optimization for us. In the following example we use the *Stochastic Gradient Descent* (SGD) optimizer.

In [None]:
# find the minimum of y = x^2
def function_to_minimize(X):
    return X * X

X = torch.tensor(5.0, requires_grad=True)
step_size = 0.1
n_steps = 25  # change this value to get more or less close to the minimum

optimizer = torch.optim.SGD(
    [X],  # the parameter(s) to optimize
    lr=step_size
)

for step in range(n_steps):
    # run the function
    y = function_to_minimize(X)
    
    optimizer.zero_grad()  # reset old gradient
    y.backward()  # compute gradients
    optimizer.step()  # optimize parameters (i.e. X)

    print(f"[{step+1:>2}] Minimized value: {function_to_minimize(X).detach().numpy():9.6f}")

#### Exercise 7

Create a function called `mse_loss` that takes as input $Y_{pred}$ and $Y$ and computes the MSE loss. Then, create a function called `objective_function` that takes as input $X$, $Y$, $W$, $b$, and outputs the result of

```
mse_loss(linear_model(X, W, b), Y)
```

where `linear_model()` is `X @ W + b`. Finally, print the output of `objective_function(X_train, Y_train, W, b)` for the variables defined below using `torch.no_grad()`

In [None]:
# Small training dataset of 10 examples

X_train = torch.tensor([
    [-1.],
    [-0.77777778],
    [-0.55555556],
    [-0.33333333],
    [-0.11111111],
    [0.11111111],
    [0.33333333],
    [0.55555556],
    [0.77777778],
    [1.]
])
Y_train = torch.tensor([
    [1.0319852],
    [1.4628717],
    [1.9185644],
    [2.3697991],
    [2.8011405],
    [3.228254 ],
    [3.6749988],
    [4.1398144],
    [4.6044483],
    [5.0241356]
])

# parameters of the model
W = torch.tensor([[0.5]], requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

W, b

In [None]:
def mse_loss(Y, Y_pred):
    return torch.mean((Y_pred - Y) ** 2)

def linear_model(X, W, b):
    return X @ W + b

def objective_function(X, Y, W, b):
    return mse_loss(linear_model(X, W, b), Y)

with torch.no_grad():
    print(objective_function(X_train, Y_train, W, b))

*__Note:__ The expected output is* `tensor(10.0723)`.

#### Exercise 8

Use gradient descent *without* using `torch.optim` module to optimize the `objective_function(X_train, Y_train, W, b)`. Run the algorithm for $250$ steps and with a learning rate of $0.01$.

In [None]:
lr = 0.01

for _ in range(250):
    loss = objective_function(X_train, Y_train, W, b)
    
    W.grad = None
    b.grad = None
    loss.backward()
    with torch.no_grad():
        W -= lr * W.grad
        b -= lr * b.grad

W, b

*__Note:__ The expected values of W and b should be close to*
```
(tensor([[1.8069]], requires_grad=True),
 tensor(3.0062, requires_grad=True))
```
The optimal values for this problem are $W = 2$ and $b = 3$; training has come pretty close.

#### Exercise 9

Now, achieve the same but by using `torch.optim.SGD`.

In [None]:
# re-initialise the model parameters
W = torch.tensor([[0.5]], requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

W, b

In [None]:
lr = 0.01

optim = torch.optim.SGD([W, b], lr=lr)

for _ in range(250):
    loss = objective_function(X_train, Y_train, W, b)
    
    optim.zero_grad()
    loss.backward()
    optim.step()

W, b

Congratulations! You have implemented linear regression from scratch using PyTorch. This is important, because the majority of currently used machine learning algorithms follow the same recipe used here:

1. pick a model to use (here: linear model)
2. pick a goodness of fit measure / loss function (here: MSE loss)
3. pick an optimization algorithm (here: stochastic gradient descent)
4. Use the chosen optimizer to update the model parameters and improve the chosen goodness of fit measure.
5. Repeat 4 an (arbitrarily chosen) number of times. (I recommend blood sacrifices to the RNG gods when choosing a number.)

### 1.5 A Toy Dataset

We have built a first regression model; next up is building a (binary) classification model. For this, we will use a *synthetic* dataset (generated fake data) for which `sklearn.datasets` offers nice utilities. In particular, `make_classification()` allows us to generate a dataset with two classes called `Class A` and `Class B`, respectively.

In [None]:
# Use sklearn to generate synthetic data.
X, Y = make_classification(
    n_samples=150, 
    n_features=2, 
    n_redundant=0, 
    n_clusters_per_class=1, 
    flip_y=0,
    class_sep=2.0,
    shuffle=True,
    random_state=1337
)

# sklearn uses labels {0, 1}. Change them to {-1, +1} for the hinge loss.
Y = (2 * Y - 1)

Y = Y.astype(np.float32)
X = X.astype(np.float32)

X_train, Y_train = torch.tensor(X[:100,:]), torch.tensor(Y[:100],).reshape(-1, 1)
X_test, Y_test = torch.tensor(X[100:,:]), torch.tensor(Y[100:]).reshape(-1, 1)

X_train[:5], Y_train[:5]  # show some training samples and their true label

Our job now is to build a model that can predict which class each sample belongs to. This should work for both our training data (`X_train`) as well as our test data (`X_test`). We will begin by visualizing the data.

In [None]:
fig, ax = plt.subplots()

class_a_train = (np.arange(len(X)) < 100) & (Y == -1)
class_a_test = (np.arange(len(X)) >= 100) & (Y == -1)

class_b_train = (np.arange(len(X)) < 100) & (Y == 1)
class_b_test = (np.arange(len(X)) >= 100) & (Y == 1)

ax.set_title("Fire & Ice")
ax.scatter(*X[class_a_train, :].T, c="tab:blue")
ax.scatter(*X[class_a_test, :].T, c="tab:cyan")
ax.scatter(*X[class_b_train, :].T, c="tab:red")
ax.scatter(*X[class_b_test, :].T, c="tab:orange")
ax.legend(["Class A Train", "Class A Test", "Class B Train", "Class B Test"])
ax.set_xlabel("Feature 1")
ax.set_ylabel("Feature 2")

As you can see in the figure above, the two classes are cleanly separated and the distribution of training and test data aligns in for both classes. Hence, we would expect the model to do quite well on this task.

#### Exercise 10

Implement the following functions:

1. A function called `hinge_loss` that takes $Y_{pred}$ and $Y$ as input and returns the expected value of the hinge loss (as encountered in Exercise 5).
2. A function called `model` that takes $X$, $W$, $b$ as input and computes `linear_model(X, W, b)`. 
3. A function called `objective_function` that takes $X$, $Y$, $W$, $b$ as input and computes
```
hinge_loss(model(X, W, b), Y)
```

Finally, apply the `objective_function` to the training dataset generated above using `objective_function(X_train, Y_train, W, b)`.

These model parameters for this exercise:

In [None]:
W = torch.tensor([
    [0.593],
    [0.236]
], requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

In [None]:
def hinge_loss(Y_pred, Y):
    return torch.maximum(torch.zeros(1), 1 - Y_pred * Y).mean()

def model(X, W, b):
    return linear_model(X, W, b)

def objective_function(X, Y, W, b):
    return hinge_loss(model(X, W, b), Y)

with torch.no_grad():
    print(objective_function(X_train, Y_train, W, b))

*__Note:__ The expected output is* `tensor(2.1518)`.

#### Exercise 11
 
Similar to Exercise 9, find values for $W$ and $b$ that minimize `objective_function()`. Do this using `SGD` with a constant stepsize (learning rate) `lr=0.01` and perform 250 steps.

In [None]:
lr = 0.01

In [None]:
optim = torch.optim.SGD([W, b], lr=lr)

for _ in range(250):
    loss = objective_function(X_train, Y_train, W, b)

    optim.zero_grad()
    loss.backward()
    optim.step()

W, b

*__Note:__ After the optimzation the values of W and b should be*
```
(tensor([[-0.8273],
         [ 0.0832]], requires_grad=True),
 tensor(-0.0955, requires_grad=True))
```

Amazing! You have just built and trained a linear SVM (*support vector machine*) in PyTorch from scratch. 

The next step is to investigate the performance of the model. For this, we need to recall how the output of the model relates to each class: We assigned `Class A` the label `1`, so whenever our model predicts a positive number, we predict that the sample belongs to `Class A`. Similarly, we assigned `Class B` the label `-1`, and whenever our model predicts a negative number, we predict that the sample belongs to `Class B`.

In [None]:
with torch.no_grad():  # Use no_grad(). Otherwise, evaluation steps will be added to the gradient.
    # First check the training accuracy
    correctly_classified_examples = model(X_train, W, b).sign() == Y_train
    correct_percent = correctly_classified_examples.count_nonzero().item() / len(Y_train)
    print(f"Training accuracy is: {correct_percent*100:.2f}%")
    
    # then the test accuracy
    correctly_classified_examples = model(X_test, W, b).sign() == Y_test
    correct_percent = correctly_classified_examples.count_nonzero().item() / len(Y_test)
    print(f"Test accuracy is: {correct_percent*100:.2f}%")


A perfect score!

In addition to the numerical accuracy, we can visualize the so called *decision boundary* of our model. The decision boundary is the line at which our guess switches from one category (`Class A`) to the other (`Class B`). This is useful in understanding how the classifier decides:

In [None]:
fig, ax = plt.subplots()

class_a_train = (np.arange(len(X)) < 100) & (Y == -1)
class_a_test = (np.arange(len(X)) >= 100) & (Y == -1)

class_b_train = (np.arange(len(X)) < 100) & (Y == 1)
class_b_test = (np.arange(len(X)) >= 100) & (Y == 1)

ax.set_title("Fire & Ice")
ax.scatter(*X[class_a_train, :].T, c="tab:blue")
ax.scatter(*X[class_a_test, :].T, c="tab:cyan")
ax.scatter(*X[class_b_train, :].T, c="tab:red")
ax.scatter(*X[class_b_test, :].T, c="tab:orange")
ax.set_xlabel("Feature 1")
ax.set_ylabel("Feature 2")

points_x = np.linspace(-0.5, 0)

# always remember to detach tensors that require_grad
points_y = -(b.detach() + W.detach()[0] * points_x) / W.detach()[1]
ax.plot(points_x, points_y, linestyle="--", color="red")
ax.legend(["Class A Train", "Class A Test", "Class B Train", "Class B Test", "Decision Boundary"])


Points to the left of the red line are predicted to be `Class B`; points to the right of the red line are predicted to be `Class A`.

## 2. Building a Neural Network

As mentioned above, and as you have witnessed during the two models we have trained so far, training follows a fairly simple recipe. PyTorch picks up on that and ships with a high level module under `torch.nn`. It contains many handy functions and building blocks to make models and neural networks. 

### 2.1 Model Building API

To build a model, we start by creating a class that inherits `torch.nn.Module`. This way, when we train and use the object, PyTorch will know automatically which function to call, which parameters to train, etc.

In [None]:
from torch import nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        # here define the strcuture of the model e.g. :
        # a linear model similar to what we had earlier
        self.simple_network = nn.Linear(in_features=2, out_features=1)
    
    # here define the 'forward' step i.e. compute the output of the model
    def forward(self, x):
        return self.simple_network(x)

`nn.Linear` takes a first argument `in_features` that indicates the dimension of the input vector and another `out_features` that indicates the dimension of the output vector. We can also specify whether we want a bias, the $b$ vector we used earlier, with the `bias` argument (which is `True` by default). We can call each building block of our model as if it was a function that takes an input and returns the output of that block e.g. `self.simple_network(x)`.

In [None]:
# let's try using the model
model = MyModel()

with torch.no_grad():
    Y_pred = model(X_train)

X_train[:5], Y_pred[:5]  # show some training samples and relative predictions

Inside `nn` we can also find some common loss functions. For example, we can use the MSE loss by either calling `nn.functional.mse_loss(Y_pred, Y_train)`, or by creating a loss object with `loss_fn = nn.MSELoss()`, and then using it like `loss = loss_fn(Y_pred, Y_train)`.

Let's now see a complete SVM example using these high-level APIs.

*__Note__: PyTorch doesn't give us an implementation of the hinge loss out-of-the-box due to its simplicity. We can however find implementations of more complex losses. For the following example, to get the same results as before, we use our defined function for `hinge_loss`*

In [None]:
# let's train the model

# we can automatically get the model parameters with its own function
optim = torch.optim.SGD(params=model.parameters(), lr=0.01)

print("[ Parameters Before Training ]")

# let's also see the parameters values before we begin
for a, b in model.named_parameters():
    print(a, b)

print("----")
print("[ Training ]")

loss_fn = hinge_loss

for step in range(250):
    Y_pred = model(X_train)
    loss = loss_fn(Y_pred, Y_train)

    optim.zero_grad()
    loss.backward()
    optim.step()
    
    if step % 25 == 0:
        print(f"[{step:>3}] Loss: {loss.item():9.6f}")


print("----")
print("[ Parameters After Training ]")

# let's also see the parameters values after the training
for a, b in model.named_parameters():
    print(a, b)

We can test the model exactly as before:

In [None]:
with torch.no_grad():  # use no grad
    correctly_classified_examples = model(X_train).sign() == Y_train
    correct_percent = correctly_classified_examples.count_nonzero().item() / len(Y_train)
    print(f"Training accuracy is: {correct_percent*100:.2f}%")
    
    # then the test accuracy
    correctly_classified_examples = model(X_test).sign() == Y_test
    correct_percent = correctly_classified_examples.count_nonzero().item() / len(Y_test)
    print(f"Test accuracy is: {correct_percent*100:.2f}%")

There is virtually no difference between the SVM that you trained above, and the SVM trained here. The `nn` API simply saves you time and produces more readable code.

#### Exercise 12

Train a new model on the fake data using the `nn` API. 

**Step 1**: Build the model.

1. Create a class called `LogisticRegressor` that inherits `nn.Module`.
2. Create a `Linear` layer with the same shape as before.
3. Create a `Sigmoid` layer. In this case, the sigmoid is our *activation function*, a function that we use after each layer of our network (only one in this case) to improve its performace.
4. In the forward function feed the output of the linear layer to the sigmoid and return the result

**Step 2**: Create the training loop. 

Use the optimizer `SGD` with learning rate 0.01 and the loss `BCELoss` (binary cross-entropy). Run the training for 200 steps.

**Step 3**: Evaluate the model.

Evaluate the accuracy of the model on the test data $X_{test}$, $Y_{test}$. Note that how you test the accuracy is different than before due to the addition of the sigmoid layer. Before, you had to check the sign of the output to check the predicted class but now your output is constrained between 0 and 1. In this case, we can set a threshold value (e.g. 0.5) to distinguish between the classes and then compare that to the ground truth as before. 

In [None]:
# (1) Build the model
class LogisticRegressor(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        return self.sigmoid(self.linear(x))

model = LogisticRegressor()

In [None]:
# (2) Create the training loop
optim = torch.optim.SGD(model.parameters(), lr=0.01)

loss_fn = nn.BCELoss()

for _ in range(200):
    Y_pred = model(X_train)
    
    loss = loss_fn(Y_pred, Y_train)
    
    optim.zero_grad()
    loss.backward()
    optim.step()

In [None]:
# (3) Evaluate the model
with torch.no_grad():
    correct_percent = torch.eq(model(X_train) > 0.5, Y_train > 0).count_nonzero() / len(Y_train)
    print(f"Training accuracy is: {correct_percent*100:.2f}%")

    correct_percent = torch.eq(model(X_test) > 0.5, Y_test > 0).count_nonzero() / len(Y_test)
    print(f"Test accuracy is: {correct_percent*100:.2f}%")

*__Note:__ Both accuracies should be above 90%*

Well done! You have just implemented logistic regression.

#### Exercise 13

Create a *multi-layer perceptron* (MLP) and train it on the data used before following the steps below. A MLP is a simple neural network that consists of two logistic regression models stacked sequentially.

**Step 1**: Build the model.

1. Create a class called `MLP` that inherits `nn.Module`.
2. Create 2 `Linear` layers: one with the same input shape as before and 500 as output one that takes 500 input and outputs the same shape as before.
3. Create a `Sigmoid` layer to be applied after each `Linear` layer.
4. In the forward function, feed the output of the first linear layer to the second layer, and return the result (don't forget to also apply the sigmoid!).

**Step 2**: Create the training loop. 

Use the optimizer `SGD` with learning rate 0.01 and the loss `BCELoss` (binary cross-entropy). Run the training for 200 steps.

**Step 3**: Evaluate the model.

In [None]:
# (1) Build the model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(2, 500)
        self.linear2 = nn.Linear(500, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        first_result = self.sigmoid(self.linear1(x))
        final_result = self.sigmoid(self.linear2(first_result))
        return final_result

model = MLP()

In [None]:
# (2) Create the training loop
optim = torch.optim.SGD(model.parameters(), lr=0.01)

loss_fn = nn.BCELoss()

for _ in range(200):
    Y_pred = model(X_train)
    
    loss = loss_fn(Y_pred, Y_train)
    
    optim.zero_grad()
    loss.backward()
    optim.step()

In [None]:
# (3) Evaluate the model
with torch.no_grad():
    correct_percent = torch.eq(model(X_train) > 0.5, Y_train > 0).count_nonzero() / len(Y_train)
    print(f"Training accuracy is: {correct_percent*100:.2f}%")

    correct_percent = torch.eq(model(X_test) > 0.5, Y_test > 0).count_nonzero() / len(Y_test)
    print(f"Test accuracy is: {correct_percent*100:.2f}%")

*__Note:__ Both accuracies should be above 90%*

### 2.2 Real World Data

So far we have used *synthetic data*, artificially generated through the `sklearn` utilities. Separating two blobs of points is a good first example, but not very similar to real-world problems.

Thus, we will now move on to widely available datasets collected from the real world. A very popular example is [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html), a collection of images, stored as 32x32px in 3-channel colors, that can be classified among ten categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. Since the dataset is very common and often used as a benchmark in Machine Learning, it is easily available through PyTorch utilities.

In [None]:
import torchvision

train_data = torchvision.datasets.CIFAR10(root="./data", train=True, download=True)
test_data = torchvision.datasets.CIFAR10(root="./data", train=False, download=True)

The images look something like the following:

In [None]:
plt.show()
fig, ax_matrix = plt.subplots(2, 4)

for idx in range(8):
    ax = ax_matrix[idx // 4, idx % 4]
    ax.imshow(train_data.data[idx])
    ax.axis("off")

fig.tight_layout()

*__Note__: Actual real world data is messy. It needs labeling, cleaning and preprocessing before it can be used for ML. CIFAR10, and all other benchmark datasets, have this done already. This is great for learning how algorithms work, but bad for learning how to do ML in the wild. We will, hence, revisit data cleaning during Assignment 1.*

Also, the dataset we just downloaded is not ready to use as it is now. Each data point (image) is stored as a `PILImage`, while, of course, we want to work with tensors. In the following code, we use `torchvision.transforms` to achieve this, and we also normalize the values inside. This module can also apply many different transformations that may come in handy depending on the dataset e.g. `CenterCrop`, `Resize`, etc.

In [None]:
import torchvision.transforms as transforms

transform = transforms.Compose(  # Do multiple transformations
    [
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ]
)

train_data.transform = transform
test_data.transform = transform

transform

Real world datasets are often too big to store in the computer's main memory, much less in the GPU's memory. Even if we manage to fit the samples in memory, when we train we need to keep a lot of extra state. This means that we often need to work in smaller *batches* (groups of samples), loading data on demand from disk, and discarding each batch before we load the next. With an added bit of random shuffling, this turns out to be so good for training that people use batches even if the dataset is small enough to fit in memory.

PyTorch provides utilities for loading data and batching it in the module `torch.utils.data`. Two important classes are `Dataset` and `DataLoader`.

* `Dataset` objects like `train_data` and `test_data` handle grabbing the data from disk or elsewhere; in the case of `CIFAR10` they get it directly from RAM (because CIFAR-10 is small). 
* `DataLoader` objects handle creating batches and iterating over the data.
   
Let's create some loaders for our data:

In [None]:
batch_size = 4

train_dataloader = torch.utils.data.DataLoader(train_data, batch_size=batch_size)
test_dataloader = torch.utils.data.DataLoader(test_data, batch_size=batch_size)

Now, we want to build a model that is able to successfully classify these images into their right category. To achieve this we will use Convolutional Neural Networks (CNNs).
CNNs typically use, in their architecture, Convolutional layers:

![Convolution image](https://1.cms.s81c.com/sites/default/files/2021-01-06/ICLH_Diagram_Batch_02_17A-ConvolutionalNeuralNetworks-WHITEBG.png)

And Pooling layers:

![Max pooling image](https://pyimagesearch.com/wp-content/uploads/2021/05/Convolutional-Neural-Networks-CNNs-and-Layer-Types.png)

See the Machine Learning Primer for further details on how these layers work.

In the upcoming code we implement a simple CNN to classify our images. Since we are starting to work with bigger datasets and bigger networks it now makes sense to use a GPU (if available).

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# build the model (code partially from pytorch.org docs)
class SimpleCNN(nn.Module):
    def __init__(self, input_channels, output_labels):  
        super().__init__()
        
        self.cnn = nn.Sequential(  # runs each layer sequentially
            nn.Conv2d(input_channels, 6, 5),
            nn.ReLU(),  # a different activation layer than sigmoid
            nn.MaxPool2d(2, 2),
            nn.Conv2d(6, 16, 5),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Flatten(),
            nn.Linear(16*5*5, output_labels)
        )
        
    def forward(self, x):
        return self.cnn(x)
    
simple_cnn = SimpleCNN(
    input_channels=3,  # we have 3 color channels
    output_labels=10  # we have 10 output categories
).to(device)  # put the model in the GPU if available

simple_cnn

We have defined our first CNN. It will take our image input, and output 10 values: one for each category. A high value means the network thinks the corresponding class is very likely, so all we have to do is pick the highest-scoring class to get a prediction.

We can now train the network. Given the size of the dataset, the training will take longer than our previous 'synthetic' examples. However, the training procedure is the same as before, with the only exception that we use a different loss function: the *cross-entropy loss*.

In [None]:
loss_fn = nn.CrossEntropyLoss()
optim = torch.optim.SGD(simple_cnn.parameters(), lr=0.001, momentum=0.9)

n_epochs = 2

print("Starting training...")

simple_cnn.train()  # set model in training mode

# loop over the dataset multiple times, similar to our "steps" used before
for epoch in range(n_epochs):
    running_loss = 0.0
    for i, (images, labels) in enumerate(train_dataloader):

        # don't forget to move the data to the right device
        # also, data will be freed from memory once out of scope
        outputs = simple_cnn(images.to(device))
        loss = loss_fn(outputs, labels.to(device))
        
        optim.zero_grad()
        loss.backward()
        optim.step()

        # print statistics
        running_loss += float(loss.item())
        if i % 1000 == 999:
            print(f'Epoch: {epoch+1} \t Batch: {i+1} \t Loss: {running_loss / 1000:.3f}')
            running_loss = 0.0

print('Finished Training')

In [None]:
# Evaluate model

# Set the model in evaluation mode
simple_cnn.eval()

correct_samples = 0
total_samples = 0

Y_pred = []  # our model's predictions, will be used later
Y_true = []  # true predictions, will be used later

with torch.no_grad():
    for images, labels in test_dataloader:
        outputs = simple_cnn(images.to(device))

        # the class with the highest value is what we choose as prediction
        _, predicted_index = torch.max(outputs.data, dim=1)  # gets max for each image and returns its value and index
        
        Y_true.extend(list(labels.numpy()))
        Y_pred.extend(list(predicted_index.cpu().numpy()))
        
        total_samples += labels.to(device).size(0)
        correct_samples += (predicted_index == labels.to(device)).sum().item()

print(f'The test accuracy is {100 * correct_samples // total_samples:.2f}%')

We got about 50% accuracy. Not that impressive, right? Well, given that we have ten classes, a random guesser would have 10% accuracy! We can safely say that the model was able to learn at least something.

*__Note__: we have to set the model in training and evaluation mode for some layers to behave in their intended manner. It doesn't make a difference now but it will for more complex networks that use e.g. dropout, batch normalization, etc.*

### 2.4 Confusion Matrix

While training models for multi-class classification, the overall accuracy score is still a great tool, it only tells a fraction of the story. To get a better sense of how well our model is doing, we can additionally look at its *confusion matrix*.

The *confusion matrix* $C$ is a pivot table of the predicted label over the true label. Each row (index $i$) corresponds to one class for the true label $Y$, and each column (index $j$) corresponds to one class for the predicted label $Y_\textrm{pred}$. The cell $C_{ij}$ contains the number of samples with true label $i$ and prediction $j$. The elements in the diagonal, $C_{ii}$, tell us how many samples were correctly classified for each class. The elements outside the diagonal tell us how many samples of a class were *confused* for another class (hence the name). This allows us to detect e.g. if the model is having trouble with one specific class ($C_{ii}$ is low), or the model tends to confuse a class for another ($C_{ij}$ is high for a specific $i\neq j$).

In [None]:
categories = ['Plane', 'Car', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck']

ConfusionMatrixDisplay.from_predictions(
    Y_true,
    Y_pred, 
    normalize="pred",
    display_labels=categories,
    values_format=".2f"
)
plt.show()

Looking at this plot, the hardest class to detect are cats (lowest value in the diagonal). They are most commonly confused for dogs.

### 2.5 Save and Load the Model

After spending so much time training this model, it may be worth saving it so that we don't need to repeat the whole process each time we want to use it.

In [None]:
# save the model
torch.save(simple_cnn.state_dict(), "my_great_model_trained.pth")

# load the model
new_cnn = SimpleCNN(3, 10)
new_cnn.load_state_dict(torch.load('my_great_model_trained.pth'))

#### Exercise 14

Extend our simple CNN with more `Linear` layers.

**Step 1**: Create a `ComplexCNN` class that has the same architecture as our `SimpleCNN` one and takes as input `middle_layer`.

**Step 2**: Change the last `Linear` layer to have output dimension of 120.

**Step 3**: Add a `ReLU` layer, a `Linear` layer with out_dimension given by `middle_layer` defaulting to 84, another `ReLU` layer and finally another `Linear` one to output our category.

**Step 4**: Train and evaluate the model on the CIFAR10 dataset. Print the final test accuracy.

In [None]:
# (1) Build the model
class ComplexCNN(nn.Module):
    def __init__(self, input_channels, output_labels, middle_layer=84):  
        super().__init__()
        
        self.cnn = nn.Sequential(  # runs each layer sequentially
            nn.Conv2d(input_channels, 6, 5),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(6, 16, 5),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Flatten(),
            nn.Linear(16*5*5, 120),
            nn.ReLU(),
            nn.Linear(120, middle_layer),
            nn.ReLU(),
            nn.Linear(middle_layer, output_labels)
        )
        
    def forward(self, x):
        return self.cnn(x)
    
complex_cnn = ComplexCNN(
    input_channels=3,
    output_labels=10,
    middle_layer=84
).to(device)

complex_cnn

In [None]:
# (2) Create the training loop
loss_fn = nn.CrossEntropyLoss()
optim = torch.optim.SGD(complex_cnn.parameters(), lr=0.001, momentum=0.9)

n_epochs = 2

print("Starting training...")

complex_cnn.train()  # set model in training mode

# loop over the dataset multiple times, similar to our "steps" used before
for epoch in range(n_epochs):
    for i, (images, labels) in enumerate(train_dataloader):

        # don't forget to move the data to the device
        # also, data will be freed from memory once out of scope
        outputs = complex_cnn(images.to(device))
        loss = loss_fn(outputs, labels.to(device))
        
        optim.zero_grad()
        loss.backward()
        optim.step()

print('Finished Training')

In [None]:
# (3) Evaluate the model
complex_cnn.eval()

correct_samples = 0
total_samples = 0

Y_pred = []  # our model's predictions, will be used later
Y_true = []  # true predictions, will be used later

with torch.no_grad():
    for images, labels in test_dataloader:
        outputs = complex_cnn(images.to(device))

        # the class with the highest value is what we choose as prediction
        _, predicted_index = torch.max(outputs.data, dim=1)  # gets max for each image and returns its value and index
        
        Y_true.extend(list(labels.numpy()))
        Y_pred.extend(list(predicted_index.cpu().numpy()))
        
        total_samples += labels.to(device).size(0)
        correct_samples += (predicted_index == labels.to(device)).sum().item()

print(f'The test accuracy is {100 * correct_samples // total_samples:.2f}%')

#### Exercise 15
Plot the confusion matrix for `complex_cnn`.

In [None]:
ConfusionMatrixDisplay.from_predictions(
    Y_true,
    Y_pred, 
    normalize="pred",
    display_labels=categories
)
plt.show()

#### Exercise 16
Save and load the `ComplexCNN` model.

In [None]:
# save the model
torch.save(complex_cnn.state_dict(), "my_great_complex_model_trained.pth")

# load the model
new_cnn = ComplexCNN(3, 10, middle_layer=84)
new_cnn.load_state_dict(torch.load('my_great_complex_model_trained.pth'))

*__Sidenote:__ One aspect of training that we didn't cover yet is hyper-parameter tuning. Varying the batch size, learning rate, number of units per layer, etc. will have a direct effect on the performance of your model. If you have the time, try changing the following parts of your model: (1) Increase the dimension of one or more `Linear` layer. (2) Try adding a `nn.Dropout2d(p=0.1)` between the Linear layers and experimenting with different values of p. (3) Decrease the learning rate. Then, re-train your model and check the accuracy.*

### 2.6 Cross-Validation

We have talked in the **ML Primer** about how models might "memorize" the training data too closely, and that is why we use separate train, validation and test sets to evaluate their performance. The ability for a model to respond well to new data is refered to as *generalization*. This is somewhat captured by the test-set accuracy, but can we quantify it better?

One option is to train the model several times using different data. This way, we obtain several estimates of the accuracy, and we can see if the training process is stable (similar results, low variance) or unstable (distinct results, high variance). Using completely separate samples would be very wasteful, but there is a more economic solution: *$K$-fold cross-validation*. It works as follows:

1. Split the whole dataset into $K$ *folds* (chunks of samples).
2. For each fold:
    1. Reserve the fold, and join all other folds into a training set.
    2. Train on the joined set, test on the reserved fold.

This gives $K$ separate estimates of the accuracy, so we can obtain an *average accuracy* and an *accuracy variance*. A high average means the model tends to produce good predictions; a low variance means the model generalizes well.

Another thing we can do with cross-validation is hyper-parameter tuning where we test one model for each combination of our folds.

Evaluating the accuracy using cross-validation in `sklearn` is quite easy, and you can find a [great practical overview](https://scikit-learn.org/stable/modules/cross_validation.html) over various types of cross-validation on their website.

Let's use a cross-validation approach to decided the learning rate we're going to use.

In [None]:
from sklearn.model_selection import KFold

def train_batch(model, inputs, labels, optim):
    model.train()
    outputs = model(inputs)
    loss = loss_fn(outputs, labels)
    
    optim.zero_grad()
    loss.backward()
    optim.step()
    
    return float(loss.item())

@torch.no_grad()
def eval_batch(model, inputs, labels):
    model.eval()
    outputs = model(inputs)

    # the class with the highest value is what we choose as prediction
    _, predicted_index = torch.max(outputs.data, dim=1)  # gets max for each image and returns its value and index
    
    return (predicted_index == labels).sum().item(), labels.size(0)
    

folds = 5 # number of chunks to split the dataset into

learning_rates = [0.1 / 10**x for x in range (folds)]

for i, (train_indices, val_indices) in enumerate(
        KFold(n_splits=folds, shuffle=True).split(train_data.data, train_data.targets)
    ):  
    
    train = torch.utils.data.DataLoader(
        torch.utils.data.Subset(train_data, train_indices),
    )
    val = torch.utils.data.DataLoader(
        torch.utils.data.Subset(train_data, val_indices)
    )
    
    model = SimpleCNN(3, 10).to(device)
    lr = learning_rates[i]
    optim = torch.optim.SGD(model.parameters(), lr=lr)
    
    loss = []
    
    # train only one epoch
    for images, labels in train:
        loss.append(train_batch(model, images.to(device), labels.to(device), optim))

    # eval. No need to use no_grad since it surrounds eval_batch
    correct_samples, total_samples = 0, 0
    for images, labels in val:
        correct, total = eval_batch(model, images.to(device), labels.to(device))
        correct_samples += correct
        total_samples += total
        
    print(f"Accuracy with {lr} as learning rate: {100 * correct_samples // total_samples:.2f}%")

We found out that the best learning rate is 0.001. This is expected as we have too few steps for lower values to have an immidiate effect while bigger values only result in big oscillations in the weights.

*__Note__: A good practice when using cross-validation for hyper-parameter tuning is to use nested cross-validations. In this way we have a inner cross-validation for a single model for which we can get better accuracy estimation and an outer validation to find the best model (relative to a hyper-parameter).*

#### Exercise 17
Use cross-validation to find the best value among `[20, 80, 160]` for the parameter to pass to `ComplexCNN`.

(Remember we left the output dimension of one linear layer as a init parameter?)

In [None]:
folds = 3  # number of chunks to split the dataset into

neurons = [20, 80, 160]

for i, (train_indices, val_indices) in enumerate(KFold(n_splits=folds, shuffle=True).split(train_data.data, train_data.targets)):  
    train = torch.utils.data.DataLoader(
        torch.utils.data.Subset(train_data, train_indices),
    )
    val = torch.utils.data.DataLoader(
        torch.utils.data.Subset(train_data, val_indices)
    )
    
    model = ComplexCNN(3, 10, neurons[i]).to(device)
    optim = torch.optim.SGD(model.parameters(), lr=0.01)
    
    loss = []
    
    # train only one epoch
    for images, labels in train:
        loss.append(train_batch(model, images.to(device), labels.to(device), optim))

    # eval
    correct_samples, total_samples = 0, 0
    for images, labels in val:
        correct, total = eval_batch(model, images.to(device), labels.to(device))
        correct_samples += correct
        total_samples += total
        
    print(f"Accuracy with {neurons[i]} as neurons in the middle layer: {100 * correct_samples // total_samples:.2f}%")