# Laserscanning - Exercise 9
PyTorch and deep learning.
Please, use this notebook locally on your own machine as the cluster is currently in a fragile state. If you are facing any problem you can not solve, feel free to contact me via email!

#### Please upload the implemented solutions until <u>31.01.2023</u> to the studip folder of your group. The file should follow this format:
##### EX09_Group_XX.ipynb (e.g. EX09_Group_04.ipynb)

Please also edit the following:

<u>Group XX:</u>

| Firstname | Lastname |
| :--- | :--- |
| firstname1 | lastname3 |
| firstname2 | lastname3 |
| firstname3 | lastname3 |

In [None]:
import numpy as np
import torch

# Getting to know PyTorch
We will use a library called PyTorch for our experiments. It integrates well with Python and the `numpy` library. It is [very well documented](https://pytorch.org/docs/stable/index.html).

It has tensor manipulation capabilities which are often analogous to numpy. However, it goes beyond that in that it allows to put computations either on the CPU or on the GPU.

In [None]:
a = [1.0, 2.0, 1.0] # List.

In [None]:
np.array(a)  # Numpy array.

In [None]:
torch.tensor(a) # PyTorch tensor.

In [None]:
np.array(a).dtype, torch.tensor(a).dtype  # np has float64 as default, torch has float32.

In [None]:
t = torch.arange(24); t  # arange as in np.

In [None]:
t.view(3,8)  # view generates a view with different dimensions, of the same data.

In [None]:
t.view(2,3,4)  # Same data as 2x3x4 tensor.

In [None]:
t.view(2,3,-1)  # Can also determine remaining dimension by itself (note the -1).

In [None]:
t.view(2,-1)  # Another example.

In [None]:
v = t.view(4,6); v

In [None]:
v.t()  # Transpose. Only for 2D.

In [None]:
v = t.view(2,3,4); v  # May think of: channels x rows x colums.

In [None]:
v.transpose(0,1)  # Transpose dimensions 0 and 1.

In [None]:
v.transpose(1,2)  # Transpose dimensions 1 and 2. Think of: channels x columns x rows.

In [None]:
v  # Show again. Think channels x rows x columns.

In [None]:
v.permute(1,2,0)  # Think channels x rows x colums --> rows x columns x channels.

In [None]:
v.permute(1,2,0).shape

In [None]:
t = torch.tensor([1,2,3])  # When data is integer, torch uses int64.
t.dtype

In [None]:
t = torch.tensor([1.0,2,3])  # When data is float, torch uses float32.
t.dtype

In [None]:
t = torch.tensor([1,2,3], dtype=torch.float64)  # To enforce a certain datatype, use dtype=...
t.dtype

In [None]:
t = torch.tensor([1,2,3])  # t is int64.
print(t.dtype)
u = t.double()  # Convert to float64.
print(u.dtype)
v = t.to(torch.double)  # Also convert to float64.
print(v.dtype)

## That's what it's all good for: put the data on the GPU
...if you have one!

In [None]:
# v = t.to(device="cuda")  # Uncomment this if you have a GPU. Otherwise, it will throw an error.

# Autograd: computing the gradient automatically
As we have learned, one essential part of optimization is computing the gradient. 'Layered' structures lead to nested function calls, e.g. $x\rightarrow y=f_1(x)\rightarrow z=f_2(y)$ leads to $f_2(f_1(x))$. Then, for the derivative, we need to apply the chain rule, which eventually leads to the backpropagation algorithm shown in the lecture.

## Computing the function value

Let's do an example, assuming three layers, denoted as $l$, $m$ and $n$, computing the function $n(m(l(x,y))$. During a *forward pass*, we would be interested in the function value $v = n(m(l(x_0,y_0))$, for given input values $(x_0,y_0)^\top$. For example, the $v$ would be the value of the loss for that particular input.

Then, for optimization, we would need the derivative of $v$ with respect to $x$ and $y$, taken at the particular input $x_0$, $y_0$. This can be computed using a *backward pass*, called *backpropagation*.

So the sequence is:
- We start from the 'input layer' $l$, having two variables, $l=(x,y)^\top$.
- Then, our second layer $m$ computes two new values, $m=(l_1^2, \sin(l_2))^\top$.
- Finally, our third layer $n$ computes the square root of the sum: $n = \sqrt{m_1 + m_2}$.

Let's do the computation for the particular values $(x_0, y_0) = (1,2)$.

In [None]:
x, y = 1.0, 2.0
l = torch.tensor([x, y])

We take the square of the first element and the sine of the second element and combine it back to a two element tensor using `stack()`.

In [None]:
m = torch.stack((l[0]*l[0], torch.sin(l[1])))

Then, we compute the sum of both elements and take the square root.

In [None]:
n = torch.sqrt(m.sum())
n

That is, for the input $(x_0, y_0) = (1,2)$, the function value of $v = \sqrt{x_0^2 + \sin(y_0)}$ is $1.3818$.

## Computing the gradient
Now to compute the gradient with respect to x and y, we would usually do the symbolic computation:
- $\frac{\partial}{\partial x}\sqrt{x^2 + \sin(y)} = \frac{1}{2}\left( x^2 + \sin(y)\right){}^{-1/2} \cdot 2 \cdot x$, and
- $\frac{\partial}{\partial y}\sqrt{x^2 + \sin(y)} = \frac{1}{2}\left( x^2 + \sin(y)\right){}^{-1/2} \cdot \cos(y)$.

Then, we would evaluate those two values at $(x_0, y_0)$. Let's do this 'by hand':

In [None]:
from math import sin, cos
ddx = 0.5 * (x*x + sin(y))**(-0.5) * 2 * x
ddy = 0.5 * (x*x + sin(y))**(-0.5) * cos(y)
torch.tensor((ddx, ddy))

## Automatic computation of the gradient: autograd
This worked, but was quite tedious. We had do to a symbolic computation, and the formulas we got are quite long. From the lecture, we know they will get longer with each layer, and to evaluate them, we will re-compute subparts of the formulas multiple times. This re-computation overhead is solved by backpropagation, which tabulates values and re-uses them.

As it turns out, PyTorch has backpropagation built in.

To start, we do the exactly same thing as above for the forward pass, but **switch on gradient computation**. This is important, as it will instruct PyTorch to keep track of all computations we do with the tensor. While we pile up algebraic manipulations, PyTorch will secretly build a graph of operations, which it will use later to perform backpropagation.

So the following computation is exactly the same as above, with the exception that we switch on the gradient for the initial tensor $l$.

In [None]:
x, y = 1.0, 2.0
l = torch.tensor([x, y], requires_grad=True)  # <-- Note the 'requires_grad=True'.
l

In [None]:
m = torch.stack((l[0]*l[0], torch.sin(l[1])))
n = torch.sqrt(m.sum())
n

So the function value is the same as above. To get the gradient, we just call the backpropagation on $n$, then grab the gradient from $l$.

In [None]:
n.backward()
l.grad

That is the same as our result from above, which we obtained based on the hand-computed symbolic derivative. So amazingly, **without doing any symbolic computation, we get the gradient**. There is three different ways to compute derivatives: symbolically (what we did above by hand), numerically (working with the difference quotient, $(f(x+\varepsilon)-f(x))/\varepsilon$), and automatically see [here](https://en.wikipedia.org/wiki/Automatic_differentiation) if you are interested. PyTorch is using the latter.

## Autograd accumulates gradients
Note that autograd **accumulates the gradients**. This is by design, because the normal application is to sum up the loss function, computed for every sample (in a batch or minibatch). Thus, the gradient is the sum of gradients.

See what happens if we compute the function $n$ again:

In [None]:
m = torch.stack((l[0]*l[0], torch.sin(l[1])))
n = torch.sqrt(m.sum())
n

Of course, we get the same function value. What happens to our gradients (stored in $l$)?

In [None]:
n.backward()
l.grad

Note this is twice the values from above. PyTorch has accumulated the gradient in $l$. If this is not intended, you must remove or zero the gradient. E.g. in a training loop, one will zero the gradient after it has been used in an update step.

In [None]:
l.grad.zero_()  # <-- Zero the gradient. Alternatively one could use: l.grad = None
m = torch.stack((l[0]*l[0], torch.sin(l[1])))
n = torch.sqrt(m.sum())
n.backward()
l.grad

# Plane estimation using PyTorch

In the lecture, we learned that a single layer perceptron with a MSE loss does in fact a least squares estimation of a plane.

We will first do a standard least squares estimation of a plane.

## Standard least squares estimation
First, we define some random $(x_1, x_2)$ samples (uniformly distributed in $[-1,1]$). We print out the first five.

In [None]:
M = 1000
X = torch.rand((M,2)) * 2 - 1
X[:5]

For the following purposes, it will be useful to have a third column which is all ones. So we re-cast our `X`, the first two columns will be the old `X` and the last column will be all ones.

In [None]:
tmp = torch.empty((M,3))
tmp[:,0:2] = X
tmp[:,2] = 1
X = tmp
X[:5]

Remember the plane equation is $y = w_1 x_1 + w_2 x_2 + b$. Using the matrix $X$, we can now simply obtain the $y$ vector by matrix multiplication:

In [None]:
w1_known, w2_known, b_known = 2.0, 4.0, 1.0
y_exact = X @ torch.tensor((w1_known, w2_known, b_known))
y_exact[:5]

Normally, we do not know `w1_known`, `w2_known`, and `b_known`. We only get random samples that were generated by a process, which produces points on a plane with added random noise. In our case, it is Gaussian noise:

In [None]:
y = torch.normal(mean=y_exact, std=0.5)

Let us plot these points: they are on a plane, with added noise, as we intended.

In [None]:
import ipyvolume as ipv
ipv.clear()
ipv.scatter(X[:,0].numpy(), X[:,1].numpy(), y.numpy(), marker="sphere")
ipv.show()

Now we pretend that we do know the model that produced these random points, but we do not know the parameters. To estimate them, we can do the standard least squares estimation, $\hat{\theta} = (X^\top X)^{-1}X^\top y$.

In [None]:
(X.t() @ X).inverse() @ X.t() @ y

That worked well! The result is close to the `w1_known`, `w2_known`, and `b_known` parameters which we used above to generate the data.

## Using optimization to fit parameters
Now we want to proceed to a general optimization approach. As we know, in our linear estimation problem above, this actually makes no sense, because we can figure out the least squares solution in closed form. That is, there is no iteration necessary. However, as soon as we add some nonlinearities, this would not work anymore.

First of all, we define our model. It is $y = w_1 x_1 + w_2 x_2 + b$, so it is just the scalar product of $(x_1, x_2, 1)$ and $\theta=(w_1, w_2, b)$, which can be conveniently expressed as `x @ theta`.

In [None]:
def model(x, theta):
    return x @ theta

Next, we define our loss function, the mean squared error.

In [None]:
def loss_function(y, y_ref):
    diff = y - y_ref
    return (diff @ diff) / y.shape[0]

Here is the training loop. We run for a number of epochs. In each epoch, we clear the gradients, forward propagate all samples (not just a single sample) through the model, compute the loss function, execute backward propagation to get the gradient, and then make a small step (according to the learning rate) in the direction of the negative gradient. That's all.

In [None]:
def training_loop(number_of_epochs, learning_rate, theta, X_training, y_training):
    for epoch in range(1, number_of_epochs+1):
        # Start with a new gradient.
        theta.grad = None
        
        # Compute the function values, using the current parameters.
        y = model(X_training, theta)
        
        # Compute the loss, and back-propagate.
        loss = loss_function(y, y_training)
        loss.backward()
        
        # Now make a small step in negative gradient direction.
        with torch.no_grad():
            theta -= learning_rate * theta.grad
            
        # Output something from time to time.
        if epoch % 50 == 0:
            print("Epoch %4d Loss %f" % (epoch, loss))
            
    return theta

In our examples (although this is not advisable in the neural network case below), we start with $w_1=w_2=b=0$.

In [None]:
theta = torch.zeros(3, requires_grad=True)

Then we run the loop to get...

In [None]:
training_loop(1000, 1e-2, theta, X, y)

So we ended up with a MSE of approximately 0.25, which makes sense because we know we generated the data with a standard deviation of 0.5.

We now modify this functions in order to record how the loss and the parameters behave during training.

In [None]:
def training_loop(number_of_epochs, learning_rate, theta, X_training, y_training):
    all_losses = []
    all_theta = []
    
    for epoch in range(1, number_of_epochs+1):
        # Start with a new gradient.
        theta.grad = None
        
        # Compute the function values, using the current parameters.
        y = model(X_training, theta)
        
        # Compute the loss, and back-propagate.
        loss = loss_function(y, y_training)
        all_losses.append(float(loss))
        loss.backward()
        
        # Now make a small step in negative gradient direction.
        with torch.no_grad():
            theta -= learning_rate * theta.grad
            
        all_theta.append(theta.detach().numpy().copy())
            
    return np.array(all_theta), all_losses

Then we run again...

In [None]:
theta = torch.zeros(3, requires_grad=True)
thetas, losses = training_loop(1000, 1e-2, theta, X, y)

Now, we can plot how the loss evolves. It converges nicely.

In [None]:
import matplotlib.pyplot as plt
plt.plot(losses);

We also see the three parameters converging to their target values, 2, 4 and 1.

In [None]:
plt.plot(thetas[:,0]); plt.plot(thetas[:,1]); plt.plot(thetas[:,2]);

One thing that bugs us is the learning rate. We have fixed it to 0.01. Can we achieve a faster convergence using a larger learning rate? We set it to 0.1.

In [None]:
theta = torch.zeros(3, requires_grad=True)
thetas, losses = training_loop(1000, 1e-1, theta, X, y)
plt.plot(losses);

In [None]:
plt.plot(thetas[:,0]); plt.plot(thetas[:,1]); plt.plot(thetas[:,2]);

As we see, this worked. It converges faster. We could stop much ahead of the 1000 iterations we used.

Now greed grips us and we set the learning rate to 1, in order to be even faster.

In [None]:
theta = torch.zeros(3, requires_grad=True)
thetas, losses = training_loop(1000, 1.0, theta, X, y)
plt.plot(losses);

This is how the parameters evolve...

In [None]:
plt.plot(thetas[:,0]); plt.plot(thetas[:,1]); plt.plot(thetas[:,2]);

...here in a detailed plot of the first 50 epochs.

In [None]:
plt.plot(thetas[:50,0]); plt.plot(thetas[:50,1]); plt.plot(thetas[:50,2]);

Instead of computing the small step in negative gradient direction by ourselves, we can plug in an optimizer. In this case, we will use the PyTorch stochastic gradient descent optimizer, which does exactly the same thing we did. It is termed stochastic, but it just does a gradient descent for a given batch, be it a minibatch (in which case 'stochastic' makes sense) or the entire training data.

In [None]:
theta = torch.zeros(3, requires_grad=True)
optimizer = torch.optim.SGD([theta], lr=1e-2)

There are some small modifications to the main loop. Instead of setting the gradient to `None` by ourselves, we now call `optimizer.zero_grad()`. Instead of our own parameter update, we can now call `optimizer.step()`.

In [None]:
def training_loop(number_of_epochs, optimizer, theta, X_training, y_training):
    all_losses = []
    all_theta = []
    
    for epoch in range(1, number_of_epochs+1):        
        # Compute the function values, using the current parameters.
        y = model(X_training, theta)
        
        # Compute the loss, and back-propagate.
        loss = loss_function(y, y_training)
        all_losses.append(float(loss))
        
        # Compute gradient and make step.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
            
        all_theta.append(theta.detach().numpy().copy())
            
    return np.array(all_theta), all_losses

In [None]:
sgd_thetas, sgd_losses = training_loop(1000, optimizer, theta, X, y)
plt.plot(sgd_losses);

Of course, since it does the same thing, we will still have the same problems if the learning rate is too large.

In [None]:
theta = torch.zeros(3, requires_grad=True)
optimizer = torch.optim.SGD([theta], lr=1.0)
thetas, losses = training_loop(1000, optimizer, theta, X, y)
plt.plot(losses);

However, now that we based everything on an optimizer, we can just plug in another, more elaborated optimizer, e.g. the `Adam` optimizer, to get a different result. This one works even if our initial learning rate is 1.0.

In [None]:
theta = torch.zeros(3, requires_grad=True)
optimizer = torch.optim.Adam([theta], lr=1.0)
adam_thetas, adam_losses = training_loop(1000, optimizer, theta, X, y)
plt.plot(adam_losses);

In [None]:
plt.plot(adam_thetas[:200,0]); plt.plot(adam_thetas[:200,1]); plt.plot(adam_thetas[:200,2]);

In [None]:
plt.plot(sgd_losses[:200]); plt.plot(adam_losses[:200]);

## Some questions...

<u>Can you explain:</u>
1. What happens if the learning rate is too small?
2. What happens if it is too large?
3. Can you explain the behavior of the parameter values during training which we see in all the cases above?

SOME SENTENCES

---
# Optimizing a neural network for image classification
Now, instead of our simple single perceptron plane fit, let us use several layers for a simple image classification problem.

First of all, we need a dataset. PyTorch has built-in support for a number of datasets (see the [list here](https://pytorch.org/vision/stable/datasets.html). We will follow along the lines of an example in the book by Stevens, Antiga, Viehmann: "Deep Learning with PyTorch", Manning Publications, and work on CIFAR-10 data. This is a collection of 60,000 very small images (32$\times$32) from 10 classes. It is balanced, 6,000 images per class, and subdivided into a training set (50,000 images) and a validation set (10,000 images).

## Loading a dataset

The following lines will load the training and validation sets. The first time you execute the cell, the dataset will be automatically downloaded to the directory specified by `data_path`, which will be used as a cache.

In [None]:
from torchvision import datasets
data_path = './cifar10/'
cifar10 = datasets.CIFAR10(data_path, train=True, download=True)
cifar10_val = datasets.CIFAR10(data_path, train=False, download=True)

In the dataset, classes use the class numbers 0, 1, ..., 9. The following is the corresponding class names.

In [None]:
class_names = ['airplane','automobile','bird','cat','deer',
               'dog','frog','horse','ship','truck']

Let us look at some of the images and their labels. As noted, the images are really small.

In [None]:
n1, n2 = 8, 10
fig = plt.figure(figsize=(n1, n2))
for i in range(n1*n2):
    ax = fig.add_subplot(n1, n2, 1 + i, xticks=[], yticks=[])
    img, l = cifar10[i]
    ax.set_title(class_names[l])
    plt.imshow(img)
plt.show()

We need to do some tricks on the data. First, the dataset contains images as `PIL.Image` (python imaging library images), in a format of 32$\times$32 pixels with 3 channels (red, green, blue), i.e., 32$\times$32$\times$3. However, we will need them as a tensor of dimensions channels$\times$rows$\times$columns.

Also, generally in machine learning, we want to *normalize* the input data so that all channels have zero mean and unit standard deviation. We have computed mean and standard deviation for all channels before and just introduce the values below.

Fortunately, `torchvision` allows to concatenate transforms (to check which, run `dir(transforms)`), and to specify them during loading. So we will load the `cifar10` and `cifar10_val` datasets again, this time with transforms specified.

In [None]:
from torchvision import transforms
transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4915, 0.4823, 0.4468),
                         (0.2470, 0.2435, 0.2616))
])

In [None]:
cifar10 = datasets.CIFAR10(data_path, train=True,
    download=False, transform=transform)
cifar10_val = datasets.CIFAR10(data_path, train=False,
    download=False, transform=transform)

## Simplifying the problem to only two classes
To keep our problem really simple, we will work only on two classes: airplane and bird. We will just load the data completely into memory, stored in a list of (tensor, label number) pairs.

In [None]:
label_map = {0: 0, 2: 1}
class_names = ['airplane', 'bird']
cifar2 = [(img, label_map[label]) for img, label in cifar10 
          if label in [0, 2]]
cifar2_val = [(img, label_map[label]) for img, label in cifar10_val
              if label in [0, 2]]

Since each class has 5,000 samples in the training set and 1,000 samples in the validation set, we should now have 10,000 samples in the training and 2,000 samples in the validation set. Check this quickly.

In [None]:
len(cifar2), len(cifar2_val)

When training, we have the option to update the unknowns after presenting a *single* data sample, a *subset* of a few data samples, or *all* data samples. As you know from the lecture, usually a small, randomly chosen subset is used, called a *minibatch*. 

Fortunately, there is already a utility function in PyTorch, called `DataLoader`. In our case, we will use it to create randomly sampled (shuffled) subsets of 64 samples each.

In [None]:
train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64,
                                           shuffle=True)

`DataLoader` combines dataset and sampling, and provides an iterator over the (sampled, shuffled) data. To see what it does, let us 'run' the iterator for one epoch, printing the shape of the presented data. As we see, 64 images with 3 channels and 32$\times$32 pixels, i.e. a 64$\times$3$\times$32$\times$32 tensor, is the training data. It is accompanied by 64 labels. This together is one minibatch. Since 10,000 = 64 $\times$ 156 + 16, the last minibatch will contain only 16 samples (scroll down to spot this).

In [None]:
[(d[0].shape, d[1].shape) for d in train_loader]

## Setting up the model and loss function
In the plane fitting above, our model was the affine function, and the loss function was the mean squared error.

Now, we will have more complicated models, which can fortunately be specified in PyTorch in a quite comfortable way. We will use fully connected layers as follows:
- The first layer will be a linear (affine) layer. Since the images are 3$\times$32$\times$32, it will have a 3,072 input. It will then reduce this to a 128 features. I.e., 128 is the number of features in the hidden layer.
- As activation function, we will use the hyperbolic tangent, `Tanh`.
- Then, another layer will reduce the 128 features to only two features. This is how we represent the outcome of our classification: as two features, one for airplane, and one for bird.
- As we have seen in the lecture, we can convert this output to 'probabilities' using the `softmax` function. Then, we compute the negative log likelihood on that. This is done in our implementation by using the `LogSoftmax` layer, together with the `NLLLoss`.

All those layers are connected in succession - this can be comfortably achieved using the `Sequential` class.

In [None]:
import torch.nn as nn
model = nn.Sequential(
            nn.Linear(3072, 128),
            nn.Tanh(),
            nn.Linear(128, 2),
            nn.LogSoftmax(dim=1))

loss_fn = nn.NLLLoss()

## Setting up the optimizer and running the training

Next, we need an optimizer. We will just take the plain stochastic gradient descent. Remember from above, we had to hand over the parameters to be optimized as tensors (with `requires_grad=True`). However, the parameters are now all hidden inside the model, none of them was created by ourselves directly. Fortunately, we can just call `model.parameters()` to get all parameters inside our model. In our case, this is a 3,072$\times$128 weight matrix for the connections between the first layer and the hidden layer, a 128-element bias vector of the hidden layer, another 128$\times$2 weight matrix for the connections between the hidden layer and the output layer, and a 2-element bias vector for the output layer.

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)

Next comes the training loop. It is very similar to the loop we used above in the plane fitting example. Since our image batches are batch$\times$3$\times$32$\times$32, but our network expects batch$\times$3072, we alter the shape using `view()`.

In [None]:
n_epochs = 100
all_losses = []
for epoch in range(n_epochs):
    for imgs, labels in train_loader:
        outputs = model(imgs.view(imgs.shape[0], -1))
        loss = loss_fn(outputs, labels)
        all_losses.append(float(loss))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if epoch % 10 == 0:
        print("Epoch: %2d, Loss: %f" % (epoch, float(loss)))

## Analysing the behaviour
In the record, we aggregate the losses for each minibatch (157), over all 100 epochs, for a total of 15700 loss values. When we plot them, we see that generally, the loss is reduced during training. Of course, since the losses are per minibatch, we also see quite some noise.

In [None]:
plt.plot(all_losses);

In order to count how many correct classifications we have, we use the following code. It computes the proportion of correct classifications, for the training data set, reported as accuracy.

In [None]:
train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64,
                                           shuffle=False)

correct = 0
total = 0

with torch.no_grad():
    for imgs, labels in train_loader:
        outputs = model(imgs.view(imgs.shape[0], -1))
        _, predicted = torch.max(outputs, dim=1)
        total += labels.shape[0]
        correct += int((predicted == labels).sum())
        
print("Accuracy: %f" % (correct / total))

We can do the same for the validation dataset.

In [None]:
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64,
                                         shuffle=False)

correct = 0
total = 0

with torch.no_grad():
    for imgs, labels in val_loader:
        outputs = model(imgs.view(imgs.shape[0], -1))
        _, predicted = torch.max(outputs, dim=1)
        total += labels.shape[0]
        correct += int((predicted == labels).sum())
        
print("Accuracy: %f" % (correct / total))

Out of curiosity, we can ask the model for the total number of parameters.

In [None]:
sum([p.numel() for p in model.parameters()])

---
# Optimizing a convolutional neural network
We will now use a *convolutional neural network*, instead of the fully connected model, to solve the same task.

First of all, we define the model. It is also in layers, but we use a slightly different way of describing it. Instead of `nn.Sequential()`, we now define a new class, `Net`, which instantiates all layers in the constructor. It has another member function, `forward()`, which computes a forward propagation of a sample. As we know, this forward propagation will also establish the computation graph which is then later used for backpropagation. In detail, our network has:
- an input of three channel (C), 32$\times$32 tensors, i.e. 3C$\times$32$\times$32
- a 3$\times$3 convolution from 3 channels to 16 channels, followed by a `Tanh` activation and a maxpool operation with window size (and stride) 2, resulting in 16C$\times$16$\times$16
- another 3$\times$3 convolution from 16 channels to 8 channels, followed by `Tanh` and maxpool, resulting in 8C$\times$8$\times$8,
- a reshape (using `view`) from 8C$\times$8$\times$8 to a 512-dimensional (D) feature vector,
- a linear layer from 512D to 32D, followed by `Tanh`, and
- another linear layer, reducing from 32D to the required 2D output.

The combination of a `LogSoftmax` layer and the `NLLLoss()` loss function (which we used above) is replaced by `CrossEntropyLoss`, which computes the same loss. (Although the loss is the same, the last layer of our network now does not explicitly represent probabilities or log probabilities. So to get them for test data, we would have to add a `Softmax` layer.)

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.act1 = nn.Tanh()
        self.pool1 = nn.MaxPool2d(2)
        self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
        self.act2 = nn.Tanh()
        self.pool2 = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(8 * 8 * 8, 32)
        self.act3 = nn.Tanh()
        self.fc2 = nn.Linear(32, 2)

    def forward(self, x):
        out = self.pool1(self.act1(self.conv1(x)))
        out = self.pool2(self.act2(self.conv2(out)))
        out = out.view(-1, 8 * 8 * 8)
        out = self.act3(self.fc1(out))
        out = self.fc2(out)
        return out

Now let us run the training loop!

In [None]:
train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64,
                                           shuffle=True)
model = Net()
loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)

n_epochs = 100
all_losses = []
for epoch in range(n_epochs):
    for imgs, labels in train_loader:
        outputs = model(imgs)
        loss = loss_fn(outputs, labels)
        all_losses.append(float(loss))
                
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if epoch % 10 == 0:
        print("Epoch: %2d, Loss: %f" % (epoch, float(loss)))

As above, we can have a look at the losses in each epoch.

In [None]:
plt.plot(all_losses);

And we can look at the training accuracy...

In [None]:
train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64,
                                           shuffle=False)

correct = 0
total = 0

with torch.no_grad():
    for imgs, labels in train_loader:
        outputs = model(imgs)
        _, predicted = torch.max(outputs, dim=1)
        total += labels.shape[0]
        correct += int((predicted == labels).sum())
        
print("Accuracy: %f" % (correct / total))

...as well as the validation accuracy.

In [None]:
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64,
                                         shuffle=False)

correct = 0
total = 0

with torch.no_grad():
    for imgs, labels in val_loader:
        outputs = model(imgs)
        _, predicted = torch.max(outputs, dim=1)
        total += labels.shape[0]
        correct += int((predicted == labels).sum())
        
print("Accuracy: %f" % (correct / total))

Fortunately, when we assembled all the layers in our `Net` class above, some secret mechanism has kept track of all the parameters. So as before, we can just ask for the parameters using `model.parameters()`, and thus count the total number of parameters.

In [None]:
sum([p.numel() for p in model.parameters()])

## Some questions...

<u>You have run two networks for image classification. What is the major difference between them? Which one makes more sense, and why?</u>

SOME SENTENCES

<u>Above, we plotted the evolution of the loss function for both networks. Please comment on the common aspects and their differences.</u>

SOME SENTENCES

<u>We have also output the accuracy for the training and the validation datasets, and the number of parameters. What kind of conclusions do you draw from these figures?</u>

SOME SENTENCES

<u>What would you suggest to improve the performance of our network, considering what you learned in the lecture? (Remember that it would even get more complicated when we switch to 10 classes.) </u>

SOME SENTENCES