# In-class exercise 10: PyTorch from the bottom up

Based on Jeremy Howard's PyTorch tutorial "What is torch.nn really?"

In this tutorial we will start at [PyTorch](https://pytorch.org/docs)'s lowest layer and then gradually introduce functions and features until we arrive at `nn.Sequential`. Lower layers give you more control over what you want to do, while higher layers allow for faster implementations. So in practice you have to choose at which layer you want to work. Moreover, knowing how the lower layers work will give you a better understanding of what is happening behind the scenes when working with the higher level abstractions.

In [1]:
import numpy as np
import torch
import torchvision
import matplotlib.pyplot as plt

# Download the data
In this tutorial we will be working with the MNIST dataset. This is a classic dataset consisting of black and white images of hand-drawn digits.

We will use [torchvision](https://pytorch.org/docs/stable/torchvision) to download the dataset. Torchvision also provides a lot of functionality for data preprocessing and augmentation, which is beyond the scope of this tutorial.

The input data $\mathbf{X}$ and targets $\mathbf{y}$ are saved in `data` and `targets`, so we will at first just extract these and work with the raw data. We also convert them to values between 0 and 1.

MNIST consists of 60,000 28x28 images, each corresponding to a single digit (0 to 9).

Note that setting `train=True` gives you the development set, i.e. both training and validation data. So we need to split this further.

# torch.tensor
PyTorch uses its own `torch.tensor` datatype. This is very similar to a Numpy Array, but can also be moved to and used for calculations on a GPU, and supports storing gradient information and hence dynamic backpropagation.

We start by manually setting up an affine layer. The special function `requires_grad` tells PyTorch that these weights require gradients. PyTorch will then record all operations done on the tensor, so backpropagation can be done automatically. Thanks to this ability we can use any normal function as a model in PyTorch.

Note that we initialize the weights via Xavier (Glorot) initialization. We only activate gradients after initialization, since we don't want gradients for that.

Appending a function with `_` in PyTorch denotes in-place operations.

We now use these weights to create a simple linear model (i.e. logistic regression). We furthermore define a loss (negative log-likelihood) for training and a function to obtain the prediction accuracy.

Let's see how our model performs before training.

tensor(0.1250)


We can now define a training loop. In this loop we need to
1. Get a mini-batch of data. When using dynamic computation graphs like in PyTorch it is important to choose a batch size that is large enough to leverage your hardware properly.
2. Generate predictions with our model
3. Calculate the loss
4. Update the gradients via `loss.backward()`
5. Update the `weight` and `bias` based on the gradients (optimization)

That is all we need! And now we can check if our performance has improved.

tensor(0.2267, grad_fn=<NegBackward>)
tensor(0.9375)


Nice, it works! Wasn't this already way easier than with pure Numpy? But this is just the start. Now that we've implemented our model in the lowest level of PyTorch we can start to go up the ladder and make this even better and simpler!

# torch.nn.functional

We will start by replacing some of our hand-written functions with their professionally implemented counterparts in `torch.nn.functional`. This library contains all of the PyTorch functions (other parts contain the classes). It is commonly imported via

Instead of using `log_softmax` and `neg_loglikelihood` we can instead just use `F.cross_entropy`, which combines both of these.

The loss should still be the same.

tensor(0.2267, grad_fn=<NllLossBackward>)


# nn.Module
Next we will use `nn.Module` and `nn.Parameter` for a clearer and more concise model definition and training loop. By subclassing `nn.Module` we obtain various convenience functions such as `.parameters()` and `.zero_grad()`.

Since `LogRegression` is now a class we will have to first instantiate it before using it. We can then call it as if it were a function.

tensor(2.3309, grad_fn=<NllLossBackward>)


We can now take advantage of `.parameters()` and `.zero_grad()` to make our training loop more concise.

And check if our results are similar to before.

In [18]:
print(loss_fn(model(x_train[:batch_size]), y_train[:batch_size]))
print(get_accuracy(model(x_train[:batch_size]), y_train[:batch_size]))

tensor(0.2308, grad_fn=<NllLossBackward>)
tensor(0.9375)


# nn.Linear

Instead of manually defining and initializing the affine layer, we can instead use the PyTorch class `nn.Linear`. PyTorch provides a wide range of predefined layers to simplify our code (and make it faster). On GitHub you will find layers for pretty much anything you might want to do.

To control weight initialization we must define a function and apply it to the model with `.apply(_)`.

Now let's check if we still get the same results as before.

tensor(0.2283, grad_fn=<NllLossBackward>)
tensor(0.9531)


# torch.optim

`torch.optim` provides various optimization algorithms. Here we will continue to use simple `SGD`, but you could just as easily switch to Adam or AMSgrad. Optimizers provide `.step()` and `.zero_grad()` methods, which allows us to make the last block in our `fit` function more concise.

In [24]:
fit(model, optimizer, num_epochs=2)

In [25]:
print(loss_fn(model(x_train[:batch_size]), y_train[:batch_size]))
print(get_accuracy(model(x_train[:batch_size]), y_train[:batch_size]))

tensor(0.1945, grad_fn=<NllLossBackward>)
tensor(0.9531)


# Dataset
PyTorch also provides an abstract Dataset class for easier handling of various data. You can subclass this just like we subclassed `nn.Module`. A Dataset only needs to provide a `__len__` (which is called by Python's `len` function) and a `__getitem__` function for indexing the dataset.

`TensorDataset` provides an easy way of converting tensors to datasets. This will make our data loading more concise, since we can handle both `x_train` and `y_train` simultaneously.

In [28]:
model = LogisticRegression(28 * 28, 10)
model.apply(initialize_weight)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

fit(model, optimizer, train_set, num_epochs=2)

print(loss_fn(model(x_train[:batch_size]), y_train[:batch_size]))
print(get_accuracy(model(x_train[:batch_size]), y_train[:batch_size]))

tensor(0.2265, grad_fn=<NllLossBackward>)
tensor(0.9531)


# DataLoader

A `DataLoader` automatically generates mini-batches for your training loop. It can run multiple workers in parallel and provides useful functionality such as data shuffling. You can create a `DataLoader` for any `Dataset`.

Using the DataLoader makes our training loop a lot cleaner:

In [31]:
model = LogisticRegression(28 * 28, 10)
model.apply(initialize_weight)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

fit(model, optimizer, train_loader, num_epochs=2)

print(loss_fn(model(x_train[:batch_size]), y_train[:batch_size]))
print(get_accuracy(model(x_train[:batch_size]), y_train[:batch_size]))

tensor(0.2238, grad_fn=<NllLossBackward>)
tensor(0.9531)


# Validation

Now that we have a training loop we can go ahead and do some real work. To avoid overfitting, enable early stopping and have some information for model development we always need a validation set.

Since the validation set does not need backpropagation we can use 2x larger batches for it. Furthermore, we should shuffle our training data to avoid correlation between batches. This is not necessary (and would waste computation time) for the validation set.

Note that you need to call `model.train()` before training and `model.eval()` before evaluation (inference), since some layers like dropout and batch normalization work differently in each mode.

In [34]:
model = LogisticRegression(28 * 28, 10)
model.apply(initialize_weight)
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

fit(model, optimizer, dataloaders, num_epochs=2)

Epoch 0: 0.285
Epoch 1: 0.314


# CNN
Using simple logistic regression (or an MLP) for images basically ignores the data's underlying structure. We can do much better than this by switching to a CNN. Since our training loop does not assume anything about the model we can train a CNN without any changes.

Our CNN will consist of 3 convolutional layers, each using PyTorch's predefined `Conv2d` layer. At the End, we perform average pooling. Since `Conv2d` assumes a shape of `[batch_size, num_channels, height, width]` we need to reshape our input inside the model via `.view(_)`.

We will now furthermore use momentum in our optimizer to speed up training.

In [36]:
model = CNN(num_channels=16, num_classes=10)
model.apply(initialize_weight)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

fit(model, optimizer, dataloaders, num_epochs=2)

Epoch 0: 0.382
Epoch 1: 0.303


# nn.Sequential

PyTorch provides a class `nn.Sequential` for simplifying the definition of modules that only consist of a stack of layers. Since these are exactly the models we have been using so far we will now switch to this interface.

Because not all functions are defined as PyTorch layers we will start by defining a module that just converts a function to a layer.

We can now define our CNN in a more concise manner. Note that we now use `nn.AdaptiveAvgPool2d`, which allows us to specify the size of the output tensor instead of the input tensor.

In [39]:
model.apply(initialize_weight)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

fit(model, optimizer, dataloaders, num_epochs=2)

Epoch 0: 0.327
Epoch 1: 0.255


# GPUs

PyTorch can run significantly faster on a GPU than on a CPU, so you should always try to leverage that hardware. To do so, you need to move both your model and your data to the device.

So let's first check if you have a GPU and choose the appropriate device.

True


Next, we move our model to the device.

Then, we redefine the DataLoader to pin the memory. This is a trick that will accelerate moving data between CPU and GPU.

Finally, we need to slightly change our training loop to send each batch to the device first.

And now we can run our CNN on the GPU!

In [44]:
model.apply(initialize_weight)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

fit(model, optimizer, dataloaders, num_epochs=2)

Epoch 0: 0.466
Epoch 1: 0.288


# Summary
Great, so now we have a concise, but general training loop and know how to quickly define new models! Now let us sum up what we have learned during this journey:
- `torch.tensor`: PyTorch tensors work like Numpy arrays, but can remember gradients and be sent to the GPU.
- `torch.nn`
    - `torch.nn.functional`: Provides various useful functions (non stateful) for training neural networks, e.g. activation and loss functions.
    - `nn.Module`: Subclass from this to create a callable that acts like a function, but can remember state. It knows what `Parameter`s and submodules it contains and provides various functionality based on that.
    - `nn.Parameter`: Wraps a tensor and tells the containing Module that it needs updating during backpropagation.
    - `torch.nn`: Many useful layers are already implemented in this library, e.g. `nn.Linear` or `nn.Conv2d`.
    - `nn.Sequential`: Provides an easy way of defining purely stacked modules.
- `torch.optim`: Optimizers such as `SGD` or `Adam`, which let you easily update and train the `Parameter`s inside the passed model.
- `Dataset`: Interface for data using only the `__len__` and `__getitem__` functions. Tensors can be converted into a `Dataset` by using `TensorDataset`.
- `DataLoader`: Takes any `Dataset` and provides an iterator for returning mini-batches with various advanced functionality.
- `GPU`: To use your GPU you need to move your model and each mini-batch to your GPU using `.to(device)`.