diff --git a/course_UvA-DL/introduction-to-pytorch/.meta.yml b/course_UvA-DL/introduction-to-pytorch/.meta.yml new file mode 100644 index 000000000..1a3726b25 --- /dev/null +++ b/course_UvA-DL/introduction-to-pytorch/.meta.yml @@ -0,0 +1,14 @@ +title: Introduction to PyTorch +author: Phillip Lippe +created: 2021-08-27 +updated: 2021-08-27 +license: CC BY-SA +description: | + This tutorial will give a short introduction to PyTorch basics, and get you setup for writing your own neural networks. + This notebook is part of a lecture series on Deep Learning at the University of Amsterdam. + The full list of tutorials can be found at https://uvadlc-notebooks.rtfd.io. +requirements: + - matplotlib +accelerator: + - CPU + - GPU diff --git a/course_UvA-DL/introduction-to-pytorch/Introduction_to_PyTorch.py b/course_UvA-DL/introduction-to-pytorch/Introduction_to_PyTorch.py new file mode 100644 index 000000000..18693f734 --- /dev/null +++ b/course_UvA-DL/introduction-to-pytorch/Introduction_to_PyTorch.py @@ -0,0 +1,986 @@ +# %% [markdown] +# Welcome to our PyTorch tutorial for the Deep Learning course 2020 at the University of Amsterdam! +# The following notebook is meant to give a short introduction to PyTorch basics, and get you setup for writing your own neural networks. +# PyTorch is an open source machine learning framework that allows you to write your own neural networks and optimize them efficiently. +# However, PyTorch is not the only framework of its kind. +# Alternatives to PyTorch include [TensorFlow](https://www.tensorflow.org/), [JAX](https://github.com/google/jax#quickstart-colab-in-the-cloud) and [Caffe](http://caffe.berkeleyvision.org/). +# We choose to teach PyTorch at the University of Amsterdam because it is well established, has a huge developer community (originally developed by Facebook), is very flexible and especially used in research. +# Many current papers publish their code in PyTorch, and thus it is good to be familiar with PyTorch as well. +# Meanwhile, TensorFlow (developed by Google) is usually known for being a production-grade deep learning library. +# Still, if you know one machine learning framework in depth, it is very easy to learn another one because many of them use the same concepts and ideas. +# For instance, TensorFlow's version 2 was heavily inspired by the most popular features of PyTorch, making the frameworks even more similar. +# If you are already familiar with PyTorch and have created your own neural network projects, feel free to just skim this notebook. +# +# We are of course not the first ones to create a PyTorch tutorial. +# There are many great tutorials online, including the ["60-min blitz"](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) on the official [PyTorch website](https://pytorch.org/tutorials/). +# Yet, we choose to create our own tutorial which is designed to give you the basics particularly necessary for the practicals, but still understand how PyTorch works under the hood. +# Over the next few weeks, we will also keep exploring new PyTorch features in the series of Jupyter notebook tutorials about deep learning. +# +# We will use a set of standard libraries that are often used in machine learning projects. +# If you are running this notebook on Google Colab, all libraries should be pre-installed. +# If you are running this notebook locally, make sure you have installed our `dl2020` environment ([link](https://github.com/uvadlc/uvadlc_practicals_2020/blob/master/environment.yml)) and have activated it. + +# %% +import time + +import matplotlib.pyplot as plt +import numpy as np +import torch +import torch.nn as nn +import torch.utils.data as data + +# %matplotlib inline +from IPython.display import set_matplotlib_formats +from matplotlib.colors import to_rgba +from tqdm.notebook import tqdm # Progress bar + +set_matplotlib_formats("svg", "pdf") + +# %% [markdown] +# ## The Basics of PyTorch +# +# We will start with reviewing the very basic concepts of PyTorch. +# As a prerequisite, we recommend to be familiar with the `numpy` package as most machine learning frameworks are based on very similar concepts. +# If you are not familiar with numpy yet, don't worry: here is a [tutorial](https://numpy.org/devdocs/user/quickstart.html) to go through. +# +# So, let's start with importing PyTorch. +# The package is called `torch`, based on its original framework [Torch](http://torch.ch/). +# As a first step, we can check its version: + +# %% +print("Using torch", torch.__version__) + +# %% [markdown] +# At the time of writing this tutorial (mid of August 2021), the current stable version is 1.9. +# You should therefore see the output `Using torch 1.9.0`, eventually with some extension for the CUDA version on Colab. +# In case you use the `dl2020` environment, you should see `Using torch 1.6.0` since the environment was provided in October 2020. +# It is recommended to update the PyTorch version to the newest one. +# If you see a lower version number than 1.6, make sure you have installed the correct the environment, or ask one of your TAs. +# In case PyTorch 1.10 or newer will be published during the time of the course, don't worry. +# The interface between PyTorch versions doesn't change too much, and hence all code should also be runnable with newer versions. +# +# As in every machine learning framework, PyTorch provides functions that are stochastic like generating random numbers. +# However, a very good practice is to setup your code to be reproducible with the exact same random numbers. +# This is why we set a seed below. + +# %% +torch.manual_seed(42) # Setting the seed + +# %% [markdown] +# ### Tensors +# +# Tensors are the PyTorch equivalent to Numpy arrays, with the addition to also have support for GPU acceleration (more on that later). +# The name "tensor" is a generalization of concepts you already know. +# For instance, a vector is a 1-D tensor, and a matrix a 2-D tensor. +# When working with neural networks, we will use tensors of various shapes and number of dimensions. +# +# Most common functions you know from numpy can be used on tensors as well. +# Actually, since numpy arrays are so similar to tensors, we can convert most tensors to numpy arrays (and back) but we don't need it too often. +# +# #### Initialization +# +# Let's first start by looking at different ways of creating a tensor. +# There are many possible options, the most simple one is to call +# `torch.Tensor` passing the desired shape as input argument: + +# %% +x = torch.Tensor(2, 3, 4) +print(x) + +# %% [markdown] +# The function `torch.Tensor` allocates memory for the desired tensor, but reuses any values that have already been in the memory. +# To directly assign values to the tensor during initialization, there are many alternatives including: +# +# * `torch.zeros`: Creates a tensor filled with zeros +# * `torch.ones`: Creates a tensor filled with ones +# * `torch.rand`: Creates a tensor with random values uniformly sampled between 0 and 1 +# * `torch.randn`: Creates a tensor with random values sampled from a normal distribution with mean 0 and variance 1 +# * `torch.arange`: Creates a tensor containing the values $N,N+1,N+2,...,M$ +# * `torch.Tensor` (input list): Creates a tensor from the list elements you provide + +# %% +# Create a tensor from a (nested) list +x = torch.Tensor([[1, 2], [3, 4]]) +print(x) + +# %% +# Create a tensor with random values between 0 and 1 with the shape [2, 3, 4] +x = torch.rand(2, 3, 4) +print(x) + +# %% [markdown] +# You can obtain the shape of a tensor in the same way as in numpy (`x.shape`), or using the `.size` method: + +# %% +shape = x.shape +print("Shape:", x.shape) + +size = x.size() +print("Size:", size) + +dim1, dim2, dim3 = x.size() +print("Size:", dim1, dim2, dim3) + +# %% [markdown] +# #### Tensor to Numpy, and Numpy to Tensor +# +# Tensors can be converted to numpy arrays, and numpy arrays back to tensors. +# To transform a numpy array into a tensor, we can use the function `torch.from_numpy`: + +# %% +np_arr = np.array([[1, 2], [3, 4]]) +tensor = torch.from_numpy(np_arr) + +print("Numpy array:", np_arr) +print("PyTorch tensor:", tensor) + +# %% [markdown] +# To transform a PyTorch tensor back to a numpy array, we can use the function `.numpy()` on tensors: + +# %% +tensor = torch.arange(4) +np_arr = tensor.numpy() + +print("PyTorch tensor:", tensor) +print("Numpy array:", np_arr) + +# %% [markdown] +# The conversion of tensors to numpy require the tensor to be on the CPU, and not the GPU (more on GPU support in a later section). +# In case you have a tensor on GPU, you need to call `.cpu()` on the tensor beforehand. +# Hence, you get a line like `np_arr = tensor.cpu().numpy()`. + +# %% [markdown] +# #### Operations +# +# Most operations that exist in numpy, also exist in PyTorch. +# A full list of operations can be found in the [PyTorch documentation](https://pytorch.org/docs/stable/tensors.html#), but we will review the most important ones here. +# +# The simplest operation is to add two tensors: + +# %% +x1 = torch.rand(2, 3) +x2 = torch.rand(2, 3) +y = x1 + x2 + +print("X1", x1) +print("X2", x2) +print("Y", y) + +# %% [markdown] +# Calling `x1 + x2` creates a new tensor containing the sum of the two inputs. +# However, we can also use in-place operations that are applied directly on the memory of a tensor. +# We therefore change the values of `x2` without the chance to re-accessing the values of `x2` before the operation. +# An example is shown below: + +# %% +x1 = torch.rand(2, 3) +x2 = torch.rand(2, 3) +print("X1 (before)", x1) +print("X2 (before)", x2) + +x2.add_(x1) +print("X1 (after)", x1) +print("X2 (after)", x2) + +# %% [markdown] +# In-place operations are usually marked with a underscore postfix (e.g. "add_" instead of "add"). +# +# Another common operation aims at changing the shape of a tensor. +# A tensor of size (2,3) can be re-organized to any other shape with the same number of elements (e.g. a tensor of size (6), or (3,2), ...). +# In PyTorch, this operation is called `view`: + +# %% +x = torch.arange(6) +print("X", x) + +# %% +x = x.view(2, 3) +print("X", x) + +# %% +x = x.permute(1, 0) # Swapping dimension 0 and 1 +print("X", x) + +# %% [markdown] +# Other commonly used operations include matrix multiplications, which are essential for neural networks. +# Quite often, we have an input vector $\mathbf{x}$, which is transformed using a learned weight matrix $\mathbf{W}$. +# There are multiple ways and functions to perform matrix multiplication, some of which we list below: +# +# * `torch.matmul`: Performs the matrix product over two tensors, where the specific behavior depends on the dimensions. +# If both inputs are matrices (2-dimensional tensors), it performs the standard matrix product. +# For higher dimensional inputs, the function supports broadcasting (for details see the [documentation](https://pytorch.org/docs/stable/generated/torch.matmul.html?highlight=matmul#torch.matmul)). +# Can also be written as `a @ b`, similar to numpy. +# * `torch.mm`: Performs the matrix product over two matrices, but doesn't support broadcasting (see [documentation](https://pytorch.org/docs/stable/generated/torch.mm.html?highlight=torch%20mm#torch.mm)) +# * `torch.bmm`: Performs the matrix product with a support batch dimension. +# If the first tensor $T$ is of shape ($b\times n\times m$), and the second tensor $R$ ($b\times m\times p$), the output $O$ is of shape ($b\times n\times p$), and has been calculated by performing $b$ matrix multiplications of the submatrices of $T$ and $R$: $O_i = T_i @ R_i$ +# * `torch.einsum`: Performs matrix multiplications and more (i.e. sums of products) using the Einstein summation convention. +# Explanation of the Einstein sum can be found in assignment 1. +# +# Usually, we use `torch.matmul` or `torch.bmm`. We can try a matrix multiplication with `torch.matmul` below. + +# %% +x = torch.arange(6) +x = x.view(2, 3) +print("X", x) + +# %% +W = torch.arange(9).view(3, 3) # We can also stack multiple operations in a single line +print("W", W) + +# %% +h = torch.matmul(x, W) # Verify the result by calculating it by hand too! +print("h", h) + +# %% [markdown] +# #### Indexing +# +# We often have the situation where we need to select a part of a tensor. +# Indexing works just like in numpy, so let's try it: + +# %% +x = torch.arange(12).view(3, 4) +print("X", x) + +# %% +print(x[:, 1]) # Second column + +# %% +print(x[0]) # First row + +# %% +print(x[:2, -1]) # First two rows, last column + +# %% +print(x[1:3, :]) # Middle two rows + +# %% [markdown] +# ### Dynamic Computation Graph and Backpropagation +# +# One of the main reasons for using PyTorch in Deep Learning projects is that we can automatically get **gradients/derivatives** of functions that we define. +# We will mainly use PyTorch for implementing neural networks, and they are just fancy functions. +# If we use weight matrices in our function that we want to learn, then those are called the **parameters** or simply the **weights**. +# +# If our neural network would output a single scalar value, we would talk about taking the **derivative**, but you will see that quite often we will have **multiple** output variables ("values"); in that case we talk about **gradients**. +# It's a more general term. +# +# Given an input $\mathbf{x}$, we define our function by **manipulating** that input, usually by matrix-multiplications with weight matrices and additions with so-called bias vectors. +# As we manipulate our input, we are automatically creating a **computational graph**. +# This graph shows how to arrive at our output from our input. +# PyTorch is a **define-by-run** framework; this means that we can just do our manipulations, and PyTorch will keep track of that graph for us. +# Thus, we create a dynamic computation graph along the way. +# +# So, to recap: the only thing we have to do is to compute the **output**, and then we can ask PyTorch to automatically get the **gradients**. +# +# > **Note: Why do we want gradients? +# ** Consider that we have defined a function, a neural net, that is supposed to compute a certain output $y$ for an input vector $\mathbf{x}$. +# We then define an **error measure** that tells us how wrong our network is; how bad it is in predicting output $y$ from input $\mathbf{x}$. +# Based on this error measure, we can use the gradients to **update** the weights $\mathbf{W}$ that were responsible for the output, so that the next time we present input $\mathbf{x}$ to our network, the output will be closer to what we want. +# +# The first thing we have to do is to specify which tensors require gradients. +# By default, when we create a tensor, it does not require gradients. + +# %% +x = torch.ones((3,)) +print(x.requires_grad) + +# %% [markdown] +# We can change this for an existing tensor using the function `requires_grad_()` (underscore indicating that this is a in-place operation). +# Alternatively, when creating a tensor, you can pass the argument +# `requires_grad=True` to most initializers we have seen above. + +# %% +x.requires_grad_(True) +print(x.requires_grad) + +# %% [markdown] +# In order to get familiar with the concept of a computation graph, we will create one for the following function: +# +# $$y = \frac{1}{|x|}\sum_i \left[(x_i + 2)^2 + 3\right]$$ +# +# You could imagine that $x$ are our parameters, and we want to optimize (either maximize or minimize) the output $y$. +# For this, we want to obtain the gradients $\partial y / \partial \mathbf{x}$. +# For our example, we'll use $\mathbf{x}=[0,1,2]$ as our input. + +# %% +x = torch.arange(3, dtype=torch.float32, requires_grad=True) # Only float tensors can have gradients +print("X", x) + +# %% [markdown] +# Now let's build the computation graph step by step. +# You can combine multiple operations in a single line, but we will +# separate them here to get a better understanding of how each operation +# is added to the computation graph. + +# %% +a = x + 2 +b = a ** 2 +c = b + 3 +y = c.mean() +print("Y", y) + +# %% [markdown] +# Using the statements above, we have created a computation graph that looks similar to the figure below: +# +#
+# +# We calculate $a$ based on the inputs $x$ and the constant $2$, $b$ is $a$ squared, and so on. +# The visualization is an abstraction of the dependencies between inputs and outputs of the operations we have applied. +# Each node of the computation graph has automatically defined a function for calculating the gradients with respect to its inputs, `grad_fn`. +# You can see this when we printed the output tensor $y$. +# This is why the computation graph is usually visualized in the reverse direction (arrows point from the result to the inputs). +# We can perform backpropagation on the computation graph by calling the +# function `backward()` on the last output, which effectively calculates +# the gradients for each tensor that has the property +# `requires_grad=True`: + +# %% +y.backward() + +# %% [markdown] +# `x.grad` will now contain the gradient $\partial y/ \partial \mathcal{x}$, and this gradient indicates how a change in $\mathbf{x}$ will affect output $y$ given the current input $\mathbf{x}=[0,1,2]$: + +# %% +print(x.grad) + +# %% [markdown] +# We can also verify these gradients by hand. +# We will calculate the gradients using the chain rule, in the same way as PyTorch did it: +# +# $$\frac{\partial y}{\partial x_i} = \frac{\partial y}{\partial c_i}\frac{\partial c_i}{\partial b_i}\frac{\partial b_i}{\partial a_i}\frac{\partial a_i}{\partial x_i}$$ +# +# Note that we have simplified this equation to index notation, and by using the fact that all operation besides the mean do not combine the elements in the tensor. +# The partial derivatives are: +# +# $$ +# \frac{\partial a_i}{\partial x_i} = 1,\hspace{1cm} +# \frac{\partial b_i}{\partial a_i} = 2\cdot a_i\hspace{1cm} +# \frac{\partial c_i}{\partial b_i} = 1\hspace{1cm} +# \frac{\partial y}{\partial c_i} = \frac{1}{3} +# $$ +# +# Hence, with the input being $\mathbf{x}=[0,1,2]$, our gradients are $\partial y/\partial \mathbf{x}=[4/3,2,8/3]$. +# The previous code cell should have printed the same result. + +# %% [markdown] +# ### GPU support +# +# A crucial feature of PyTorch is the support of GPUs, short for Graphics Processing Unit. +# A GPU can perform many thousands of small operations in parallel, making it very well suitable for performing large matrix operations in neural networks. +# When comparing GPUs to CPUs, we can list the following main differences (credit: [Kevin Krewell, 2009](https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/)) +# +#
+# +# CPUs and GPUs have both different advantages and disadvantages, which is why many computers contain both components and use them for different tasks. +# In case you are not familiar with GPUs, you can read up more details in this [NVIDIA blog post](https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/) or [here](https://www.intel.com/content/www/us/en/products/docs/processors/what-is-a-gpu.html). +# +# GPUs can accelerate the training of your network up to a factor of $100$ which is essential for large neural networks. +# PyTorch implements a lot of functionality for supporting GPUs (mostly those of NVIDIA due to the libraries [CUDA](https://developer.nvidia.com/cuda-zone) and [cuDNN](https://developer.nvidia.com/cudnn)). +# First, let's check whether you have a GPU available: + +# %% +gpu_avail = torch.cuda.is_available() +print(f"Is the GPU available? {gpu_avail}") + +# %% [markdown] +# If you have a GPU on your computer but the command above returns False, make sure you have the correct CUDA-version installed. +# The `dl2020` environment comes with the CUDA-toolkit 10.1, which is selected for the Lisa supercomputer. +# Please change it if necessary (CUDA 10.2 is currently common). +# On Google Colab, make sure that you have selected a GPU in your runtime setup (in the menu, check under `Runtime -> Change runtime type`). +# +# By default, all tensors you create are stored on the CPU. +# We can push a tensor to the GPU by using the function `.to(...)`, or `.cuda()`. +# However, it is often a good practice to define a `device` object in your code which points to the GPU if you have one, and otherwise to the CPU. +# Then, you can write your code with respect to this device object, and it allows you to run the same code on both a CPU-only system, and one with a GPU. +# Let's try it below. +# We can specify the device as follows: + +# %% +device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") +print("Device", device) + +# %% [markdown] +# Now let's create a tensor and push it to the device: + +# %% +x = torch.zeros(2, 3) +x = x.to(device) +print("X", x) + +# %% [markdown] +# In case you have a GPU, you should now see the attribute `device='cuda:0'` being printed next to your tensor. +# The zero next to cuda indicates that this is the zero-th GPU device on your computer. +# PyTorch also supports multi-GPU systems, but this you will only need once you have very big networks to train (if interested, see the [PyTorch documentation](https://pytorch.org/docs/stable/distributed.html#distributed-basics)). +# We can also compare the runtime of a large matrix multiplication on the CPU with a operation on the GPU: + +# %% +x = torch.randn(5000, 5000) + +# CPU version +start_time = time.time() +_ = torch.matmul(x, x) +end_time = time.time() +print(f"CPU time: {(end_time - start_time):6.5f}s") + +# GPU version +x = x.to(device) +# The first operation on a CUDA device can be slow as it has to establish a CPU-GPU communication first. +# Hence, we run an arbitrary command first without timing it for a fair comparison. +if torch.cuda.is_available(): + _ = torch.matmul(x * 0.0, x) +start_time = time.time() +_ = torch.matmul(x, x) +end_time = time.time() +print(f"GPU time: {(end_time - start_time):6.5f}s") + +# %% [markdown] +# Depending on the size of the operation and the CPU/GPU in your system, the speedup of this operation can be >500x. +# As `matmul` operations are very common in neural networks, we can already see the great benefit of training a NN on a GPU. +# The time estimate can be relatively noisy here because we haven't run it for multiple times. +# Feel free to extend this, but it also takes longer to run. +# +# When generating random numbers, the seed between CPU and GPU is not synchronized. +# Hence, we need to set the seed on the GPU separately to ensure a reproducible code. +# Note that due to different GPU architectures, running the same code on different GPUs does not guarantee the same random numbers. +# Still, we don't want that our code gives us a different output every time we run it on the exact same hardware. +# Hence, we also set the seed on the GPU: + +# %% +# GPU operations have a separate seed we also want to set +if torch.cuda.is_available(): + torch.cuda.manual_seed(42) + torch.cuda.manual_seed_all(42) + +# Additionally, some operations on a GPU are implemented stochastic for efficiency +# We want to ensure that all operations are deterministic on GPU (if used) for reproducibility +torch.backends.cudnn.determinstic = True +torch.backends.cudnn.benchmark = False + +# %% [markdown] +# ## Learning by example: Continuous XOR +# +# If we want to build a neural network in PyTorch, we could specify all our parameters (weight matrices, bias vectors) using `Tensors` (with `requires_grad=True`), ask PyTorch to calculate the gradients and then adjust the parameters. +# But things can quickly get cumbersome if we have a lot of parameters. +# In PyTorch, there is a package called `torch.nn` that makes building neural networks more convenient. +# +# We will introduce the libraries and all additional parts you might need to train a neural network in PyTorch, using a simple example classifier on a simple yet well known example: XOR. +# Given two binary inputs $x_1$ and $x_2$, the label to predict is $1$ if either $x_1$ or $x_2$ is $1$ while the other is $0$, or the label is $0$ in all other cases. +# The example became famous by the fact that a single neuron, i.e. a linear classifier, cannot learn this simple function. +# Hence, we will learn how to build a small neural network that can learn this function. +# To make it a little bit more interesting, we move the XOR into continuous space and introduce some gaussian noise on the binary inputs. +# Our desired separation of an XOR dataset could look as follows: +# +#
+ +# %% [markdown] +# ### The model +# +# The package `torch.nn` defines a series of useful classes like linear networks layers, activation functions, loss functions etc. +# A full list can be found [here](https://pytorch.org/docs/stable/nn.html). +# In case you need a certain network layer, check the documentation of the package first before writing the layer yourself as the package likely contains the code for it already. +# We import it below: + +# %% +# %% + +# %% [markdown] +# Additionally to `torch.nn`, there is also `torch.nn.functional`. +# It contains functions that are used in network layers. +# This is in contrast to `torch.nn` which defines them as `nn.Modules` (more on it below), and `torch.nn` actually uses a lot of functionalities from `torch.nn.functional`. +# Hence, the functional package is useful in many situations, and so we import it as well here. + +# %% [markdown] +# #### nn.Module +# +# In PyTorch, a neural network is build up out of modules. +# Modules can contain other modules, and a neural network is considered to be a module itself as well. +# The basic template of a module is as follows: + + +# %% +class MyModule(nn.Module): + def __init__(self): + super().__init__() + # Some init for my module + + def forward(self, x): + # Function for performing the calculation of the module. + pass + + +# %% [markdown] +# The forward function is where the computation of the module is taken place, and is executed when you call the module (`nn = MyModule(); nn(x)`). +# In the init function, we usually create the parameters of the module, using `nn.Parameter`, or defining other modules that are used in the forward function. +# The backward calculation is done automatically, but could be overwritten as well if wanted. +# +# #### Simple classifier +# We can now make use of the pre-defined modules in the `torch.nn` package, and define our own small neural network. +# We will use a minimal network with a input layer, one hidden layer with tanh as activation function, and a output layer. +# In other words, our networks should look something like this: +# +#
+# +# The input neurons are shown in blue, which represent the coordinates $x_1$ and $x_2$ of a data point. +# The hidden neurons including a tanh activation are shown in white, and the output neuron in red. +# In PyTorch, we can define this as follows: + + +# %% +class SimpleClassifier(nn.Module): + def __init__(self, num_inputs, num_hidden, num_outputs): + super().__init__() + # Initialize the modules we need to build the network + self.linear1 = nn.Linear(num_inputs, num_hidden) + self.act_fn = nn.Tanh() + self.linear2 = nn.Linear(num_hidden, num_outputs) + + def forward(self, x): + # Perform the calculation of the model to determine the prediction + x = self.linear1(x) + x = self.act_fn(x) + x = self.linear2(x) + return x + + +# %% [markdown] +# For the examples in this notebook, we will use a tiny neural network with two input neurons and four hidden neurons. +# As we perform binary classification, we will use a single output neuron. +# Note that we do not apply a sigmoid on the output yet. +# This is because other functions, especially the loss, are more efficient and precise to calculate on the original outputs instead of the sigmoid output. +# We will discuss the detailed reason later. + +# %% +model = SimpleClassifier(num_inputs=2, num_hidden=4, num_outputs=1) +# Printing a module shows all its submodules +print(model) + +# %% [markdown] +# Printing the model lists all submodules it contains. +# The parameters of a module can be obtained by using its `parameters()` functions, or `named_parameters()` to get a name to each parameter object. +# For our small neural network, we have the following parameters: + +# %% +for name, param in model.named_parameters(): + print(f"Parameter {name}, shape {param.shape}") + +# %% [markdown] +# Each linear layer has a weight matrix of the shape `[output, input]`, and a bias of the shape `[output]`. +# The tanh activation function does not have any parameters. +# Note that parameters are only registered for `nn.Module` objects that are direct object attributes, i.e. `self.a = ...`. +# If you define a list of modules, the parameters of those are not registered for the outer module and can cause some issues when you try to optimize your module. +# There are alternatives, like `nn.ModuleList`, `nn.ModuleDict` and `nn.Sequential`, that allow you to have different data structures of modules. +# We will use them in a few later tutorials and explain them there. + +# %% [markdown] +# ### The data +# +# PyTorch also provides a few functionalities to load the training and +# test data efficiently, summarized in the package `torch.utils.data`. + +# %% + +# %% [markdown] +# The data package defines two classes which are the standard interface for handling data in PyTorch: `data.Dataset`, and `data.DataLoader`. +# The dataset class provides an uniform interface to access the +# training/test data, while the data loader makes sure to efficiently load +# and stack the data points from the dataset into batches during training. + +# %% [markdown] +# #### The dataset class +# +# The dataset class summarizes the basic functionality of a dataset in a natural way. +# To define a dataset in PyTorch, we simply specify two functions: `__getitem__`, and `__len__`. +# The get-item function has to return the $i$-th data point in the dataset, while the len function returns the size of the dataset. +# For the XOR dataset, we can define the dataset class as follows: + +# %% + + +class XORDataset(data.Dataset): + def __init__(self, size, std=0.1): + """ + Inputs: + size - Number of data points we want to generate + std - Standard deviation of the noise (see generate_continuous_xor function) + """ + super().__init__() + self.size = size + self.std = std + self.generate_continuous_xor() + + def generate_continuous_xor(self): + # Each data point in the XOR dataset has two variables, x and y, that can be either 0 or 1 + # The label is their XOR combination, i.e. 1 if only x or only y is 1 while the other is 0. + # If x=y, the label is 0. + data = torch.randint(low=0, high=2, size=(self.size, 2), dtype=torch.float32) + label = (data.sum(dim=1) == 1).to(torch.long) + # To make it slightly more challenging, we add a bit of gaussian noise to the data points. + data += self.std * torch.randn(data.shape) + + self.data = data + self.label = label + + def __len__(self): + # Number of data point we have. Alternatively self.data.shape[0], or self.label.shape[0] + return self.size + + def __getitem__(self, idx): + # Return the idx-th data point of the dataset + # If we have multiple things to return (data point and label), we can return them as tuple + data_point = self.data[idx] + data_label = self.label[idx] + return data_point, data_label + + +# %% [markdown] +# Let's try to create such a dataset and inspect it: + +# %% +dataset = XORDataset(size=200) +print("Size of dataset:", len(dataset)) +print("Data point 0:", dataset[0]) + +# %% [markdown] +# To better relate to the dataset, we visualize the samples below. + + +# %% +def visualize_samples(data, label): + if isinstance(data, torch.Tensor): + data = data.cpu().numpy() + if isinstance(label, torch.Tensor): + label = label.cpu().numpy() + data_0 = data[label == 0] + data_1 = data[label == 1] + + plt.figure(figsize=(4, 4)) + plt.scatter(data_0[:, 0], data_0[:, 1], edgecolor="#333", label="Class 0") + plt.scatter(data_1[:, 0], data_1[:, 1], edgecolor="#333", label="Class 1") + plt.title("Dataset samples") + plt.ylabel(r"$x_2$") + plt.xlabel(r"$x_1$") + plt.legend() + + +# %% +visualize_samples(dataset.data, dataset.label) +plt.show() + +# %% [markdown] +# #### The data loader class +# +# The class `torch.utils.data.DataLoader` represents a Python iterable over a dataset with support for automatic batching, multi-process data loading and many more features. +# The data loader communicates with the dataset using the function `__getitem__`, and stacks its outputs as tensors over the first dimension to form a batch. +# In contrast to the dataset class, we usually don't have to define our own data loader class, but can just create an object of it with the dataset as input. +# Additionally, we can configure our data loader with the following input arguments (only a selection, see full list [here](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)): +# +# * `batch_size`: Number of samples to stack per batch +# * `shuffle`: If True, the data is returned in a random order. +# This is important during training for introducing stochasticity. +# * `num_workers`: Number of subprocesses to use for data loading. +# The default, 0, means that the data will be loaded in the main process which can slow down training for datasets where loading a data point takes a considerable amount of time (e.g. large images). +# More workers are recommended for those, but can cause issues on Windows computers. +# For tiny datasets as ours, 0 workers are usually faster. +# * `pin_memory`: If True, the data loader will copy Tensors into CUDA pinned memory before returning them. +# This can save some time for large data points on GPUs. +# Usually a good practice to use for a training set, but not necessarily for validation and test to save memory on the GPU. +# * `drop_last`: If True, the last batch is dropped in case it is smaller than the specified batch size. +# This occurs when the dataset size is not a multiple of the batch size. +# Only potentially helpful during training to keep a consistent batch size. +# +# Let's create a simple data loader below: + +# %% +data_loader = data.DataLoader(dataset, batch_size=8, shuffle=True) + +# %% +# next(iter(...)) catches the first batch of the data loader +# If shuffle is True, this will return a different batch every time we run this cell +# For iterating over the whole dataset, we can simple use "for batch in data_loader: ..." +data_inputs, data_labels = next(iter(data_loader)) + +# The shape of the outputs are [batch_size, d_1,...,d_N] where d_1,...,d_N are the +# dimensions of the data point returned from the dataset class +print("Data inputs", data_inputs.shape, "\n", data_inputs) +print("Data labels", data_labels.shape, "\n", data_labels) + +# %% [markdown] +# ### Optimization +# +# After defining the model and the dataset, it is time to prepare the optimization of the model. +# During training, we will perform the following steps: +# +# 1. Get a batch from the data loader +# 2. Obtain the predictions from the model for the batch +# 3. Calculate the loss based on the difference between predictions and labels +# 4. Backpropagation: calculate the gradients for every parameter with respect to the loss +# 5. Update the parameters of the model in the direction of the gradients +# +# We have seen how we can do step 1, 2 and 4 in PyTorch. Now, we will look at step 3 and 5. + +# %% [markdown] +# #### Loss modules +# +# We can calculate the loss for a batch by simply performing a few tensor operations as those are automatically added to the computation graph. +# For instance, for binary classification, we can use Binary Cross Entropy (BCE) which is defined as follows: +# +# $$\mathcal{L}_{BCE} = -\sum_i \left[ y_i \log x_i + (1 - y_i) \log (1 - x_i) \right]$$ +# +# where $y$ are our labels, and $x$ our predictions, both in the range of $[0,1]$. +# However, PyTorch already provides a list of predefined loss functions which we can use (see [here](https://pytorch.org/docs/stable/nn.html#loss-functions) for a full list). +# For instance, for BCE, PyTorch has two modules: `nn.BCELoss()`, `nn.BCEWithLogitsLoss()`. +# While `nn.BCELoss` expects the inputs $x$ to be in the range $[0,1]$, i.e. the output of a sigmoid, `nn.BCEWithLogitsLoss` combines a sigmoid layer and the BCE loss in a single class. +# This version is numerically more stable than using a plain Sigmoid followed by a BCE loss because of the logarithms applied in the loss function. +# Hence, it is adviced to use loss functions applied on "logits" where possible (remember to not apply a sigmoid on the output of the model in this case!). +# For our model defined above, we therefore use the module `nn.BCEWithLogitsLoss`. + +# %% +loss_module = nn.BCEWithLogitsLoss() + +# %% [markdown] +# #### Stochastic Gradient Descent +# +# For updating the parameters, PyTorch provides the package `torch.optim` that has most popular optimizers implemented. +# We will discuss the specific optimizers and their differences later in the course, but will for now use the simplest of them: `torch.optim.SGD`. +# Stochastic Gradient Descent updates parameters by multiplying the gradients with a small constant, called learning rate, and subtracting those from the parameters (hence minimizing the loss). +# Therefore, we slowly move towards the direction of minimizing the loss. +# A good default value of the learning rate for a small network as ours is 0.1. + +# %% +# Input to the optimizer are the parameters of the model: model.parameters() +optimizer = torch.optim.SGD(model.parameters(), lr=0.1) + +# %% [markdown] +# The optimizer provides two useful functions: `optimizer.step()`, and `optimizer.zero_grad()`. +# The step function updates the parameters based on the gradients as explained above. +# The function `optimizer.zero_grad()` sets the gradients of all parameters to zero. +# While this function seems less relevant at first, it is a crucial pre-step before performing backpropagation. +# If we would call the `backward` function on the loss while the parameter gradients are non-zero from the previous batch, the new gradients would actually be added to the previous ones instead of overwriting them. +# This is done because a parameter might occur multiple times in a computation graph, and we need to sum the gradients in this case instead of replacing them. +# Hence, remember to call `optimizer.zero_grad()` before calculating the gradients of a batch. + +# %% [markdown] +# ### Training +# +# Finally, we are ready to train our model. +# As a first step, we create a slightly larger dataset and specify a data loader with a larger batch size. + +# %% +train_dataset = XORDataset(size=1000) +train_data_loader = data.DataLoader(train_dataset, batch_size=128, shuffle=True) + +# %% [markdown] +# Now, we can write a small training function. +# Remember our five steps: load a batch, obtain the predictions, calculate the loss, backpropagate, and update. +# Additionally, we have to push all data and model parameters to the device of our choice (GPU if available). +# For the tiny neural network we have, communicating the data to the GPU actually takes much more time than we could save from running the operation on GPU. +# For large networks, the communication time is significantly smaller than the actual runtime making a GPU crucial in these cases. +# Still, to practice, we will push the data to GPU here. + +# %% +# Push model to device. Has to be only done once +model.to(device) + +# %% [markdown] +# In addition, we set our model to training mode. +# This is done by calling `model.train()`. +# There exist certain modules that need to perform a different forward +# step during training than during testing (e.g. BatchNorm and Dropout), +# and we can switch between them using `model.train()` and `model.eval()`. + + +# %% +def train_model(model, optimizer, data_loader, loss_module, num_epochs=100): + # Set model to train mode + model.train() + + # Training loop + for epoch in tqdm(range(num_epochs)): + for data_inputs, data_labels in data_loader: + + # Step 1: Move input data to device (only strictly necessary if we use GPU) + data_inputs = data_inputs.to(device) + data_labels = data_labels.to(device) + + # Step 2: Run the model on the input data + preds = model(data_inputs) + preds = preds.squeeze(dim=1) # Output is [Batch size, 1], but we want [Batch size] + + # Step 3: Calculate the loss + loss = loss_module(preds, data_labels.float()) + + # Step 4: Perform backpropagation + # Before calculating the gradients, we need to ensure that they are all zero. + # The gradients would not be overwritten, but actually added to the existing ones. + optimizer.zero_grad() + # Perform backpropagation + loss.backward() + + # Step 5: Update the parameters + optimizer.step() + + +# %% +train_model(model, optimizer, train_data_loader, loss_module) + +# %% [markdown] +# #### Saving a model +# +# After finish training a model, we save the model to disk so that we can load the same weights at a later time. +# For this, we extract the so-called `state_dict` from the model which contains all learnable parameters. +# For our simple model, the state dict contains the following entries: + +# %% +state_dict = model.state_dict() +print(state_dict) + +# %% [markdown] +# To save the state dictionary, we can use `torch.save`: + +# %% +# torch.save(object, filename). For the filename, any extension can be used +torch.save(state_dict, "our_model.tar") + +# %% [markdown] +# To load a model from a state dict, we use the function `torch.load` to +# load the state dict from the disk, and the module function +# `load_state_dict` to overwrite our parameters with the new values: + +# %% +# Load state dict from the disk (make sure it is the same name as above) +state_dict = torch.load("our_model.tar") + +# Create a new model and load the state +new_model = SimpleClassifier(num_inputs=2, num_hidden=4, num_outputs=1) +new_model.load_state_dict(state_dict) + +# Verify that the parameters are the same +print("Original model\n", model.state_dict()) +print("\nLoaded model\n", new_model.state_dict()) + +# %% [markdown] +# A detailed tutorial on saving and loading models in PyTorch can be found +# [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html). + +# %% [markdown] +# ### Evaluation +# +# Once we have trained a model, it is time to evaluate it on a held-out test set. +# As our dataset consist of randomly generated data points, we need to +# first create a test set with a corresponding data loader. + +# %% +test_dataset = XORDataset(size=500) +# drop_last -> Don't drop the last batch although it is smaller than 128 +test_data_loader = data.DataLoader(test_dataset, batch_size=128, shuffle=False, drop_last=False) + +# %% [markdown] +# As metric, we will use accuracy which is calculated as follows: +# +# $$acc = \frac{\#\text{correct predictions}}{\#\text{all predictions}} = \frac{TP+TN}{TP+TN+FP+FN}$$ +# +# where TP are the true positives, TN true negatives, FP false positives, and FN the fale negatives. +# +# When evaluating the model, we don't need to keep track of the computation graph as we don't intend to calculate the gradients. +# This reduces the required memory and speed up the model. +# In PyTorch, we can deactivate the computation graph using `with torch.no_grad(): ...`. +# Remember to additionally set the model to eval mode. + + +# %% +def eval_model(model, data_loader): + model.eval() # Set model to eval mode + true_preds, num_preds = 0.0, 0.0 + + with torch.no_grad(): # Deactivate gradients for the following code + for data_inputs, data_labels in data_loader: + + # Determine prediction of model on dev set + data_inputs, data_labels = data_inputs.to(device), data_labels.to(device) + preds = model(data_inputs) + preds = preds.squeeze(dim=1) + preds = torch.sigmoid(preds) # Sigmoid to map predictions between 0 and 1 + pred_labels = (preds >= 0.5).long() # Binarize predictions to 0 and 1 + + # Keep records of predictions for the accuracy metric (true_preds=TP+TN, num_preds=TP+TN+FP+FN) + true_preds += (pred_labels == data_labels).sum() + num_preds += data_labels.shape[0] + + acc = true_preds / num_preds + print(f"Accuracy of the model: {100.0*acc:4.2f}%") + + +# %% +eval_model(model, test_data_loader) + +# %% [markdown] +# If we trained our model correctly, we should see a score close to 100% accuracy. +# However, this is only possible because of our simple task, and +# unfortunately, we usually don't get such high scores on test sets of +# more complex tasks. + +# %% [markdown] +# #### Visualizing classification boundaries +# +# To visualize what our model has learned, we can perform a prediction for every data point in a range of $[-0.5, 1.5]$, and visualize the predicted class as in the sample figure at the beginning of this section. +# This shows where the model has created decision boundaries, and which points would be classified as $0$, and which as $1$. +# We therefore get a background image out of blue (class 0) and orange (class 1). +# The spots where the model is uncertain we will see a blurry overlap. +# The specific code is less relevant compared to the output figure which +# should hopefully show us a clear separation of classes: + + +# %% +@torch.no_grad() # Decorator, same effect as "with torch.no_grad(): ..." over the whole function. +def visualize_classification(model, data, label): + if isinstance(data, torch.Tensor): + data = data.cpu().numpy() + if isinstance(label, torch.Tensor): + label = label.cpu().numpy() + data_0 = data[label == 0] + data_1 = data[label == 1] + + plt.figure(figsize=(4, 4)) + plt.scatter(data_0[:, 0], data_0[:, 1], edgecolor="#333", label="Class 0") + plt.scatter(data_1[:, 0], data_1[:, 1], edgecolor="#333", label="Class 1") + plt.title("Dataset samples") + plt.ylabel(r"$x_2$") + plt.xlabel(r"$x_1$") + plt.legend() + + # Let's make use of a lot of operations we have learned above + model.to(device) + c0 = torch.Tensor(to_rgba("C0")).to(device) + c1 = torch.Tensor(to_rgba("C1")).to(device) + x1 = torch.arange(-0.5, 1.5, step=0.01, device=device) + x2 = torch.arange(-0.5, 1.5, step=0.01, device=device) + xx1, xx2 = torch.meshgrid(x1, x2) # Meshgrid function as in numpy + model_inputs = torch.stack([xx1, xx2], dim=-1) + preds = model(model_inputs) + preds = torch.sigmoid(preds) + # Specifying "None" in a dimension creates a new one + output_image = preds * c0[None, None] + (1 - preds) * c1[None, None] + output_image = ( + output_image.cpu().numpy() + ) # Convert to numpy array. This only works for tensors on CPU, hence first push to CPU + plt.imshow(output_image, origin="upper", extent=(-0.5, 1.5, -0.5, 1.5)) + plt.grid(False) + + +visualize_classification(model, dataset.data, dataset.label) +plt.show() + +# %% [markdown] +# The decision boundaries might not look exactly as in the figure in the preamble of this section which can be caused by running it on CPU or a different GPU architecture. +# Nevertheless, the result on the accuracy metric should be the approximately the same. + +# %% [markdown] +# ## Additional features we didn't get to discuss yet +# +# Finally, you are all set to start with your own PyTorch project! +# In summary, we have looked at how we can build neural networks in PyTorch, and train and test them on data. +# However, there is still much more to PyTorch we haven't discussed yet. +# In the comming series of Jupyter notebooks, we will discover more and more functionalities of PyTorch, so that you also get familiar to PyTorch concepts beyond the basics. +# If you are already interested in learning more of PyTorch, we recommend the official [tutorial website](https://pytorch.org/tutorials/) that contains many tutorials on various topics. +# Especially logging with Tensorboard ([tutorial +# here](https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html)) +# is a good practice that we will explore from Tutorial 5 on. diff --git a/course_UvA-DL/introduction-to-pytorch/comparison_CPU_GPU.png b/course_UvA-DL/introduction-to-pytorch/comparison_CPU_GPU.png new file mode 100644 index 000000000..b7d94f91e Binary files /dev/null and b/course_UvA-DL/introduction-to-pytorch/comparison_CPU_GPU.png differ diff --git a/course_UvA-DL/introduction-to-pytorch/continuous_xor.svg b/course_UvA-DL/introduction-to-pytorch/continuous_xor.svg new file mode 100644 index 000000000..12bfd7f81 --- /dev/null +++ b/course_UvA-DL/introduction-to-pytorch/continuous_xor.svg @@ -0,0 +1 @@ + diff --git a/course_UvA-DL/introduction-to-pytorch/pytorch_computation_graph.svg b/course_UvA-DL/introduction-to-pytorch/pytorch_computation_graph.svg new file mode 100644 index 000000000..19c488782 --- /dev/null +++ b/course_UvA-DL/introduction-to-pytorch/pytorch_computation_graph.svg @@ -0,0 +1 @@ +x2abc3yViewer does not support full SVG 1.1 diff --git a/course_UvA-DL/introduction-to-pytorch/small_neural_network.svg b/course_UvA-DL/introduction-to-pytorch/small_neural_network.svg new file mode 100644 index 000000000..065197aa0 --- /dev/null +++ b/course_UvA-DL/introduction-to-pytorch/small_neural_network.svg @@ -0,0 +1 @@ +x1x2Viewer does not support full SVG 1.1