# A Quick Introduction to PyTorch

PyTorch is one of the preeminent machine-learning and optimization libraries currently available. It contains a number of powerful features that drastically simplify the task of fitting models and training neural networks. While we won't have time in this tutorial to examine more than a few of the core features, there are many additional tutorials available online. This tutorial will roughly follow a few of the introductory PyTorch tutorials available at [pytorch.org/tutorials](https://pytorch.org/tutorials/).

In this tutorial, we'll be covering three topics, each briefly:
1. The basic structure of objects in PyTorch
2. Non-linear optimization
3. Neural Networks

Author: Noah C. Benson &lt;[nben@uw.edu](mailto:nben@uw.edu)&gt;

## Basic PyTorch Data Structures

At first glance, PyTorch appears to be somewhat like NumPy in that it gives the user a set of classes and functions for interacting with a `Tensor` type that behaves much like NumPy's `ndarray` type. Both NumPy and PyTorch, for example, define functions like `log`, `sin`, and `mean` that work with their respective array type. However, the `Tensor` and `ndarray` objects aren't interchangeable. This is because PyTorch `Tensor`s are intended for use in optimization problems, and thus they track a wide variety of data about what computations they have been used in. These data are critical for performing efficient gradient-descent parameter-tuning, which is generally required for optimization and for training neural networks.

Let's start by importing a number of PyTorch modules for use in this tutorial.

In [None]:
# Import the PyTorch library:
import torch
# Import the Neural-Network sub-module of PyTorch:
from torch import nn
# The DataLoader class is a helper for loading datasets during model training:
from torch.utils.data import DataLoader

# Finally, we want to import matplotlib.pyplot so that we can visualize the
# results of our training.
import matplotlib.pyplot as plt

### PyTorch Tensors

Most of PyTorch revolves around a single datatype, the PyTorch `Tensor` (`torch.Tensor`). At first glance, the tensor class seems almost identical to NumPy's `ndarray` class. However, although they share many similarities, the two types are not directly compatible. The main differences between them are:
1. tensors track a variety of data that can be used to calculate the gradient of whatever computation one uses them in, and
2. tensors have a `device` parameter that allows one to specify `'cpu'` or `'cuda'` (when GPUs are available for calculations).

This notebook uses the default `device` value, `'cpu'`, so all of our calculations will run on CPUs, which is no problem for the kinds of calculations in this notebook. GPUs can be much faster for certain kinds of computations, especially those involving matrix convolution, which we will discuss later. To use the `'cuda'` option, you need both to have GPUs available and to have the appropriate software and drivers installed.

Let's take a closer look at the `torch.Tensor` type here.

In [None]:
# The torch library uses many of the same conventions as the numpy library.
u = torch.zeros((10, 3))  # Make a 10x3 matrix.

u[:,0] = 1 # Assign the first column to have values of 1.

u = (u - 0.5) *  3.0 # Subtract 1/2 then multiply the tensor by 3.

print(u) # Print the tensor.

In [None]:
# Note that we will sometimes get an error if we try to use numpy arrays and
# tensors interchangeably. Although some operations like this may work, you
# should generally work with one or the other and not try to mix them.
import numpy as np

a = np.zeros((10,3))

torch.mean(a)

In [None]:
# However, if you need to extract a numpy array from a tensor, you can
# use first detach the tensor (i.e., remove it from the backend system
# that keeps track of gradients) then request the numpy representation.
a = u.detach().numpy()
print(type(a))
print(a)

In [None]:
# If you need to convert a numpy array into a tensor, you can also use the
# torch.from_numpy() function.
a = np.linspace(0, 1, 10)

b = torch.from_numpy(a)
print(type(b))
print(b)

The PyTorch library includes analog functions for most of NumPy's basic API;
for example, functions like `numpy.mean`, `numpy.std`, `numpy.exp`,
`numpy.log`, `numpy.sum`, `numpy.median`, and `numpy.sin` have PyTorch
 analogs `torch.mean`, `torch.std`, `torch.exp`, `torch.log`, `torch.sum`,
`torch.median`, and `torch.sin`. Most of these functions work similarly in
both libraries.

In [None]:
# To check that the interfaces are the same, let's make some identical
# arrays/tensors:
a = np.linspace(0, 1, 10)
u = torch.linspace(0, 1, 10)

print('Sum:')
print(' -', np.sum(a))
print(' -', torch.sum(u))

print('Standard Deviation:')
print(' -', np.std(a))
print(' -', torch.std(u))

Notice that the standard deviations produced by the two libraries aren't
identical!

This difference occurs because the libraries use slightly different formulas
for the standard deviation. (In brief: there are different formulas for the
standard deviation depending on whether you are trying to characterize the
set of values you have or trying to infer something about the
population/distribution from which the values were drawn.)

The point here is that it's important to verify that a PyTorch function does
what you think it does. Most of them operate similarly to NumPy, but it can't
be taken for granted.

### Nonlinear Optimization

The simplest kind of optimization problem involves some function `f(x)` that returns a single real value. Typically, the goal of the optimization is to determine the value(s) of `x` that produce the smallest value of `f(x)`. (Note that the `x` here might be a single number or a vector, but `f(x)` returns a single number.)

During optimization, we typically don't know how `f(x)` is calculated, but we can calculate `f(x)` for any given `x`.  For example, in Python code, we might define an optimization function as follows:

```python
def minimize(f):
    """Given a function `f` as an argument, attempts to find the value `x`
    such that `f(x)` is the minimum value of `f`. The return value is `x`.
    """
    # In this optimization function, we can call f(x):
    y0 = f(0)
    y1 = f(1)
    # But we don't know we don't know how f is calculating its outputs.
    ...

# This function could be minimized analytically, if we knew its form:
def fn_to_minimize(x):
    return (x - 2)**2 + (x - 1)**3

# But there's no way for the minimize function to know the form:
x_min = minimize(fn_to_minimize)
```

Most nonlinear optimization starts by making a guess as to what `x`-value yields the minimum `f(x)`, then calculating `f(x)` and using some strategy to make a new guess that is probably even closer to the minimum. If the value of `f(x)` for the new `x`-value is greater, then a different guess can be made; it's it's lesser, then the process can be repeated.

The most common way, given `x1` and `f(x1)`, to make a guess of a value `x2` so that `f(x2) < f(x1)`, is to calculate the derivative of `df(x) / dx` in terms of `x` at `x1`. The derivative always points in the direction of fastest increase (so a positive derivative means the function is increasing and a negative derivative means it's decreasing), so by taking a small step away from `x1` in the opposite direction of the derivative, one can get a little closer to the minimum.

This method works just as well with multiple parameters (or parameter vectors) as it does with a single parameter, as long as the function outputs a single number that can be minimized. In multidimensional functions, the gradient of the parameters points in the direction of fastest increase just like in single parameter functions.

As an example of how an optimization problem is typically solved in PyTorch, we will start with a very simple optimization problem: find the minimum of `f(x) = (x - 2)**2`.

In [None]:
# First we define the function that we're minimizing:
def func(x):
    return (x - 2)**2

# If we create a PyTorch tensor that tracks its gradient, we can have PyTorch
# automatically calculate the derivative of `func` for us.
x = torch.tensor(5.0, requires_grad=True)

# To do so, we call `func(x)`
y = func(x)
# Then call the backward gradient algorithm:
y.backward()
# Then the gradient is `x.grad`:
x.grad

To minimize `func(x)`, we can use the gradients along with an optimizer
object. PyTorch has several optimizer objects that use slightly different
strategies, most of which are based on gradient descent. We'll use one called
`SGD` which stands for Stochastic Gradient Descent. (The "stochastic" part of
the name is due to the algorithm's ability to handle training when random
subsets of a training dataset are used iteratively--the algorithm itself
doesn't incorporate stochastic decisions.)

In [None]:
# We want to start the minimization with x equal to this value:
x = torch.tensor(5.0, requires_grad=True)

# Declare an optimizer (SGD: stochastic gradient descent).
# We are minimizing over the argument t (i.e., the input), and
# we provide a low learning-rate (which affects how big the
# optimizer's steps are).
optimizer = torch.optim.SGD([x], lr=0.1)

# Now we can take several steps to see if the optimizer converges.
for step_number in range(40): # we'll take 20 steps...
    # We're starting a new step, so we reset the gradients.
    optimizer.zero_grad()
    loss = func(x)
    print("Step number", step_number,
          "  x = ", float(x),
          "  loss = ", float(loss))
    loss.backward()
    optimizer.step()

print(x, func(x))

This combination of calling `output.backward()` then examining the `input.grad` value works not just for single values, but also for high-dimensional tensors, and PyTorch is very efficient about calculating these gradients.

Because we can calculate the gradient for our loss function, we can minimize the loss by using a simple gradient descent optimizer.

Note that in the above example, because the optimization is a simple gradient descent, the starting point can potentially change the result. Specifically, the optimization algorithm will typically find the nearest local minimum to the starting point. We used a function with only one minimum value, but if `func(x)` were equal to `sin(x)`, then the algorithm would find whatever minimum was nearest to the initial `x`-value.

## Training a Neural Network to recognize images of handwritten numbers

The remainder of this tutorial provides an example walkthrough of how to train a neural network using PyTorch. We will use a public datasetof written numerals, the [MNIST database](https://en.wikipedia.org/wiki/MNIST_database), and we will setup a simple neural network that can be trained to recognize the number represented in a small (28x28) image.

### The MNIST Dataset

For this tutorial we'll be using the MNIST dataset, which contains images of handwritten numerals (0-9), each of which has been labeled. The dataset is available on various locations around the internet, but it has been prepopulated on the JupyterHub for us.

In [None]:
# Load training data.
(training_labels, training_images) = torch.load(
    '/home/jovyan/shared/data/benson-deep-learning/mnist_train.pt',
    weights_only=True)

# Load test data.
(test_labels, test_images) = torch.load(
    '/home/jovyan/shared/data/benson-deep-learning/mnist_test.pt',
    weights_only=True)

We now have a representation of this dataset in the variables `training_labels`/`trainig_image` and `test_labels`/`test_images`. Naturally, these will correspond to our training and testing datasets; these are independent in order to avoid overfitting.

To use these with PyTorch, we will need to convert them into PyTorch `Dataset` classes. Datasets simply contain a number of data-points (images of numerals in our case) and paired with their labels.

In [None]:
class MNISTDataset(torch.utils.data.Dataset):
    def __init__(self, images, labels):
        self.images = images.float()
        self.labels = labels
    def __getitem__(self, ii):
        return (self.images[ii], self.labels[ii])
    def __len__(self):
        return len(self.labels)
training_data = MNISTDataset(training_images, training_labels)
test_data = MNISTDataset(test_images, test_labels)

PyTorch typically expects to interact with trainig and testing data via a class called `DataLoader`, which we imported earlier, from the `torch.utils.data` module. This class manages the loading and caching of individual data samples from our datasets, and can even perform loading in parallel if the dataset is large and cumbersome enough. In this cell, we setup data loaders for the two datasets and demonstrate how they work.

In [None]:
# Training data usually arrives to the model-fitting routine as 
# a batch of samples. This sets the size of each batch.
batch_size = 64

# Create data loaders.
train_dataloader = DataLoader(training_data, batch_size=batch_size)
test_dataloader = DataLoader(test_data, batch_size=batch_size)

# Extract a random image/label pair and print their metadata:
for X, y in test_dataloader:
    print("Shape of X [Batch-Size, Image-Channels, Height, Width]: ")
    print("     ", X.shape, X.dtype)
    print("Shape of y [1 Label per Batch]: ")
    print("     ", y.shape, y.dtype)
    break

Notice that when we loop over the `test_dataloader` object, it yields a sequence of `(X,y)` tuples where `X` is a tensor of clothing images from the Fashion MNIST dataset and `y` is a list of the corresponding integer labels. Notice that the first dimension of both `X` and `y` is 64, which is also our training batch size--i.e., the dataloaders always yield data samples in batches. Notice also that the second dimension of the `X` value is 1; this is because the images are grayscale and thus have only one color-channel. The final two dimensions are the height and width of the image.

The `y` labels that are returned are just integers, 0-9, corresponding to the associated numeral.

Now, let's see if we can verify that these labels are correct by looking at a few of the images.

In [None]:
# We'll and label 4 images in a 2x2 grid:
(fig, axs) = plt.subplots(2, 2, figsize=(4,4))
axs = axs.flatten()

# Get one batch of samples from the dataloader--remember that this
# batch will have 64 images in it.
for (X_batch,y_batch) in train_dataloader:
    # Go ahead and plot the first four of these images
    for (ax,X,y) in zip(axs, X_batch, y_batch):
        ax.imshow(X)
        ax.set_title(y.item())
        ax.axis('off')
    break

### Defining and training the Network

The next step for our neural network project is to define the neural network itself.

PyTorch makes defining neural network models fairly easy. You simply need to declare a Python class that inherits from the `torch.nn.Module` class, which represents a single module of a neural network. Because a module can itself contain a number of network layers, the module can either represent a piece of a network or it can be the entire neural network.

When declaring a `Module`, we need to make sure to define the stack of layers in the network in the class's constructor (the `__init__()` method), and we need to declare how the model is calculated in the class's `forward()` method.

In [None]:
# Define model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
            nn.ReLU()
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork()
print(model)

As we can see, this network is a stack of a few operators. First of all, we have a layer that flattens the inputs from 28x28 images into 784-element vectors. Next, we have a sub-stack of layers that consists of three [linear operators](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html), each of which is rectified by a [rectified linear unit](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html).

We aren't going to worry too much about what these particular layers of the network are doing. Suffice to say that these are fairly common components of neural networks, and that a full discussion of common neural network layers is beyond the scope of this tutorial.

One other thing to note is that the `out_features` of the final linear operator in the stack is 10. This means that the output is in fact a 10-dimensional tensor (like a numpy array whose shape is `(10,)`). Typically when performing this kind of classification problem we interpret each of the output dimensions as representing the likelihood of an input to the model belonging to one of the dataset classes, so if the output features for a particular image are `[0, 0.1, 0, 0.01, 0.5, 0.3, 0.2, 0, 0,3]`, we interpret the model as predicting that the image belongs to class 4 (because `output[4]` is 0.5, which is the largest value in the outputs). In short, we can convert the output feature tensor into a predicted class number by taking the `argmax` of the output tensor.

---

In order to train the above neural network, we will need to define a loss function that we are trying to minimize. In this case, we can use a builtin loss function called [`CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html). We won't discuss the details of how this loss function works, but it is commonly used with classification problems like the one we are encoding here.

We'll also need an optimizer, and in this case we'll use the same optimizer we used above: SGD (stochastic gradient descent).  Because our model was written using the `torch.nn.Module` class, we can get all of the model parameters by calling the `model.parameters()` method.

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

To perform the training itself, we just need to step through the data-items in the training dataset and provide them each to the optimizer, much like we did in the simpler optimization example above.

In [None]:
# Get the size of the training dataset.
size = len(train_dataloader.dataset)

# Walk through each training batch:
for (batch_num, (X, y)) in enumerate(train_dataloader):
    # Compute prediction:
    pred = model(X)
    # And compute the loss of that prediction:
    loss = loss_fn(pred, y)

    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print out a status message every so often.
    if batch_num % 100 == 0:
        (loss, current) = (loss.item(), batch_num * len(X))
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

Okay, it appears that the above cell worked, but what did it do? We can see from the printed lines that as the optimization proceeded, the loss decreased. If we want to see how well this fitting procedure worked, we can look at some examples from the test dataset and see how well the trained model performs. For this, we can essentially copy-and-paste the code-block above that we used to look at the initial images and labels.

In [None]:
# We'll and label 4 images in a 2x2 grid:
(fig, axs) = plt.subplots(2, 2, figsize=(6,6))
axs = axs.flatten()

# Get one batch of samples from the dataloader--remember that this
# batch will have 64 images in it.
for (X_batch,y_batch) in test_dataloader:
    # We want a random set of 4 images each time, so we keep drawing
    # new image batches until we randomly draw a number over 0.9:
    if np.random.rand() < 0.9: continue
    # Go ahead and plot the first four of these images
    for (ax,X,y) in zip(axs, X_batch, y_batch):
        ax.imshow(X)
        # Get the model's prediction for this particular image:
        pred = model(X[None,:])
        # Convert the predicted tensor into a single label:
        label = torch.argmax(pred)
        # If that label is equal to y, the network got it right;
        # otherwise, it got it wrong!
        y_name = str(y.item())
        label_name = str(label.item())
        ax.set_title(f"Estimate: {label_name}; True: {y_name}")
        ax.axis('off')
    break

Clearly, our network isn't perfect--if you run the above cell many times, you will see that sometimes the network gets the item type correct, and sometimes it gets it wrong. However, you might also notice that when the network is wrong, it is wrong in a fairly understandable way (for example, 6 might be labeled a 0 or a 2 might be labeled a 7). This shouldn't be too surprising, considering that we trained a fairly small neural network with a simple architecture, but hopefully this example demonstrates the fundamentals of how PyTorch organizes models and networks used for machine learning.

## Training a Convolutional Neural Network (CNN)

A convolutional neural network is a specific kind of neural network.
Fundamentally, it operates on the same principles: the computation starts with a set of numbers that get filtered by a set of weights into a new layer of numbers; these numbers then get filtered by more layers, etc. The difference is primarily in how the networks are organized.

CNNs typically operate on images, and their parameters (weights) are small images called kernels that are convoled with the inputs. Image convolutions are difficult to explain with words, but [this animation](https://en.m.wikipedia.org/wiki/File:2D_Convolution_Animation.gif) does a good job of explaining the basic idea. In essence, each image kernel represents a filter that gets applied to each position/pixel of the image and that pixel's immediate neighborhood in the image to produce a new value for the next layer of the network.

The image convolutions that a model performs can vary according to several meta-parameters, primarily:
* they can be larger or smaller, thus operating over a larger or smaller neighborhood of the input image;
* they can have a `stride`, instructing them to skip over pixels in the input when producing the output thus decreasing the size of the output image;
* they can pad their inputs to produce outputs of the same size (alternatively, the images decrease in size slightly due to pixel loss around the edges).

We can build a CNN using the `nn.Conv2d` module object; aside from the use of this type, the model is very similar to the `NeuralNetwork` type we built earlier.

In [None]:
# Define model
class ConvolutionalNeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv2d_relu_stack = nn.Sequential(
            nn.Conv2d(1, 2, kernel_size=3, stride=1), #26x26
            nn.Conv2d(2, 2, kernel_size=3, stride=1), #24x24
            nn.ReLU(),
            nn.Conv2d(2, 8, kernel_size=5, stride=2), #10x10
            nn.Conv2d(8, 8, kernel_size=3, stride=1), #8x8
            nn.ReLU(),
            nn.Conv2d(8, 10, kernel_size=7, stride=2),
            nn.Flatten()) #1x1
        
    def forward(self, x):
        logits = self.conv2d_relu_stack(x)
        return logits

model = ConvolutionalNeuralNetwork()
print(model)

We will train the network in a very similar manner as well.

In [None]:
# Get the size of the training dataset.
size = len(train_dataloader.dataset)

# Make the optimizer:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Walk through each training batch:
for (batch_num, (X, y)) in enumerate(train_dataloader):
    # Compute prediction:
    pred = model(X[:,None,...])
    # And compute the loss of that prediction:
    loss = loss_fn(pred, y)

    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print out a status message every so often.
    if batch_num % 100 == 0:
        (loss, current) = (loss.item(), batch_num * len(X))
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

Finally, we can validate the CNN model as we did for the earlier neural network.

In [None]:
# We'll and label 4 images in a 2x2 grid:
(fig, axs) = plt.subplots(2, 2, figsize=(6,6))
axs = axs.flatten()

# Get one batch of samples from the dataloader--remember that this
# batch will have 64 images in it.
for (X_batch,y_batch) in test_dataloader:
    # We want a random set of 4 images each time, so we keep drawing
    # new image batches until we randomly draw a number over 0.9:
    if np.random.rand() < 0.9: continue
    # Go ahead and plot the first four of these images
    for (ax,X,y) in zip(axs, X_batch, y_batch):
        ax.imshow(X)
        # Get the model's prediction for this particular image:
        pred = model(X[None,:])
        # Convert the predicted tensor into a single label:
        label = torch.argmax(pred)
        # If that label is equal to y, the network got it right;
        # otherwise, it got it wrong!
        y_name = str(y.item())
        label_name = str(label.item())
        ax.set_title(f"Estimate: {label_name}; True: {y_name}")
        ax.axis('off')
    break

## Additional References

Additional PyTorch materials can be found primarily at [pytorch.org](https://pytorch.org/) (note specifically the [Docs](https://pytorch.org/docs/) and [Tutorials](https://pytorch.org/tutorials/) links at the top of the page). You may also want to check out PyTorch's [60-minute blitz video tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html).