# Pytorch Tutorial
*by Hong Xu*

This tutorial aims to teach the skills necessary to write and understand *pytorch* code for deep learning purposes. We will touch on the basics of *pytorch*, the essential tools, and some additional resources. You do not need a GPU to run this tutorial, although you will if you wish to train any non-trivial network, so I will include the commands as an option to try. Most of the code here comes from [jcjohnson's pytorch examples](https://github.com/jcjohnson/pytorch-examples).

First you will need to install *pytorch* and *pytorchvision*. If you don't have these, you can either
* Install Anaconda, and run `conda install pytorch torchvision cudatoolkit=10.1 -c pytorch` or
* Run
    * `pip3 install torch torchvision` (linux/mac)
    * `pip3 install torch===1.4.0 torchvision===0.5.0 –f https://download.pytorch.org/whl/torch_stable.html` (windows)

Now we check whether they were installed correctly

In [1]:
import torch
import torchvision
import numpy as np

## Pytorch Basics

In this section, we will introduce three levels of abstraction; the concepts of tensors, variables, layers, and modules.

### Tensors

*Pytorch* is a library primarily focused on deep learning, where the most convenient way to represent data and parameters is through tensors. **Tensors are multi-dimensional arrays**.

<img src="./figures/tensor.png" width="350"  />

In *pytorch*, tensors have similar utility to *numpy* arrays, but they can easily run on a GPU if specified. They have properties like `.shape` and can be applied element-wise operations easily. The following code (from [jcjohnson](https://github.com/jcjohnson/pytorch-examples) pytorch examples) demonstrates the learning of a 2-layer network on some random input using only *pytorch* tensors.

Recall how a neural network is trained from the following image, and try to match each step of the process to the code bellow:

![structure](./figures/optimizer_explain.png)

In [2]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU <----------------------------------------

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y
  h = x.mm(w1)
  h_relu = h.clamp(min=0)
  y_pred = h_relu.mm(w2)

  # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
  # of shape (); we can get its value as a Python number with loss.item().
  loss = (y_pred - y).pow(2).sum()
  if(t%50 == 0):
      print(t, loss.item())

  # Backprop to compute gradients of w1 and w2 with respect to loss
  grad_y_pred = 2.0 * (y_pred - y)
  grad_w2 = h_relu.t().mm(grad_y_pred)
  grad_h_relu = grad_y_pred.mm(w2.t())
  grad_h = grad_h_relu.clone()
  grad_h[h < 0] = 0
  grad_w1 = x.t().mm(grad_h)

  # Update weights using gradient descent
  w1 -= learning_rate * grad_w1
  w2 -= learning_rate * grad_w2

0 29787476.0
50 14170.0693359375
100 595.2889404296875
150 54.57164764404297
200 7.232732772827148
250 1.146773338317871
300 0.19770023226737976
350 0.03555745631456375
400 0.006758421193808317
450 0.0015261215157806873


Running on GPU is as simple as commenting out the fourth line above. Do not overlook the power of GPU operations on tensors. For instance, the following 3000x3000 matrix multiplication takes 350 ms on the CPU, and only 0.1 ms on the GPU.

```python
# Task : compute matrix multiplication C = AB
d = 3000

# using numpy: takes 350 ms
A = np.random.rand(d,d).astype(np.float32)
B = np.random.rand(d,d).astype(np.float32)
C = A.dot(B)

# using torch with GPU
A = torch.rand(d, d).cuda()
B = torch.rand(d, d).cuda()
C = torch.mm(A, B)
```

Technically, you could write any network code on *numpy*, but it would be painfully slow and excrutiating.

### Variable

**A variable represents a node in a computational graph; it stores data and gradients**. This structure accounts for the fact that in gradient descent, the most popular optimization framework, we need to store both the data and the gradient for every structure. 

* `x.data` is a Tensor
* `x.grad` is a Variable of gradients (same shape as x.data)
* `x.grad.data` is the Tensor of gradients

![tensor](./figures/variable.png)

**Autograd**: We can use the `torch.autograd` package to automatically calculate the gradients. This will be particularly useful when building intricate networks. In the code below, you will see how the varibles imported from autograd work with the `loss.backward()` to automatically compute the gradients of any variable on a computational graph.


In [3]:
import torch
from torch.autograd import Variable

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
x = Variable(torch.randn(N, D_in, device=device), requires_grad=False)
y = Variable(torch.randn(N, D_out, device=device), requires_grad=False)

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
w1 = Variable(torch.randn(D_in, H, device=device), requires_grad=True)
w2 = Variable(torch.randn(H, D_out, device=device), requires_grad=True)

learning_rate = 1e-6
for t in range(500):
  # Forward pass: compute predicted y using operations on Tensors. Since w1 and
  # w2 have requires_grad=True, operations involving these Tensors will cause
  # PyTorch to build a computational graph, allowing automatic computation of
  # gradients. Since we are no longer implementing the backward pass by hand we
  # don't need to keep references to intermediate values.
  y_pred = x.mm(w1).clamp(min=0).mm(w2)
  
  loss = (y_pred - y).pow(2).sum()
  if(t%50 == 0):
      print(t, loss.item())

  # Use autograd to compute the backward pass. This call will compute the
  # gradient of loss with respect to all Tensors with requires_grad=True.
  # After this call w1.grad and w2.grad will be Tensors holding the gradient
  # of the loss with respect to w1 and w2 respectively.
  loss.backward()

  # Update weights using gradient descent. For this step we just want to mutate
  # the values of w1 and w2 in-place; we don't want to build up a computational
  # graph for the update steps, so we use the torch.no_grad() context manager
  # to prevent PyTorch from building a computational graph for the updates
  with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()

0 42456720.0
50 10116.80078125
100 229.71112060546875
150 9.445591926574707
200 0.5174202919006348
250 0.03318571299314499
300 0.0025607836432754993
350 0.00036081703728996217
400 9.777984814718366e-05
450 4.161860124440864e-05


### Layers

**Layers are abstractions of frequently used computational graphs**, particularly in deep learning applications. Using layers from `torch.nn` allows us to quickly and efficiently build deep learning architechtures that can automatically compute gradients and optimize. Layers are often the smallest units we have to deal with extensively for any network architecture. `torch.nn` also includes a plethora of loss functions, which can be seen as layers with a scalar output.

Some examples of `torch.nn` layers:
* ConvXd (X = 1, 2, 3) 
* ConvTransposeXd (X = 1, 2, 3)
* MaxPoolXd (X = 1, 2, 3)
* Dropout
* Linear
* Normalization
* MSELoss
* L1Loss
* BCELoss

We can now write the previous code in a condensed way using only `torch.nn` layers:

In [4]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        ).to(device)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function. Setting
# reduction='sum' means that we are computing the *sum* of squared errors rather
# than the mean; this is for consistency with the examples above where we
# manually compute the loss, but in practice it is more common to use mean
# squared error as a loss by setting reduction='elementwise_mean'.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
  # Forward pass: compute predicted y by passing x to the model. Module objects
  # override the __call__ operator so you can call them like functions. When
  # doing so you pass a Tensor of input data to the Module and it produces
  # a Tensor of output data.
  y_pred = model(x)

  loss = loss_fn(y_pred, y)
  if(t%50 == 0):
      print(t, loss.item())
  
  # Zero the gradients before running the backward pass.
  model.zero_grad()

  # Backward pass: compute gradient of the loss with respect to all the learnable
  # parameters of the model. Internally, the parameters of each Module are stored
  # in Tensors with requires_grad=True, so this call will compute gradients for
  # all learnable parameters in the model.
  loss.backward()

  # Update the weights using gradient descent. Each parameter is a Tensor, so
  # we can access its data and gradients like we did before.
  with torch.no_grad():
    for param in model.parameters():
      param.data -= learning_rate * param.grad

0 607.2765502929688
50 33.4088020324707
100 2.6166069507598877
150 0.30343571305274963
200 0.041633669286966324
250 0.006539386697113514
300 0.001097670290619135
350 0.00019118207274004817
400 3.4066804801113904e-05
450 6.165347713249503e-06


### Network Modules

**Network modules are user level abstractions of network substructures**. A module can be an entire network or a network can be comprised of a large amount of modules. For instance, you can have a module for a ResnetBlock, a commonly used structure in a lot of image processing networks. Modules allow you to build potentially massive networks in a manageable way.

Modules are wrapped in a class in pytorch. A module class has two important methods: the `__init__` method where all the layers or other modules are initialized and the `forward` method in charge of defining the computational graph.

Simple example (from Chongruo Wu’s tutorial):

![net_example](./figures/net_example.png)

```python
class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1,10, kernel_size=5) # Convolutional layer with 5x5 kernel size, 
                                                    #1 input channel and 10 output channels.
        self.conv2 = nn.Conv2d(10,20, kernel_size=5)
        self.mp = nn.MaxPool2d(2)
        self.fc = nn.Linear(320, 10)
        
    def forward(self,x):
        in_size = x.size(0)
        x = F.relu(self.mp(self.conv1(x)))
        x = F.relu(self.mp(self.conv2(x)))
        x = x.view(in_size, -1) # Flattens the tensor
        x = self.fc(x)
        return F.log_softmax(x)
```

Now we can tidy up our code from before by writing a TwoLayerNet module class.

In [5]:
import torch

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    """
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    """
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)

  def forward(self, x):
    """
    In the forward function we accept a Tensor of input data and we must return
    a Tensor of output data. We can use Modules defined in the constructor as
    well as arbitrary (differentiable) operations on Tensors.
    """
    h_relu = self.linear1(x).clamp(min=0)
    y_pred = self.linear2(h_relu)
    return y_pred

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function.The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = loss_fn(y_pred, y)
  if(t%50 == 0):
      print(t, loss.item())

  # Zero the gradients before running the backward pass.
  model.zero_grad()

  # Backward pass: compute gradient of the loss with respect to all the learnable
  # parameters of the model. Internally, the parameters of each Module are stored
  # in Tensors with requires_grad=True, so this call will compute gradients for
  # all learnable parameters in the model.
  loss.backward()

  # Update the weights using gradient descent. Each parameter is a Tensor, so
  # we can access its data and gradients like we did before.
  with torch.no_grad():
    for param in model.parameters():
      param.data -= learning_rate * param.grad

0 719.9278564453125
50 36.06056213378906
100 2.6069955825805664
150 0.3170478641986847
200 0.050712019205093384
250 0.00920913927257061
300 0.0017887178109958768
350 0.00036087119951844215
400 7.470513082807884e-05
450 1.574227280798368e-05


## Pytorch: Optimizer

In this section, we will discuss the optimizer, which will allow us to perform different learning paradigms by making minimal modifications. In particular, we have been using gradient descent exclussively up to this point, but we might want to use more sophisticated optimizer like AdaGrad, RMSPro, Adam, among others.

The `torch.optimizer` function allows us to do just that. Just simply create an optimizer

```python
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
```

Notice that this is an Adam optimizer with learning rate `learning_rate`. And the gradient descent step can simply be replaced with

```python
optimizer.step()
```

The code below will run our usual network with an Adam optimizer.

In [6]:
import torch

class TwoLayerNet(torch.nn.Module):
  def __init__(self, D_in, H, D_out):
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.linear2 = torch.nn.Linear(H, D_out)

  def forward(self, x):
    h_relu = self.linear1(x).clamp(min=0)
    y_pred = self.linear2(h_relu)
    return y_pred

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
  # Forward pass: Compute predicted y by passing x to the model
  y_pred = model(x)

  # Compute and print loss
  loss = loss_fn(y_pred, y)
  if(t%50 == 0):
      print(t, loss.item())

  # Zero gradients, perform a backward pass, and update the weights.
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

0 727.7957763671875
50 31.996126174926758
100 1.9637930393218994
150 0.23282673954963684
200 0.04093573987483978
250 0.008779775351285934
300 0.002096084877848625
350 0.0005341502837836742
400 0.00014208459469955415
450 3.887387356371619e-05


## Pytorch: MNIST step-by-step example

This section focuses on analyzing a step-by-step example of the 