# PyTorch Tutorial

This is a short introduction to PyTorch that I am writing for various reasons. It will contain what I feel is useful to understand how to use PyTorch to train deep networks (MLPs, CNNs, RNNs).

There are a few reasons to use PyTorch over TensorFlow; if you are interested in an in-depth comparison I am sure you will be able to find something on the web. My personal perspective is that PyTorch might be better as an educational and self-educational resource because of the following reasons:

* **Multiple-level APIs**: do you want to write a new optimization algorithm? Use PyTorch's ```Tensor``` object and ignore everything else; need to perform visualization on weights and activation matrices? ```Tensor```s support Python-style indexing; want to try a new kind of architecture? Use the ```layers```-style APIs. You can use as little or as much of PyTorch as you want, and it still makes sense.
* **Plays nice with Python**: writing TensorFlow code can be weird and convoluted because of the concept of ```Session``` and TF's ```Tensors``` having to be evaluated before having an actual value. This is because TensorFlow builds a static computational graph from your code, and each time you want to run the graph you have to interact with a ```Session```. On the other hand, PyTorch constructs a dynamic graph which is actually built again each time you want to perform a forward pass on your network: this means that all Python control statements can be used in your model definition. This is especially helpful if you are learning about RNNs.

**Sources**:

[Official tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html)

[Autograd whitepaper](https://openreview.net/pdf?id=BJJsrmfCZ)

[Autograd documentation](https://pytorch.org/docs/stable/autograd.html)

[pytorch-examples repository](https://github.com/jcjohnson/pytorch-examples) (very good)

In [0]:
# run this once
!pip install torch torchvision

##PyTorch basics

The core object in PyTorch is ```tensor```. If you are familiar with tensorflow, its role should not come as a surprise: it is the object that PyTorch uses to represent variables, data and just about everything numeric in your graph computation. The main difference is that PyTorch's ```tensor```s have a nicer API (opinion warning!) which make them easier to interact with numpy.

In [0]:
## Hello, pytorch!
!pip install torch

import torch
x = torch.rand(5, 3)
print(x)
z = x.numpy()
print(z)

## don't see the difference? try the following:
print(type(x))
print(type(z))

## tensor slicing
print(x[:, 2])

In [0]:
## Hello, tf!
import tensorflow as tf

x = tf.random_normal((5, 3))
print(x)

session = tf.Session()
x_ = session.run(x)
print(x_)

This allows us to discuss an important difference between ```tf``` and ```torch```. The first one uses a static computational graph, whereas ```torch``` uses a dynamic one. 

Simply put, this means that ```tf``` builds the computational graph only once; on the other hand, ```torch``` re-uses the user-provided graph definition to build the graph again and again. This has a few interesting repercussions:

* *Lower efficiency*: on top of the graph building overhead, also take into account the automatic differentiation overhead -- if the graph is allowed to change from a forward pass to another, then all derivatives have to be computed again.
* *Possibility to use standard language control flow*: you can use Python's control flow operators in your program: the tensors and functions they build will be differentiated nicely. On the other hand, in tensorflow you have ```tf.cond(...)``` and similar functions. There is no ```session``` concept in PyTorch.

Note: it is not still apparent to me how exactly this second point follows from having a dynamic graph...




## PyTorch gradient computation

Each ```tensor``` object does not just contain a numpy array; on top of that, it has a few important attributes that are needed for gradient computation. Let's see a few concepts, without going into much detail:

In [0]:
## User-defined tensors do not have grad_fn. This avoids the tf.placeholder() stuff you have in tf to feed data into your model.
import torch
x = torch.ones(2, 2, requires_grad=True)
y = x + 2
print(x.grad_fn)
print(y.grad_fn)



In [0]:
z = y * y * 3
out = z.mean()

print(z)
print(out)
x.grad.data.zero_() ## try to comment this out if you are curious!
out.backward()
print(x.grad) ## d(out)/d(x)

##Implementing a two-layer MLP with PyTorch

In [0]:
## random data

num_examples = 1000
input_size = 100
num_classes = 10

x = torch.randn(num_examples, input_size)
y = torch.randn(num_examples, num_classes)


In [0]:
## model definition and training

H_size = 100

w1 = torch.randn(input_size, H_size, requires_grad=True)
b1 = torch.randn(H_size, requires_grad=True)
w2 = torch.randn(H_size, num_classes, requires_grad=True)
b2 = torch.randn(num_classes, requires_grad=True)

num_iterations = 5
learning_rate = 1e-6

for i in range(num_iterations):
  o1 = torch.matmul(x, w1) + b1
  o1 = torch.clamp(o1, min=0)
  o2 = torch.matmul(o1, w2) + b2
  y_pred = torch.clamp(o2, min=0)
  loss = sum(sum((y_pred - y).pow(2)))
  print(i, loss)

  loss.backward()
  #print(w1.grad[:2])
  #print(w2.grad[:2])
  with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad
    b1 -= learning_rate * b1.grad
    b2 -= learning_rate * b2.grad

  # Manually zero the gradients after running the backward pass
  w1.grad.zero_()
  w2.grad.zero_()
  b1.grad.zero_()
  b2.grad.zero_()


## The above, step by step 

Cell by cell, actually!

In [0]:
# model definition
H_size = 100

w1 = torch.randn(input_size, H_size, requires_grad=True)
b1 = torch.randn(H_size, requires_grad=True)
w2 = torch.randn(H_size, num_classes, requires_grad=True)
b2 = torch.randn(num_classes, requires_grad=True)

In [0]:
# training loop
learning_rate = 1e-8

o1 = torch.matmul(x, w1) + b1
o1 = torch.clamp(o1, min=0)
o2 = torch.matmul(o1, w2) + b2
y_pred = torch.clamp(o2, min=0)
loss = sum(sum((y_pred - y).pow(2)))

loss.backward(retain_graph=True)

print(loss)

with torch.no_grad():
  w1 -= learning_rate * w1.grad
  w2 -= learning_rate * w2.grad
  
print(w1.grad[0, 0:10])

In [0]:
# run this cell multiple times and watch the loss go down!

loss.backward(retain_graph=True)

with torch.no_grad():
  w1 -= learning_rate * w1.grad
  w2 -= learning_rate * w2.grad
  w1.grad.zero_()
  w2.grad.zero_()
  
print(loss)



## The ``Sequential``` high-level APIs

In [0]:
model = torch.nn.Sequential(
          torch.nn.Linear(input_size, H_size),
          torch.nn.ReLU(),
          torch.nn.Linear(H_size, num_classes),
        )

loss_fn = torch.nn.MSELoss(size_average=False)
learning_rate = 1e-8

for i in range(100):
  y_pred = model(x)
  
  loss = loss_fn(y_pred, y)
  print(i, loss.item())
  
  model.zero_grad()
  
  loss.backward()

  with torch.no_grad():
    for param in model.parameters():
      param.data -= learning_rate * param.grad