# PyTorch Introduction

Goal takeways:
- Automatic differentiation is a powerful tool
- PyTorch implements common function used in deep learning
- Data Processing with PyTorch DataSet
- Mixed Presision Training in PyTorch

In [3]:
import numpy as np
import torch

torch.manual_seed(446)
np.random.seed(446)

## Tensors and relation to numpy

By this point, we have worked with numpy quite a bit. PyTorch's building block, the `tensor` is similar to numpy's `ndarray`.

In [6]:
# we create tensors in a similar way to numpy nd arrays
x_numpy = np.array([0.1, 0.2, 0.3])
x_torch = torch.tensor([0.1, 0.2, 0.3])
print('x_numpy, x_torch')
print(x_numpy, x_torch)
print()

# to and from numpy, pytorch
print('to and from numpy and pytorch')
print(torch.from_numpy(x_numpy), x_torch.numpy())
print()

# we can do basic operations like +-*/
y_numpy = np.array([3, 4, 5.])
y_torch = torch.tensor([3, 4, 5])
print('x + y')
print(x_numpy + y_numpy, x_torch + y_torch)
print()

# many functions that are in numpy are also in pytorch
print("norm")
print(np.linalg.norm(x_numpy), torch.norm(x_torch))
print()

# to apply an operation along a dimension,
# we use the dim keyword argument instead of axis
print('mean along the 0th dimension')
x_numpy = np.array([[1, 2.], [3, 4.]])
x_torch = torch.tensor([[1, 2.], [3, 4.]])
print(np.mean(x_numpy, axis=0), torch.mean(x_torch, dim=0))

x_numpy, x_torch
[0.1 0.2 0.3] tensor([0.1000, 0.2000, 0.3000])

to and from numpy and pytorch
tensor([0.1000, 0.2000, 0.3000], dtype=torch.float64) [0.1 0.2 0.3]

x + y
[3.1 4.2 5.3] tensor([3.1000, 4.2000, 5.3000])

norm
0.37416573867739417 tensor(0.3742)

mean along the 0th dimension
[2. 3.] tensor([2., 3.])


## `Tensor.view`
We can use the `Tensor.view()` function to reshape tensors similarly to `numpy.reshape()`

It can also automatically calculate the correct dimension of `a - 1` is passed in. This is useful if we are working with
batches, but the batch size is unknown.

In [7]:
# MNIST
N, C, W, H = 10000, 3, 28, 28
X = torch.randn((N, C, W, H))

print(X.shape)
print(X.view(N, C, 784).shape)
print(X.view(-1, C, 784).shape)  # automatically choose the 0th dimension

torch.Size([10000, 3, 28, 28])
torch.Size([10000, 3, 784])
torch.Size([10000, 3, 784])


## Broadcasting Semantics
Two tensors are "Broadcastable" if the following rules hold:
- Each tensor has at least one dimension
- When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal,
one of them is 1, or one of them does not exist.

In [9]:
# PyTorch operations support NumPy Broadcasting Semantics
x = torch.empty(5, 1, 4, 1)
y = torch.empty(3, 1, 1)
print((x + y).size())

torch.Size([5, 3, 4, 1])


## Computation graphs

What's special about PyTorch's `tensor` object is that it implicitly creates a computation graph in the background. A
computation graph is a way of writing a mathematical expression as a graph. There is an algorithm to compute the gradients
of all the variable of a computation graph in time on the same order it is to compute the function itself.

Consider the expression $e = (a+b) * (b+1)$ with values $a=2, b=1$. We can draw the evaluated computation graph as in
PyTorch, we can write this as

[![](https://mermaid.ink/img/eyJjb2RlIjoiZ3JhcGggVERcbiAgICBFKGUgPSBjKmQsZT02KVxuICAgIEMoYyA9IGErYiwgYyA9IDMpXG4gICAgRChkID0gYisxLCBkID0gMilcbiAgICBBKGEsIGE9MilcbiAgICBCKGIsIGI9MSlcbiAgICBBIC0tPiBDXG4gICAgQiAtLT4gQ1xuICAgIEMgLS0-IEVcbiAgICBEIC0tPiBFIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQifSwidXBkYXRlRWRpdG9yIjpmYWxzZSwiYXV0b1N5bmMiOnRydWUsInVwZGF0ZURpYWdyYW0iOmZhbHNlfQ)](https://mermaid-js.github.io/mermaid-live-editor/edit#eyJjb2RlIjoiZ3JhcGggVERcbiAgICBFKGUgPSBjKmQsZT02KVxuICAgIEMoYyA9IGErYiwgYyA9IDMpXG4gICAgRChkID0gYisxLCBkID0gMilcbiAgICBBKGEsIGE9MilcbiAgICBCKGIsIGI9MSlcbiAgICBBIC0tPiBDXG4gICAgQiAtLT4gQ1xuICAgIEMgLS0-IEVcbiAgICBEIC0tPiBFIiwibWVybWFpZCI6IntcbiAgXCJ0aGVtZVwiOiBcImRlZmF1bHRcIlxufSIsInVwZGF0ZUVkaXRvciI6ZmFsc2UsImF1dG9TeW5jIjp0cnVlLCJ1cGRhdGVEaWFncmFtIjpmYWxzZX0)

In [10]:
a = torch.tensor(2.0, requires_grad=True)  # we set requires_grad=True to let PyTorch know to keep the graph
b = torch.tensor(1.0, requires_grad=True)
c = a + b
d = b + 1
e = c * d
print(f'{c}')
print(f'{d}')
print(f'{e}')

3.0
2.0
6.0


We can see that PyTorch kept track of the computation graph for us.

## CUDA Semantics

It's easy copy tensor from cpu to gpu or from gpu to cpu.

In [12]:
cpu = torch.device("cpu")

# on Colab we should set torch.device("gpu"), on personal computer with single GPU we set cuda
gpu = torch.device("cuda")

x = torch.rand(10)
print(x)
x = x.to(gpu)
print(x)
x = x.to(cpu)
print(x)

tensor([0.3959, 0.6177, 0.7256, 0.0971, 0.9186, 0.8277, 0.4409, 0.9344, 0.8967,
        0.1897])
tensor([0.3959, 0.6177, 0.7256, 0.0971, 0.9186, 0.8277, 0.4409, 0.9344, 0.8967,
        0.1897], device='cuda:0')
tensor([0.3959, 0.6177, 0.7256, 0.0971, 0.9186, 0.8277, 0.4409, 0.9344, 0.8967,
        0.1897])


## PyTorch as an auto grad framework

Now that we have seen PyTorch keeps the graph around for us, let's use it to compute some gradients for us.

Consider the function $f(x) = (x-2)^2$

Q: Compute $\frac{df(x)}{dx}$ and then compute $f'(1)$

We make a backward() call on the leaf variable(y) in the computation, computing all the gradients of `y` at once.

In [13]:
def f(x):
    return (x - 2) ** 2


def fp(x):
    return 2 * (x - 2)


x = torch.tensor([1.0], requires_grad=True)

y = f(x)
y.backward()

print(f'Analytical f\'(x): {fp(x)}')
print(f'PyTorch\'s f\'(x): {x.grad}')

Analytical f'(x): tensor([-2.], grad_fn=<MulBackward0>)
PyTorch's f'(x): tensor([-2.])


It can also find gradients of functions.

Let $\omega = [\omega_1, \omega_2]^T$

Consider $g(\omega) = 2\omega_1\omega_2 + \omega_2 \cos(\omega_1)$

Q: Compute $\nabla_{\omega}g(\omega) and verify \nabla_{\omega}g([\pi, 1]) = [2, \pi - 1]^T$

In [28]:
def g(w):
    return 2 * w[0] * w[1] + w[1] * torch.cos(w[0])


def grad_g(w):
    return torch.tensor([2 * w[1] - w[1] * torch.sin(w[0]), 2 * w[0] + torch.cos(w[0])])

w = torch.tensor([np.pi, 1], requires_grad=True)

z = g(w)
z.backward()

print(f'Analytical grad g(w): {grad_g(w)}')
print(f'Pytorch\s grad(w): {w.grad}')

Analytical grad g(w): tensor([2.0000, 5.2832])
Pytorch\s grad(w): tensor([2.0000, 5.2832])


## Using the gradients

Now that we have gradients, we can use our favorite optimization algorithm, gradient descent!

Let $f% the same function we defined above.

Q: What is the value of $x$ that minimizes $f$?

In [39]:
x = torch.tensor([5.0], requires_grad=True)
step_size = 0.25

print(f'This is the result: {x.item():.3f}')

This is the result: 5.000


In [40]:
print('iter, \tx, \tf(x), \tf\'(x), \tf\'(x) pytorch')
for i in range(15):
    y = f(x)
    y.backward()  # compute the gradient

    # print("{}, \t{:.3f}, \t{:.3f}, \t{:.3f}, \t{:.3f}".format(i, x.item(), f(x).item(), fp(x).item, x.grad.item()))
    print(f'{i} \t{x.item():.3f} \t{f(x).item():.3f} \t{fp(x).item():.3f} \t{x.grad.item():.3f}')
    x.data = x.data - step_size * x.grad  # perform a GD update step

    # We need to zero the grad variable since the backward()
    # call accumulates the gradients in .grad instead of overwriting.
    # The detach_() is for efficiency. You do not need to worry to much about it.
    x.grad.detach_()
    x.grad.zero_()

iter, 	x, 	f(x), 	f'(x), 	f'(x) pytorch
0 	5.000 	9.000 	6.000 	6.000
1 	3.500 	2.250 	3.000 	3.000
2 	2.750 	0.562 	1.500 	1.500
3 	2.375 	0.141 	0.750 	0.750
4 	2.188 	0.035 	0.375 	0.375
5 	2.094 	0.009 	0.188 	0.188
6 	2.047 	0.002 	0.094 	0.094
7 	2.023 	0.001 	0.047 	0.047
8 	2.012 	0.000 	0.023 	0.023
9 	2.006 	0.000 	0.012 	0.012
10 	2.003 	0.000 	0.006 	0.006
11 	2.001 	0.000 	0.003 	0.003
12 	2.001 	0.000 	0.001 	0.001
13 	2.000 	0.000 	0.001 	0.001
14 	2.000 	0.000 	0.000 	0.000
