Credits



*   [DataFlowR online course](https://dataflowr.github.io/website/)
*   [Pytorch Tutorials](https://pytorch.org/tutorials/beginner/basics/intro.html)



# PyTorch tensors and automatic differentiation

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import torch
import numpy as np

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print('Using gpu: %s ' % torch.cuda.is_available())

Using gpu: True 


Tensors are used to encode the signal to process, but also the internal states and parameters of models.

**Manipulating data through this constrained structure allows to use CPUs and GPUs at peak performance.**



## Initializing a tensor

Construct a 3x5 matrix, uninitialized:

In [4]:
x = torch.empty(3,5)
print(x.dtype)
print(x)

torch.float32
tensor([[6.0021e+00, 1.1379e-42, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]])


In [5]:
x = torch.randn(3,5)
print(x)

tensor([[ 1.0848,  0.9703,  0.9429, -0.2934,  0.5745],
        [ 0.0297,  0.2637, -0.2185, -0.0887, -1.4960],
        [ 0.5915, -0.0129, -0.9466,  1.2913, -0.8218]])


In [6]:
print(x.size())

torch.Size([3, 5])


torch.Size is in fact a [tuple](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences), so it supports the same operations.


In [7]:
x.size()[1]

5

In [8]:
x.size() == (3,5)

True

In [9]:
x_ones = torch.ones_like(x) # retains the properties of x

### Bridge to numpy


In [10]:
y = x.numpy()
print(y)

[[ 1.0848428   0.9702707   0.9429496  -0.29342934  0.5744521 ]
 [ 0.02971681  0.26370436 -0.21845931 -0.08866614 -1.4959964 ]
 [ 0.59146595 -0.01287817 -0.9465888   1.2913276  -0.8218247 ]]


In [11]:
a = np.ones(5)
b = torch.from_numpy(a)
print(a.dtype)
print(b)

float64
tensor([1., 1., 1., 1., 1.], dtype=torch.float64)


In [12]:
a[1]=0
b


tensor([1., 0., 1., 1., 1.], dtype=torch.float64)

In [13]:
c = b.long()
print(c.dtype, c)
print(b.dtype, b)

torch.int64 tensor([1, 0, 1, 1, 1])
torch.float64 tensor([1., 0., 1., 1., 1.], dtype=torch.float64)


In [14]:
xr = torch.randn(3, 5)
print(xr.dtype, xr)

torch.float32 tensor([[ 0.0177,  0.0381,  0.3973, -0.5031, -0.8506],
        [ 0.4498,  0.3508, -1.8947,  0.8824, -1.8403],
        [-0.4247, -0.2585, -0.6598,  1.7379,  0.4978]])


In [15]:
resb = xr + b
resb

tensor([[ 1.0177,  0.0381,  1.3973,  0.4969,  0.1494],
        [ 1.4498,  0.3508, -0.8947,  1.8824, -0.8403],
        [ 0.5753, -0.2585,  0.3402,  2.7379,  1.4978]], dtype=torch.float64)

In [16]:
resc = xr + c
resc

tensor([[ 1.0177,  0.0381,  1.3973,  0.4969,  0.1494],
        [ 1.4498,  0.3508, -0.8947,  1.8824, -0.8403],
        [ 0.5753, -0.2585,  0.3402,  2.7379,  1.4978]])

Be careful with types!

In [17]:
resb == resc

tensor([[False,  True, False,  True,  True],
        [False,  True,  True, False,  True],
        [False,  True,  True, False, False]])

In [18]:
torch.set_printoptions(precision=10)

In [19]:
resb[0,1]

tensor(0.0381111614, dtype=torch.float64)

In [20]:
resc[0,1]

tensor(0.0381111614)

In [21]:
resc[0,1].dtype

torch.float32

In [22]:
xr[0,1]

tensor(0.0381111614)

In [23]:
torch.set_printoptions(precision=4)

## Attributes of a tensor

In [24]:
tensor = torch.rand(3,4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


## Manipulating

Standard numpy-like indexing and slicing

In [25]:
tensor = torch.ones(4, 4)
print(f"First row: {tensor[0]}")
print(f"First column: {tensor[:, 0]}")
print(f"Last column: {tensor[..., -1]}")
tensor[:,1] = 0
print(tensor)

First row: tensor([1., 1., 1., 1.])
First column: tensor([1., 1., 1., 1.])
Last column: tensor([1., 1., 1., 1.])
tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])


Joining tensors

In [26]:
t1 = torch.cat([tensor, tensor, tensor], dim=1)
print(t1)

tensor([[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.]])


Arithmetic operations

In [27]:
# This computes the matrix multiplication between two tensors. y1, y2, y3 will have the same value
# ``tensor.T`` returns the transpose of a tensor
y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)

y3 = torch.rand_like(y1)
torch.matmul(tensor, tensor.T, out=y3)


# This computes the element-wise product. z1, z2, z3 will have the same value
z1 = tensor * tensor
z2 = tensor.mul(tensor)

z3 = torch.rand_like(tensor)
torch.mul(tensor, tensor, out=z3)

tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])

Single element tensors

In [28]:
agg = tensor.sum()
agg_item = agg.item()
print(agg_item, type(agg_item))

12.0 <class 'float'>


### In-place modification


In [29]:
x

tensor([[ 1.0848,  0.9703,  0.9429, -0.2934,  0.5745],
        [ 0.0297,  0.2637, -0.2185, -0.0887, -1.4960],
        [ 0.5915, -0.0129, -0.9466,  1.2913, -0.8218]])

In [30]:
xr

tensor([[ 0.0177,  0.0381,  0.3973, -0.5031, -0.8506],
        [ 0.4498,  0.3508, -1.8947,  0.8824, -1.8403],
        [-0.4247, -0.2585, -0.6598,  1.7379,  0.4978]])

In [31]:
print(x+xr)

tensor([[ 1.1026,  1.0084,  1.3402, -0.7965, -0.2761],
        [ 0.4795,  0.6145, -2.1132,  0.7937, -3.3363],
        [ 0.1668, -0.2714, -1.6064,  3.0292, -0.3240]])


In [32]:
x.add_(xr)
print(x)

tensor([[ 1.1026,  1.0084,  1.3402, -0.7965, -0.2761],
        [ 0.4795,  0.6145, -2.1132,  0.7937, -3.3363],
        [ 0.1668, -0.2714, -1.6064,  3.0292, -0.3240]])


Any operation that mutates a tensor in-place is post-fixed with an ```_```

For example: ```x.fill_(y)```, ```x.t_()```, will change ```x```.

In [33]:
print(x.t())

tensor([[ 1.1026,  0.4795,  0.1668],
        [ 1.0084,  0.6145, -0.2714],
        [ 1.3402, -2.1132, -1.6064],
        [-0.7965,  0.7937,  3.0292],
        [-0.2761, -3.3363, -0.3240]])


In [34]:
x.t_()
print(x)

tensor([[ 1.1026,  0.4795,  0.1668],
        [ 1.0084,  0.6145, -0.2714],
        [ 1.3402, -2.1132, -1.6064],
        [-0.7965,  0.7937,  3.0292],
        [-0.2761, -3.3363, -0.3240]])


### [Broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html)


Broadcasting automagically expands dimensions by replicating coefficients, when it is necessary to perform operations.

1. If one of the tensors has fewer dimensions than the other, it is reshaped by adding as many dimensions of size 1 as necessary in the front; then
2. for every mismatch, if one of the two tensor is of size one, it is expanded along this axis by replicating  coefficients.

If there is a tensor size mismatch for one of the dimension and neither of them is one, the operation fails.

In [37]:
A = torch.tensor([[1.], [2.], [3.], [4.]])
print(A.size())
B = torch.tensor([[5., -5., 5., -5., 5.]])
print(B.size())
C = A + B

torch.Size([4, 1])
torch.Size([1, 5])


In [43]:
C

tensor([[ 6., -4.,  6., -4.,  6.],
        [ 7., -3.,  7., -3.,  7.],
        [ 8., -2.,  8., -2.,  8.],
        [ 9., -1.,  9., -1.,  9.]])

The original (column-)vector
\begin{eqnarray*}
A = \left( \begin{array}{c}
1\\
2\\
3\\
4\\
\end{array}\right)
\end{eqnarray*}
is transformed into the matrix
\begin{eqnarray*}
A = \left( \begin{array}{ccccc}
1&1&1&1&1\\
2&2&2&2&2\\
3&3&3&3&3\\
4&4&4&4&4
\end{array}\right)
\end{eqnarray*}
and the original (row-)vector
\begin{eqnarray*}
B = (5,-5,5,-5,5)
\end{eqnarray*}
is transformed into the matrix
\begin{eqnarray*}
B = \left( \begin{array}{ccccc}
5&-5&5&-5&5\\
5&-5&5&-5&5\\
5&-5&5&-5&5\\
5&-5&5&-5&5
\end{array}\right)
\end{eqnarray*}
so that summing these matrices gives:
\begin{eqnarray*}
A+B = \left( \begin{array}{ccccc}
6&-4&6&-4&6\\
7&-3&7&-3&7\\
8&-2&8&-2&8\\
9&-1&9&-1&9
\end{array}\right)
\end{eqnarray*}

### Cuda


In [44]:
torch.cuda.is_available()

True

In [45]:
#device = torch.device('cpu')
device = torch.device('cuda') # Uncomment this to run on GPU

In [46]:
x.device

device(type='cpu')

In [47]:
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
    y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
    x = x.to(device)                       # or just use strings ``.to("cuda")``
    z = x + y
    print(z,z.type())
    print(z.to("cpu", torch.double))       # ``.to`` can also change dtype together!

tensor([[ 2.1026,  1.4795,  1.1668],
        [ 2.0084,  1.6145,  0.7286],
        [ 2.3402, -1.1132, -0.6064],
        [ 0.2035,  1.7937,  4.0292],
        [ 0.7239, -2.3363,  0.6760]], device='cuda:0') torch.cuda.FloatTensor
tensor([[ 2.1026,  1.4795,  1.1668],
        [ 2.0084,  1.6145,  0.7286],
        [ 2.3402, -1.1132, -0.6064],
        [ 0.2035,  1.7937,  4.0292],
        [ 0.7239, -2.3363,  0.6760]], dtype=torch.float64)


In [48]:
x = torch.randn(1)
x = x.to(device)

In [49]:
x.device

device(type='cuda', index=0)

In [50]:
# the following line is only useful if CUDA is available
x = x.data
print(x)
print(x.item())
print(x.cpu().numpy())

tensor([-0.8401], device='cuda:0')
-0.8400593400001526
[-0.84005934]


# Autograd: automatic differentiation


When executing tensor operations, PyTorch can automatically construct on-the-fly the graph of operations to compute the gradient of any quantity with respect to any tensor involved.

To be more concrete, we introduce the following example: we consider parameters $w\in \mathbb{R}$ and $b\in \mathbb{R}$ with the corresponding function:
\begin{eqnarray*}
\ell = \left(\exp(wx+b) - y^* \right)^2
\end{eqnarray*}

Our goal here, will be to compute the following partial derivatives:
\begin{eqnarray*}
\frac{\partial \ell}{\partial w}\mbox{ and, }\frac{\partial \ell}{\partial b}.
\end{eqnarray*}

The reason for doing this will be clear when you will solve the practicals for this lesson!

You can decompose this function as a composition of basic operations. This is call the forward pass on the graph of operations.
![backprop1](https://dataflowr.github.io/notebooks/Module2/img/backprop1.png)

Let say we start with our model in `numpy`:

In [51]:
w = np.array([0.5])
b = np.array([2])
xx = np.array([0.5])#np.arange(0,1.5,.5)

transform these into `tensor`:

In [52]:
xx_t = torch.from_numpy(xx)
w_t = torch.from_numpy(w)
b_t = torch.from_numpy(b)


A `tensor` has a Boolean field `requires_grad`, set to `False` by default, which states if PyTorch should build the graph of operations so that gradients with respect to it can be computed.

In [53]:
w_t.requires_grad

False

We want to take derivative with respect to $w$ so we change this value:

In [54]:
w_t.requires_grad_(True)

tensor([0.5000], dtype=torch.float64, requires_grad=True)

We want to do the same thing for $b$ but the following line will produce an error!

In [58]:
b_t.requires_grad_(True)
b_t

RuntimeError: only Tensors of floating point dtype can require gradients

Reading the error message should allow you to correct the mistake!

In [59]:
dtype = torch.float64

In [60]:
b_t = b_t.type(dtype)

In [61]:
b_t.requires_grad_(True)

tensor([2.], dtype=torch.float64, requires_grad=True)



We now compute the function:

In [62]:
def fun(x,ystar):
    y = torch.exp(w_t*x+b_t)
    print(y)
    return torch.sum((y-ystar)**2)

ystar_t = torch.randn_like(xx_t)
l_t = fun(xx_t,ystar_t)

tensor([9.4877], dtype=torch.float64, grad_fn=<ExpBackward0>)


In [63]:
l_t

tensor(76.5741, dtype=torch.float64, grad_fn=<SumBackward0>)

In [64]:
l_t.requires_grad

True

After the computation is finished, i.e. *forward pass*, you can call ```.backward()``` and have all the gradients computed automatically.

In [65]:
print(w_t.grad)

None


In [66]:
l_t.backward()

In [67]:
print(w_t.grad)
print(b_t.grad)

tensor([83.0240], dtype=torch.float64)
tensor([166.0480], dtype=torch.float64)




Let's try to understand these numbers...

![backprop2](https://dataflowr.github.io/notebooks/Module2/img/backprop2.png)

In [68]:
yy_t = torch.exp(w_t*xx_t+b_t)
print(torch.sum(2*(yy_t-ystar_t)*yy_t*xx_t))
print(torch.sum(2*(yy_t-ystar_t)*yy_t))

tensor(83.0240, dtype=torch.float64, grad_fn=<SumBackward0>)
tensor(166.0480, dtype=torch.float64, grad_fn=<SumBackward0>)


`tensor.backward()` accumulates the gradients in  the `grad` fields  of tensors.

In [69]:
l_t = fun(xx_t,ystar_t)
l_t.backward()

tensor([9.4877], dtype=torch.float64, grad_fn=<ExpBackward0>)


In [70]:
print(w_t.grad)
print(b_t.grad)

tensor([166.0480], dtype=torch.float64)
tensor([332.0960], dtype=torch.float64)


By default, `backward` deletes the computational graph when it is used so that you will get an error below:

In [71]:
l_t.backward()

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

In [72]:
# Manually zero the gradients
w_t.grad.data.zero_()
b_t.grad.data.zero_()
l_t = fun(xx_t,ystar_t)
l_t.backward(retain_graph=True)
l_t.backward()
print(w_t.grad)
print(b_t.grad)

tensor([9.4877], dtype=torch.float64, grad_fn=<ExpBackward0>)
tensor([166.0480], dtype=torch.float64)
tensor([332.0960], dtype=torch.float64)


The gradients must be set to zero manually. Otherwise they will cumulate across several _.backward()_ calls.
This accumulating behavior is desirable in particular to compute the gradient of a loss summed over several “mini-batches,” or the gradient of a sum of losses.