<p style="align: center;"><img align=center src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500 height=450/></p>

<h3 style="text-align: center;"><b>"Глубокое обучение". Продвинутый поток</b></h3>

<h2 style="text-align: center;"><b>Семинар 6. Основы библиотеки PyTorch </b></h2>


# PyTorch basics: syntax, torch.cuda and torch.autograd</b></h2>

<p style="align: center;"><img src="https://upload.wikimedia.org/wikipedia/commons/9/96/Pytorch_logo.png" width=400 height=100></p>

Hi! In this notebook we will cover the basics of the **PyTorch deep learning framework**. 

<h3 style="text-align: center;"><b>Intro</b></h3>

**Frameworks** are the specific code libraries with their own internal structure and pipelines.

There are many deep learning frameworks nowadays (02/2019). The difference between them is in the internal computation principles. For example, in **[Caffe](http://caffe.berkeleyvision.org/)** and **[Caffe2](https://caffe2.ai/)** you write the code using some "ready blocks" (just like the $LEGO^{TM}$ :). In **[TensorFlow](https://www.tensorflow.org/)** and **[Theano](http://deeplearning.net/software/theano/)** you declare the computation graph at first, then compile it and use it for inference/training (`tf.session()`). By the way, now TensorFlow (since v1.10) has the [Eager Execution](https://www.tensorflow.org/guide/eager), which can be handy for fast prototyping and debugging. **[Keras](https://keras.io/)** is a very popular and useful DL framework that allows to create networks fast and has many demanding features. 

<p style="align: center;"><img src="https://habrastorage.org/web/e3e/c3e/b78/e3ec3eb78d714a7993a6b922911c0866.png" width=500 height=500></p>  
<p style="text-align: center;"><i>Image credit: https://habr.com/post/334380/</i><p>

We will use PyTorch bacause it's been actively developed and supported by the community and [Facebook AI Research](https://research.fb.com/category/facebook-ai-research/).

<h3 style="text-align: center;"><b>Installation</b></h3>

The detailed instruction on how to install PyTorch you can find on the [official PyTorch website](https://pytorch.org/).

## Syntax

In [3]:
import torch

Some facts about PyTorch:  
- dynamic computation graph
- handy `torch.nn` and `torchvision` modules for fast neural network prototyping
- even faster than TensorFlow on some tasks
- allows to use GPU easily

At its core, PyTorch provides two main features:

- An n-dimensional Tensor, similar to numpy but can run on GPUs
- Automatic differentiation for building and training neural networks

If PyTorch was a formula, it would be:  

$$PyTorch = NumPy + CUDA + Autograd$$

(CUDA - [wiki](https://en.wikipedia.org/wiki/CUDA))

Let's see how we can use PyTorch to operate with vectors and tensors.  

Recall that **a tensor** is a multidimensional vector, e.g. :  

`x = np.array([1,2,3])` -- a vector = a tensor with 1 dimension (to be more precise: `(3,)`)  
`y = np.array([[1, 2, 3], [4, 5, 6]])` -- a matrix = a tensor with 2 dimensions (`(2, 3)` in this case)  
`z = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9]],  
               [[1, 2, 3], [4, 5, 6], [7, 8, 9]],  
               [[1, 2, 3], [4, 5, 6], [7, 8, 9]]])` -- "a cube" (3, 3, 3) = a tensor with 3 dimensions (`(3, 3, 3)` in this case)

One real example of 3-dimensional tensor is **an image**, it has 3 dimensions: `height`, `width` and the `channel depth` (= 3 for color images, 1 for a greyscale). You can think of it as of parallelepiped consisting of the real numbers.

In PyTorch we will use `torch.Tensor` (`FloatTensor`, `IntTensor`, `ByteTensor`) for all the computations.

All tensor types:

In [3]:
torch.HalfTensor      # 16 бит, floating point
torch.FloatTensor     # 32 бита, floating point
torch.DoubleTensor    # 64 бита, floating point

torch.ShortTensor     # 16 бит, integer, signed
torch.IntTensor       # 32 бита, integer, signed
torch.LongTensor      # 64 бита, integer, signed

torch.CharTensor      # 8 бит, integer, signed
torch.ByteTensor      # 8 бит, integer, unsigned

torch.ByteTensor

We will use only `torch.FloatTensor()` and `torch.IntTensor()`. 

Let's begin to do something!

* Creating the tensor:

In [4]:
a = torch.FloatTensor([1, 2])
a


tensor([1., 2.])

In [5]:
a.shape

torch.Size([2])

In [11]:
b = torch.FloatTensor([[1,2,3], [4,5,6]])
b

tensor([[1., 2., 3.],
        [4., 5., 6.]])

In [7]:
b.shape

torch.Size([2, 3])

In [8]:
x = torch.FloatTensor(2,3,4)

In [8]:
x

tensor([[[0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
         [0.0000e+00, 1.0194e-38, 7.7052e+31, 7.2148e+22],
         [2.5226e-18, 6.4825e-10, 1.0072e-11, 7.7206e-10]],

        [[6.4475e-10, 4.0515e-11, 1.0432e-08, 2.9571e-18],
         [6.7333e+22, 1.7591e+22, 1.7184e+25, 4.3222e+27],
         [6.1972e-04, 7.2443e+22, 1.7728e+28, 7.0367e+22]]])

In [9]:
x = torch.FloatTensor(100)
x

tensor([1.7253e+19, 1.8888e+31, 5.3383e+31, 4.2964e+24, 2.9388e+29, 4.7421e+16,
        5.8270e-10, 1.4583e-19, 4.4721e+21, 4.4330e+27, 1.3848e-14, 1.4585e-19,
        2.4068e-12, 2.5204e-12, 4.9115e-14, 1.3563e-19, 1.7753e+28, 1.3458e-14,
        1.7260e+25, 5.9423e-02, 1.8888e+31, 4.7414e+16, 1.2690e+31, 5.6025e-02,
        1.4585e-19, 2.4039e-12, 1.3563e-19, 1.3563e-19, 1.3563e-19, 2.9965e+32,
        6.0039e+31, 7.5338e+28, 9.8700e+17, 1.8370e+25, 1.0514e-14, 9.3233e-09,
        8.2892e-33, 1.3563e-19, 1.3563e-19, 1.2686e+31, 5.6025e-02, 1.4585e-19,
        1.4583e-19, 6.1140e-02, 7.7455e+26, 3.0357e+32, 7.5589e+28, 5.2839e-11,
        4.4653e+30, 4.1209e+21, 8.1580e-33, 1.4754e-19, 4.7429e+30, 1.3818e+31,
        1.9203e+31, 3.0266e+24, 1.9051e+31, 2.1687e+29, 4.7429e+30, 5.3977e+28,
        1.2943e+22, 2.2136e-10, 1.0397e+21, 1.3896e+31, 1.9203e+31, 7.3581e+31,
        1.7968e+35, 1.2035e+30, 1.2728e+25, 7.7783e+31, 2.4176e-12, 7.7781e+31,
        1.8515e+28, 9.1041e-12, 6.2609e+

In [10]:
x = torch.IntTensor(45, 57, 14, 2)
x.shape

torch.Size([45, 57, 14, 2])

**Note:** if you create `torch.Tensor` with the following constructor it will be filled with the "random trash numbers":

In [11]:
x = torch.IntTensor(3, 2, 4)
x

tensor([[[743989232,       437, 743989232,       437],
         [743919280,       437, 743919280,       437]],

        [[752667440,       437, 752650160,       437],
         [752672304,       437, 752672304,       437]],

        [[734276448,       437, 734276448,       437],
         [734276448,       437, 734276448,       437]]], dtype=torch.int32)

Here is a way to fill a new tensor with zeroes:

In [3]:
x1 = torch.FloatTensor(3, 2, 4)
x1.zero_()
x2 = torch.zeros(3, 2, 4)
x3 = torch.zeros_like(x1)

assert torch.allclose(x1, x2) and torch.allclose(x1, x3)
x1

tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.]]])

Random distribution initialization

In [9]:
x = torch.randn((2,3))                # Normal(0, 1) with shape
x

tensor([[ 0.3119, -0.4487,  1.2334],
        [-1.0515, -0.0313, -1.7783]])

In [14]:
x.random_(0, 10)                      # discrete U[0, 10]
x.uniform_(0, 1)                      # U[0, 1]
x.normal_(mean=0, std=1)              # Normal with mean and std
x.bernoulli_(p=0.5)                   # bernoulli with parameter p

tensor([[0., 1., 1.],
        [0., 1., 1.]])

## Numpy -> Torch

All numpy function have its pair in torch.

https://github.com/torch/torch7/wiki/Torch-for-Numpy-users

`np.reshape()` == `torch.view()`:

In [12]:
b, b.stride()

(tensor([[1., 2., 3.],
         [4., 5., 6.]]),
 (3, 1))

In [15]:
b.view(3, 2), b.view(2, 3).stride()  

(tensor([[1., 2.],
         [3., 4.],
         [5., 6.]]),
 (3, 1))

In [18]:
b.view(2, 3)

tensor([[1., 2., 3.],
        [4., 5., 6.]])

**Note:** `torch.view()` creates a new tensor, one the old one remains unchanged

In [17]:
b.view(-1)

tensor([1., 2., 3., 4., 5., 6.])

In [18]:
b

tensor([[1., 2., 3.],
        [4., 5., 6.]])

In [19]:
b.T.stride(), b.is_contiguous(), b.T.is_contiguous()

((1, 3), True, False)

In [20]:
b.reshape(-1) # returns view or contigues tensor

tensor([1., 2., 3., 4., 5., 6.])

In [21]:
b

tensor([[1., 2., 3.],
        [4., 5., 6.]])

In [22]:
b.T

tensor([[1., 4.],
        [2., 5.],
        [3., 6.]])

In [23]:
b.T.stride()

(1, 3)

* Change a tensor type:

In [24]:
a = torch.FloatTensor([1.5, 3.2, -7])

In [25]:
a.type_as(torch.IntTensor())

tensor([ 1,  3, -7], dtype=torch.int32)

In [26]:
a.to(torch.int32)

tensor([ 1,  3, -7], dtype=torch.int32)

In [27]:
a.type_as(torch.ByteTensor())

tensor([  1,   3, 249], dtype=torch.uint8)

In [28]:
a.to(torch.uint8)

tensor([  1,   3, 249], dtype=torch.uint8)

**Note:** `.type_as()` creates a new tensor, the old one remains unchanged

In [29]:
a

tensor([ 1.5000,  3.2000, -7.0000])

* Indexing is just like in `NumPy`:

In [30]:
a = torch.FloatTensor([[100, 20, 35], [15, 163, 534], [52, 90, 66]])
a

tensor([[100.,  20.,  35.],
        [ 15., 163., 534.],
        [ 52.,  90.,  66.]])

In [31]:
a[0, 0]

tensor(100.)

In [32]:
a[0:2, 1]

tensor([ 20., 163.])

**Ariphmetics and boolean operations** and their analogues:  

| Operator | Analogue |
|:-:|:-:|
|`+`| `torch.add()` |
|`-`| `torch.sub()` |
|`*`| `torch.mul()` |
|`/`| `torch.div()` |

* Addition:

In [19]:
a = torch.FloatTensor([[1, 2, 3], [10, 20, 30], [100, 200, 300]])
b = torch.FloatTensor([[-1, -2, -3], [-10, -20, -30], [100, 200, 300]])

In [20]:
a + b

tensor([[  0.,   0.,   0.],
        [  0.,   0.,   0.],
        [200., 400., 600.]])

In [21]:
a.add(b)

tensor([[  0.,   0.,   0.],
        [  0.,   0.,   0.],
        [200., 400., 600.]])

In [36]:
b = -a
b

tensor([[  -1.,   -2.,   -3.],
        [ -10.,  -20.,  -30.],
        [-100., -200., -300.]])

In [37]:
a + b

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])

* Subtraction:

In [38]:
a - b

tensor([[  2.,   4.,   6.],
        [ 20.,  40.,  60.],
        [200., 400., 600.]])

In [39]:
a.sub(b) # copy

tensor([[  2.,   4.,   6.],
        [ 20.,  40.,  60.],
        [200., 400., 600.]])

In [40]:
a.sub_(b) # inplace

tensor([[  2.,   4.,   6.],
        [ 20.,  40.,  60.],
        [200., 400., 600.]])

* Multiplication (elementwise):

In [41]:
a * b

tensor([[-2.0000e+00, -8.0000e+00, -1.8000e+01],
        [-2.0000e+02, -8.0000e+02, -1.8000e+03],
        [-2.0000e+04, -8.0000e+04, -1.8000e+05]])

In [42]:
a.mul(b)

tensor([[-2.0000e+00, -8.0000e+00, -1.8000e+01],
        [-2.0000e+02, -8.0000e+02, -1.8000e+03],
        [-2.0000e+04, -8.0000e+04, -1.8000e+05]])

* Division (elementwise):

In [25]:
a = torch.FloatTensor([[1, 2, 3], [10, 20, 30], [100, 200, 300]])
b = torch.FloatTensor([[-1, -2, -3], [-10, -20, -30], [100, 200, 300]])

In [26]:
a / b

tensor([[-1., -1., -1.],
        [-1., -1., -1.],
        [ 1.,  1.,  1.]])

In [27]:
a.div(b)

tensor([[-1., -1., -1.],
        [-1., -1., -1.],
        [ 1.,  1.,  1.]])

**Note:** all this operations create new tensors, the old tensors remain unchanged

In [28]:
a

tensor([[  1.,   2.,   3.],
        [ 10.,  20.,  30.],
        [100., 200., 300.]])

In [29]:
b

tensor([[ -1.,  -2.,  -3.],
        [-10., -20., -30.],
        [100., 200., 300.]])

* Comparison operators:

In [30]:
a = torch.FloatTensor([[1, 2, 3], [10, 20, 30], [100, 200, 300]])
b = torch.FloatTensor([[-1, -2, -3], [-10, -20, -30], [100, 200, 300]])

In [31]:
a == b

tensor([[False, False, False],
        [False, False, False],
        [ True,  True,  True]])

In [32]:
a != b

tensor([[ True,  True,  True],
        [ True,  True,  True],
        [False, False, False]])

In [33]:
a < b

tensor([[False, False, False],
        [False, False, False],
        [False, False, False]])

In [38]:
a > b

tensor([[ True,  True,  True],
        [ True,  True,  True],
        [False, False, False]])

In [36]:
a[a > b].reshape((2,3))

tensor([[ 1.,  2.,  3.],
        [10., 20., 30.]])

* Using boolean mask indexing:

In [16]:
a[a > b].reshape(())

tensor([ 1.,  2.,  3., 10., 20., 30.])

In [54]:
b[a == b]

tensor([100., 200., 300.])

Elementwise application of the **universal functions**:

In [39]:
a = torch.FloatTensor([[1, 2, 3], [10, 20, 30], [100, 200, 300]])

In [40]:
a.sin()

tensor([[ 0.8415,  0.9093,  0.1411],
        [-0.5440,  0.9129, -0.9880],
        [-0.5064, -0.8733, -0.9998]])

In [41]:
torch.sin(a)

tensor([[ 0.8415,  0.9093,  0.1411],
        [-0.5440,  0.9129, -0.9880],
        [-0.5064, -0.8733, -0.9998]])

In [42]:
a.tan()

tensor([[ 1.5574, -2.1850, -0.1425],
        [ 0.6484,  2.2372, -6.4053],
        [-0.5872, -1.7925, 45.2447]])

In [59]:
a.exp()

tensor([[2.7183e+00, 7.3891e+00, 2.0086e+01],
        [2.2026e+04, 4.8517e+08, 1.0686e+13],
        [       inf,        inf,        inf]])

In [60]:
a.log()

tensor([[0.0000, 0.6931, 1.0986],
        [2.3026, 2.9957, 3.4012],
        [4.6052, 5.2983, 5.7038]])

In [61]:
b = -a
b

tensor([[  -1.,   -2.,   -3.],
        [ -10.,  -20.,  -30.],
        [-100., -200., -300.]])

In [62]:
b.abs()

tensor([[  1.,   2.,   3.],
        [ 10.,  20.,  30.],
        [100., 200., 300.]])

* The sum, mean, max, min:

In [49]:
a.sum(dim=1)

tensor([  6.,  60., 600.])

In [64]:
a.mean()

tensor(74.)

Along axis:

In [65]:
a

tensor([[  1.,   2.,   3.],
        [ 10.,  20.,  30.],
        [100., 200., 300.]])

In [66]:
a.sum(dim=0)

tensor([111., 222., 333.])

In [67]:
a.sum(1)

tensor([  6.,  60., 600.])

In [68]:
a.max()

tensor(300.)

In [55]:
a.max(0)

torch.return_types.max(
values=tensor([100., 200., 300.]),
indices=tensor([2, 2, 2]))

In [70]:
a.min()

tensor(1.)

In [71]:
a.min(0)

torch.return_types.min(
values=tensor([1., 2., 3.]),
indices=tensor([0, 0, 0]))

In [58]:
c = torch.FloatTensor([[1, 2, 3], [10, 200, 30], [100, 20, 300]])
c

tensor([[  1.,   2.,   3.],
        [ 10., 200.,  30.],
        [100.,  20., 300.]])

In [57]:
c.max(0)

torch.return_types.max(
values=tensor([100., 200., 300.]),
indices=tensor([2, 1, 2]))

**Note:** the second tensor returned by `.max()` and `.min()` contains the indices of max/min elements along this axis. E.g. in that case `a.min()` returned `(1, 2, 3)` which are the minimum elements along 0 axis (along columns) and their indices along 0 axis are `(0, 0, 0)`.

**Matrix operations**:

* Transpose a tensor:

In [59]:
a = torch.FloatTensor([[1, 2, 3], [10, 20, 30], [100, 200, 300]])
a

tensor([[  1.,   2.,   3.],
        [ 10.,  20.,  30.],
        [100., 200., 300.]])

In [60]:
a.t()

tensor([[  1.,  10., 100.],
        [  2.,  20., 200.],
        [  3.,  30., 300.]])

It is not not the inplace operation too:

In [61]:
a

tensor([[  1.,   2.,   3.],
        [ 10.,  20.,  30.],
        [100., 200., 300.]])

* Dot product of vectors:

In [5]:
a = torch.FloatTensor([1, 2, 3, 4, 5, 6])
b = torch.FloatTensor([-1, -2, -4, -6, -8, -10])

In [6]:
a.dot(b)

tensor(-141.)

In [7]:
a.shape, b.shape

(torch.Size([6]), torch.Size([6]))

In [8]:
a @ b

tensor(-141.)

In [9]:
type(a)

torch.Tensor

In [10]:
type(b)

torch.Tensor

In [12]:
type(a@b)

torch.Tensor

* Matrix product:

In [14]:
a = torch.FloatTensor([[1, 2, 3], [10, 20, 30], [100, 200, 300]])
b = torch.FloatTensor([[-1, -2, -3], [-10, -20, -30], [100, 200, 300]])

In [17]:
a.mm(b)

tensor([[  279.,   558.,   837.],
        [ 2790.,  5580.,  8370.],
        [27900., 55800., 83700.]])

In [18]:
a @ b

tensor([[  279.,   558.,   837.],
        [ 2790.,  5580.,  8370.],
        [27900., 55800., 83700.]])

Remain unchanged:

In [85]:
a

tensor([[  1.,   2.,   3.],
        [ 10.,  20.,  30.],
        [100., 200., 300.]])

In [86]:
b

tensor([[ -1.,  -2.,  -3.],
        [-10., -20., -30.],
        [100., 200., 300.]])

In [71]:
a = torch.FloatTensor([[1, 2, 3], [10, 20, 30], [100, 200, 300]])
b = torch.FloatTensor([[-1], [-10], [100]])

In [72]:
print(a.shape, b.shape)

torch.Size([3, 3]) torch.Size([3, 1])


In [73]:
a @ b

tensor([[  279.],
        [ 2790.],
        [27900.]])

In [75]:
(a @ b).shape

torch.Size([3, 1])

If we unroll the tensor `b` in an array (`torch.view(-1)`) the multiplication would be like with the column:

In [86]:
b

tensor([[ -1.],
        [-10.],
        [100.]])

In [91]:
b.view(-1)

tensor([ -1., -10., 100.])

In [92]:
a @ b.view(-1)

tensor([  279.,  2790., 27900.])

In [93]:
a.mv(b.view(-1))

tensor([  279.,  2790., 27900.])

In [94]:
y = torch.Tensor(2, 3, 4, 5)
z = torch.Tensor(2, 3, 5, 6)
(y @ z).shape

torch.Size([2, 3, 4, 6])

**From NumPu to PyTorch conversion**:

In [2]:
import numpy as np

a = np.random.rand(3, 3)
a

array([[0.41119881, 0.4268529 , 0.84194002],
       [0.59647231, 0.49509554, 0.74978182],
       [0.72376793, 0.79701025, 0.96919461]])

In [90]:
b = torch.from_numpy(a)
b

tensor([[0.9460, 0.8570, 0.6760],
        [0.7849, 0.3468, 0.0738],
        [0.0721, 0.3871, 0.1364]], dtype=torch.float64)

**NOTE!** `a` and `b` have the same data storage, so the changes in one tensor will lead to the changes in another:

In [91]:
b -= b
b

tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], dtype=torch.float64)

In [92]:
a

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

**From PyTorch to NumPy conversion:**

In [99]:
a = torch.FloatTensor(2, 3, 4)
a

tensor([[[1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08],
         [1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08],
         [1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08]],

        [[1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08],
         [1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08],
         [1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08]]])

In [100]:
type(a)

torch.Tensor

In [101]:
x = a.numpy()
x

array([[[1.e-08, 1.e-08, 1.e-08, 1.e-08],
        [1.e-08, 1.e-08, 1.e-08, 1.e-08],
        [1.e-08, 1.e-08, 1.e-08, 1.e-08]],

       [[1.e-08, 1.e-08, 1.e-08, 1.e-08],
        [1.e-08, 1.e-08, 1.e-08, 1.e-08],
        [1.e-08, 1.e-08, 1.e-08, 1.e-08]]], dtype=float32)

In [102]:
x.shape

(2, 3, 4)

In [103]:
type(x)

numpy.ndarray

In [104]:
x -= x

In [105]:
a

tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])

Let's write the `forward_pass(X, w)` ($w_0$ is a part of the $w$) for a single neuron (activation = sigmoid) using PyTorch:

In [94]:
def forward_pass(X, w):
    return torch.sigmoid(X @ w)

In [95]:
X = torch.FloatTensor([[-5, 5], [2, 3], [1, -1]])
w = torch.FloatTensor([[-0.5], [2.5]])
result = forward_pass(X, w)
print('result: {}'.format(result))

result: tensor([[1.0000],
        [0.9985],
        [0.0474]])


In [96]:
X@w

tensor([[15.0000],
        [ 6.5000],
        [-3.0000]])

## <h1 style="text-align: center;"><a href="https://ru.wikipedia.org/wiki/CUDA">CUDA</a></h3>

[CUDA documentation](https://docs.nvidia.com/cuda/)

We can use both CPU (Central Processing Unit) and GPU (Graphical Processing Unit) to make the computations with PyTorch. We can switch between them easily, this is one of the most important things in PyTorch framework.

In [3]:
x = torch.FloatTensor(1024, 10024).uniform_()
x

tensor([[0.0401, 0.4454, 0.8064,  ..., 0.6840, 0.1050, 0.3472],
        [0.4592, 0.5255, 0.7796,  ..., 0.0501, 0.4926, 0.5331],
        [0.1048, 0.7120, 0.6402,  ..., 0.7543, 0.0650, 0.2821],
        ...,
        [0.4126, 0.8782, 0.8510,  ..., 0.4894, 0.1222, 0.2790],
        [0.5710, 0.5876, 0.6326,  ..., 0.3231, 0.4816, 0.6545],
        [0.7488, 0.5398, 0.7202,  ..., 0.0101, 0.4830, 0.3498]])

In [4]:
x.type()

'torch.FloatTensor'

In [5]:
x.is_cuda

False

Place a tensor on GPU (GPU memory is used):

In [6]:
!nvidia-smi

Sat Jan 02 13:29:27 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 457.51       Driver Version: 457.51       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 2070   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8    14W /  N/A |    161MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [12]:
x = x.cuda()
x

tensor([[0.0401, 0.4454, 0.8064,  ..., 0.6840, 0.1050, 0.3472],
        [0.4592, 0.5255, 0.7796,  ..., 0.0501, 0.4926, 0.5331],
        [0.1048, 0.7120, 0.6402,  ..., 0.7543, 0.0650, 0.2821],
        ...,
        [0.4126, 0.8782, 0.8510,  ..., 0.4894, 0.1222, 0.2790],
        [0.5710, 0.5876, 0.6326,  ..., 0.3231, 0.4816, 0.6545],
        [0.7488, 0.5398, 0.7202,  ..., 0.0101, 0.4830, 0.3498]],
       device='cuda:0')

In [8]:
!nvidia-smi

Sat Jan 02 13:29:31 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 457.51       Driver Version: 457.51       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 2070   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P0    29W /  N/A |    679MiB /  8192MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [9]:
x

tensor([[0.0401, 0.4454, 0.8064,  ..., 0.6840, 0.1050, 0.3472],
        [0.4592, 0.5255, 0.7796,  ..., 0.0501, 0.4926, 0.5331],
        [0.1048, 0.7120, 0.6402,  ..., 0.7543, 0.0650, 0.2821],
        ...,
        [0.4126, 0.8782, 0.8510,  ..., 0.4894, 0.1222, 0.2790],
        [0.5710, 0.5876, 0.6326,  ..., 0.3231, 0.4816, 0.6545],
        [0.7488, 0.5398, 0.7202,  ..., 0.0101, 0.4830, 0.3498]],
       device='cuda:0')

In [13]:
x = x.cpu()
!nvidia-smi

torch.cuda.empty_cache()
!nvidia-smi

Sat Jan 02 12:53:05 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 457.51       Driver Version: 457.51       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 2070   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   55C    P0    30W /  N/A |    681MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [10]:
device = torch.device("cuda:0")
x = x.to(device)
x

tensor([[0.0401, 0.4454, 0.8064,  ..., 0.6840, 0.1050, 0.3472],
        [0.4592, 0.5255, 0.7796,  ..., 0.0501, 0.4926, 0.5331],
        [0.1048, 0.7120, 0.6402,  ..., 0.7543, 0.0650, 0.2821],
        ...,
        [0.4126, 0.8782, 0.8510,  ..., 0.4894, 0.1222, 0.2790],
        [0.5710, 0.5876, 0.6326,  ..., 0.3231, 0.4816, 0.6545],
        [0.7488, 0.5398, 0.7202,  ..., 0.0101, 0.4830, 0.3498]],
       device='cuda:0')

In [11]:
!nvidia-smi

Sat Jan 02 13:31:26 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 457.51       Driver Version: 457.51       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 2070   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P0    29W /  N/A |    681MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Let's multiply two tensors on GPU and then move the result on the CPU:

In [13]:
a = torch.FloatTensor(10000, 10000).uniform_()
b = torch.FloatTensor(10000, 10000).uniform_()
c = a.cuda().mul(b.cuda()).cpu()

In [125]:
c

tensor([[0.0012, 0.2051, 0.4646,  ..., 0.3617, 0.1091, 0.0315],
        [0.3328, 0.3207, 0.3378,  ..., 0.0379, 0.4017, 0.7091],
        [0.2017, 0.2484, 0.0779,  ..., 0.0412, 0.2483, 0.4548],
        ...,
        [0.4912, 0.3384, 0.3785,  ..., 0.4149, 0.4922, 0.0312],
        [0.2373, 0.1510, 0.4573,  ..., 0.2100, 0.1666, 0.7077],
        [0.0110, 0.2734, 0.2392,  ..., 0.0141, 0.7753, 0.4950]])

In [126]:
a

tensor([[0.0813, 0.8068, 0.7543,  ..., 0.4184, 0.4151, 0.0908],
        [0.8222, 0.4505, 0.4318,  ..., 0.0694, 0.4496, 0.7591],
        [0.4407, 0.3484, 0.6002,  ..., 0.1208, 0.4895, 0.7962],
        ...,
        [0.6698, 0.5983, 0.6350,  ..., 0.9427, 0.5958, 0.0567],
        [0.3812, 0.9589, 0.7442,  ..., 0.3967, 0.9847, 0.7579],
        [0.0163, 0.5849, 0.7876,  ..., 0.2243, 0.8950, 0.6320]])

Tensors placed on CPU and tensors placed on GPU are unavailable for each other:

In [14]:
a = torch.FloatTensor(10000, 10000).uniform_().cpu()
b = torch.FloatTensor(10000, 10000).uniform_().cuda()

In [15]:
a + b

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Example of working with GPU:

In [16]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [17]:
x = torch.FloatTensor(5, 5, 5).uniform_()

# check for CUDA availability (NVIDIA GPU)
if torch.cuda.is_available():
    # get the CUDA device name
    device = torch.device('cuda')          # CUDA-device object
    y = torch.ones_like(x, device=device)  # create a tensor on GPU
    x = x.to(device)                       # or just `.to("cuda")`
    z = x + y
    print(z)
    # you can set the type while `.to` operation
    print(z.to("cpu", torch.double))

tensor([[[1.7103, 1.3067, 1.6329, 1.9796, 1.6858],
         [1.8093, 1.6806, 1.0085, 1.3209, 1.5874],
         [1.4154, 1.0591, 1.9960, 1.9804, 1.6723],
         [1.9225, 1.0853, 1.4865, 1.8066, 1.0401],
         [1.9484, 1.2702, 1.6252, 1.1983, 1.4079]],

        [[1.2002, 1.7652, 1.5089, 1.0096, 1.7345],
         [1.7537, 1.6421, 1.4570, 1.9552, 1.0273],
         [1.0975, 1.0360, 1.8590, 1.0140, 1.9357],
         [1.7492, 1.0682, 1.2421, 1.9253, 1.3472],
         [1.6340, 1.5248, 1.2879, 1.1833, 1.2339]],

        [[1.2398, 1.6800, 1.6322, 1.4584, 1.1835],
         [1.1367, 1.9143, 1.1514, 1.6518, 1.0214],
         [1.4547, 1.7466, 1.4360, 1.8045, 1.0691],
         [1.6835, 1.3066, 1.5733, 1.2349, 1.6105],
         [1.4530, 1.6686, 1.4774, 1.7631, 1.4839]],

        [[1.9421, 1.5182, 1.6417, 1.0364, 1.2409],
         [1.2417, 1.9962, 1.7904, 1.6778, 1.7674],
         [1.5855, 1.3705, 1.6531, 1.5293, 1.2166],
         [1.9382, 1.6901, 1.2070, 1.9993, 1.2234],
         [1.4826, 1.4778,

## AutoGrad

**Chain rule (a.k.a. backpropagation in NN)** used here

Assume we have $f(w(\theta))$
$${\frac  {\partial{f}}{\partial{\theta}}}
={\frac  {\partial{f}}{\partial{w}}}\cdot {\frac  {\partial{w}}{\partial{\theta}}}$$


*Additional reading: In multidimentional case it is described by composition of partial derivatives:*
$$
D_\theta(f\circ w) = D_{w(\theta)}(f)\circ D_\theta(w)
$$

Simple example of gradient propagation:

$$y = \sin \left(x_2^2(x_1 + x_2)\right)$$

<img src="https://ars.els-cdn.com/content/image/1-s2.0-S0010465515004099-gr1.jpg" width=700></img>


The autograd package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

The examples:

In [18]:
dtype = torch.float
device = torch.device("cuda:0")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 3, 3, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

In [19]:
y_pred = (x @ w1).clamp(min=0).matmul(w2)
loss = (y_pred - y).pow(2).sum()
# calculate the gradients
loss.backward()

In [20]:
print((y_pred - y).pow(2).sum())

tensor(5871.9170, device='cuda:0', grad_fn=<SumBackward0>)


In [21]:
w1.grad, w2.grad

(tensor([[-1692.5210, -1054.4653,  -868.6688],
         [ 1334.2271,   379.9589,   811.7915],
         [ -348.5174,   816.7480,   277.9403]], device='cuda:0'),
 tensor([[  58.0329, -869.0873,  957.5291, 1056.8748, 1311.5952,  571.3660,
          -336.2128,  442.7061,    2.5595,  426.3328],
         [ -94.3596, -142.9242,  361.0377,  411.0956,  351.2498,  160.6999,
          -108.7632,  132.2290,   42.6296,   10.7221],
         [ -18.8681, -563.3286,  736.5255,  819.4072,  967.5113,  449.6131,
          -251.7136,  337.2563,  -43.5571,  249.8490]], device='cuda:0'))

In [22]:
loss.grad # can't access to non-leaf grad in AD tree

  loss.grad # can't access to non-leaf grad in AD tree


In [23]:
# make the variable remember grad of loss
y_pred = (x @ w1).clamp(min=0).matmul(w2)
y_pred.retain_grad()

loss = (y_pred - y).pow(2).sum()
loss.retain_grad()

loss.backward()

In [24]:
loss.grad

tensor(1., device='cuda:0')

In [25]:
x.grad # doesn't require grad

In [26]:
y.grad # doesn't require grad

**NOTE:** the gradients are placed into the `.grad` field of tensors (variables) on which gradients were calculated. Gradients *are not placed* in the variable `loss` here!

In [27]:
w1

tensor([[-2.6005, -0.8509, -1.7464],
        [ 1.2162, -0.3466,  1.1503],
        [-0.9665,  1.0468,  0.1761]], device='cuda:0', requires_grad=True)

In [28]:
with torch.no_grad():
    pass

<h3 style="text-align: center;">Further reading:<b></b></h3>

*1). Official PyTorch tutorials: https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py*

*2). arXiv article about the deep learning frameworks comparison: https://arxiv.org/pdf/1511.06435.pdf*

*3). Useful repo with different tutorials: https://github.com/yunjey/pytorch-tutorial*

*4). Facebook AI Research (main contributor of PyTorch) website: https://facebook.ai/developers/tools*