![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)

# <center> Deep Learning Methods </center>
## <center> Lecture 5 - PyTorch</center>
### <center> PyTorch Basics </center>

Colab users should use GPU runtime:<br>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/DeepLearningMethods/05_PyTorch/MainPyTorchBasics.ipynb)

### Useful PyTorch tutorials:
https://pytorch.org/tutorials/

In [1]:
#-- Wide screen:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [2]:
#-- Imports:
import numpy as np

#-- torch:
import torch

### Tensors
Tensors are similar to NumPy’s ndarrays,  
Tensors can also be used on GPUs.

In [3]:
mX = torch.ones(2, 3)
mX

tensor([[1., 1., 1.],
        [1., 1., 1.]])

In [4]:
#-- type, shape & size:
mX.type(), mX.shape, mX.size()

('torch.FloatTensor', torch.Size([2, 3]), torch.Size([2, 3]))

In [5]:
#-- dytpe = int:
mX = torch.ones(2, 3, dtype=int)
mX, mX.type()

(tensor([[1, 1, 1],
         [1, 1, 1]]),
 'torch.LongTensor')

In [6]:
#-- To NumPy:
vX = torch.linspace(1, 3, 15)
vX

tensor([1.0000, 1.1429, 1.2857, 1.4286, 1.5714, 1.7143, 1.8571, 2.0000, 2.1429,
        2.2857, 2.4286, 2.5714, 2.7143, 2.8571, 3.0000])

In [7]:
vX.numpy()

array([1.       , 1.1428572, 1.2857143, 1.4285715, 1.5714285, 1.7142857,
       1.8571429, 2.       , 2.142857 , 2.2857141, 2.4285715, 2.5714285,
       2.7142856, 2.857143 , 3.       ], dtype=float32)

#### Notice the difference between the following two cells:
(be careful when initialize a tensor with round numbers)

In [8]:
vX = torch.tensor([1, 2, 5, 6.])
vX, vX.type()

(tensor([1., 2., 5., 6.]), 'torch.FloatTensor')

In [9]:
vX = torch.tensor([1, 2, 5, 6])
vX, vX.type()

(tensor([1, 2, 5, 6]), 'torch.LongTensor')

###  Autograd
Consider the following function:
$$y=f\left(x\right)=x^{2}+3$$
$$\implies f'\left(x\right)=2x$$

In [10]:
f = lambda x: x**2 + 3
x = torch.tensor(7., requires_grad=True)
y = f(x)

In [11]:
#-- compute gradients:
y.backward()

In [12]:
#-- check that f'(7) = 14:
x.grad

tensor(14.)

Consider now:
$$y=f\left(\boldsymbol{x},\boldsymbol{w}\right)=\boldsymbol{w}^{T}\boldsymbol{x}$$
$$\implies\nabla_{\boldsymbol{x}}f=\boldsymbol{w}$$
and
$$\implies\nabla_{\boldsymbol{w}}f=\boldsymbol{x}$$

In [13]:
f  = lambda vX, vW: vW[None,:] @ vX[:,None]
vX = torch.tensor([1., 3], requires_grad=True)
vW = torch.tensor([2., 5], requires_grad=True)
y  = f(vX, vW)

In [14]:
#-- compute gradients:
y.backward()

In [15]:
#-- check that:
#--     1. ∇xf = w
#--     2. ∇wf = x
print(vX.grad)
print(vW.grad)

tensor([2., 5.])
tensor([1., 3.])


### Why do we need to set to zero the gradients?
Let us repeat the code from the cells above:

In [16]:
y = f(vX, vW)
y.backward()

In [17]:
print(vX.grad)
print(vW.grad)

tensor([ 4., 10.])
tensor([2., 6.])


Note that the results are different now.  
This is because we did not reset the gradients.  
Let us try again:

In [18]:
vX.grad.data.zero_()
vW.grad.data.zero_()

y = f(vX, vW)
y.backward()

In [19]:
print(vX.grad)
print(vW.grad)

tensor([2., 5.])
tensor([1., 3.])


Consider:
$$\boldsymbol{y}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}\boldsymbol{x}$$
where
* $\boldsymbol{x}\in\mathbb{R}^{\text{in}}$
* $\boldsymbol{W}\in\mathbb{R}^{\text{out}\times\text{in}}$
* $\boldsymbol{y}\in\mathbb{R}^{\text{out}}$

The Jacobian $\boldsymbol{J}_{f}\left(\boldsymbol{x}\right)$:
$$\implies\nabla_{\boldsymbol{x}}f\left(\boldsymbol{x}\right)\left[\boldsymbol{h}\right]=\boldsymbol{W}\boldsymbol{h}$$
$$\implies\boldsymbol{J}_{f}\left(\boldsymbol{x}\right)=\boldsymbol{W}$$

Since $\boldsymbol{y}\in\mathbb{R}^{\text{out}}$ is a vector, we define:
$$g\left(\boldsymbol{x}\right)=\boldsymbol{h}^{T}\boldsymbol{y}\in\mathbb{R}$$
Then,
$$\boldsymbol{y}\text{.backward}\left(\boldsymbol{h}\right)=\nabla_{\boldsymbol{x}}\boldsymbol{g}\left(\boldsymbol{x}\right)=\boldsymbol{h}^{T}\boldsymbol{W}$$
In general: $\boldsymbol{y}\text{.backward}\left(\boldsymbol{h}\right)=\boldsymbol{h}^{T}\boldsymbol{J}_{f}$

In [20]:
f  = lambda vX, mW: mW @ vX
vX = torch.tensor([1., 4, 5], requires_grad=True)
mW = torch.tensor([[1., 4, 0],
                   [2,  1, 7]])

vY = f(vX, mW)
vH = torch.ones(2)
vY.backward(vH)

In [21]:
#-- check that:
#--     x.grad = h^T @ ∇xf
print(vX.grad)
print(vH @ mW)

tensor([3., 5., 7.])
tensor([3., 5., 7.])


Some imports:

In [22]:
if 'google.colab' in str(get_ipython()):
    !pip install torchinfo

In [23]:
import torch.nn            as nn
import torch.nn.functional as F
import torchinfo

### Sequential model:
$$\hat{\boldsymbol{y}}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}_{3}\sigma\left(\boldsymbol{W}_{2}\sigma\left(\boldsymbol{W}_{1}\boldsymbol{x}+\boldsymbol{b}_{1}\right)+\boldsymbol{b}_{2}\right)$$

In [24]:
oModel = nn.Sequential(
    nn.Identity(),                            #-- just for the summary
    nn.Linear(100, 50),            nn.ReLU(), #-- z1 = σ(W1 * x + b1)
    nn.Linear(50,  25),            nn.ReLU(), #-- z2 = σ(W2 * z1 + b2)
    nn.Linear(25,  10, bias=False)            #-- y  = W3 * z2
)

torchinfo.summary(oModel, (16, 100))

Layer (type:depth-idx)                   Output Shape              Param #
Sequential                               --                        --
├─Identity: 1-1                          [16, 100]                 --
├─Linear: 1-2                            [16, 50]                  5,050
├─ReLU: 1-3                              [16, 50]                  --
├─Linear: 1-4                            [16, 25]                  1,275
├─ReLU: 1-5                              [16, 25]                  --
├─Linear: 1-6                            [16, 10]                  250
Total params: 6,575
Trainable params: 6,575
Non-trainable params: 0
Total mult-adds (M): 0.11
Input size (MB): 0.01
Forward/backward pass size (MB): 0.01
Params size (MB): 0.03
Estimated Total Size (MB): 0.04

### Custom module (layer)
Consider the following architecture:
$$\hat{\boldsymbol{y}}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}_{3}\left(\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)+\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)\right)$$

<center> <img src="https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DeepLearningMethods/07_PyTorch/ParallelNetwork.png?raw=true" alt="a" style="width: 500px;"/> </center>

since we compute $\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)$ and $\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)$ in parallel  
this model can not be implemented using sequential model.

<center> <img src="https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DeepLearningMethods/07_PyTorch/ParallelNetwork.png?raw=true" alt="a" style="width: 200px;"/> </center>
$$\hat{\boldsymbol{y}}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}_{3}\left(\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)+\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)\right)$$

#### Option I: Define a new (custom) layer:
$$\text{NewLayer}\left(\boldsymbol{x}\right)=\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)+\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)$$
and then use `nn.Sequential`

In [25]:
class NewLayer(nn.Module):
    def __init__(self, dIn, dOut):
        super().__init__() #-- always do this
        self.Linear1 = nn.Linear(dIn, dOut, bias=False)
        self.Linear2 = nn.Linear(dIn, dOut, bias=False)

    def forward(self, mX):
        mZ1 = torch.sigmoid(self.Linear1(mX)) #-- σ1(W1 * x)
        mZ2 = torch.tanh   (self.Linear2(mX)) #-- σ2(W2 * x)
        return mZ1 + mZ2

In [26]:
oModel = nn.Sequential(
    NewLayer(100, 50),            #-- z = σ1(W1 * x) + σ2(W2 * x)
    nn.Linear(50, 10, bias=False) #-- y = W3 * z
)
torchinfo.summary(oModel, (16, 100))

Layer (type:depth-idx)                   Output Shape              Param #
Sequential                               --                        --
├─NewLayer: 1-1                          [16, 50]                  --
│    └─Linear: 2-1                       [16, 50]                  5,000
│    └─Linear: 2-2                       [16, 50]                  5,000
├─Linear: 1-2                            [16, 10]                  500
Total params: 10,500
Trainable params: 10,500
Non-trainable params: 0
Total mult-adds (M): 0.17
Input size (MB): 0.01
Forward/backward pass size (MB): 0.01
Params size (MB): 0.04
Estimated Total Size (MB): 0.06

$$\hat{\boldsymbol{y}}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}_{3}\left(\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)+\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)\right)$$
#### Option II: Manually define the architecture:

In [27]:
class ParallelModel(nn.Module):
    def __init__(self, dIn, dHidden, dOut):
        super().__init__() #-- always do this
        self.Linear1 = nn.Linear(dIn,     dHidden, bias=False)
        self.Linear2 = nn.Linear(dIn,     dHidden, bias=False)
        self.Linear3 = nn.Linear(dHidden, dOut,    bias=False)

    def forward(self, mX):
        mZ1 = torch.sigmoid(self.Linear1(mX)) #-- σ1(W1 * x)
        mZ2 = torch.tanh   (self.Linear2(mX)) #-- σ2(W2 * x)
        mY  = self.Linear3 (mZ1 + mZ2)        #-- W3 * (σ1(W1 * x) + σ2(W2 * x))
        return mY

In [28]:
oModel = ParallelModel(100, 50, 10)

torchinfo.summary(oModel, (16, 100))

Layer (type:depth-idx)                   Output Shape              Param #
ParallelModel                            --                        --
├─Linear: 1-1                            [16, 50]                  5,000
├─Linear: 1-2                            [16, 50]                  5,000
├─Linear: 1-3                            [16, 10]                  500
Total params: 10,500
Trainable params: 10,500
Non-trainable params: 0
Total mult-adds (M): 0.17
Input size (MB): 0.01
Forward/backward pass size (MB): 0.01
Params size (MB): 0.04
Estimated Total Size (MB): 0.06

### Using GPU
To move data to the GPU we use `.cuda()`, or `.to.('cuda')`

In [29]:
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
DEVICE

device(type='cuda', index=0)

In [30]:
mX = torch.randn(2, 4).cuda()
mX

tensor([[ 0.6915,  1.8768,  0.7592, -1.1731],
        [ 1.1161,  0.0019, -0.6576, -0.1635]], device='cuda:0')

In [31]:
mX = torch.randn(2, 4).to(DEVICE)
mX

tensor([[ 0.8333, -0.2878,  0.1535, -0.7411],
        [ 1.2242, -1.3785,  0.1699, -0.6099]], device='cuda:0')

In [32]:
#-- Generate data directly inside the GPU
mX = torch.randn(2, 4, device=DEVICE)
mX

tensor([[ 0.0039, -0.7947, -2.1475, -0.0750],
        [-0.3380,  1.5361,  1.4912,  0.7783]], device='cuda:0')

In [33]:
#-- Move the parameters of the model to the GPU:
oModel.to(DEVICE)
next(oModel.parameters()).device

device(type='cuda', index=0)

Back to cpu:

In [34]:
mX = mX.cpu()
#-- or:
mX = mX.to('cpu')

mX.device

device(type='cpu')

### CPU vs GPU:

In [35]:
import time

mX1 = torch.randn(10000, 10000)
mX2 = torch.randn(10000, 10000)

startTime = time.time()
mX3       = mX1 @ mX2
endTime   = time.time()

print(f'CPU time: {endTime - startTime}')

CPU time: 8.724896907806396


In [36]:
mX1 = torch.randn(10000, 10000, device='cuda')
mX2 = torch.randn(10000, 10000, device='cuda')

startTime = time.time()
for _ in range(10):
    mX3 = mX1 @ mX2
endTime = time.time()

print(f'GPU time: {endTime - startTime}')

GPU time: 0.0320281982421875


###  The End