![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)

# <center> Machine Learning Methods </center>
## <center> Lecture 28 - PyTorch</center>
### <center> PyTorch Basics</center>

Colab users should use GPU runtime:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethod/28_PyTorch/MainPytorchBasics1.ipynb)

### Useful PyTorch tutorials:
https://pytorch.org/tutorials/

In [None]:
#-- Wide screen:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [None]:
#-- Auto reload:
%load_ext autoreload
%autoreload 2

In [1]:
#-- Imports:
import numpy             as np
import pandas            as pd
import matplotlib.pyplot as plt
import matplotlib

matplotlib.rc('font', **{'size' : 16})

#-- torch:
import torch

### Tensors
Tensors are similar to NumPy’s ndarrays,  
Tensors can also be used on GPUs.

In [2]:
mX = torch.ones(2, 3)
mX

tensor([[1., 1., 1.],
        [1., 1., 1.]])

In [3]:
#-- type, shape & size:
mX.type(), mX.shape, mX.size()

('torch.FloatTensor', torch.Size([2, 3]), torch.Size([2, 3]))

In [4]:
#-- dytpe = int:
mX = torch.ones(2, 3, dtype=int)
mX, mX.type()

(tensor([[1, 1, 1],
         [1, 1, 1]]),
 'torch.LongTensor')

In [5]:
#-- To NumPy:
vX = torch.linspace(1, 3, 15)
vX

tensor([1.0000, 1.1429, 1.2857, 1.4286, 1.5714, 1.7143, 1.8571, 2.0000, 2.1429,
        2.2857, 2.4286, 2.5714, 2.7143, 2.8571, 3.0000])

In [6]:
vX.numpy()

array([1.       , 1.1428572, 1.2857143, 1.4285715, 1.5714285, 1.7142857,
       1.8571429, 2.       , 2.142857 , 2.2857141, 2.4285715, 2.5714285,
       2.7142856, 2.857143 , 3.       ], dtype=float32)

#### Notice the difference between the following two cells:
(be careful when initialize a tensor with round numbers)

In [7]:
vX = torch.tensor([1, 2, 5, 6.])
vX, vX.type()

(tensor([1., 2., 5., 6.]), 'torch.FloatTensor')

In [8]:
vX = torch.tensor([1, 2, 5, 6])
vX, vX.type()

(tensor([1, 2, 5, 6]), 'torch.LongTensor')

###  Autograd
Consider the following function:
$$y=f\left(x\right)=x^{2}+3$$
$$\implies f'\left(x\right)=2x$$

In [9]:
f = lambda x: x**2 + 3
x = torch.tensor(7., requires_grad=True)
y = f(x)

In [10]:
#-- compute gradients:
y.backward()

In [11]:
#-- check that f'(7) = 14:
x.grad

tensor(14.)

Consider now:
$$y=f\left(\boldsymbol{x},\boldsymbol{w}\right)=\boldsymbol{w}^{T}\boldsymbol{x}$$
$$\implies\nabla_{\boldsymbol{x}}f=\boldsymbol{w}$$
and
$$\implies\nabla_{\boldsymbol{w}}f=\boldsymbol{x}$$

In [12]:
f  = lambda vX, vW: vW[None,:] @ vX[:,None]
vX = torch.tensor([1., 3], requires_grad=True)
vW = torch.tensor([2., 5], requires_grad=True)
y  = f(vX, vW)

In [13]:
#-- compute gradients:
y.backward()

In [14]:
#-- check that:
#--     1. ∇xf = w
#--     2. ∇wf = x
print(vX.grad)
print(vW.grad)

tensor([2., 5.])
tensor([1., 3.])


### Why do we need to set to zero the gradients?
Let us repeat the code from the cells above:

In [21]:
y = f(vX, vW)
y.backward()

In [22]:
print(vX.grad)
print(vW.grad)

tensor([10., 25.])
tensor([ 5., 15.])


Note that the results are different now.  
This is because we did not reset the gradients.  
Let us try again:

In [25]:
vX.grad.data.zero_()
vW.grad.data.zero_()

y = f(vX, vW)
y.backward()

In [26]:
print(vX.grad)
print(vW.grad)

tensor([2., 5.])
tensor([1., 3.])


Consider:
$$\boldsymbol{y}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}\boldsymbol{x}$$
$$\implies\nabla_{\boldsymbol{x}}f\left[\boldsymbol{h}\right]=\boldsymbol{W}\boldsymbol{h}$$
$$\implies\nabla_{\boldsymbol{x}}f=\boldsymbol{W}$$

Since $\boldsymbol{y}$ is a vector, applying:
$$\boldsymbol{y}\text{.backward}\left(\boldsymbol{h}\right)$$
results in $\boldsymbol{h}^{T}\boldsymbol{\nabla}\boldsymbol{y}$  
(or $\boldsymbol{h}^{T}\boldsymbol{\nabla}f$)

In [27]:
f  = lambda mW, vX: mW @ vX
vX = torch.tensor([1., 4], requires_grad=True)
mW = torch.tensor([[1., 4],
                   [2,  1]], requires_grad=True)

vY = f(mW, vX)
vH = torch.ones(2)
vY.backward(vH)

In [28]:
#-- check that:
#--     x.grad = h^T @ ∇xf
print(vX.grad)
print(vH.T @ mW)

tensor([3., 5.])
tensor([3., 5.], grad_fn=<SqueezeBackward3>)


Some imports:

In [29]:
import torch.nn            as nn
import torch.nn.functional as F
import torchsummary

### Sequential model:
$$\hat{\boldsymbol{y}}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}_{3}\sigma\left(\boldsymbol{W}_{2}\sigma\left(\boldsymbol{W}_{1}\boldsymbol{x}+\boldsymbol{b}_{1}\right)+\boldsymbol{b}_{2}\right)$$

In [30]:
oModel = nn.Sequential(
    nn.Sequential(), #-- just for the summary
    nn.Linear(100, 50), nn.ReLU(), #-- z1 = σ(W1 * x + b1)
    nn.Linear(50,  25), nn.ReLU(), #-- z2 = σ(W2 * z1 + b2)
    nn.Linear(25,  10, bias=False) #-- y  = W3 * z2
)

torchsummary.summary(oModel, (100,)); print()

Layer (type:depth-idx)                   Output Shape              Param #
├─Sequential: 1-1                        [-1, 100]                 --
├─Linear: 1-2                            [-1, 50]                  5,050
├─ReLU: 1-3                              [-1, 50]                  --
├─Linear: 1-4                            [-1, 25]                  1,275
├─ReLU: 1-5                              [-1, 25]                  --
├─Linear: 1-6                            [-1, 10]                  250
Total params: 6,575
Trainable params: 6,575
Non-trainable params: 0
Total mult-adds (M): 0.01
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.03
Estimated Total Size (MB): 0.03



### Custom module (layer)
Consider the following architecture:
$$\hat{\boldsymbol{y}}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}_{3}\left(\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)+\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)\right)$$

<center> <img src="https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DeepLearningMethods/07_PyTorch/ParallelNetwork.png?raw=true" alt="a" style="width: 500px;"/> </center>

since we compute $\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)$ and $\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)$ in parallel  
this model can not be implemented using sequential model.

<center> <img src="https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DeepLearningMethods/07_PyTorch/ParallelNetwork.png?raw=true" alt="a" style="width: 200px;"/> </center>
$$\hat{\boldsymbol{y}}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}_{3}\left(\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)+\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)\right)$$

#### Option I: Define a new (custom) layer:
$$\text{NewLayer}\left(\boldsymbol{x}\right)=\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)+\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)$$
and then use `nn.Sequential`

In [31]:
class NewLayer(nn.Module):
    def __init__(self, dIn, dOut):
        super(NewLayer, self).__init__() #-- always do this
        self.Linear1 = nn.Linear(dIn, dOut, bias=False)
        self.Linear2 = nn.Linear(dIn, dOut, bias=False)

    def forward(self, mX):
        mZ1 = torch.sigmoid(self.Linear1(mX)) #-- σ1(W1 * x)
        mZ2 = torch.tanh(self.Linear2(mX))    #-- σ2(W2 * x)
        return mZ1 + mZ2

In [32]:
oModel = nn.Sequential(
    NewLayer(100, 50),            #-- z = σ1(W1 * x) + σ2(W2 * x)
    nn.Linear(50, 10, bias=False) #-- y = W3 * z
)
torchsummary.summary(oModel, (100,)); print()

Layer (type:depth-idx)                   Output Shape              Param #
├─NewLayer: 1-1                          [-1, 50]                  --
|    └─Linear: 2-1                       [-1, 50]                  5,000
|    └─Linear: 2-2                       [-1, 50]                  5,000
├─Linear: 1-2                            [-1, 10]                  500
Total params: 10,500
Trainable params: 10,500
Non-trainable params: 0
Total mult-adds (M): 0.02
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.04
Estimated Total Size (MB): 0.04



$$\hat{\boldsymbol{y}}=f\left(\boldsymbol{x}\right)=\boldsymbol{W}_{3}\left(\sigma_{1}\left(\boldsymbol{W}_{1}\boldsymbol{x}\right)+\sigma_{2}\left(\boldsymbol{W}_{2}\boldsymbol{x}\right)\right)$$
#### Option II: Manually define the architecture:

In [33]:
class ParallelModel(nn.Module):
    def __init__(self, dIn, dHidden, dOut):
        super(ParallelModel, self).__init__() #-- always do this
        self.Linear1 = nn.Linear(dIn,     dHidden, bias=False)
        self.Linear2 = nn.Linear(dIn,     dHidden, bias=False)
        self.Linear3 = nn.Linear(dHidden, dOut,    bias=False)

    def forward(self, mX):
        mZ1 = torch.sigmoid(self.Linear1(mX)) #-- σ1(W1 * x)
        mZ2 = torch.tanh(self.Linear2(mX))    #-- σ2(W2 * x)
        mY  = self.Linear3(mZ1 + mZ2)         #-- W3 * (σ1(W1 * x) + σ2(W2 * x))
        return mY

In [34]:
oModel = ParallelModel(100, 50, 10)

torchsummary.summary(oModel, (100,)); print()

Layer (type:depth-idx)                   Output Shape              Param #
├─Linear: 1-1                            [-1, 50]                  5,000
├─Linear: 1-2                            [-1, 50]                  5,000
├─Linear: 1-3                            [-1, 10]                  500
Total params: 10,500
Trainable params: 10,500
Non-trainable params: 0
Total mult-adds (M): 0.01
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.04
Estimated Total Size (MB): 0.04



### Using GPU
To move data to the GPU we use `.cuda()`, or `.to.('cuda')`

In [35]:
mX = torch.randn(2, 4).cuda()
mX

tensor([[-1.8986,  0.9100, -0.2939, -0.6268],
        [-0.3024,  1.5884, -1.4253, -1.6571]], device='cuda:0')

In [36]:
mX = torch.randn(2, 4).to('cuda')
mX

tensor([[-0.0570, -0.5940, -0.6023,  1.4956],
        [ 1.2450, -0.8159,  0.1091,  0.4449]], device='cuda:0')

In [37]:
#-- Generate data directly inside the GPU
mX = torch.randn(2, 4, device='cuda')
mX

tensor([[-1.1095,  1.1097,  0.6147, -0.7757],
        [-1.4830, -1.2083, -0.6251, -0.1581]], device='cuda:0')

In [38]:
#-- Move the parameters of the model to the GPU:
oModel.to('cuda')
next(oModel.parameters()).device

device(type='cuda', index=0)

Back to cpu:

In [39]:
mX = mX.cpu()
#-- or:
mX = mX.to('cpu')
mX.device

device(type='cpu')

### CPU vs GPU:

In [40]:
import time

mX1 = torch.randn(10000, 10000)
mX2 = torch.randn(10000, 10000)

startTime = time.time()
mX3       = mX1 @ mX2
endTime   = time.time()

print(f'CPU time: {endTime - startTime}')

CPU time: 6.751370429992676


In [41]:
mX1 = mX1.cuda()
mX2 = mX2.cuda()

startTime = time.time()
mX3       = mX1 @ mX2
endTime   = time.time()

print(f'GPU time: {endTime - startTime}')

GPU time: 0.04587554931640625


###  The End