# PyTorch


## General

> An open source machine learning framework that accelerates the path from research prototyping to production deployment.

- Mostly used to create neural networks
- Specialized to use hardware acceleration (GPU, TPU, Tensor Cores etc.)
- __API based on `numpy`__ (`torch.Tensor` instead of `np.array`)

In [4]:
import numpy as np
import torch
# Alias below is pretty common
# one can also use torch directly
import torch.nn.functional as F

X = torch.rand(300, 10)  # Random uniform
print(type(X))

W = torch.randn(10)  # Random normal
b = torch.tensor([1])  # create again another random tensor

y = X @ W + b

print(y.dtype, y.shape)

<class 'torch.Tensor'>
torch.float32 torch.Size([300])


## Data types

PyTorch, similarly to `numpy` provides multiple data types, for example: 

- `torch.float` (32-bit precision)
- `torch.double` (64-bit precision)
- `torch.half` (16-bit precision)

and many others (see [here](https://pytorch.org/docs/stable/tensor_attributes.html)).

> Usually we will use floating point values (either `float` or `half`), depending on context

### Why not double?

> Default `dtype` in PyTorch is `float` because __it doesn't take up so much memory and is accurate enough__

Also, GPU memory is costly (and there isn't enough of it usually), hence lower precision (up to a certain point) might be a good solution.

## Casting

One can easily cast PyTorch tensors to desired data types, see below:

In [5]:
import numpy as np

array = np.random.randn(10, 5)
tensor = torch.from_numpy(array)
print(array.dtype, tensor.dtype)

float64 torch.float64


In [6]:
# cast to half type
new_tensor = tensor.half()

new_tensor.dtype

torch.float16

In [7]:
# numpy interoperability
new_tensor.numpy()

array([[ 0.2207 , -0.4895 , -0.974  , -0.7163 , -0.3904 ],
       [ 0.261  , -0.1913 ,  0.09644, -0.4658 ,  1.246  ],
       [-0.2954 ,  0.4834 ,  0.8955 , -0.6025 , -1.111  ],
       [-0.3735 , -0.6553 ,  1.651  ,  0.1934 , -0.1555 ],
       [ 0.5166 , -0.6772 , -0.3328 , -1.351  ,  1.748  ],
       [-0.1812 ,  0.512  , -0.475  , -0.4182 , -0.2444 ],
       [ 1.421  , -0.3674 , -0.2566 , -0.2686 ,  1.169  ],
       [ 0.4937 , -1.168  , -0.0638 , -0.03564, -0.572  ],
       [-1.131  ,  1.469  ,  0.3977 ,  1.199  , -0.5757 ],
       [-0.648  , -0.01033, -0.608  , -0.2213 , -0.4197 ]], dtype=float16)

In [8]:
# upcasting
(new_tensor + tensor).dtype

torch.float64

## Device

PyTorch can utilize multiple device types. In general:
- we use CPU for data loading
- we use specialized devices (usually GPU, sometimes TPU) for running the data through neural network

> TPU support is currently experimental, see challenges for more info

Let's start by checking if GPU is available on our devices:

In [9]:
torch.cuda.is_available()

False

Based on this information we can create a special device type that we can later use:

In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(device)

cpu


In later sections (basics of training) you will see how we can use this device variable
for device agnostic code.

## Automatic differentiation

In order for neural networks to learn we need to calculate gradients of `loss` w.r.t. parameters (like we did with linear regression previously).

> This time differentiation graph (__sometimes also called a tape__) is provided by PyTorch

![](./images/grad.jpg)

To use PyTorch's [autograd](https://pytorch.org/docs/stable/autograd.html) we need a few changes in the above code.

First, we have to mark tensors which require gradient using `requires_grad=True` argument during creation:

> Most of PyTorch functions creating tensors like `rand`, `randn` etc. have `requires_grad` as an optional parameter!

In [11]:
W = torch.randn(10, requires_grad=True)
b = torch.tensor([1.], requires_grad=True)

> Only tensor of floating data type can have gradient! __No integers or a-like__

After that we can use them normally:

In [12]:
y = X @ W + b

loss = y.sum()

## Running backpropagation

Like we did during "Gradient Methods" we can run backpropagation algorithm explicitly.

> In PyTorch we run backpropagation __on tensor__

In [13]:
print(W.grad, b.grad)

loss.backward()

# Use .grad attribute 
print(W.grad, b.grad)

None None
tensor([147.7014, 155.7213, 148.9580, 143.3640, 146.8089, 154.9344, 145.1965,
        145.6324, 139.8669, 152.7871]) tensor([300.])


> Implicitly tensor with `1` is fed into `backward` __if tensor is a scalar!__

> If tensor is not a scalar, you have to provide a tensor with initial gradient of specified shape, see [here](https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward)

## grad_fn, how PyTorch keeps track of operations

- __PyTorch keeps functions which created the tensor (if any) inside `grad_fn`__ attribute
- if `grad_fn` is `None` it is a tensor which:
    - was created by user explicitly (either with `requires_grad` set to `True` or `False`)
    
See below:

In [14]:
print(y.grad_fn, y.is_leaf, y.requires_grad)

print(W.grad_fn, W.is_leaf, W.requires_grad)

print(X.grad_fn, X.is_leaf, X.requires_grad)

<AddBackward0 object at 0x0000022884966EE0> False True
None True True
None True False


# Modules

> __PyTorch provides multiple modules based on what we intend to do__

> __REFER TO THIS LIST WHENEVER IN DOUBT__

> __MOST OF NUMPY FUNCTIONALITY IS IMPLEMENTED HENCE THERE IS NO NEED TO USE IT, ALWAYS USE PYTORCH COUNTERPARTS!__

> __DO NOT USE PYTHON BUILT-IN FUNCTIONS LIKE SUM!__

## [import torch](https://pytorch.org/docs/stable/torch.html) 

> __Basic functionality, acts like `numpy` high level namespace__

- creating tensors, like:
    - `torch.tensor`- creates tensor from `list` data (like `np.array`)
    - `torch.randn` - random normal tensor of specified shape
    - `torch.zeros`, `torch.zeros_like`
- indexing, slicing, like:
    - `torch.cat` - concatenate across __given dimension__
    - `torch.stack` - stack across __new dimension__
    - `torch.reshape` - reshape the tensor (use instead of `torch.view`)
- random sampling (`torch.seed`)
- mathematical functions like 
    - `torch.sin` 
    - `torch.sigmoid`
    - `torch.abs`
    - `torch.mean`
- reduction operations, like:
    - `torch.argmax` - __index of item with maximum value__ (possibly across dimensions)
    - `torch.min` - minimum value (possibly across dimensions)
    - `torch.var_mean` - variance and mean (possibly across dimensions)
- comparison operations, like:
    - `torch.top_k` - maximum `k` values (and their indices)
    - `torch.sort` - sort values (possibly across dimensions)
    - `torch.argsort` - as above, but return indices sorting the tensor
- enabling/disabling gradient (`torch.no_grad`, see training & loop section)
- __other operations on tensors__

Let's see some of them in action:

In [2]:
import torch

t1 = torch.tensor([1, 2, 3]) # use torch.tensor always as it does type inference
t2 = torch.Tensor([1, 2, 3])

t1. dtype, t2.dtype

(torch.int64, torch.float32)

In [3]:
# Random tensor of shape and slicing it

t1 = torch.zeros(64, 18)
t2 = torch.randn_like(t1)
t2[0, 9:]

tensor([ 0.4164,  0.3896, -0.2437,  0.1109,  0.1156,  0.7962, -0.9374,  0.5662,
         0.1221])

In [7]:
torch.topk(t2, k=3, dim=0)

torch.return_types.topk(
values=tensor([[2.5645, 2.4702, 2.9204, 2.1125, 2.4508, 3.1817, 2.4998, 2.2284, 2.1669,
         2.1737, 2.0773, 4.0561, 1.7559, 2.7503, 2.2525, 1.6442, 3.5927, 1.9406],
        [2.5129, 2.4029, 2.1267, 1.9612, 1.6177, 2.2472, 1.4460, 2.2186, 2.0504,
         1.6029, 1.9764, 2.4437, 1.5435, 2.3339, 2.1958, 1.5503, 2.2064, 1.6870],
        [2.3649, 1.9801, 1.4681, 1.4865, 1.6087, 1.9867, 1.4129, 1.9937, 1.9930,
         1.5907, 1.9118, 2.3072, 1.4368, 2.2790, 1.8168, 1.4041, 2.0703, 1.4445]]),
indices=tensor([[ 5, 29, 42, 23, 26, 14, 63, 30, 51,  6, 53, 34, 56, 13, 38, 60, 52, 57],
        [31, 17,  1, 41, 36, 47,  7, 31, 30, 14, 21, 52,  7, 47, 24, 59, 33, 56],
        [ 3, 42, 32,  2, 15, 10, 48, 23, 20,  3,  8, 11,  1, 62, 52, 32, 16, 48]]))

In [11]:
t1 = torch.tensor([1, 2, 3])
torch.stack((t1, t1, t1, t1), dim=0).shape

torch.Size([4, 3])

In [12]:
torch.cat((t1, t1 ,t1), dim=0).shape

torch.Size([9])

## [import torch.nn as nn](https://pytorch.org/docs/stable/nn.html)

> __PyTorch neural network related modules__

We will learn more about that in the following lessons, __but keep in mind that__:
- __All of the operations (layers) are presented as classes (which you instantiate)__
- __Layers can be mixed together and depend on `torch.nn.Module` (ALL OF THEM ARE INSTANCES OF IT__)

Some examples:

In [None]:
import torch.nn as nn

layer = nn.Linear(10, 5)
for param in layer.parameters():
    print(param.shape)

In [None]:
# Layers are callable functors

another_layer = nn.Conv3d(12, 32, 3)
inputs = torch.randn(64, 12, )
outputs = another_layers(inputs)

## [import torch.nn.functional as F](https://pytorch.org/docs/stable/nn.functional.html)

> __PyTorch neural network components PROVIDED AS FUNCTIONS INSTEAD OF OBJECTS__

Why would we use one over the other? __Code readability__ is the answer.

First, let's see where `torch.nn` looks better than `torch.nn.fucntional`:

In [13]:
# Setup cell with data

data = torch.randn(64, 10) # 64 examples, 10 features

In [16]:
# torch.nn approach

import torch.nn as nn

layer = nn.Linear(10, 5)
output = layer(data)

output.shape

torch.Size([64, 5])

In [17]:
# torch.nn.functional approach

import torch.nn.functional as F

output = F.linear(data, weight=torch.randn(5, 10))

output.shape

torch.Size([64, 5])

### Pros of torch.nn

- __Creating stateful objects (NEURAL NETWORKS AND OTHER MODELS)__
- Implicitly creates weight (with appropriate initialization!)
- Easier to compose and divided steps

Now let's see when `torch.nn.functional` is better:

In [18]:
# torch.nn approach

softmax = torch.nn.Softmax(dim=1)
probabilities = softmax(data)

probabilities.shape

torch.Size([64, 10])

In [19]:
# torch.nn.functional approach

probabilities = F.softmax(data, dim=1)

probabilities.shape

torch.Size([64, 10])

### Pros of torch.nn.functional

- __Using non-stateful objects__ (softmax has no weights or attributes)
- Directly applicable
- __Used only one time__ (and reused when needed)

## torch.nn.Module

> [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) is a base class for every deep learning model in PyTorch (usually neural networks)

Given that, we will inherit it from it each time we create a more complicated module.

Let's see how we can code up __linear regression__:

In [12]:
class LinearRegression(torch.nn.Module):
    def __init__(self, n_features: int):
        # This line is always required at the beginning
        # Registers parameters of our model in graph
        super().__init__()

        self.W = torch.nn.Parameter(torch.randn(n_features))
        self.b = torch.nn.Parameter(torch.ones(1))
        self.other_tensor = torch.randn(5)

    def forward(self, X):
        return X @ self.W + self.b

### torch.nn.Parameter

> If we want a tensor to be a part of `nn.Module` we have to wrap it inside `nn.Parameter`

Let's see what parameters our model currently has:

In [13]:
model = LinearRegression(15)

# named_parameters is a generator, you can also use parameters method
for name, parameter in model.named_parameters():
    print(name, parameter.shape)

W torch.Size([15])
b torch.Size([1])


As one can see `self.other_tensor` __is not registered as a parameter__.

This means, we won't be able to easily optimize it and it is "merely an attribute".

## forward method

Users should implement logic of the model (how data goes through neural network) inside this method.

> When running data through our model we will use `__call__` method. __This ensures any hooks registered for module will run correctly__

In [14]:
output = model(torch.randn(64, 15))

output.shape

torch.Size([64])

## Input shape

> PyTorch requires `(batch_size, n_features1, ..., n_features2)` tensors as input

In the case above, batch size was `64` with `15` input features to linear regression.

## Exercise

Create __multiclass LogisticRegression__ layer from scratch by inheriting from `nn.Module` and using `nn.Parameter`.

- User can specify number of input features and `n_classes` in the initialization
- Remember about the `forward` method (__returns logits__)
- Create another method `predict_proba` by reusing code above (which PyTorch module should you use?)
- Create another method `predict` which returns correct class __for each sample__ (use output from __correctly called__ `forward` method)

In [15]:
class LogisticRegression(torch.nn.Module):
    def __init__(self, in_features: int, n_classes: int):
        super().__init__()
        self.in_features: int = in_features
        self.n_classes: int = n_classes

        # W: (in_features, n_classess)
        self.W = torch.nn.Parameter(torch.randn(self.in_features, self.n_classes))
        self.b = torch.nn.Parameter(torch.ones(self.n_classes))

    def forward(self, X):
        # X: (batch, in_features)
        return X @ self.W + self.b

    def predict_proba(self, X):
        return torch.nn.functional.softmax(self(X), dim=1)

    def predict(self, X):
        return torch.argmax(self(X), dim=1)

## Summary

- PyTorch can be considered as `numpy` on GPU for neural networks
- PyTorch provides different data types:
    - `float` is a good default value (good balance between necessary precision, performance and memory usage)
- PyTorch can run on different devices:
    - GPU is used for running neural networks
    - CPU is used for data loading and other intensive tasks
- We should write device agnostic code (basics shown here)
- We should inherit from `torch.nn.Module` when creating neural networks
    - Override `__init__` and `forward`
    - Use this model via functor `__call__`


# Challenges

## Assessment

- Check available function and attributes of [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). What are those used for, see some examples.

## Non-assessment

- Check out [CUDA semantics](https://pytorch.org/docs/stable/notes/cuda.html) in PyTorch. How to choose specific GPU device if there are multiple of them?
- What are the aforementioned hooks? Check out [this article](https://medium.com/the-dl/how-to-use-pytorch-hooks-5041d777f904) for more information