# GPU & Data Parallel


# Processes vs threads

Those terms are often confused, and it is important to get them right, to know where and when we can use each of them.

> __A coroutine lives in a thread, a thread lives in process, a process lives in core, a core lives in a CPU.__

## Core

> Core is a physical part of CPU (also other devices) and processes and run on them

- The more cores, the more processes we can run at once
- There are technologies (e.g. hyper threading) providing two virtual cores for one physical (this indicates how many threads can be run on a core at the same time)

## Process

> Process is any running program, single smallest management unit

It is characterized by:
- __Separate memory space (one part of computer program is used exclusively by one process)__
- __Creation/Desctruction is costly__
- __Communication between processes is costly__
- One process per core
- Independent (one crashing process does not interfere with others)

> __`torch.utils.data.DataLoader` using multiple processes to load data!__

## Threads

> Threads are more lightweight and more than one can run in a single process

- Program needs at least one thread to run a.k.a. __main thread__
- __Shared memory space__ (overriding the same variable, in general, __not in Python__)
- __Creation/Destruction is cheapter__
- __Communication between threads is faster__
- Multiple threads per core (only one can own the process at a time)
- Dependent, can crash because of what the others are doing

### Global Interpreter Lock

> __Python has Global Interpreter Lock (GIL) which slows down threads!__

It happens due to explicit use of mutex (variable can be owned only by one thread at a time), some traits:
- __might be slower than single core__ (depending on the use case and run)
- __could be worked around__ (different language or interpreter, the second one not advised in general)

## Concurrent vs parallel

> Concurrent programming, refers to threads (or other forms of execution) switching ownership of the processor

> Parallel refers to running across different cores

- __Process are always parallel__
- __Threads can run concurrently__ (e.g. two threads on one core with HyperThreading) or __in parallel__ (across multiple cores)

## General tips

- __Try to schedule your tasks evenly across processes__ (so they take similar amount of time and resources)
- __Try to divide your tasks into as independent parts as possible__
- __Try not to share data__ (send parts of data and gather at the end from different execution units)
- Circumvent GIL when speed is a factor (e.g. move execution to language like C/C++)
- Usually use Python's `multiprocessing` module for parallel code

# CPU

- __Central Processing Unit__
- __Focused on instructions__ - large set of instructions can be run on it
- Optimized for general tasks
- Typically smaller number of cores (around 4-18, twice with hyperthreading)
- Can run up to a 1000 threads or so

# GPU

- __Graphics Processing Unit__
- __Focused on data__ - specialized for SIMD (Single Instruction Multiple Data) tasks on floating point data
- Unable to handle general tasks well
- Typically large number of specialized cores (3000-10000 cores)
- Can run tens/hundreds of thousands threads at once

![](images/num_cores.jpg)

# Which one to choose?

> Fortunately, we can use both devices, where they excel

For deep learning this looks more or less like this:

## CPU

> Remember CPU is also responsible for other tasks and it's performance may vary based on OS load!

- General processing and instructions (e.g. data loading)
- Usually performs "slower" tasks (use multiple threads for loading data)

## GPU

> GPU may also be responsible for other tasks, but usually to a lesser extent

- __Running identical set of instructions with a lot of data__
- This means most of layers benefit from this approach as those are usually GEMM (GEneric Matrix Multiplication) 

# Which hardware is supported

- Right now __Intel CPUs__ are the most widely used for everything, including deep learning
- __NVidia's GPUs__ are at the forefront of deep learning
- __AMD's GPUs__ are currently not officially supported

# I know I have a GPU! Why isn't it available?

Correct **drivers** are needed for us to work with GPU

> A driver is a piece of software that lets the operating system and a hardware device communicate with each other.

- Correct drivers are usually provided out of the box (or are easily installable)
- If we do something incorrectly with drivers, it might be hard to revert changes (occurs rarely if done with case and thought)
- __Possible solution:__ Let someone else do that for us (e.g. cloud providers, system administrators)

> Also appropriate software versions are required (e.g. compiled with CUDA support)

In [1]:
import torch

cuda_available = torch.cuda.is_available() # check if cuda is available

print('Got GPU?', cuda_available)

Got GPU? False


# CUDA

> CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model designed for GPU

Programming with CUDA can be done in multiple languages, __but usually with C/C++ (eventually Fortran)

- You program in CUDA using low level macros and functions
- This code is passed from CPU to GPU device where it is run
- Requires specific compiler called __nvcc__ (run `nvcc --version` to get more info about your CUDA version)

## Important things

- Each time CPU passes control to GPU device it can run it's own set of operations
- CPU has to wait for results of those operations at a point called __synchronization point__
- __The less synchronization points the better!__ Always try to "stick" with one "environment" (fortunately PyTorch does most of this for us)

> You can find more about PyTorch's cuda capabilities inside [`torch.cuda`](https://pytorch.org/docs/stable/cuda.html) package

# Tensor Cores 

> Read more about them on [NVidia's site](https://www.nvidia.com/en-us/data-center/tensor-cores/)

- Specialized set of instructions (and a new data type) which allows us to speed up our computations __up to 10 times!__ 
- Provided in newer graphic cards
- Suitable for mixed precision training (see below) and high data throughput

## Tensor Cores tips

> In order to utilize Tensor Cores efficiently, there are a few guidelines one should follow while creating most of the architectures

__Based on [this article](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/)__

- Use Mixed Precision training (example in PyTorch below)
- Parameters and inputs should be:
    - divisible by `8` for `float16` (a.k.a. half precision)
    - divisible by `16` for `int8` (rare case in deep learning)
    
## Other performance tips

- Use "math-light" operations and activations:
    - ReLU is "math-light" as it only involves thresholding value
    - Tanh is "math-heavy" as it involves sigmoid

## Exercise

Below you have a few code cells (__up to PyTorch's AMP__) with schematic code.

- Analyze them and write a comment next to each, whether Tensor Cores best practices were violated
- If they were, why? If they weren't, why, which part of code violates the guidelines?
- Any other issues with this code that you can spot?
- Look closely at every line, __errors may occur not only in model, but also data loading__

In [None]:
import torch

# Case 1 - 15 features, 3 class classification

data = torch.randn(128, 15)

model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.Linear(64, 32),
    torch.nn.Linear(32, 16),
    torch.nn.Linear(16, 8),
    torch.nn.Linear(8, 4),
    torch.nn.Linear(4, 5)
)

model(data).shape

In [None]:
import torch

# Case 2 - 15 features, 16 class classification

data = torch.randn(123, 16)
dataloader = torch.utils.data.DataLoader(data, batch_size=64)

model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 16)
)

for batch in dataloader:
    model(dataloader)

In [None]:
import torch

# Case 3 - Images and convolution

data, mask = torch.randn(1024, 3, 26, 26), torch.randn(1024, 1, 26, 26)
dataset = SuperDataset(data, mask) # let's assume this is a torch.utils.data.Dataset
dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = torch.nn.Sequential(
    torch.nn.Conv2d(3, 64, kernel_size=3)
    torch.nn.ReLU(),
    torch.nn.Conv2d(64, 128, kernel_size=3),
    torch.nn.ReLU(),
    torch.nn.Conv2d(128, 64, kernel_size=3),
    torch.nn.ReLU(),
    torch.nn.Conv2d(64, 1, kernel_size=3)
    torch.nn.Sigmoid(),
)

for img, mask in dataloader:
    predicted = model(dataloader)
    ...

# Automatic Mixed Precision (AMP)

> __Automatic Mixed Precision__ automatically casts __parts of neural networks and inputs__ to lower precision datatype

> In order to fully utilize current performance boosts (including Tensor Cores) we have to use __mixed precision training__

PyTorch provides easy to use interface:

## Autocasting

> [`torch.cuda.amp.autocast`](https://pytorch.org/docs/stable/amp.html#id4) is a context manager or decorator which runs regions of the code in mixed precision

In [None]:
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# inputs.dtype == float32
for inputs, targets in data:
    optimizer.zero_grad()

    # Enables autocasting for the forward pass (model + loss)
    with autocast():
        # Here inputs.dtype == float16, DONE AUTOMATICALLY
        output = model(inputs)
        loss = loss_fn(output, targets)

    # Exits the context manager before backward()
    loss.backward()
    optimizer.step()

### Autocasting know-how

- __Only for CUDA!__
- __Running `backward` inside `autocast` is not recommended!__ (see below)
- Some layers (or parts of layers, e.g. weights/buffers) will have their precision unchanged
- It is analyzed on a per-layer basis in order not to affect (at least drastically) models performance
- Regions of autocast/no-autocast (`torch.cuda.amp.autocast(enabled=False)`) can be nested inside each other (rarely useful)

## Gradient Scaling

> [`torch.cuda.amp.GradScaler`](https://pytorch.org/docs/stable/amp.html#id5) __prevents underflow during backward pass__ (small updates may not fit into `half` and those would be lost!) 

- Loss scaling is automatically determined during training and used throughout rest of the process

In [None]:
scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

# Exercise

> Requires GPU capable device!

Create training loop (with specified number of epochs) with `autocast` and gradient scaling (no unscaling).

In [None]:
!pip install pytorch-lightning-bolts

In [None]:
import tempfile

import torchvision
from pl_bolts.datamodules import CIFAR10DataModule

with tempfile.TemporaryDirectory() as data_dir:
    dm = CIFAR10DataModule(
        data_dir=data_dir, shuffle=True, num_workers=1, normalize=True, batch_size=64
    )
    train_dataloader = dm.train_dataloader()
    test_dataloader = dm.test_dataloader()


# Use provided model, optimizer, criterion
model = torchvision.models.resnet50(num_classes=10)
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
criterion = torch.nn.CrossEntropyLoss()

# Setup GradScaler here
...

epochs = 10
for epoch in range(epochs):
    # Create train and validation loops over dataloaders
    # Including autocasting and proper usage of GradScaler
    ...

# Moving/creating data on GPU

Torch tensors and models have a ```.to``` method which moves them to a device (we have seen that previously)

> PyTorch provides multiple arguments when moving/constructing our data on device which allows us to finetune performance

Let's see two methods, which cover most of the possibilities:

## torch.tensor

Allows us to __create__ `torch.Tensor` instance from Python's `list` instances, is part of PyTorch's [creation ops](https://pytorch.org/docs/stable/torch.html)

Let's see it's signature:

In [None]:
torch.tensor(data, *, dtype=None, device=None, requires_grad=False, pin_memory=False)

### device

`device` can be:
- instance of `torch.device` (advised)
- string specifying name of the device (usually `"cpu"` or `"cuda"`)

> PyTorch allows us to easily move data to different GPUs __(one tensor can reside only on a single GPU!)__

Check out the code below:

In [None]:
cpu_device = torch.device("cpu")
first_available_gpu = torch.device("cuda")

# Same as above
zeroth_gpu = torch.device("cuda:0")

N = 4
n_th_gpu = torch.device(f"cuda:{N}")

### pin_memory

> When `.to` method is invoked, special "staging area" has to be prepared on CPU and data from pageable memory is copied to it (think of it as `git add` before `git commit`)

![](images/pin_memory.png)


#### Why?

- GPU cannot access pageable memory on CPU
- Created pinned memory region and direct creation of `torch.Tensor`s on it allows us to mitigate this issue

#### Pros

- __Faster memory transfers__ (usually)
- No need to create "pinned memory" region over and over again
- No need for copy from CPU

#### Cons

- __Part of CPU memory is occupied by PyTorch for the duration of program run__
- Due to above, it will not be available for system usage if needed

### When to use?

- If we create `torch.Tensor` instances of the same (similar) size on the device (which is often the case)
- __We should measure performance__ as it depends on many variables
- __At least try it if you need speedup__ and data loading is the __bottleneck__ 

> __Usually used with `torch.utils.data.DataLoader` as it has `pin_memory` argument!__

> __Check [`torch.utils.bottleneck`](https://pytorch.org/docs/stable/bottleneck.html) for more info about profiling!__

## tensor.to(...) method

> Allows us to __move__ tensor to device (usually GPU) and specify many details about it

In [None]:
to(
    device=None, # Previously
    dtype=None, # Previously
    non_blocking=False,
    copy=False, # Make a new copy even if tensor matches the format
    # You almost never should set copy=True
    memory_format=torch.preserve_format, # OBLIGATORY ASSESSMENT!
)

### non_blocking

> `non_blocking=True` will allow CPU to run without waiting for the move to complete

#### Pros

- CPU does not have to wait for the move operation to finish
- Improves parallelization of code
- Possible speedups (see when to use)

#### Cons

- Not usable in many cases
- May confuse users as to your intent
- May be harder to profile accurately

#### When to use

- Setting it to `True` __should not__ hurt the performance anyhow
- When it may logically improve performance (see below)
- When __synchronization point__ is nonimmediate

See a case where it is immediate and nonimmediate:

In [None]:
# At least 1 CUDA device required
if torch.cuda.device_count() >= 1:
    
    # IMMEDIATE (no point)

    t = torch.randn(100, 100).to("gpu", non_blocking=True)
    # Control is passed back and you have to wait until move finishes
    t += 10
    
    # NONIMMEDIATE (might improve performance)
    t1 = torch.randn(100, 100).to("gpu", non_blocking=True)
    t2 = torch.randn(10, 10)
    t2 = torch.cos(torch.sin(t2))
    # Possibly some other operations on CPU
    t2 = t2 @ t2
    
    # Synchronization point, data has to be moved to GPU by now
    t1 += 10

# Challenges

## Assessment 

- Read about Python's coroutines in documentation [here](https://docs.python.org/3/library/asyncio-task.html) (general understanding of coroutines may be part of an assessment!)
- __Read about TPUs [here](https://cloud.google.com/tpu/docs/tpus) and expect a few questions about them!__
- What is the difference between BCHW format and BHWC (a.k.a. channels first vs channels last)? Read about it [here](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html)
- Go over [torch.cuda](https://pytorch.org/docs/stable/cuda.html) to see what can be done with CUDA enabled devices in PyTorch

## Non-assessment

- Read more about Python's [GIL](https://wiki.python.org/moin/GlobalInterpreterLock). What are non Cython options to circumvent the limitations?
- Read more about PyTorch and TPU integration via `torch-xla` package [here](https://pytorch.org/xla/release/1.8/index.html)
- Read more about performance optimization for Deep Learning [here](https://docs.nvidia.com/deeplearning/performance/index.html). Some of the tips were provided, but there's way more to uncover if you are interested!
- Read more about optimization of data transfers to CUDA enabled devices [here](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/)