# Multiple GPUs

There are two main modes to use multiple GPUs during your neural network training/inference:
- __Model Parallel - Splitting model across many GPUs__
- __Data Parallel - Splitting data across many GPUs__

> __Both of those modes can be mixed, though usually we only need Data Parallel!__

## Model Parallel

> Model Parallel requires (see MANDATORY assessments for full picture) manual casting parts of the model to specified devices

We will only take a look at single machine Model Parallel (single machine, multiple GPUs):

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = torch.nn.Linear(10, 10).to('cuda:0')
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5).to('cuda:1')

    def forward(self, x):
        x = self.relu(self.net1(x.to('cuda:0')))
        return self.net2(x.to('cuda:1'))

## Data Parallel

> __PyTorch provides special [`torch.nn.parallel.DistributedDataParallel`](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html#distributeddataparallel) class to work with data split across multiple devices__

- __Works on single and multiple machines__
- Is currently the fastest iteration of PyTorch's
- Works with Model Parallel (see [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#combine-ddp-with-model-parallelism))

### DistributedDataParallel vs DataParallel

__Never use [`torch.nn.DataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html), because:__
- `DataParallel` works only on a single machine
- It is constrained by GIL (Global Interpreter Lock), hence __can run multiple threads, not processes__
- Due to above usually slower even on a single machine
- __Does not work with Model Parallel__

# Challenges

## Assessment 

- Check [PyTorch Pipeline parallelism](https://pytorch.org/docs/stable/pipeline.html) for more sensible & automated approach to model sharing across devices

## Non-assessment

- Check out [Model Parallel tutorial](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html) to see how to use pretrained and ready models with multiple GPUs
- Check out [Getting Started with RPC](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html) for Model Parallel across multiple machines