# Multi-GPU

## What's the point of multi-GPU?

You may need more than one GPU if model training time consumes a significant fraction of execution pipeline time. 
Therefore, if you have several GPUs, you can use all of them to train a model. This will speed up the training process of the model.

Parameter `device` allows training model on multiple GPU (Сreates a copy of model on each selected GPU).
Next, batch data is split across available GPUs and gradients are computed separately and then averaged on one device (usually on the first GPU of the available).

Initialization of a large model on a large number of GPUs may take some time (minutes or tens of minutes)!

In [1]:
import os
import sys
import warnings

sys.path.append('../../..')
from batchflow import Pipeline, B, C, V, D, C
from batchflow.opensets import Imagenette320
from batchflow.models.tf import ResNet18

Specify which GPU(s) to be used. More about it in [CUDA documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars).

In [2]:
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=3,4,5,6,7

env: CUDA_DEVICE_ORDER=PCI_BUS_ID
env: CUDA_VISIBLE_DEVICES=3,4,5,6,7


## Create a dataset, define a default model config

In [3]:
dataset = Imagenette320(bar=True)

model_config = {'inputs/images/shape': B.image_shape,
                'inputs/labels/classes': D.num_classes,
                'initial_block/inputs': 'images',
                'device': C('device'),
                'microbatch': C('microbatch')}

 50%|█████     | 1/2 [00:15<00:15, 15.82s/it]


In [4]:
BATCH_SIZE = 64

# Train model on single GPU

**By default, if one or more GPUs are visible for a pipeline model, a model uses only the first GPU!** Now it's `'GPU:0'`. You can also configure it directly and we will talk about it later.

In [5]:
config = {'microbatch': None, 'device': None}

template = (Pipeline()
            .init_variable('loss_history', [])
            .init_model('dynamic', ResNet18,'conv_nn', config=model_config)
            .resize((320, 320))
            .to_array()
            .train_model('conv_nn', fetches='loss',
                         images=B.images, labels=B.labels,
                         save_to=V('loss_history', mode='a'), use_lock=True))

pipeline_single = template << dataset.train << config

Most of the next cell execution time is spent on the initialization model. We will compare the initialization time on one GPU and multiple GPUs later.

In [6]:
%%time
pipeline_single.next_batch(BATCH_SIZE)

CPU times: user 12.2 s, sys: 4.21 s, total: 16.4 s
Wall time: 16.8 s


<batchflow.batch_image.ImagesBatch at 0x7f5b15dbf630>

In [7]:
pipeline_single.run(BATCH_SIZE, shuffle=True, n_epochs=1, bar=True, drop_last=True, prefetch=10)

100%|██████████| 197/197 [01:29<00:00,  2.12it/s]


<batchflow.pipeline.Pipeline at 0x7f5b153dd0f0>

# Add GPUs

We could use `device` and set up 2 GPUs to train the model:

In [8]:
config.update({'device': ['GPU:1', 'GPU:2']})

Parameter `device` can be string, list of strings, or regular expression.

Example:
```python
'device': 'GPU:0'                     # Used only GPU:0
'device': ['GPU:0', 'GPU:1', 'GPU:2'] # Used GPU:0, GPU:1 and GPU:2
'device': 'GPU:*'                     # Used all avalible GPU
```

> **Batch size must be divisible by the number of devices!** \
**If `microbatch` is on, microbatch size must be divisible by the number of devices!**

# Train model on multiple GPU

In [10]:
pipeline_multi = template << dataset.train << config

In [11]:
%%time
pipeline_multi.next_batch(BATCH_SIZE)

CPU times: user 28.6 s, sys: 3.17 s, total: 31.7 s
Wall time: 30.2 s


<batchflow.batch_image.ImagesBatch at 0x7f595050b6a0>

In [12]:
pipeline_multi.run(BATCH_SIZE, shuffle=True, n_epochs=1, bar=True, drop_last=True, prefetch=10)

100%|██████████| 197/197 [01:01<00:00,  3.06it/s]


<batchflow.pipeline.Pipeline at 0x7f5950496e80>

The model’s training time is about 1 minute on two GPUs. If we add more GPUs, we get even less training time.

# Multi-GPU and microbathing

###  Schematic illustration of the formation of batches to each GPU

<img src="./img/Batch_microbatch_GPU.png" width="700">

We can use `microbatch` and `device` at the same time. If we have huge batches, it will be useful.

Add microbathing.

In [13]:
config.update({'device': ['GPU:3', 'GPU:4'], 'microbatch': 2})

# Train model with multiple GPUs and microbatching

In [14]:
pipeline_multi_micro = template << dataset.train << config

In [15]:
%%time
pipeline_multi_micro.next_batch(BATCH_SIZE)

CPU times: user 31.9 s, sys: 3.4 s, total: 35.3 s
Wall time: 32.4 s


<batchflow.batch_image.ImagesBatch at 0x7f55507819b0>

In [16]:
pipeline_multi_micro.run(BATCH_SIZE, shuffle=True, n_epochs=1, bar=True, drop_last=True, prefetch=10)

100%|██████████| 197/197 [02:56<00:00,  1.07it/s]


<batchflow.pipeline.Pipeline at 0x7f5550781b00>

Model training finishes without error. It means that we can use the `device` and the `microbatch` together.  

Let's look at the training time of the model with one GPU, the model with two GPUs, and the last model. When we added one more GPU to the first model (made the second model), we got to reduce the training process time by 1.5 times! But training time didn't reduce twice due to the appearance of overheads caused by information exchange between multiple GPUs. Furthermore, when we added microbatch to the second model (made the last model), we increased model training time (More about that in [01_microbatch tutorial](./01_microbatch.ipynb)).

As stated at the beginning, initialization takes more time on multiple GPUs than on one GPU (see cells with inline magic `%%time`).

Using multiple GPUs is a convenient way to speeding up the model training process.

Next tutorial is about [different training procedures](./03_train_steps.ipynb).