# Multi-GPU Considerations

These days, training neural networks with multiple GPU is often necessary.
As SpeechBrain is strongly linked to PyTorch, we provide the same multi-GPU utilities:

- *Data Parallel (DP)*
- *Distributed Data Parallel (DDP)*.

One big difference between DP and DDP is that DP can only be run on a single machine (with multiple GPUs), while  DDP can exploit GPUs across different servers as well. 


## DataParallel
Data Parallel (DP) relies on a wrapper applied to neural network modules.
The wrapper can be simpy applied following:
 
`nn_modules = torch.nn.DataParallel(nn_modules).`

DP uses 4 primitives to implement data parallelism:
1. *replicate:* replicates a Module on multiple GPU devices.
2. *scatter:* distributes the input in the first_dimension
3. *parallel_apply:* applies a set of already-distributed inputs to a set of already-distributed models.
4. *gather:* gathers and concatenate the outputs.


The common pattern for using Data Parallel in SpeechBrain is the following:


```
cd recipes/<dataset>/<task>/
python experiment.py params.yaml --data_parallel_backend=True --data_parallel_count=2
```

**IMPORTANT**: the batch size for each GPU process is: **batch_size / data_parallel_count**. So you should consider changing the batch_size value according to your need.

## Distributed Data Parallel (DDP)

DDP implements data parallelism on different processes. This way, the GPUs do not necessarily have to be in the same server. This solution is much more flexible. However, the training routines must be written considering multi-threading. 

With SpeechBrain, we put several efforts to make sure the code is compliant with DDP. For instance, to avoid conflicts across processes we develop the `run_on_main` function. It is called when critical operations such as writing a file on disk are performed. It ensures that these operations are run in a single process only. The other processes are waiting until this operation is completed.

Using DDP in speechbrain is quite easy:

```
cd recipes/<dataset>/<task>/
python -m torch.distributed.launch --nproc_per_node=4 experiment.py hyperparams.yaml --distributed_launch=True --distributed_backend='nccl'
```

Where:
- nproc_per_node must be equal to the number of GPUs.
- distributed_backend is the type of backend managing multiple processes synchronizations (e.g, 'nccl', 'gloo'). Try to switch the DDP backend if you have issues with nccl.

You can run the model in different servers with:


```
# Machine 1
cd recipes/<dataset>/<task>/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node=0 --master_addr machine_1_adress --master_port 5555 experiment.py hyperparams.yaml --distributed_launch=True --distributed_backend='nccl'

# Machine 2
cd recipes/<dataset>/<task>/
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node=1 --master_addr machine_1_adress --master_port 5555 experiment.py hyperparams.yaml --distributed_launch=True --distributed_backend='nccl'

Machine 1 will have 2 subprocess (subprocess1: with local_rank=0, rank=0, and subprocess2: with local_rank=1, rank=1). Machine 2 will have 2 subprocess (subprocess1: with local_rank=0, rank=2, and subprocess2: with local_rank=1, rank=3).
```


To use DDP, you should consider using `torch.distributed.launch` for setting the subprocess with the right Unix variables (`local_rank` and `rank`). The `local_rank` variable enables the program to set the right device argument for each DDP subprocess, while the `rank` variable (which is unique for each subprocess) will be used for registering the subprocess rank to the DDP group. In this way, we can manage multi-GPU training over multiple servers.

Note that using DDP on different machines introduces a **communication overhead** that might slow down training (depending on how fast is the connection across the different machines). 


We would like to advise our users that despite being more efficient, DDP is also more prone to exhibit unexpected bugs. Indeed, DDP is quite server-dependent and some setups might generate errors with the PyTorch implementation of DDP.  The future version of pytorch will improve the stability of DDP.


