# Multi-node distributed training with PyTorch Lightning

description: multi-node, multi-gpu distributed pytorch lightning with distributeddataparallel (ddp)

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

In [None]:
import git
from pathlib import Path

# get root of git repo
prefix = Path(git.Repo(".", search_parent_directories=True).working_tree_dir)

# training script
source_dir = prefix.joinpath(
    "code", "train", "pytorch-lightning", "mnist-autoencoder"
)
script_name = "train-multi-node.py"

# environment file
environment_file = prefix.joinpath("environments", "pt-lightning.yml")

# azure ml settings
environment_name = "pt-lightning"
experiment_name = "pt-lightning-ddp-example"
cluster_name = "gpu-K80-2"

## Create environment

Define a conda environment YAML file with your training script dependencies and create an Azure ML environment. This notebook will use the same environment definition that was used for part 1 of the tutorial. The dependencies for this tutorial include **mlflow** and **azureml-mlflow**.

In [None]:
from azureml.core import Environment

env = Environment.from_conda_specification(environment_name, environment_file)

# specify a GPU base image
env.docker.enabled = True
env.docker.base_image = (
    "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04"
)

Alternatively, you can just capture all your dependencies directly in a custom Docker image or Dockerfile, and create your environment from that. For more information, see [Train with custom image](https://docs.microsoft.com/azure/machine-learning/how-to-train-with-custom-image).

## Adapt training script to set required env vars

In order to run multi-node Lightning jobs, Lightning requires the following environment variables to be set on each node in your cluster:

* `MASTER_ADDR`: IP address of rank 0 node
* `MASTER_PORT`: free port on rank 0 node
* `NODE_RANK`: global rank of the node (from 0 to N, where N is the total number of nodes)

Since Azure ML does not currently set these environment variables, we will write a utility script *azureml_env_adapter.py* that will set those environment variables using the OpenMPI environment variables that are set on each node. Import the `set_environment_variables()` method from the utility script into your training script, and call this method in the beginning of the training script (in this case inside the `cli_main()` method).

In a future release, Azure ML will set these environment variables automatically for PyTorch jobs, at which point this adapter code will no longer be necessary. Once this is available, we will update this tutorial.

## Configure and run training job
Create a ScriptRunConfig to specify the training script & arguments, environment, and cluster to run on.

Lightning supports several [distributed modes](https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#distributed-modes) for training. DistributedDataParallel (DDP) is recommended over DataParallel (DP) for training.

For multi-node, specify the number of GPUs per node to train on (typically this will correspond to the number of GPUs in your cluster's SKU) and the distributed mode, in this case DistributedDataParallel ("ddp"). In addition, specify the number of nodes to use for distributed training. PyTorch Lightning expects these as arguments `--gpus`, `--accelerator` and `--num_nodes`, respectively. See their [Multi-GPU DistributedDataParallel](https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#distributed-data-parallel) training documentation for more information. Note that you do not need to define these flags manually in your training script as Lightning can add them automatically. The training script parses the command-line arguments and passes them to the [`Trainer()`](https://pytorch-lightning.readthedocs.io/en/stable/trainer.html?highlight=Trainer).

### Azure ML distributed job configuration
In order for Azure ML to launch the multi-node job, define an `MpiConfiguration` with a `node_count` count value that matches the value you specified to your training script's *--num_nodes* argument. For the MpiConfiguration, set `process_count_per_node=1` - this is already the default value, so we don't need to explicitly specify it again here. Note that even though we are running a multi-node, multi-GPU job, we are only specifying Azure ML to launch one process per node. This is because Lightning will handle launching the extra processes for each GPU.

In [None]:
from azureml.core import ScriptRunConfig, Experiment
from azureml.core.runconfig import MpiConfiguration

cluster = ws.compute_targets[cluster_name]

num_nodes = 2
src = ScriptRunConfig(
    source_directory=source_dir,
    script=script_name,
    arguments=[
        "--max_epochs",
        50,
        "--gpus",
        2,
        "--accelerator",
        "ddp",
        "--num_nodes",
        num_nodes,
    ],
    compute_target=cluster,
    environment=env,
    distributed_job_config=MpiConfiguration(node_count=num_nodes),
)

run = Experiment(ws, experiment_name).submit(src)
run

You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()