# Train with PyTorch Lightning

description: train single-node, including single-node multi-gpu, pytorch lightning

In [None]:
!pip install --upgrade tensorboard azureml-tensorboard

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

In [None]:
# training script
source_dir = "src"
script_name = "train.py"

# environment file
environment_file = "environment.yml"

# azure ml settings
environment_name = "pt-lightning"
experiment_name = "pt-lightning-tutorial"
compute_name = "gpu-K80-2"

## Create environment

Define a conda environment YAML file with your training script dependencies and create an Azure ML environment. The dependencies for this tutorial include **torch**, **torchvision**, and **pytorch-lightning**.

Since this example is for GPU training, you will need to specify a GPU base image that has the necessary dependencies. Azure ML maintains a set of base images published on Microsoft Container Registry (MCR) that you can use, see the [Azure/AzureML-Containers](https://github.com/Azure/AzureML-Containers) GitHub repo for more information.

Azure ML will build a conda environment with the dependencies you specified in your .yml file on the base image.

In [None]:
from azureml.core import Environment

env = Environment.from_conda_specification(environment_name, environment_file)

# specify a GPU base image
env.docker.enabled = True
env.docker.base_image = (
    "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04"
)

Alternatively, you can just capture all your dependencies directly in a custom Docker image or Dockerfile, and create your environment from that. For more information, see [Train with custom image](https://docs.microsoft.com/azure/machine-learning/how-to-train-with-custom-image).

## Configure and run training job
Create a ScriptRunConfig to specify the training script & arguments, environment, and cluster to run on.

For single-node, single-GPU training, specify `1` GPU to the `--gpus` command-line argument expected by Lightning.
Note that you do not need to define this flag manually in your training script as Lightning can add it automatically. The training script parses the command-line arguments and passes them to the [`Trainer()`](https://pytorch-lightning.readthedocs.io/en/stable/trainer.html?highlight=Trainer).

Lightning handles all the NVIDIA flags for you, there's no need to set them yourself. 

In [None]:
import os
from azureml.core import ScriptRunConfig, Experiment

src = ScriptRunConfig(
    source_directory=source_dir,
    script=script_name,
    arguments=["--max_epochs", 25, "--gpus", 1],
    compute_target=compute_name,
    environment=env,
)

run = Experiment(ws, experiment_name).submit(src)
run

In [None]:
run.wait_for_completion(show_output=True)

### Single-node multi-GPU training

Lightning supports several [distributed modes](https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#distributed-modes) for training. DistributedDataParallel (DDP) is recommended over DataParallel (DP) for training.

For multi-GPU training on a single node, specify the number of GPUs to train on (typically this will correspond to the number of GPUs in your cluster's SKU) and the distributed mode, in this case DistributedDataParallel ("ddp"), which Lightning expects as arguments `--gpus` and `--accelerator`, respectively. The Lightning implementation of DDP will manage starting the individual processes on each GPU under the hood. See their [Multi-GPU](https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html) training documentation for more information.

In [None]:
import os
from azureml.core import ScriptRunConfig, Experiment

src = ScriptRunConfig(
    source_directory=source_dir,
    script=script_name,
    arguments=["--max_epochs", 25, "--gpus", 2, "--accelerator", "ddp"],
    compute_target=compute_name,
    environment=env,
)

run = Experiment(ws, experiment_name).submit(src)
run

You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

In [None]:
run.wait_for_completion(show_output=True)