# Data Parallel Training with PyTorch DDP

In this notebook we will learn how to engineer data parallel job using PyTorch Distributed Data Parallel (`DDP`). While SageMaker doesn’t support PyTorch DDP natively, it’s possible to run DDP training jobs on SageMaker. This notebook and associated code assets provides such implementation.

As a trianing task, we will finetune pretrained Resnet18 model to classify ants and bees. We use open-source **Hymenoptera dataset**. We use data parallel to distribute task between 2 `p2.xlarge` instances with single GPU device each. Feel free to change modify number and type of instances in training cluster and observe how this change training speed. Note, that this is a small-scale training and will not be indicative of training efficiency on real-life tasks. 

We start with necessary imports and basic SageMaker training configs. Then we download, unzip, and upload dataset to Amazon S3 bucket. Note, it may take several minutes to complete these operations.

In [1]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role() # replace it with role ARN if you are not using SageMaker Notebook or Studio environments.

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/pytorch-distribution-options'
print('Bucket:\n{}'.format(bucket))

Bucket:
sagemaker-us-east-1-941656036254


In [3]:
# Downloading dataset and unzipping it locally
! wget https://download.pytorch.org/tutorial/hymenoptera_data.zip
! unzip hymenoptera_data.zip

Archive:  hymenoptera_data.zip
   creating: hymenoptera_data/
   creating: hymenoptera_data/train/
   creating: hymenoptera_data/train/ants/
  inflating: hymenoptera_data/train/ants/0013035.jpg  
  inflating: hymenoptera_data/train/ants/1030023514_aad5c608f9.jpg  
  inflating: hymenoptera_data/train/ants/1095476100_3906d8afde.jpg  
  inflating: hymenoptera_data/train/ants/1099452230_d1949d3250.jpg  
  inflating: hymenoptera_data/train/ants/116570827_e9c126745d.jpg  
  inflating: hymenoptera_data/train/ants/1225872729_6f0856588f.jpg  
  inflating: hymenoptera_data/train/ants/1262877379_64fcada201.jpg  
  inflating: hymenoptera_data/train/ants/1269756697_0bce92cdab.jpg  
  inflating: hymenoptera_data/train/ants/1286984635_5119e80de1.jpg  
  inflating: hymenoptera_data/train/ants/132478121_2a430adea2.jpg  
  inflating: hymenoptera_data/train/ants/1360291657_dc248c5eea.jpg  
  inflating: hymenoptera_data/train/ants/1368913450_e146e2fb6d.jpg  
  inflating: hymenoptera_data/train/ants/147318

In [4]:
data_url = sagemaker_session.upload_data(path="./hymenoptera_data", key_prefix="hymenoptera_data")
print(f"S3 location of dataset {data_url}")

## Launching Training Processes

Amazon SageMaker has no out-of-the-box support for PyTorch DDP training. Specifically, it doesn’t know how to start distributed DDP processes in training cluster. Hence, we need to develop a launching utility to perform this function. This utility is quite simple and then can be re-used for any other DDP-based training jobs.

In launcher script we use DDP module `torch.distributed.run` which simplifies spawning training processes in cluster.  As part of launcher script, we need to collect information about training world, specifically number of nodes and GPUs devices in cluster as well as identify node which will act as master coordinator. Then torch.distributed.run will spawn multiple training processes. 


Let’s highlight several key areas in our launcher script. First, we need to collect information about SageMaker training cluster. For this we use environmental variables set by SageMaker automatically.

```python
    nodes = json.loads(os.getenv("SM_HOSTS"))
    nnodes = len(nodes)
    node_rank = nodes.index(os.getenv("SM_CURRENT_HOST"))
    nproc_per_node = os.getenv("SM_NUM_GPUS", 1)
Next we need to form command line to start torch.distributed.run:

    cmd = [
        sys.executable,
        "-m",
        "torch.distributed.run",
        f"--nproc_per_node={nproc_per_node}",
        f"--nnodes={str(nnodes)}",
        f"--node_rank={node_rank}",
        f"--rdzv_id={os.getenv('SAGEMAKER_JOB_NAME')}",
        "--rdzv_backend=c10d",
        f"--rdzv_endpoint={nodes[0]}:{RDZV_PORT}",
        distr_args.train_script,
    ]
    # Adding training hyperparameters which will be then passed in training script
    cmd.extend(training_hyperparameters)
```
Note, that we are adding training hyperparameters “as is” in the end of command line. These arguments are not handled by launcher, but by training script to configure training. Lastly, we use Python subprocess.Popen to start torch.distributed.run utility as a module:

```python
    process = subprocess.Popen(cmd, env=os.environ)
    process.wait()
    if process.returncode != 0:
        raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
Note, that we are copying environment variables to subprocesses to preserve all SageMaker variables. If spawned process returns non-zero code (an indication of error), we then raise exception to propagate error code to SageMaker control plane.
```

Summarizing, our launcher utility is responsible for collecting training cluster configuration and then starting torch.distributed.run on each node. The utility then takes care of starting multiple training processes.

Run cell below to review full listing of launcher utility.

In [15]:
!pygmentize 2_sources/launcher.py

[37m# This module gathers requirements parameters of pytorch distirbuted training world[39;49;00m
[37m# from environmental variable propagated by DSP for Pytorch Distributed job type.[39;49;00m

[37m# The module is intended to be light-weight and rely exclusively on native torch distributed utility:[39;49;00m
[37m# https://github.com/pytorch/pytorch/blob/master/torch/distributed/run.py[39;49;00m


[34mfrom[39;49;00m [04m[36margparse[39;49;00m [34mimport[39;49;00m ArgumentParser


[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36margparse[39;49;00m [34mimport[39;49;00m ArgumentParser, REMAINDER
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m

logging.basicConfig(level=logging.DEBUG)
LOGGER = logging.getLogger([31m__name__[39;49;00m)

[37m# port for distributed DDP processes to communic

## Adopting Training Script For DDP

To use DDP, we need to make minimal changes in our training script. First of all, we initialize training process and add it to DDP process group:

```python
dist.init_process_group(
    backend="nccl",
    rank=int(os.getenv("RANK", 0)),
    world_size=int(os.getenv("WORLD_SIZE", 1)),
)
```

Since we have GPU-based instances, we use `NCCL` communication backend. Also we utilize enviornment variables set but `torch.distributed.run` module: world size and global rank. 

Next, we need to identify which GPU device will store model and run computations. We use local_rank envvar set by `torch.distributed.run` during process spawn.

```python
torch.cuda.set_device(os.getenv("LOCAL_RANK"))
device = torch.device("cuda")
model = model.to(device)
```

We then wrap our regular PyTorch model with special DDP implementation. This implementation allows us to work with PyTorch model as if it is a regular locally stored model. under the hood, DDP implements gradient synchronization between training processes in process group.
```python
model = DDP(model)
```

Last step we need to need to modify training data loader so tghat each training process gets a unqiue slice of data during training step. For this, we use `DistributedSampler` which samples data records based on total number of processes (`world_size` variable) and global rank (`rank` variable) of given training process:

```python
    # Note that we are passing global rank in data samples to get unique data slice
    train_sampler = torch.utils.data.distributed.DistributedSampler(
        image_datasets["train"], num_replicas=args.world_size, rank=args.rank
    )
    train_loader = torch.utils.data.DataLoader(
        image_datasets["train"],
        batch_size=args.batch_size,
        shuffle=False,
        num_workers=0,
        pin_memory=True,
        sampler=train_sampler,
    ) 
```
You can review a full listing of training script by running cell below.


In [16]:
!pygmentize 2_sources/train_ddp.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m division, print_function

[37m# Common imports[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mfrom[39;49;00m [04m[36mwebbrowser[39;49;00m [34mimport[39;49;00m get

[37m# Third Party imports[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m [34mas[39;49;00m [04m[36mdist[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctional[39;49;00m [34mas[39;49;00

In [14]:
from sagemaker.pytorch import PyTorch

#ps_instance_type = 'ml.p3.2xlarge'
ps_instance_type = 'ml.p3.2xlarge'
ps_instance_count = 2

#distribution = {'parameter_server': {
#                    'enabled': True}
#                }
hyperparameters = {
  'train-script': 'train_ddp.py',
  'epochs': 25,
  #'batch-size-per-device' : 16,
  #'steps-per-epoch': 100
  }

estimator_ms = PyTorch(
                       source_dir='2_sources',
                       entry_point='launcher.py', 
                       role=role,
                       framework_version='1.9',
                       py_version='py38',
                       disable_profiler=True,
                       debugger_hook_config=False,
                       hyperparameters=hyperparameters,
                       instance_count=ps_instance_count, 
                       instance_type=ps_instance_type,
                       )

estimator_ms.fit(inputs={"train":f"{data_url}/train", "val":f"{data_url}/val"})

2022-04-22 20:04:35 Starting - Starting the training job......
2022-04-22 20:05:23 Starting - Preparing the instances for training......
2022-04-22 20:06:36 Downloading - Downloading input data...
2022-04-22 20:06:57 Training - Downloading the training image.......................bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2022-04-22 20:10:46,664 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2022-04-22 20:10:46,687 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-04-22 20:10:46,696 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-04-22 20:10:46,649 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2022-04-22 20:10:46,670 sagemaker_pytorch_containe