# Introduction


In this notebook, we will explore distributed elastic training using PyTorch, with a focus on BERT pretraining using a large model size. We will also showcase the cost benefits of using low-cost, low-priority VMs on Azure Machine Learning.

This notebook is intended for data scientists and machine learning practitioners who are familiar with PyTorch, and want to learn how to use Azure Machine Learning to run distributed elastic training jobs on Azure.


## Requirements/Prerequisites
- An Azure acoount with active subscription [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- Azure Machine Learning workspace [Configure workspace](../../../configuration.ipynb) 
- Python Environment
- Install Azure ML Python SDK Version 2
## Learning Objectives
- Connect to workspace using Python SDK v2
- Distributed training of Pytorch model using Torch Elastic


# 1. Setup and Dependencies

## 1.1 Import required libraries


In [1]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient
from azure.ai.ml.entities import AmlCompute

## 1.2 Connect to workspace using DefaultAzureCredential
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [9]:
credential = DefaultAzureCredential()
ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your AML workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AML_WORKSPACE_NAME>"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

## 1.3 Create a low-priority compute cluster

In this section, we will create and configure a low-priority compute cluster on Azure Machine Learning. 

Training on [low-priority compute](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-optimize-cost#low-pri-vm) can cost much less than on dedicated compute.

When creating a compute cluster, its priority must be specified. The priority can either be _dedicated_ (default) or _low priority_. (See [this API doc](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.amlcompute.amlcomputeprovisioningconfiguration?view=azure-ml-py) for details.)

A job using dedicated compute is granted uninterrupted access to a VM. In contrast, low-priority compute is temporary and may be preempted by higher priority jobs. When a low-priority job is preempted, it must stop running and wait for compute to become available again. When compute becomes available again, the job restarts from scratch on the new compute. To avoid starting from scratch every time a job is preempted and restarted, special handling is needed in the training script to save state between preemptions.

In [7]:
compute_name = "low-pri-cluster"
compute_cluster = AmlCompute(
   name=compute_name, 
   description="Low priority compute cluster",
   size="Standard_NC6s_v3",
   min_instances=0, 
   max_instances=5,
   tier='LowPriority'
)
 
ml_client.begin_create_or_update(compute_cluster)

<azure.core.polling._poller.LROPoller at 0x23b1a8744f0>

# 2. PyTorch Elastic Concepts

In this section, we will explain the concepts of PyTorch Store and Rendezvous Backend.

PyTorch Elastic is a framework for distributed training in PyTorch that allows you to scale your training jobs to multiple nodes and handle failures gracefully. Here are some of the key concepts in PyTorch Elastic:

- **Rendezvous**: A rendezvous is a mechanism for nodes to discover each other and agree on a common set of parameters for the training job. In PyTorch Elastic, a rendezvous is typically implemented as a backend service that nodes can connect to and exchange information with. The rendezvous service is responsible for coordinating the nodes and ensuring that they all have the same view of the training job (e.g. the same set of parameters, node rank, etc.)

- **RendezvousHandler**: A rendezvous handler is an object that encapsulates the logic for connecting to a rendezvous service and exchanging information with other nodes. In PyTorch Elastic, a rendezvous handler is responsible for creating a rendezvous and registering the current node with it. There are several different types of rendezvous handlers available in PyTorch Elastic, such as `EtcdStoreRendezvousHandler`, `TCPStoreRendezvousHandler`,  `FileStoreRendezvousHandler`.

- **RendezvousBackend**: A rendezvous backend is the implementation of the rendezvous service. PyTorch Elastic provides several different rendezvous backends, such as `EtcdRendezvousBackend` or `FileStoreRendezvousBackend`.

- **PyTorch Store**: A PyTorch store is a key-value store that is used to share data between nodes in a distributed training job. PyTorch Elastic provides a `TCPStore` implementation that uses a TCP socket to exchange data between nodes. PyTorch Elastic also provides a `FileStore` implementation that uses a file on disk to exchange data between nodes.

These concepts work together to enable distributed training in PyTorch Elastic. A rendezvous backend is responsible for coordinating nodes and ensuring that they all have the same view of the training job. A rendezvous handler is responsible for creating a rendezvous and registering the current node with it. A PyTorch store is used to share data between nodes in the training job.

### When to use a custom PyTorch Store and Rendezvous Backend

In [4]:
# TODO Is stateless the right word here?

When training on dynamic or low priority clusters, using existing stores like TCPStore or EtcdStore can have some limitations. For example:

- **Node failures**: If nodes can go down during training, the existing stores may not be able to handle the failure gracefully. This can result in lost data or inconsistent state across the nodes.

- **Master node**: If the master node is not known upfront, it may be difficult to configure the existing stores to work with the cluster. For example, the master node may need to be manually specified in the configuration, which can be cumbersome if the cluster is dynamic.

- **Additional infrastructure setup**: Using existing stores may require additional infrastructure setup, such as setting up a separate Etcd cluster or TCP server. This can add complexity to the training setup and increase the chances of failure.

Creating a custom rendezvous backend or store that is stateless can simplify things in some cases. A stateless backend or store can be more resilient to node failures, since it does not rely on maintaining state across nodes. It can also be easier to configure, since it does not require setting up additional infrastructure.


# 3 Custom PyTorch Store and Rendezvous Backend 

In this notebook, we are going to use Azure Table Storage for storing rendezvous information and sharing data between nodes. Azure Table Storage is a NoSQL key-value store that is part of the Azure Storage service. It is a good fit for this scenario because it is a fully managed service that can scale to handle large amounts of data. It also supports atomic operations, which makes it easy to implement a rendezvous backend or store that is stateless.


# 

## 3.1 Custom PyTorch Store

The custom PyTorch store is implemented in the ``src/azure_table_store.py`` file.

In this example, the `AzureTableStore` class implements the `Store` interface and uses the Azure Data Table API to store and retrieve data. The `AzureTableStore` class is stateless, which means that it does not need to maintain any state across nodes. This makes it resilient to node failures, since it does not rely on maintaining state across nodes. It also makes it easy to configure, since it does not require setting up additional infrastructure.

## 3.2 Custom Rendezvous Backend

The custom rendezvous backend is implemented in the ``src/azure_table_backend.py`` file.

## 3.3 Custom Rendezvous Handler

To allow PyTorch Elastic to use Azure Data Table as the rendezvous store for distributed training, we can create an instance of the `AzureTableRendezvousBackend` class and pass it to the `RendezvousHandler` object. 



# 4. Distributed Elastic Training with PyTorch

Distributed elastic training allows us to dynamically adjust the number of training workers during the training process. In this section, we will implement the PyTorch distributed elastic training code and integrate our custom PyTorch Store and Rendezvous Backend.

# 5. Submit the job to Azure Machine Learning

## 5.1 Configure Command for reading and writing data
The CIFAR 10 dataset, a compressed file,  is downloaded from a public url. The `read_write_data.py` code which is in the `src` folder does the extraction of files using the `tarfile library`.

In [9]:
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes


inputs = {
    "cifar_zip": Input(
        type=AssetTypes.URI_FILE,
        path="https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz",
    ),
}

outputs = {
    "cifar": Output(
        type=AssetTypes.URI_FOLDER,
        path=f"azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/workspaceblobstore/paths/CIFAR-10",
    )
}

job = command(
    code="./src",  # local path where the code is stored
    command="python read_write_data.py --input_data ${{inputs.cifar_zip}} --output_folder ${{outputs.cifar}}",
    inputs=inputs,
    outputs=outputs,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
    compute="cpu-cluster",
)

# submit the command
returned_job = ml_client.jobs.create_or_update(job)
# get a URL for the status of the job
returned_job.studio_url

[32mUploading src (0.02 MBs): 100%|##########| 21798/21798 [00:00<00:00, 70059.55it/s]
[39m



'https://ml.azure.com/runs/sincere_sun_wvznthjf13?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/gasi_rg_centraleuap/workspaces/sdk_vnext_cli&tid=72f988bf-86f1-41af-91ab-2d7cd011db47'

In [10]:
ml_client.jobs.stream(returned_job.name)

RunId: sincere_sun_wvznthjf13
Web View: https://ml.azure.com/runs/sincere_sun_wvznthjf13?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/gasi_rg_centraleuap/workspaces/sdk_vnext_cli

Execution Summary
RunId: sincere_sun_wvznthjf13
Web View: https://ml.azure.com/runs/sincere_sun_wvznthjf13?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/gasi_rg_centraleuap/workspaces/sdk_vnext_cli



In [10]:
returned_job = ml_client.jobs.get("sincere_sun_wvznthjf13")
returned_job

Experiment,Name,Type,Status,Details Page
elastic,sincere_sun_wvznthjf13,command,Completed,Link to Azure Machine Learning studio


In [11]:
print(returned_job.name)
print(returned_job.experiment_name)
print(returned_job.outputs.cifar)
print(returned_job.outputs.cifar.path)

sincere_sun_wvznthjf13
elastic
${{parent.jobs.sincere_sun_wvznthjf13.outputs.cifar}}
azureml://subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/gasi_rg_centraleuap/workspaces/sdk_vnext_cli/datastores/workspaceblobstore/paths/CIFAR-10


## 5.2 Configure Command Job for Distributed Training with PyTorch Elastic

In this section, we will configure the command job for distributed training with PyTorch Elastic. Using the `max_instance_count` parameter, we can specify how much we are allowed to scale our cluster up to.

In [23]:
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes

# === Note on path ===
# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.
# Local paths are automatically uploaded to the default datastore in the cloud.
# More details on supported paths: https://docs.microsoft.com/azure/machine-learning/how-to-read-write-data-v2#supported-paths

inputs = {
    "cifar": Input(
        type=AssetTypes.URI_FOLDER, path=returned_job.outputs.cifar.path
    ),  # path="azureml:azureml_stoic_cartoon_wgb3lgvgky_output_data_cifar:1"), #path="azureml://datastores/workspaceblobstore/paths/azureml/stoic_cartoon_wgb3lgvgky/cifar/"),
    "epoch": 1,
    "batchsize": 256,
    "lr": 0.01,
}

job = command(
    code="./src",
    command="python driver.py cifar_train.py --data-dir ${{inputs.cifar}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}}",
    inputs=inputs,
    environment="azureml:AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu:7",
    # compute=compute_name,
    compute="low-pri-cluster",
    resources={"max_instance_count": 2},
    distribution={"type": "PyTorch"},
)

In [24]:
# Submit the job
submitted_command_job = ml_client.jobs.create_or_update(job)

# Monitor the job on Azure ML Studio
submitted_command_job.studio_url

'https://ml.azure.com/runs/happy_jewel_kt3h0x98gz?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/gasi_rg_centraleuap/workspaces/sdk_vnext_cli&tid=72f988bf-86f1-41af-91ab-2d7cd011db47'