Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Distributed PyTorch with DistributedDataParallel
In this tutorial, you will train a PyTorch model on the [CIFAR10](http://www.cs.toronto.edu/~kriz/cifar.html) dataset using distributed training with PyTorch's `DistributedDataParallel` module across a Azure Stack Hub CPU Kubernetes cluster. The training dataset are stored on Azure Machine Learning datastore backed up by Azure Stack Hub Storage account.

## Prerequisite

*     A Kubernetes cluster deployed on Azure Stack Hub, connected to Azure through ARC.
     
   For details on how to deploy kubernetes cluster on Azure Stack Hub and enabling ARC connection to Azure, please follow [this guide](https://github.com/Azure/AML-Kubernetes/blob/master/docs/ASH/AML-ARC-Compute.md)
  

*     Datastore setup in Azure Machine Learning workspace backed up by Azure Stack Hub storage account.

   [This document](https://github.com/Azure/AML-Kubernetes/blob/master/docs/ASH/Train-AzureArc.md) is a detailed guide on how to create Azure Machine Learning workspace, create a  Azure Stack Hub Storage account, and setup datastore in AML workspace backed by ASH storage account.


*      Last but not least, you need to be able to run a Notebook. 

   If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at [here](https://github.com/Azure/MachineLearningNotebooks) first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc.

In [3]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

Failure while loading azureml_run_type_providers. Failed to load entrypoint hyperdrive = azureml.train.hyperdrive:HyperDriveRun._from_run_dto with exception (azureml-telemetry 1.19.0 (c:\users\v-songshanli\anaconda3\envs\pythonproject\lib\site-packages), Requirement.parse('azureml-telemetry~=1.18.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception (azureml-telemetry 1.19.0 (c:\users\v-songshanli\anaconda3\envs\pythonproject\lib\site-packages), Requirement.parse('azureml-telemetry~=1.18.0'), {'azureml-automl-core'}).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.PipelineRun = azureml.pipeline.core.run:PipelineRun._from_dto with exception (azureml-core 1.19.0 (c:\users\v-songshanli\anaconda3\envs\pythonproject\lib\site-packages), Requirement.parse('azureml-core~=1.18.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint a

SDK version: 1.19.0


## Initialize workspace

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`. 

If you haven't done already please go to `config.json` file and fill in your workspace information.

In [4]:
from azureml.core.workspace import Workspace,  ComputeTarget
from azureml.exceptions import ComputeTargetException

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Workspace name: sl-ash2-mal
Azure region: eastus
Subscription id: 6b736da6-3246-44dd-a0b8-b5e95484633d
Resource group: sl-ash2


## Prepare dataset

You may download cifar10 dataset from [cifar10-data](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz). Create folder "cifar10-data" under working directory of this notebook, then  copy "cifar-10-python.tar.gz" to folder "cifar10-data". The following cell will upload "cifar-10-python.tar.gz" to datastore of the workspace, and finally registered as dataset in the workspace. 

Upload and dataset registration take about 3 mins.

To set up datastore using an azure stack hub storage account, please refer to [Train_azure_arc](https://github.com/Azure/AML-Kubernetes/blob/master/docs/ASH/Train-AzureArc.md). To register the dataset manually, please refer to this [video](https://msit.microsoftstream.com/video/51f7a3ff-0400-b9eb-2703-f1eb38bc6232)


In [5]:
import os
from azureml.core import Datastore, Dataset

dataset_name = 'cifar10'
datastore_name = "ashstore"

if dataset_name not  in ws.datasets:
    datastore =  Datastore.get(ws, datastore_name)
    
    src_dir, target_path = 'cifar10-data', 'cifar10-data-ash' #assuming cifar-10-python.tar.gz is in folder cifar10-data
    
    # upload data from local to AML datastore:
    datastore.upload(src_dir, target_path)

    # register data uploaded as AML dataset:
    datastore_paths = [(datastore, target_path)]
    cifar_ds = Dataset.File.from_files(path=datastore_paths)
    cifar_ds.register(ws, dataset_name, "CIFAR-10 images from https://www.cs.toronto.edu/~kriz/cifar.html")
        
dataset_ash = ws.datasets[dataset_name]

## Create or attach existing ArcKubernetesCompute

The attaching code here depends  python package azureml-contrib-k8s which current is in private preview. Install private preview branch of AzureML SDK by running following command (private preview):

<pre>
pip install --disable-pip-version-check --extra-index-url https://azuremlsdktestpypi.azureedge.net/azureml-contrib-k8s-preview/D58E86006C65 azureml-contrib-k8s
</pre>

Attaching ASH cluster the first time may take 7 minutes. It will be much faster after first attachment.

In [6]:
from azureml.contrib.core.compute.arckubernetescompute import ArcKubernetesCompute

resource_id = "/subscriptions/6b736da6-3246-44dd-a0b8-b5e95484633d/resourceGroups/AML-stack-val/providers/Microsoft.Kubernetes/connectedClusters/kub-orlando-Test"

attach_config = ArcKubernetesCompute.attach_configuration(
    resource_id= resource_id,
)

try:
    attach_name = "peymanarc"
    arcK_target_result = ArcKubernetesCompute.attach(ws, attach_name, attach_config)
    arcK_target_result.wait_for_completion(show_output=True)
    print('arc attach  success')
except ComputeTargetException as e:
    print(e)
    print('arc attach  failed')

attach_name = "nc6"#"ds3v2" 
arcK_target = ws.compute_targets[attach_name]

SucceededProvisioning operation finished, operation "Succeeded"
arc attach  success


### Configure the training job and Submit a run


In [7]:
from azureml.core import Experiment

experiment_name = 'pytorch-cifar-distr'
experiment = Experiment(ws, name=experiment_name)

### Create an environment


In [8]:
from azureml.core import Environment

pytorch_env = Environment.from_conda_specification(name = 'pytorch-1.6-cpu', file_path = 'pytorch-script/conda_dependencies.yml')

# Specify a CPU base image
pytorch_env.docker.enabled = True
pytorch_env.docker.base_image = 'mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04'


### Configure the training job: torch.distributed with GLOO backend

Create a ScriptRunConfig object to specify the configuration details of your training job, including your training script, environment to use, and the compute target to run on.

In order to run a distributed PyTorch job with **torch.distributed** using the GLOO backend, create a `PyTorchConfiguration` and pass it to the `distributed_job_config` parameter of the ScriptRunConfig constructor. Specify `communication_backend='Gloo'` in the PyTorchConfiguration. The below code will configure node_count = 2. These is the number of worker nodes. The number of  distributed jobs will be 3 if one master node is used.  GLOO backend which is recommended backend for communications between CPUs.

Tthe script for distributed training of CIFAR10 is already provided for you at `pytorch-script/cifar_dist_main.py`. In practice, you should be able to take any custom PyTorch training script as is and run it with Azure ML without having to modify your code.

With node_count=2, training for one epoch may take 20 mins with vm size comparable to Standard_DS3_v2

In [9]:
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import PyTorchConfiguration
from azureml.core import Dataset
import os

dataset_ash = Dataset.get_by_name(ws, name=dataset_name)
args = [
        '--data-folder', dataset_ash.as_mount(),
        '--dist-backend', 'gloo',
       '--epochs', 1 #20
           ]

distributed_job_config=PyTorchConfiguration(communication_backend='Gloo', node_count=2) #configuring AML pytorch config

project_folder = "pytorch-script"
run_script = "cifar_dist_main.py"
src = ScriptRunConfig(
                     source_directory=project_folder,
                      script=run_script,
                      arguments=args,
                      compute_target=arcK_target,
                      environment=pytorch_env,
                      distributed_job_config=distributed_job_config)

### Submit job
Run your experiment by submitting your ScriptRunConfig object. Note that this call is asynchronous.

In [10]:
run = experiment.submit(src)
run.wait_for_completion(show_output=True) # this provides a verbose log

RunId: pytorch-cifar-distr_1613442168_09fb2717
Web View: https://ml.azure.com/experiments/pytorch-cifar-distr/runs/pytorch-cifar-distr_1613442168_09fb2717?wsid=/subscriptions/6b736da6-3246-44dd-a0b8-b5e95484633d/resourcegroups/sl-ash2/workspaces/sl-ash2-mal

Streaming azureml-logs/65_job_prep-tvmps_4984b88dd9855bc51a7c2dc4ec9b8fb975e1f2535330a25170316224556bd189_d.txt

[2021-02-16T02:26:25.867258] Entering job preparation.
[2021-02-16T02:26:26.814648] Starting job preparation.
[2021-02-16T02:26:26.814685] Extracting the control code.
[2021-02-16T02:26:26.836668] fetching and extracting the control code on master node.
[2021-02-16T02:26:26.836693] Starting extract_project.
[2021-02-16T02:26:26.836728] Starting to extract zip file.
[2021-02-16T02:26:27.798329] Finished extracting zip file.
[2021-02-16T02:26:27.917262] Using urllib.request Python 3.0 or later
[2021-02-16T02:26:27.917316] Start fetching snapshots.
[2021-02-16T02:26:27.917353] Start fetching snapshot.
[2021-02-16T02:26:27.9

125 196 Loss: 1.820 | Acc: 31.554% (5089/16128)
126 196 Loss: 1.818 | Acc: 31.613% (5139/16256)
127 196 Loss: 1.818 | Acc: 31.659% (5187/16384)
128 196 Loss: 1.816 | Acc: 31.783% (5248/16512)
129 196 Loss: 1.815 | Acc: 31.839% (5298/16640)
130 196 Loss: 1.814 | Acc: 31.870% (5344/16768)
131 196 Loss: 1.812 | Acc: 31.919% (5393/16896)
132 196 Loss: 1.809 | Acc: 31.984% (5445/17024)
133 196 Loss: 1.807 | Acc: 32.014% (5491/17152)
134 196 Loss: 1.805 | Acc: 32.089% (5545/17280)
135 196 Loss: 1.803 | Acc: 32.181% (5602/17408)
136 196 Loss: 1.802 | Acc: 32.294% (5663/17536)
137 196 Loss: 1.801 | Acc: 32.343% (5713/17664)
138 196 Loss: 1.798 | Acc: 32.464% (5776/17792)
139 196 Loss: 1.796 | Acc: 32.545% (5832/17920)
140 196 Loss: 1.795 | Acc: 32.602% (5884/18048)
141 196 Loss: 1.793 | Acc: 32.647% (5934/18176)
142 196 Loss: 1.791 | Acc: 32.709% (5987/18304)
143 196 Loss: 1.790 | Acc: 32.775% (6041/18432)
144 196 Loss: 1.789 | Acc: 32.829% (6093/18560)
145 196 Loss: 1.786 | Acc: 32.914% (6151


Streaming azureml-logs/75_job_post-tvmps_4984b88dd9855bc51a7c2dc4ec9b8fb975e1f2535330a25170316224556bd189_d.txt

[2021-02-16T02:37:48.584051] Entering job release
[2021-02-16T02:37:49.882145] Starting job release
[2021-02-16T02:37:49.882642] Logging experiment finalizing status in history service.[2021-02-16T02:37:49.882786] job release stage : upload_datastore starting...

Starting the daemon thread to refresh tokens in background for process with pid = 484
[2021-02-16T02:37:49.883036] job release stage : start importing azureml.history._tracking in run_history_release.
[2021-02-16T02:37:49.883257] job release stage : execute_job_release starting...[2021-02-16T02:37:49.883541] job release stage : copy_batchai_cached_logs starting...

[2021-02-16T02:37:49.885432] job release stage : copy_batchai_cached_logs completed...
[2021-02-16T02:37:49.936077] Entering context manager injector.
[2021-02-16T02:37:49.977103] job release stage : upload_datastore completed...
[2021-02-16T02:37:50.269

{'runId': 'pytorch-cifar-distr_1613442168_09fb2717',
 'target': 'nc6',
 'status': 'Completed',
 'startTimeUtc': '2021-02-16T02:25:52.395478Z',
 'endTimeUtc': '2021-02-16T02:38:09.719482Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': 'decc257a-e2e9-4923-97a5-b12fed3fa395',
  'azureml.git.repository_uri': 'git@github.com:lisongshan007/AML-Kubernetes.git',
  'mlflow.source.git.repoURL': 'git@github.com:lisongshan007/AML-Kubernetes.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': 'e2bca45811ab76c19bc777642ccf7f12fecd9162',
  'mlflow.source.git.commit': 'e2bca45811ab76c19bc777642ccf7f12fecd9162',
  'azureml.git.dirty': 'False',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '24d86baf-41f9-40e1-ae0d-5ae8b4c9cdfc'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'input__5a164dc5', 'mecha

In [11]:
# register the model
register_model_name = 'cifar10torch'
model = run.register_model(model_name=register_model_name, model_path='outputs/cifar10torch.pt')

The machine learning model named "cifar10torch" should be registered in your AML workspace.

## Test Registered Model

To test the trained model, you can create (or use existing) a AKS cluster for serving the model using AML deployment

In [12]:
from azureml.core import Environment, Workspace, Model, ComputeTarget
from azureml.core.compute import AksCompute
from azureml.core.model import InferenceConfig
from azureml.core.webservice import Webservice, AksWebservice
from azureml.core.compute_target import ComputeTargetException
import numpy as np
import json

In [13]:
ws = Workspace.from_config()

# Choose a name for your AKS cluster
aks_name = 'aks-service-2'

if aks_name not in  ws.compute_targets:
    # Use the default configuration (can also provide parameters to customize)
    prov_config = AksCompute.provisioning_configuration()

    # Create the cluster
    aks_target = ComputeTarget.create(workspace = ws,
                                    name = aks_name,
                                    provisioning_configuration = prov_config)
    is_new_compute  = True

    if aks_target.get_status() != "Succeeded":
        aks_target.wait_for_completion(show_output=True)
else:  
    aks_target =  ws.compute_targets[aks_name]   
    is_new_compute  = False
    
print("using compute target: ", aks_target.name)

using compute target:  aks-service-2


### Deploy the mode

In [14]:
env = Environment.from_conda_specification(name='pytorch-cifar', file_path='pytorch-script/conda_dependencies.yml')

inference_config = InferenceConfig(entry_script='score_pytorch.py', environment=env)
deploy_config = AksWebservice.deploy_configuration()

#register_model_name = "cifar10torch80"
model = ws.models[register_model_name]
service_name = 'cifartorchservice4'

service = Model.deploy(workspace=ws,
                       name=service_name,
                       models=[model],
                       inference_config=inference_config,
                       deployment_config=deploy_config,
                       deployment_target=aks_target,
                       overwrite=True)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running........
Succeeded
AKS service creation operation finished, operation "Succeeded"


### Test with Inputs:

For testing purpose, first five images from test batch are extracted. Let's take a look these pictures: 

In [15]:
from PIL import Image

img_files = ["test_img_0_cat.jpg", "test_img_1_ship.jpg","test_img_2_ship.jpg","test_img_3_plane.jpg","test_img_4_frog.jpg"]

for img_file in img_files:
    image = Image.open("test_imgs/{}".format(img_file))
    image.show()

After some data process, these five images are converted to json as input for the trained model. The outputs are logits for each class per image:

In [16]:
with open("cifar_test_input_pytorch.json", "r") as fp:
    inputs_json = json.load(fp)
    
inputs = json.dumps(inputs_json)
resp = service.run(inputs)
predicts = resp["predicts"]
print(predicts)

[[-0.5160019397735596, -1.2456295490264893, 1.6823105812072754, 0.6995013356208801, 0.8892011046409607, 0.47988930344581604, 1.4375267028808594, -0.47753068804740906, -1.1731479167938232, -1.5666691064834595], [5.023127555847168, 5.144172668457031, -0.8311423063278198, -3.577611207962036, -2.236011505126953, -5.63224983215332, -3.499835729598999, -2.228290557861328, 5.397491455078125, 2.4311327934265137], [3.812830686569214, 2.7654457092285156, -0.23190726339817047, -2.1370229721069336, -1.7254106998443604, -4.214117050170898, -2.6311137676239014, -2.1098084449768066, 5.139931678771973, 1.676523208618164], [3.4747440814971924, 2.662689685821533, 0.07015886902809143, -1.692741870880127, -1.5648386478424072, -3.8260350227355957, -2.626882553100586, -1.4485933780670166, 4.246033191680908, 1.0078842639923096], [-1.1009407043457031, -1.784458875656128, 2.548421859741211, 0.21697556972503662, 2.352715492248535, -0.056166648864746094, 3.779412269592285, -0.6903271675109863, -3.094657659530639

Then you can easily get the predictions of labels:

In [17]:
import numpy as np
classes = ('plane', 'car', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck')
np_predicts = np.array(predicts)
pred_indexes = np.argmax(np_predicts, 1)

predict_labels = [classes[i] for i in pred_indexes]
print(predict_labels)

['bird', 'ship', 'ship', 'ship', 'frog']


### Delete the newly created cluster

Note: This is important if you wish to avoid the cost of this cluster

In [18]:
if is_new_compute:
    aks_target.delete()
    service.delete()

## Next Steps

1. Learn how to [download model then upload to Azure Storage blobs](../AML-model-download-upload.ipynb)
2. Learn how to [inference using KFServing with model in Azure Storage Blobs](https://aka.ms/kfas)