Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

#  Train using Distributed Pytorch on Azure Arc-enabled Machine Learning with NFS-mounted data

This example notebook demonstrates how to train a Deep Learning model using Pytorch and data stored on an NFS server.

* Setup an NFS Server
* Download training data to the NFS Server
* Configure NFS Server mounts on your Kubernetes Cluster
* Setup your connection to Azure Machine Learning
* Create the necessary Azure Machine Learning objects
* Submit a Training Run

## Setup an NFS Server
This notebook assumes that you either have access to an existing NFS server or know how to set one up.  Setting up and configuring NFS is beyond
the scope of this example.  To complete this notebook you will need to know the address of your NFS server and know how to mount it locally so that it is accessible to this notebook.

Once you have a working NFS server mount, configure the 'nfs_mount_path' variable below to point to it.

In [12]:
nfs_mount_path = '/nfs_share'

import os
mnist_dir = os.path.join(nfs_mount_path, 'mnist')
os.makedirs(mnist_dir, exist_ok=True)

## Download training data to the NFS Server
This step uses the Torchvision utilities (from PyTorch) to download the MNIST data to the NFS server.

In [13]:
%pip install torchvision==0.7.0

from torchvision import datasets
import os

datasets.MNIST(mnist_dir, train=True, download=True)

You should consider upgrading via the '/home/azureuser/code/e2e/reinforcement-learning/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


Dataset MNIST
    Number of datapoints: 60000
    Root location: /nfs_share/mnist
    Split: Train

## Cofigure NFS Server mounts on your Kubernetes Cluster

Follow the instructions [here](../amlarc-nfs-setup/README.md) to configure your Azure Arc-enabled Machine Learning cluster to mount your NFS server.

## Setup your connection to Azure Machine Learning

In [1]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

Failure while loading azureml_run_type_providers. Failed to load entrypoint hyperdrive = azureml.train.hyperdrive:HyperDriveRun._from_run_dto with exception (azureml-core 1.32.0 (/disks/4TB/code/e2e/reinforcement-learning/lib/python3.6/site-packages), Requirement.parse('azureml-core~=1.30.0')).


SDK version: 1.32.0


In [2]:
# Connect to the Workspace described by local configuration
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


dasommer-ml-eus
dasommer
eastus
4aaa645c-5ae2-4ae9-a17a-84b9023bc56a


## Create the necessary Azure Machine Learning objects

In [3]:
# Create an Experiment
from azureml.core import Experiment
experiment_name = 'train-on-amlarc-with-nfs'
experiment = Experiment(workspace = ws, name = experiment_name)

In [4]:
# Create a Docker-based environment with Pytorch installed
from azureml.core import Environment
from azureml.core.runconfig import DockerConfiguration
from azureml.core.conda_dependencies import CondaDependencies

env_name = 'AzureML-PyTorch-1.6-CPU'
myenv = Environment.get(workspace=ws, name=env_name)

# Enable Docker
docker_config = DockerConfiguration(use_docker=True)

In [5]:
# Specify the name of an existing Azure Arc-enabled Machine Learning compute target
amlarc_cluster = 'amlarc'

## Submit a Training Run

In [9]:
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import PyTorchConfiguration

# Configure the run.  For this example we will use the NFS data path set above.
backend = 'Gloo'
dist_config = PyTorchConfiguration(communication_backend=backend, node_count = 3)

src = ScriptRunConfig(source_directory='scripts', 
                      script='train.py', 
                      compute_target=amlarc_cluster,
                      environment=myenv,
                      arguments=['--data-dir', mnist_dir, '--backend', backend],
                      docker_runtime_config=docker_config,
                      distributed_job_config=dist_config)
 
run = experiment.submit(config=src)
run

Experiment,Id,Type,Status,Details Page,Docs Page
train-on-amlarc-with-nfs,train-on-amlarc-with-nfs_1629910140_7ddd53e1,azureml.scriptrun,Starting,Link to Azure Machine Learning studio,Link to Documentation


Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).

In [8]:
# Shows output of the run on stdout.
run.wait_for_completion(show_output=True)

RunId: train-on-amlarc-with-nfs_1629907282_1e5fef56
Web View: https://ml.azure.com/runs/train-on-amlarc-with-nfs_1629907282_1e5fef56?wsid=/subscriptions/4aaa645c-5ae2-4ae9-a17a-84b9023bc56a/resourcegroups/dasommer/workspaces/dasommer-ml-eus&tid=72f988bf-86f1-41af-91ab-2d7cd011db47

Streaming azureml-logs/55_azureml-execution-tvmps_ff698f97a874dd9a67b874c4963a6dbc-ps-0_d.txt

2021-08-25 16:03:07 bash: /azureml-envs/azureml_9a80c1e51ee3bc159c49887413775b4b/lib/libtinfo.so.5: no version information available (required by bash)
2021-08-25 16:03:07 ++ dirname /dlws-scripts/bootstrap.sh
2021-08-25 16:03:07 + CWD=/dlws-scripts
2021-08-25 16:03:07 + hostname
2021-08-25 16:03:07 aks-agentpool-11828563-vmss000001
2021-08-25 16:03:07 + whoami
2021-08-25 16:03:07 root
2021-08-25 16:03:07 + . /dlts-runtime/env/init.env
2021-08-25 16:03:07 ++ export DLTS_SD_SELF_IP=10.240.0.5
2021-08-25 16:03:07 ++ DLTS_SD_SELF_IP=10.240.0.5
2021-08-25 16:03:07 ++ export DLTS_SD_SELF_SSH_PORT=42042
2021-08-25 16:03:

ActivityFailedException: ActivityFailedException:
	Message: Activity Failed:
{
    "error": {
        "code": "UserError",
        "message": "User program failed with RuntimeError: Dataset not found. You can use download=True to download it",
        "messageParameters": {},
        "detailsUri": "https://aka.ms/azureml-run-troubleshooting",
        "details": []
    },
    "time": "0001-01-01T00:00:00.000Z"
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Activity Failed:\n{\n    \"error\": {\n        \"code\": \"UserError\",\n        \"message\": \"User program failed with RuntimeError: Dataset not found. You can use download=True to download it\",\n        \"messageParameters\": {},\n        \"detailsUri\": \"https://aka.ms/azureml-run-troubleshooting\",\n        \"details\": []\n    },\n    \"time\": \"0001-01-01T00:00:00.000Z\"\n}"
    }
}

In [None]:
run.get_metrics()