Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

#  Train using Distributed Pytorch on Azure Arc-enabled Machine Learning with NFS-mounted data

This example notebook demonstrates how to train a Deep Learning model using Pytorch and data stored on an NFS server.

* Setup an NFS Server
* Download training data to the NFS Server
* Configure NFS Server mounts on your Kubernetes Cluster
* Setup your connection to Azure Machine Learning
* Create the necessary Azure Machine Learning objects
* Submit a Training Run

## Setup an NFS Server
This notebook assumes that you either have access to an existing NFS server or know how to set one up.  Setting up and configuring NFS is beyond
the scope of this example.  To complete this notebook you will need to know the address of your NFS server and know how to mount it locally so that it is accessible to this notebook.

Once you have a working NFS server mount, configure the 'nfs_mount_path' variable below to point to it.

In [None]:
nfs_mount_path = '/nfs_share'

import os
mnist_dir = os.path.join(nfs_mount_path, 'mnist')
os.makedirs(mnist_dir, exist_ok=True)

## Download training data to the NFS Server
This step uses the Torchvision utilities (from PyTorch) to download the MNIST data to the NFS server.

In [None]:
%pip install torchvision==0.7.0

from torchvision import datasets
import os

datasets.MNIST(mnist_dir, train=True, download=True)

## Cofigure NFS Server mounts on your Kubernetes Cluster

Follow the instructions [here](../amlarc-nfs-setup/README.md) to configure your Azure Arc-enabled Machine Learning cluster to mount your NFS server.

## Setup your connection to Azure Machine Learning

In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

In [None]:
# Connect to the Workspace described by local configuration
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Create the necessary Azure Machine Learning objects

In [None]:
# Create an Experiment
from azureml.core import Experiment
experiment_name = 'train-on-amlarc-with-nfs'
experiment = Experiment(workspace = ws, name = experiment_name)

In [None]:
# Create a Docker-based environment with Pytorch installed
from azureml.core import Environment
from azureml.core.runconfig import DockerConfiguration
from azureml.core.conda_dependencies import CondaDependencies

env_name = 'AzureML-PyTorch-1.6-CPU'
myenv = Environment.get(workspace=ws, name=env_name)

# Enable Docker
docker_config = DockerConfiguration(use_docker=True)

In [None]:
# Specify the name of an existing Azure Arc-enabled Machine Learning compute target
amlarc_cluster = 'amlarc'

## Submit a Training Run

In [None]:
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import PyTorchConfiguration

# Configure the run.  For this example we will use the NFS data path set above.
backend = 'Gloo'
dist_config = PyTorchConfiguration(communication_backend=backend, node_count = 3)

src = ScriptRunConfig(source_directory='scripts', 
                      script='train.py', 
                      compute_target=amlarc_cluster,
                      environment=myenv,
                      arguments=['--data-dir', mnist_dir, '--backend', backend],
                      docker_runtime_config=docker_config,
                      distributed_job_config=dist_config)
 
run = experiment.submit(config=src)
run

Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).

In [None]:
# Shows output of the run on stdout.
run.wait_for_completion(show_output=True)

In [None]:
run.get_metrics()