# Train Tensorflow Model Distributed on Batch AI
In this notebook we will train a TensorFlow model ([ResNet50](https://arxiv.org/abs/1512.03385)) in a distributed fashion using [Horovod](https://github.com/uber/horovod) on the Imagenet dataset. This tutorial will take you through the following steps:
 * [Create Azure Resources](#azure_resources)
 * [Create Fileserver(NFS)](#create_fileshare)
 * [Configure Batch AI Cluster](#configure_cluster)
 * [Submit and Monitor Job](#job)
 * [Clean Up Resources](#clean_up)

In [1]:
import sys
sys.path.append("../common") 

from dotenv import dotenv_values, set_key, find_dotenv, get_key
from getpass import getpass
import os
import json
from utils import get_password, write_json_to_file, dotenv_for

Below are the variables that describe our experiment. By default we are using the NC24rs_v3 (Standard_NC24rs_v3) VMs which have V100 GPUs and Infiniband. By default we are using 2 nodes with each node having 4 GPUs, this equates to 8 GPUs. Feel free to increase the number of nodes but be aware what limitations your subscription may have.

Set the USE_FAKE to True if you want to use fake data rather than the Imagenet dataset. This is often a good way to debug your models as well as checking what IO overhead is.

In [2]:
# Variables for Batch AI - change as necessary
ID                     = "ddtftestyz"
GROUP_NAME             = f"batch{ID}rg"
STORAGE_ACCOUNT_NAME   = f"batch{ID}st"
FILE_SHARE_NAME        = f"batch{ID}share"
SELECTED_SUBSCRIPTION  = "Team Danielle Internal" #"<YOUR SUBSCRIPTION>"
WORKSPACE              = "workspace"
NUM_NODES              = 2
CLUSTER_NAME           = "yzhang100"
VM_SIZE                = "Standard_NC24rs_v3"
GPU_TYPE               = "V100"
PROCESSES_PER_NODE     = 4
LOCATION               = "eastus"
NFS_NAME               = f"batch{ID}nfs"
EXPERIMENT             = f"distributed_tensorflow_{GPU_TYPE}"
USERNAME               = "batchai_user"
USE_FAKE               = False
DOCKERHUB              = "yzhang001" #"<YOUR DOCKERHUB>"

In [3]:
FAKE='-env FAKE=True' if USE_FAKE else ''
TOTAL_PROCESSES = PROCESSES_PER_NODE * NUM_NODES

<a id='azure_resources'></a>
## Create Azure Resources
First we need to log in to our Azure account. 

In [4]:
!az login -o table

[33mTo sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AFZLBYPLW to authenticate.[0m
CloudName    IsDefault    Name                                                State    TenantId
-----------  -----------  --------------------------------------------------  -------  ------------------------------------
AzureCloud   True         Boston DS Dev                                       Enabled  72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False        Solution Template Testing                           Enabled  72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False        ADS Demo Subscription                               Enabled  72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False        Energy Solution Accelerator                         Enabled  72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False        Team Danielle Internal                              Enabled  72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False      

If you have more than one Azure account you will need to select it with the command below. If you only have one account you can skip this step.

In [5]:
!az account set --subscription "$SELECTED_SUBSCRIPTION"

In [32]:
!az account list -o table

Name                                                CloudName    SubscriptionId                        State    IsDefault
--------------------------------------------------  -----------  ------------------------------------  -------  -----------
Boston DS Dev                                       AzureCloud   0ca618d2-22a8-413a-96d0-0f1b531129c3  Enabled  False
Solution Template Testing                           AzureCloud   3bcfa59c-82a0-44f9-ac08-b3479370bace  Enabled  False
ADS Demo Subscription                               AzureCloud   9f156ff1-0bac-4c28-adcd-60bd97ff0cfc  Enabled  False
Energy Solution Accelerator                         AzureCloud   a0691237-17b5-4b11-a762-63d8d3fecfd6  Enabled  False
Team Danielle Internal                              AzureCloud   edf507a2-6235-46c5-b560-fd463ba2e771  Enabled  True
Azure Stack Diagnostics CI and Production VaaS      AzureCloud   a8183b2d-7a4c-45e9-8736-dac11b84ff14  Enabled  False
Core-ES-BLD                            

Next we create the group that will hold all our Azure resources.

In [7]:
!az group create -n $GROUP_NAME -l $LOCATION -o table

Location    Name
----------  -----------------
eastus      batchddtftestyzrg


We will create the storage account that will store our fileshare where all the outputs from the jobs will be stored.

In [8]:
json_data = !az storage account create -l $LOCATION -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME --sku Standard_LRS
print('Storage account {} provisioning state: {}'.format(STORAGE_ACCOUNT_NAME, 
                                                         json.loads(''.join(json_data))['provisioningState']))

Storage account batchddtftestyzst provisioning state: Succeeded


In [9]:
json_data = !az storage account keys list -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME
storage_account_key = json.loads(''.join([i for i in json_data if 'WARNING' not in i]))[0]['value']

In [10]:
!az storage share create --account-name $STORAGE_ACCOUNT_NAME \
--account-key $storage_account_key --name $FILE_SHARE_NAME

{
  "created": false
}


In [11]:
!az storage directory create --share-name $FILE_SHARE_NAME  --name scripts \
--account-name $STORAGE_ACCOUNT_NAME --account-key $storage_account_key

{
  "created": false
}


Here we are setting some defaults so we don't have to keep adding them to every command

In [12]:
!az configure --defaults location=$LOCATION
!az configure --defaults group=$GROUP_NAME

In [34]:
%env AZURE_STORAGE_ACCOUNT $STORAGE_ACCOUNT_NAME
%env AZURE_STORAGE_KEY=$storage_account_key

env: AZURE_STORAGE_ACCOUNT=batchddtftestyzst
env: AZURE_STORAGE_KEY=Dv6slfbR/0u0TJHUFGGwtFq1YHOXGCtXpUAotikZgFolEt+yP11mzFu8iY+C0xG6iM0lC7Qze2nVWTDMHWojZg==


#### Create Workspace
Batch AI has the concept of workspaces and experiments. Below we will create the workspace for our work.

In [14]:
!az batchai workspace create -n $WORKSPACE -g $GROUP_NAME

{
  "creationTime": "2018-08-14T16:58:56.865000+00:00",
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchddtftestyzrg/providers/Microsoft.BatchAI/workspaces/workspace",
  "location": "eastus",
  "name": "workspace",
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-08-14T16:58:56.865000+00:00",
  "resourceGroup": "batchddtftestyzrg",
  "tags": null,
  "type": "Microsoft.BatchAI/workspaces"
}


<a id='create_fileshare'></a>
## Create Fileserver
In this example we will store the data on an NFS fileshare. It is possible to use many storage solutions with Batch AI. NFS offers the best traideoff between performance and ease of use. The best performance is achieved by loading the data locally but this can be cumbersome since it requires that the data is download by the all the nodes which with the imagenet dataset can take hours. 

In [36]:
!az batchai file-server create -n $NFS_NAME --disk-count 4 --disk-size 250 -w $WORKSPACE \
-s Standard_DS4_v2 -u $USERNAME -p {get_password(dotenv_for())} -g $GROUP_NAME --storage-sku Premium_LRS

[K{- Finished ..
  "creationTime": "2018-08-15T13:13:37.339000+00:00",
  "dataDisks": {
    "cachingType": "none",
    "diskCount": 4,
    "diskSizeInGb": 250,
    "storageAccountType": "Premium_LRS"
  },
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchddtftestyzrg/providers/Microsoft.BatchAI/workspaces/workspace/fileservers/batchddtftestyznfs",
  "mountSettings": {
    "fileServerInternalIp": "10.0.0.4",
    "fileServerPublicIp": "137.117.110.238",
    "mountPoint": "/data"
  },
  "name": "batchddtftestyznfs",
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-08-15T13:20:32.858000+00:00",
  "resourceGroup": "batchddtftestyzrg",
  "sshConfiguration": {
    "publicIpsToAllow": null,
    "userAccountSettings": {
      "adminUserName": "batchai_user",
      "adminUserPassword": null,
      "adminUserSshPublicKey": null
    }
  },
  "subnet": {
    "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/filese

In [37]:
!az batchai file-server list -o table -w $WORKSPACE -g $GROUP_NAME

Name                Resource Group     Size             Disks       Public IP        Internal IP    Mount Point
------------------  -----------------  ---------------  ----------  ---------------  -------------  -------------
batchddtftestyznfs  batchddtftestyzrg  Standard_DS4_v2  4 x 250 Gb  137.117.110.238  10.0.0.4       /data


In [38]:
json_data = !az batchai file-server list -w $WORKSPACE -g $GROUP_NAME
nfs_ip=json.loads(''.join([i for i in json_data if 'WARNING' not in i]))[0]['mountSettings']['fileServerPublicIp']

After we have created the NFS share we need to copy the data to it. To do this we write the script below which will be executed on the fileserver. It installs a tool called azcopy and then downloads and extracts the data to the appropriate directory.

In [39]:
%%writefile nodeprep.sh
#!/usr/bin/env bash
wget https://gist.githubusercontent.com/msalvaris/073c28a9993d58498957294d20d74202/raw/87a78275879f7c9bb8d6fb9de8a2d2996bb66c24/install_azcopy
chmod 777 install_azcopy
sudo ./install_azcopy

mkdir -p /data/imagenet
azcopy --source https://datasharesa.blob.core.windows.net/imagenet/validation.csv \
        --destination  /data/imagenet/validation.csv\
        --source-sas "?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=7x3rN7c/nlXbnZ0gAFywd5Er3r6MdwCq97Vwvda25WE%3D"\
        --quiet

azcopy --source https://datasharesa.blob.core.windows.net/imagenet/validation.tar.gz \
        --destination  /data/imagenet/validation.tar.gz\
        --source-sas "?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=zy8L4shZa3XXBe152hPnhXsyfBqCufDOz01a9ZHWU28%3D"\
        --quiet

azcopy --source https://datasharesa.blob.core.windows.net/imagenet/train.csv \
        --destination  /data/imagenet/train.csv\
        --source-sas "?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=EUcahDDZcefOKtHoVWDh7voAC1BoxYNM512spFmjmDU%3D"\
        --quiet

azcopy --source https://datasharesa.blob.core.windows.net/imagenet/train.tar.gz \
        --destination  /data/imagenet/train.tar.gz\
        --source-sas "?se=2025-01-01&sp=r&sv=2017-04-17&sr=b&sig=qP%2B7lQuFKHo5UhQKpHcKt6p5fHT21lPaLz1O/vv4FNU%3D"\
        --quiet

cd /data/imagenet
tar -xzf train.tar.gz
tar -xzf validation.tar.gz

Overwriting nodeprep.sh


Next we will copy the file over and run it on the NFS VM. This will install azcopy and download and prepare the data

In [40]:
USERNAME

'batchai_user'

In [41]:
nfs_ip

'137.117.110.238'

In [30]:
#!az batchai file-server delete -n $NFS_NAME -g $GROUP_NAME -w $WORKSPACE -y

[K - Finished ..[0m

In [29]:
#!az batchai file-server delete --help 

[0m
[0mCommand[0m
[0m    az batchai file-server delete : Delete a file server.[0m
[0m[0m
[0mArguments[0m
[0m    --no-wait           : Do not wait for the long-running operation to finish.[0m
[0m    --yes -y            : Do not prompt for confirmation.[0m
[0m[0m
[0mResource Id Arguments[0m
[0m    --ids               : One or more resource IDs (space-delimited). If provided, no other
                          'Resource Id' arguments should be specified.[0m
[0m    --name -n           : Name of file server.[0m
[0m    --resource-group -g : Name of resource group. You can configure the default group using `az
                          configure --defaults group=<name>`.  Default: batchddtftestyzrg.[0m
[0m    --workspace -w      : Name of workspace.[0m
[0m[0m
[0mGlobal Arguments[0m
[0m    --debug             : Increase logging verbosity to show all debug logs.[0m
[0m    --help -h           : Show this help message and exit.[0m
[0m    --outp

In [42]:
!sshpass -p {get_password(dotenv_for())} scp -o "StrictHostKeyChecking=no" nodeprep.sh $USERNAME@{nfs_ip}:~/



In [43]:
!sshpass -p {get_password(dotenv_for())} ssh -o "StrictHostKeyChecking=no" $USERNAME@{nfs_ip} "sudo chmod 777 ~/nodeprep.sh && ./nodeprep.sh"

--2018-08-15 13:27:32--  https://gist.githubusercontent.com/msalvaris/073c28a9993d58498957294d20d74202/raw/87a78275879f7c9bb8d6fb9de8a2d2996bb66c24/install_azcopy
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.32.133
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|151.101.32.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 481 [text/plain]
Saving to: ‘install_azcopy’

     0K                                                       100%  118M=0s

2018-08-15 13:27:32 (118 MB/s) - ‘install_azcopy’ saved [481/481]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   983  100   983    0     0   4568      0 --:--:-- --:--:-- --:--:--  4572
Get:1 http://security.ubuntu.com/ubuntu xenial-security InRelease [107 kB]
Get:2 https://packages.microsoft.com/repos/microsoft-ubuntu-xenial-prod xenial InRelease [2,

Processing triggers for libc-bin (2.23-0ubuntu10) ...
--2018-08-15 13:28:19--  https://aka.ms/downloadazcopyprlinux
Resolving aka.ms (aka.ms)... 23.212.169.122
Connecting to aka.ms (aka.ms)|23.212.169.122|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://azcopy.azureedge.net/azcopy-7-1-0-netcorepreview/azcopy_7.1.0-netcorepreview_all.tar.gz [following]
--2018-08-15 13:28:20--  https://azcopy.azureedge.net/azcopy-7-1-0-netcorepreview/azcopy_7.1.0-netcorepreview_all.tar.gz
Resolving azcopy.azureedge.net (azcopy.azureedge.net)... 72.21.81.200, 2606:2800:11f:17a5:191a:18d5:537:22f9
Connecting to azcopy.azureedge.net (azcopy.azureedge.net)|72.21.81.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3841375 (3.7M) [application/octet-stream]
Saving to: ‘azcopy.tar.gz’

     0K .......... .......... .......... .......... ..........  1% 5.55M 1s
    50K .......... .......... .......... .......... ..........  2%  282M 0s


sent 11,682,997 bytes  received 1,290 bytes  23,368,574.00 bytes/sec
total size is 11,675,344  speedup is 1.00
[2018/08/15 13:28:22] Transfer summary:
-----------------
Total files transferred: 1
Transfer successfully:   1
Transfer skipped:        0
Transfer failed:         0
Elapsed time:            00.00:00:01
[2018/08/15 13:28:47] Transfer summary:
-----------------
Total files transferred: 1
Transfer successfully:   1
Transfer skipped:        0
Transfer failed:         0
Elapsed time:            00.00:00:25
[2018/08/15 13:29:03] Transfer summary:
-----------------
Total files transferred: 1
Transfer successfully:   1
Transfer skipped:        0
Transfer failed:         0
Elapsed time:            00.00:00:01
[2018/08/15 13:41:33] Transfer summary:
-----------------
Total files transferred: 1
Transfer successfully:   1
Transfer skipped:        0
Transfer failed:         0
Elapsed time:            00.00:12:21


Next we create our experiment.

In [45]:
EXPERIMENT

'distributed_tensorflow_V100'

In [44]:
!az batchai experiment create -n $EXPERIMENT -g $GROUP_NAME -w $WORKSPACE

[K - Starting ..[K - Finished ..[K{
  "creationTime": "2018-08-15T14:16:59.595000+00:00",
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchddtftestyzrg/providers/Microsoft.BatchAI/workspaces/workspace/experiments/distributed_tensorflow_v100",
  "name": "distributed_tensorflow_v100",
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-08-15T14:16:59.595000+00:00",
  "resourceGroup": "batchddtftestyzrg",
  "type": "Microsoft.BatchAI/workspaces/experiments"
}
[0m

<a id='configure_cluster'></a>
## Configure Batch AI Cluster
We then upload the scripts we wish to execute onto the fileshare. The fileshare will later be mounted by Batch AI. An alternative to uploading the scripts would be to embedd them inside the Docker container.

In [47]:
!az storage file upload --share-name $FILE_SHARE_NAME --source ./src/imagenet_estimator_tf_horovod.py --path scripts
!az storage file upload --share-name $FILE_SHARE_NAME --source ./src/resnet_model.py --path scripts
!az storage file upload --share-name $FILE_SHARE_NAME --source ../common/timer.py --path scripts

Finished[#############################################################]  100.0000%
Finished[#############################################################]  100.0000%
Finished[#############################################################]  100.0000%


Below it the command to create the cluster.

In [48]:
!az batchai cluster create \
    -w $WORKSPACE \
    --name $CLUSTER_NAME \
    --image UbuntuLTS \
    --vm-size $VM_SIZE \
    --min $NUM_NODES --max $NUM_NODES \
    --afs-name $FILE_SHARE_NAME \
    --afs-mount-path extfs \
    --user-name $USERNAME \
    --password {get_password(dotenv_for())} \
    --storage-account-name $STORAGE_ACCOUNT_NAME \
    --storage-account-key $storage_account_key \
    --nfs $NFS_NAME \
    --nfs-mount-path nfs 

[K{- Finished ..
  "allocationState": "resizing",
  "allocationStateTransitionTime": "2018-08-15T14:24:04.869000+00:00",
  "creationTime": "2018-08-15T14:24:04.869000+00:00",
  "currentNodeCount": 0,
  "errors": null,
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchddtftestyzrg/providers/Microsoft.BatchAI/workspaces/workspace/clusters/yzhang100",
  "name": "yzhang100",
  "nodeSetup": {
    "mountVolumes": {
      "azureBlobFileSystems": null,
      "azureFileShares": [
        {
          "accountName": "batchddtftestyzst",
          "azureFileUrl": "https://batchddtftestyzst.file.core.windows.net/batchddtftestyzshare",
          "credentials": {
            "accountKey": null,
            "accountKeySecretReference": null
          },
          "directoryMode": "0777",
          "fileMode": "0777",
          "relativeMountPath": "extfs"
        }
      ],
      "fileServers": [
        {
          "fileServer": {
            "id": "/subscriptions/edf5

Let's check that the cluster was created succesfully.

In [49]:
!az batchai cluster show -n $CLUSTER_NAME -w $WORKSPACE

{
  "allocationState": "steady",
  "allocationStateTransitionTime": "2018-08-15T14:26:33.091000+00:00",
  "creationTime": "2018-08-15T14:24:04.869000+00:00",
  "currentNodeCount": 2,
  "errors": null,
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchddtftestyzrg/providers/Microsoft.BatchAI/workspaces/workspace/clusters/yzhang100",
  "name": "yzhang100",
  "nodeSetup": {
    "mountVolumes": {
      "azureBlobFileSystems": null,
      "azureFileShares": [
        {
          "accountName": "batchddtftestyzst",
          "azureFileUrl": "https://batchddtftestyzst.file.core.windows.net/batchddtftestyzshare",
          "credentials": {
            "accountKey": null,
            "accountKeySecretReference": null
          },
          "directoryMode": "0777",
          "fileMode": "0777",
          "relativeMountPath": "extfs"
        }
      ],
      "fileServers": [
        {
          "fileServer": {
            "id": "/subscript

In [50]:
!az batchai cluster list -w $WORKSPACE -o table

Name       Resource Group     Workspace    VM Size             State    Idle    Running    Preparing    Leaving    Unusable
---------  -----------------  -----------  ------------------  -------  ------  ---------  -----------  ---------  ----------
yzhang100  batchddtftestyzrg  workspace    STANDARD_NC24RS_V3  steady   0       0          2            0          0


In [51]:
!az batchai cluster node list -c $CLUSTER_NAME -w $WORKSPACE -o table

ID                                IP            SSH Port
--------------------------------  ------------  ----------
tvm-587366007_1-20180815t142631z  40.87.83.215  50001
tvm-587366007_2-20180815t142631z  40.87.83.215  50000


<a id='job'></a>
## Submit and Monitor Job
Below we specify the job we wish to execute.  

In [52]:
jobs_dict = {
  "$schema": "https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2017-09-01-preview/job.json",
  "properties": {
    "nodeCount": NUM_NODES,
    "customToolkitSettings": {
      "commandLine": f"source /opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/mpivars.sh; \
      echo $AZ_BATCH_HOST_LIST; \
      mpirun -n {TOTAL_PROCESSES} -ppn {PROCESSES_PER_NODE} -hosts $AZ_BATCH_HOST_LIST \
      -env I_MPI_FABRICS=dapl \
      -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 \
      -env I_MPI_DYNAMIC_CONNECTION=0 \
      -env I_MPI_DEBUG=6 \
      -env I_MPI_HYDRA_DEBUG=on \
      -env DISTRIBUTED=True \
      {FAKE} \
      python -u $AZ_BATCHAI_INPUT_SCRIPTS/imagenet_estimator_tf_horovod.py"
    },
    "stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/extfs",
    "inputDirectories": [{
        "id": "SCRIPTS",
        "path": "$AZ_BATCHAI_MOUNT_ROOT/extfs/scripts"
      },
      {
        "id": "TRAIN",
        "path": "$AZ_BATCHAI_MOUNT_ROOT/nfs/imagenet",
      },
      {
        "id": "TEST",
        "path": "$AZ_BATCHAI_MOUNT_ROOT/nfs/imagenet",
      },
    ],
    "outputDirectories": [{
        "id": "MODEL",
        "pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/extfs",
        "pathSuffix": "Models"
    }],
    "containerSettings": {
      "imageSourceRegistry": {
        "image": f"{DOCKERHUB}/distributed-training.horovod-tf"
      }
    }
  }
}

In [53]:
write_json_to_file(jobs_dict, 'job.json')

In [54]:
JOB_NAME='tf-horovod-{}'.format(NUM_NODES*PROCESSES_PER_NODE)

We now submit the job to Batch AI

In [55]:
!az batchai job create -n $JOB_NAME --cluster $CLUSTER_NAME -w $WORKSPACE -e $EXPERIMENT -f job.json

[K{- Finished ..
  "caffe2Settings": null,
  "caffeSettings": null,
  "chainerSettings": null,
  "cluster": {
    "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchddtftestyzrg/providers/Microsoft.BatchAI/workspaces/workspace/clusters/yzhang100",
    "resourceGroup": "batchddtftestyzrg"
  },
  "cntkSettings": null,
  "constraints": {
    "maxWallClockTime": "7 days, 0:00:00"
  },
  "containerSettings": {
    "imageSourceRegistry": {
      "credentials": null,
      "image": "yzhang001/distributed-training.horovod-tf",
      "serverUrl": null
    },
    "shmSize": null
  },
  "creationTime": "2018-08-15T14:28:26.038000+00:00",
  "customMpiSettings": null,
  "customToolkitSettings": {
    "commandLine": "source /opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/mpivars.sh;       echo $AZ_BATCH_HOST_LIST;       mpirun -n 8 -ppn 4 -hosts $AZ_BATCH_HOST_LIST       -env I_MPI_FABRICS=dapl       -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0       -env I_MP

With the command below we can check the status of the job

In [60]:
!az batchai job list -w $WORKSPACE -e $EXPERIMENT -o table

Name          Cluster    Cluster RG         Cluster Workspace    Tool    Nodes    State      Exit code
------------  ---------  -----------------  -------------------  ------  -------  ---------  -----------
tf-horovod-8  yzhang100  batchddtftestyzrg  workspace            custom  2        succeeded  0


To view the files that the job has generated use the command below

In [59]:
!az batchai job file list -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr

[
  {
    "contentLength": 13055,
    "downloadUrl": "https://batchddtftestyzst.file.core.windows.net/batchddtftestyzshare/edf507a2-6235-46c5-b560-fd463ba2e771/batchddtftestyzrg/workspaces/workspace/experiments/distributed_tensorflow_v100/jobs/tf-horovod-8/2837b74b-1269-424d-a17f-58918a8f70db/stdouterr/execution-tvm-587366007_1-20180815t142631z.log?sv=2016-05-31&sr=f&sig=cbZqwemJ7j3v88eE8kqyMMp979zWNahCy3vkkvLVi%2Bk%3D&se=2018-08-15T15%3A57%3A34Z&sp=rl",
    "fileType": "file",
    "lastModified": "2018-08-15T14:38:17+00:00",
    "name": "execution-tvm-587366007_1-20180815t142631z.log"
  },
  {
    "contentLength": 19287,
    "downloadUrl": "https://batchddtftestyzst.file.core.windows.net/batchddtftestyzshare/edf507a2-6235-46c5-b560-fd463ba2e771/batchddtftestyzrg/workspaces/workspace/experiments/distributed_tensorflow_v100/jobs/tf-horovod-8/2837b74b-1269-424d-a17f-58918a8f70db/stdouterr/execution-tvm-587366007_2-20180815t142631z.log?sv=2016-05-31&sr=f&sig=4yjyCtAiYBSuoY9QhiKx

We are also able to stream the stdout and stderr that our job produces. This is great to check the progress of our job as well as debug issues.

In [61]:
!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr -f stdout.txt

[33mFile found with URL "https://batchddtftestyzst.file.core.windows.net/batchddtftestyzshare/edf507a2-6235-46c5-b560-fd463ba2e771/batchddtftestyzrg/workspaces/workspace/experiments/distributed_tensorflow_v100/jobs/tf-horovod-8/2837b74b-1269-424d-a17f-58918a8f70db/stdouterr/stdout.txt?sv=2016-05-31&sr=f&sig=fu9zqjXgUs%2BcVEjO8x96I0QKS5xgdh0g%2Bx9y%2BSRPuAQ%3D&se=2018-08-15T16%3A08%3A27Z&sp=rl". Start streaming[0m
10.0.0.5,10.0.0.6
[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 3  Build 20170405 (id: 17193)
[0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-ib0
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-ib0
[5] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-ib0
[3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa

In [62]:
!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr -f stderr.txt

[33mFile found with URL "https://batchddtftestyzst.file.core.windows.net/batchddtftestyzshare/edf507a2-6235-46c5-b560-fd463ba2e771/batchddtftestyzrg/workspaces/workspace/experiments/distributed_tensorflow_v100/jobs/tf-horovod-8/2837b74b-1269-424d-a17f-58918a8f70db/stdouterr/stderr.txt?sv=2016-05-31&sr=f&sig=IsGAoIbia9KRCzHSlXr1Pof918B1%2BfiQVZna1ikvwD4%3D&se=2018-08-15T16%3A08%3A41Z&sp=rl". Start streaming[0m
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*a

We can either wait for the job to complete or delete it with the command below.

In [None]:
!az batchai job delete -w $WORKSPACE -e $EXPERIMENT --name $JOB_NAME -y

<a id='clean_up'></a>
## Clean Up Resources
Next we wish to tidy up the resource we created.  
First we reset the default values we set earlier.

In [None]:
!az configure --defaults group=''
!az configure --defaults location=''

 Next we delete the cluster

In [None]:
!az batchai cluster delete -w $WORKSPACE --name $CLUSTER_NAME -g $GROUP_NAME -y

Once the cluster is deleted you will not incur any cost for the computation but you can still retain your experiments and workspace. If you wish to delete those as well execute the commands below.

In [None]:
!az batchai experiment delete -w $WORKSPACE --name $EXPERIMENT -g $GROUP_NAME -y

In [None]:
!az batchai workspace delete -n $WORKSPACE -g $GROUP_NAME -y

Finally we can delete the group and we will have deleted everything created for this tutorial.

In [None]:
!az group delete --name $GROUP_NAME -y