# Create Azure and Batch AI Resources
In this notebook we will create the necessary resources to train a ResNet50 model([ResNet50](https://arxiv.org/abs/1512.03385)) in a distributed fashion using [Horovod](https://github.com/uber/horovod) on the ImageNet dataset. If you plan on using fake data then the sections marked optional can be skipped. This notebook will take you through the following steps:
 * [Create Azure Resources](#azure_resources)
 * [Create Fileserver(NFS)](#create_fileshare)
 * [Upload Data to Blob (Optional)](#upload_data)
 * [Configure Batch AI Cluster](#configure_cluster)

In [4]:
import sys
sys.path.append("common") 

from dotenv import set_key
import os
import json
from utils import get_password, dotenv_for
from pathlib import Path

Below are the variables that describe our experiment. By default we are using the NC24rs_v3 (Standard_NC24rs_v3) VMs which have V100 GPUs and Infiniband. By default we are using 2 nodes with each node having 4 GPUs, this equates to 8 GPUs. Feel free to increase the number of nodes but be aware what limitations your subscription may have.

Set the USE_FAKE to True if you want to use fake data rather than the Imagenet dataset. This is often a good way to debug your models as well as checking what IO overhead is.

In [64]:
# Variables for Batch AI - change as necessary
ID                     = "dtdemo"
GROUP_NAME             = f"batch{ID}rg"
STORAGE_ACCOUNT_NAME   = f"batch{ID}st"
FILE_SHARE_NAME        = f"batch{ID}share"
SELECTED_SUBSCRIPTION  = "Boston Team Danielle"
WORKSPACE              = "workspace"
NUM_NODES              = 2
CLUSTER_NAME           = "msv100"
VM_SIZE                = "Standard_NC24rs_v3"
GPU_TYPE               = "V100"
PROCESSES_PER_NODE     = 4
LOCATION               = "eastus"
NFS_NAME               = f"batch{ID}nfs"
USERNAME               = "batchai_user"
USE_FAKE               = False
DOCKERHUB              = os.getenv('DOCKER_REPOSITORY', "masalvar")
DATA                   = Path("/data")
CONTAINER_NAME         = f"batch{ID}container"
DOCKER_PWD             = "<YOUR_DOCKER_PWD>"

dotenv_path = dotenv_for()
set_key(dotenv_path, 'DOCKER_PWD', DOCKER_PWD)
set_key(dotenv_path, 'GROUP_NAME', GROUP_NAME)
set_key(dotenv_path, 'FILE_SHARE_NAME', FILE_SHARE_NAME)
set_key(dotenv_path, 'WORKSPACE', WORKSPACE)
set_key(dotenv_path, 'NUM_NODES', str(NUM_NODES))
set_key(dotenv_path, 'CLUSTER_NAME', CLUSTER_NAME)
set_key(dotenv_path, 'GPU_TYPE', GPU_TYPE)
set_key(dotenv_path, 'PROCESSES_PER_NODE', str(PROCESSES_PER_NODE))
set_key(dotenv_path, 'STORAGE_ACCOUNT_NAME', STORAGE_ACCOUNT_NAME)

(True, 'STORAGE_ACCOUNT_NAME', 'batchdtdemost')

<a id='azure_resources'></a>
## Create Azure Resources
First we need to log in to our Azure account. 

In [9]:
!az login -o table

[33mTo sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AJB7HJE89 to authenticate.[0m
CloudName    IsDefault    Name                                            State     TenantId
-----------  -----------  ----------------------------------------------  --------  ------------------------------------
AzureCloud   True         Visual Studio Enterprise                        Enabled   72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False        Azure Internal - London                         Enabled   72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False        Boston Team Danielle                            Enabled   72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False        Cosmos_WDG_Core_BnB_100348                      Enabled   72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False        Azure Stack Diagnostics CI and Production VaaS  Enabled   72f988bf-86f1-41af-91ab-2d7cd011db47
AzureCloud   False        Orb Prod           

If you have more than one Azure account you will need to select it with the command below. If you only have one account you can skip this step.

In [10]:
!az account set --subscription "$SELECTED_SUBSCRIPTION"

In [11]:
!az account list -o table

[33mA few accounts are skipped as they don't have 'Enabled' state. Use '--all' to display them.[0m
Name                                            CloudName    SubscriptionId                        State    IsDefault
----------------------------------------------  -----------  ------------------------------------  -------  -----------
Visual Studio Enterprise                        AzureCloud   fb11e9eb-22e1-4347-8d0a-84ef60157664  Enabled  False
Azure Internal - London                         AzureCloud   1ba81249-8edd-4619-a486-3d28a2176aad  Enabled  False
Boston Team Danielle                            AzureCloud   edf507a2-6235-46c5-b560-fd463ba2e771  Enabled  True
Cosmos_WDG_Core_BnB_100348                      AzureCloud   dae41bd3-9db4-4b9b-943e-832b57cac828  Enabled  False
Azure Stack Diagnostics CI and Production VaaS  AzureCloud   a8183b2d-7a4c-45e9-8736-dac11b84ff14  Enabled  False
Orb Prod                                        AzureCloud   4d2b0ef3-39f9-42c6-9332

Next we create the group that will hold all our Azure resources.

In [12]:
!az group create -n $GROUP_NAME -l $LOCATION -o table

Location    Name
----------  -------------
eastus      batchdtdemorg


We will create the storage account that will store our fileshare where all the outputs from the jobs will be stored.

In [13]:
json_data = !az storage account create -l $LOCATION -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME --sku Standard_LRS
print('Storage account {} provisioning state: {}'.format(STORAGE_ACCOUNT_NAME, 
                                                         json.loads(''.join(json_data))['provisioningState']))

Storage account batchdtdemost provisioning state: Succeeded


In [14]:
json_data = !az storage account keys list -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME
storage_account_key = json.loads(''.join([i for i in json_data if 'WARNING' not in i]))[0]['value']

In [15]:
!az storage share create --account-name $STORAGE_ACCOUNT_NAME \
--account-key $storage_account_key --name $FILE_SHARE_NAME

{
  "created": true
}


In [16]:
!az storage directory create --share-name $FILE_SHARE_NAME  --name scripts \
--account-name $STORAGE_ACCOUNT_NAME --account-key $storage_account_key

{
  "created": true
}


Here we are setting some defaults so we don't have to keep adding them to every command

In [61]:
!az configure --defaults location=$LOCATION
!az configure --defaults group=$GROUP_NAME

In [62]:
%env AZURE_STORAGE_ACCOUNT $STORAGE_ACCOUNT_NAME
%env AZURE_STORAGE_KEY=$storage_account_key

env: AZURE_STORAGE_ACCOUNT=batchdtdemost
env: AZURE_STORAGE_KEY=AtQA2uvmxTSvo0SXnI5FjMOXl+qp5fKwNcPL+Y2N0N/0+EhcRt4RhFuXf+YKvG9qDSrB6ZrgNmJ8fgloABMtSQ==


#### Create Workspace
Batch AI has the concept of workspaces and experiments. Below we will create the workspace for our work.

In [31]:
!az batchai workspace create -n $WORKSPACE -g $GROUP_NAME

{
  "creationTime": "2018-12-17T11:28:01.861000+00:00",
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchdtdemorg/providers/Microsoft.BatchAI/workspaces/workspace",
  "location": "eastus",
  "name": "workspace",
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-12-17T11:28:01.861000+00:00",
  "resourceGroup": "batchdtdemorg",
  "tags": null,
  "type": "Microsoft.BatchAI/workspaces"
}


<a id='upload_data'></a>
## Upload Data to Blob (Optional)
In this section we will create a blob container and upload the imagenet data we prepared locally in the previous notebook.

**You only need to run this section if you want to use real data. If USE_FAKE is set to False the commands below won't be executed.**


In [68]:
if USE_FAKE is False:
    !az storage container create --account-name {STORAGE_ACCOUNT_NAME} \
                                 --account-key {storage_account_key} \
                                 --name {CONTAINER_NAME}

^C


In [None]:
if USE_FAKE is False:
    # Should take about 20 minutes
    !azcopy --source {DATA/"train.tar.gz"} \
    --destination https://{STORAGE_ACCOUNT_NAME}.blob.core.windows.net/{CONTAINER_NAME}/train.tar.gz \
    --dest-key {storage_account_key} --quiet

[?1h=[6nFinished: 0 file(s), 0 B; Average Speed:0 B/s.                                 [6n[1;1H[6nFinished: 0 file(s), 0 B; Average Speed:0 B/s.                                 [6n[1;1H[6nFinished: 0 file(s), 0 B; Average Speed:0 B/s.                                 [6n[1;1H[6nFinished: 0 file(s), 0 B; Average Speed:0 B/s.                                 [6n[1;1H[6nFinished: 0 file(s), 4 MB; Average Speed:436.59 KB/s.                          [6n[1;1H[6nFinished: 0 file(s), 100 MB; Average Speed:8.75 MB/s.                          [6n[1;1H[6nFinished: 0 file(s), 104 MB; Average Speed:7.34 MB/s.                          [6n[1;1H[6nFinished: 0 file(s), 128 MB; Average Speed:7.9 MB/s.                           [6n[1;1H[6nFinished: 0 file(s), 704 MB; Average Speed:38.58 MB/s.                         [6n[1;1H[6nFinished: 0 file(s), 728 MB; Average Speed:35.87 MB/s.                         [6n[1;1H[6nFinished: 0 file(s), 916 MB; Average Speed:41 MB/s.        

Finished: 0 file(s), 22.48 GB; Average Speed:118.53 MB/s.                      [6n[1;1H[6nFinished: 0 file(s), 22.836 GB; Average Speed:119.15 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 22.84 GB; Average Speed:117.94 MB/s.                      [6n[1;1H[6nFinished: 0 file(s), 23.012 GB; Average Speed:117.61 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 23.379 GB; Average Speed:118.28 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 23.777 GB; Average Speed:119.09 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 24.223 GB; Average Speed:120.12 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 24.277 GB; Average Speed:119.2 MB/s.                      [6n[1;1H[6nFinished: 0 file(s), 24.664 GB; Average Speed:119.93 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 24.961 GB; Average Speed:120.2 MB/s.                      [6n[1;1H[6nFinished: 0 file(s), 24.965 GB; Average Speed:119.07 MB/s.            

[1;1H[6nFinished: 0 file(s), 45.535 GB; Average Speed:121.53 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 45.988 GB; Average Speed:122.08 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 46.254 GB; Average Speed:122.14 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 46.27 GB; Average Speed:121.54 MB/s.                      [6n[1;1H[6nFinished: 0 file(s), 46.688 GB; Average Speed:122 MB/s.                        [6n[1;1H[6nFinished: 0 file(s), 47.031 GB; Average Speed:122.26 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 47.617 GB; Average Speed:123.14 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 47.688 GB; Average Speed:122.69 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 48.113 GB; Average Speed:123.15 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 48.598 GB; Average Speed:123.75 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 48.824 GB; Average Speed:123.7 MB/s.   

Finished: 0 file(s), 70.676 GB; Average Speed:126 MB/s.                        [6n[1;1H[6nFinished: 0 file(s), 70.957 GB; Average Speed:126.05 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 71.168 GB; Average Speed:125.98 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 71.469 GB; Average Speed:126.07 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 71.957 GB; Average Speed:126.48 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 72.203 GB; Average Speed:126.47 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 72.305 GB; Average Speed:126.21 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 72.848 GB; Average Speed:126.71 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 73.145 GB; Average Speed:126.79 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 73.242 GB; Average Speed:126.52 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 73.559 GB; Average Speed:126.63 MB/s.            

[1;1H[6nFinished: 0 file(s), 95.391 GB; Average Speed:129.05 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 95.828 GB; Average Speed:129.29 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 95.84 GB; Average Speed:128.96 MB/s.                      [6n[1;1H[6nFinished: 0 file(s), 96.063 GB; Average Speed:128.91 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 96.391 GB; Average Speed:129.01 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 96.441 GB; Average Speed:128.73 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 96.898 GB; Average Speed:129 MB/s.                        [6n[1;1H[6nFinished: 0 file(s), 97.16 GB; Average Speed:129 MB/s.                         [6n[1;1H[6nFinished: 0 file(s), 97.277 GB; Average Speed:128.82 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 97.746 GB; Average Speed:129.1 MB/s.                      [6n[1;1H[6nFinished: 0 file(s), 97.75 GB; Average Speed:128.67 MB/s.   

Finished: 0 file(s), 116.68 GB; Average Speed:127.22 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 117.098 GB; Average Speed:127.4 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 117.371 GB; Average Speed:127.42 MB/s.                    [6n[1;1H[6nFinished: 0 file(s), 117.77 GB; Average Speed:127.57 MB/s.                     [6n[1;1H[6nFinished: 0 file(s), 118.23 GB; Average Speed:127.79 MB/s.                     [6n[1;1H

In [24]:
if USE_FAKE is False:
    !azcopy --source {DATA/"validation.tar.gz"} \
    --destination https://{STORAGE_ACCOUNT_NAME}.blob.core.windows.net/{CONTAINER_NAME}/validation.tar.gz \
    --dest-key {storage_account_key} --quiet

[?1h=[6nFinished: 0 file(s), 0 B; Average Speed:0 B/s.                                 [6n[1;1H[6nFinished: 0 file(s), 0 B; Average Speed:0 B/s.                                 [6n[1;1H[6nFinished: 0 file(s), 0 B; Average Speed:0 B/s.                                 [6n[1;1H[6nFinished: 0 file(s), 0 B; Average Speed:0 B/s.                                 [6n[1;1H[6nFinished: 0 file(s), 4 MB; Average Speed:450.13 KB/s.                          [6n[1;1H[6nFinished: 0 file(s), 164 MB; Average Speed:14.72 MB/s.                         [6n[1;1H[6nFinished: 0 file(s), 604 MB; Average Speed:45.81 MB/s.                         [6n[1;1H[6nFinished: 0 file(s), 988 MB; Average Speed:64.87 MB/s.                         [6n[1;1H[6nFinished: 0 file(s), 1.012 GB; Average Speed:59.96 MB/s.                       [6n[1;1H[6nFinished: 0 file(s), 1.277 GB; Average Speed:67.68 MB/s.                       [6n[1;1H[6nFinished: 0 file(s), 1.367 GB; Average Speed:65.5 MB/s.    

<a id='create_fileshare'></a>
## Create Fileserver
In this example we will store the data on an NFS fileshare. It is possible to use many storage solutions with Batch AI. NFS offers the best tradeoff between performance and ease of use. The best performance is achieved by loading the data locally but this can be cumbersome since it requires that the data is download by the all the nodes which with the ImageNet dataset can take hours. If you are using fake data we won't be using the fileserver but we will create one so that if you want to run the real ImageNet data later the server is ready.

In [33]:
!az batchai file-server create -n $NFS_NAME --disk-count 4 --disk-size 250 -w $WORKSPACE \
-s Standard_DS4_v2 -u $USERNAME -p {get_password(dotenv_for())} -g $GROUP_NAME --storage-sku Premium_LRS

[K{- Finished ..
  "creationTime": "2018-12-17T11:28:16.993000+00:00",
  "dataDisks": {
    "cachingType": "none",
    "diskCount": 4,
    "diskSizeInGb": 250,
    "storageAccountType": "Premium_LRS"
  },
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchdtdemorg/providers/Microsoft.BatchAI/workspaces/workspace/fileservers/batchdtdemonfs",
  "mountSettings": {
    "fileServerInternalIp": "10.0.0.4",
    "fileServerPublicIp": "104.211.11.81",
    "mountPoint": "/data"
  },
  "name": "batchdtdemonfs",
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-12-17T11:37:33.643000+00:00",
  "resourceGroup": "batchdtdemorg",
  "sshConfiguration": {
    "publicIpsToAllow": null,
    "userAccountSettings": {
      "adminUserName": "batchai_user",
      "adminUserPassword": null,
      "adminUserSshPublicKey": null
    }
  },
  "subnet": {
    "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/fileserverrg-3080f303-f4

In [34]:
!az batchai file-server list -o table -w $WORKSPACE -g $GROUP_NAME

Name            Resource Group    Size             Disks       Public IP      Internal IP    Mount Point
--------------  ----------------  ---------------  ----------  -------------  -------------  -------------
batchdtdemonfs  batchdtdemorg     Standard_DS4_v2  4 x 250 Gb  104.211.11.81  10.0.0.4       /data


In [35]:
json_data = !az batchai file-server list -w $WORKSPACE -g $GROUP_NAME
nfs_ip=json.loads(''.join([i for i in json_data if 'WARNING' not in i]))[0]['mountSettings']['fileServerPublicIp']

After we have created the NFS share we need to copy the data to it. To do this we write the script below which will be executed on the fileserver. It installs a tool called azcopy and then downloads and extracts the data to the appropriate directory.

In [36]:
nodeprep_script = f"""
#!/usr/bin/env bash
wget https://gist.githubusercontent.com/msalvaris/073c28a9993d58498957294d20d74202/raw/87a78275879f7c9bb8d6fb9de8a2d2996bb66c24/install_azcopy
chmod 777 install_azcopy
sudo ./install_azcopy

mkdir -p /data/imagenet

azcopy --source https://{STORAGE_ACCOUNT_NAME}.blob.core.windows.net/{CONTAINER_NAME}/validation.tar.gz \
        --destination  /data/imagenet/validation.tar.gz\
        --source-key {storage_account_key}\
        --quiet


azcopy --source https://{STORAGE_ACCOUNT_NAME}.blob.core.windows.net/{CONTAINER_NAME}/train.tar.gz \
        --destination  /data/imagenet/train.tar.gz\
        --source-key {storage_account_key}\
        --quiet

cd /data/imagenet
tar -xzf train.tar.gz
tar -xzf validation.tar.gz
"""

In [37]:
with open('nodeprep.sh', 'w') as f:
    f.write(nodeprep_script)

Next we will copy the file over and run it on the NFS VM. This will install azcopy and download and prepare the data

In [38]:
if USE_FAKE:
    raise Warning("You should not be running this section if you simply want to use fake data")

In [39]:
if USE_FAKE is False:
    !sshpass -p {get_password(dotenv_for())} scp -o "StrictHostKeyChecking=no" nodeprep.sh $USERNAME@{nfs_ip}:~/

ssh: connect to host 104.211.11.81 port 22: Connection timed out
lost connection


In [46]:
if USE_FAKE is False:
    !sshpass -p {get_password(dotenv_for())} ssh -o "StrictHostKeyChecking=no" $USERNAME@{nfs_ip} "sudo chmod 777 ~/nodeprep.sh && ./nodeprep.sh"

--2018-11-28 11:27:12--  https://gist.githubusercontent.com/msalvaris/073c28a9993d58498957294d20d74202/raw/87a78275879f7c9bb8d6fb9de8a2d2996bb66c24/install_azcopy
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.32.133
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|151.101.32.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 481 [text/plain]
Saving to: ‘install_azcopy’

     0K                                                       100%  107M=0s

2018-11-28 11:27:12 (107 MB/s) - ‘install_azcopy’ saved [481/481]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   983  100   983    0     0   4168      0 --:--:-- --:--:-- --:--:--  4182
Hit:1 http://azure.archive.ubuntu.com/ubuntu xenial InRelease
Get:2 http://azure.archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB]
Get:3 http://azure.archive

Processing triggers for libc-bin (2.23-0ubuntu10) ...
--2018-11-28 11:27:59--  https://aka.ms/downloadazcopyprlinux
Resolving aka.ms (aka.ms)... 23.222.209.19
Connecting to aka.ms (aka.ms)|23.222.209.19|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://azcopy.azureedge.net/azcopy-7-1-0-netcorepreview/azcopy_7.1.0-netcorepreview_all.tar.gz [following]
--2018-11-28 11:27:59--  https://azcopy.azureedge.net/azcopy-7-1-0-netcorepreview/azcopy_7.1.0-netcorepreview_all.tar.gz
Resolving azcopy.azureedge.net (azcopy.azureedge.net)... 72.21.81.200, 2606:2800:11f:17a5:191a:18d5:537:22f9
Connecting to azcopy.azureedge.net (azcopy.azureedge.net)|72.21.81.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3841375 (3.7M) [application/octet-stream]
Saving to: ‘azcopy.tar.gz’

     0K .......... .......... .......... .......... ..........  1% 23.7M 0s
    50K .......... .......... .......... .......... ..........  2%  211M 0s
 


sent 11,683,102 bytes  received 1,290 bytes  23,368,784.00 bytes/sec
total size is 11,675,344  speedup is 1.00
[2018/11/28 11:28:20] Transfer summary:
-----------------
Total files transferred: 1
Transfer successfully:   1
Transfer skipped:        0
Transfer failed:         0
Elapsed time:            00.00:00:20
[2018/11/28 11:35:38] Transfer summary:
-----------------
Total files transferred: 1
Transfer successfully:   1
Transfer skipped:        0
Transfer failed:         0
Elapsed time:            00.00:07:12


<a id='configure_cluster'></a>
## Configure Batch AI Cluster
We then upload the scripts we wish to execute onto the fileshare. The fileshare will later be mounted by Batch AI. An alternative to uploading the scripts would be to embedd them inside the Docker image.

In [55]:
!az storage file upload --share-name $FILE_SHARE_NAME --source HorovodPytorch/cluster_config/docker.service --path scripts
!az storage file upload --share-name $FILE_SHARE_NAME --source HorovodPytorch/cluster_config/nodeprep.sh --path scripts

Finished[#############################################################]  100.0000%
Finished[#############################################################]  100.0000%


Below it the command to create the cluster. 

In [56]:
!az batchai cluster create \
    -w $WORKSPACE \
    --name $CLUSTER_NAME \
    --image UbuntuLTS \
    --vm-size $VM_SIZE \
    --min $NUM_NODES --max $NUM_NODES \
    --afs-name $FILE_SHARE_NAME \
    --afs-mount-path extfs \
    --user-name $USERNAME \
    --password {get_password(dotenv_for())} \
    --storage-account-name $STORAGE_ACCOUNT_NAME \
    --storage-account-key $storage_account_key \
    --nfs $NFS_NAME \
    --nfs-mount-path nfs \
    --config-file HorovodPytorch/cluster_config/cluster.json

[K - Finished ..{
  "allocationState": "resizing",
  "allocationStateTransitionTime": "2018-12-17T12:56:56.065000+00:00",
  "creationTime": "2018-12-17T12:56:56.065000+00:00",
  "currentNodeCount": 0,
  "errors": null,
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchdtdemorg/providers/Microsoft.BatchAI/workspaces/workspace/clusters/msv100",
  "name": "msv100",
  "nodeSetup": {
    "mountVolumes": {
      "azureBlobFileSystems": null,
      "azureFileShares": [
        {
          "accountName": "batchdtdemost",
          "azureFileUrl": "https://batchdtdemost.file.core.windows.net/batchdtdemoshare",
          "credentials": {
            "accountKey": null,
            "accountKeySecretReference": null
          },
          "directoryMode": "0777",
          "fileMode": "0777",
          "relativeMountPath": "extfs"
        }
      ],
      "fileServers": [
        {
          "fileServer": {
            "id": "/subscriptions/edf507a2-6235-46c5-b560-f

Let's check that the cluster was created succesfully.

In [48]:
!az batchai cluster show -n $CLUSTER_NAME -w $WORKSPACE

{
  "allocationState": "resizing",
  "allocationStateTransitionTime": "2018-12-17T12:35:16.177000+00:00",
  "creationTime": "2018-12-17T12:35:16.177000+00:00",
  "currentNodeCount": 0,
  "errors": null,
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchdtdemorg/providers/Microsoft.BatchAI/workspaces/workspace/clusters/msv100",
  "name": "msv100",
  "nodeSetup": {
    "mountVolumes": {
      "azureBlobFileSystems": null,
      "azureFileShares": [
        {
          "accountName": "batchdtdemost",
          "azureFileUrl": "https://batchdtdemost.file.core.windows.net/batchdtdemoshare",
          "credentials": {
            "accountKey": null,
            "accountKeySecretReference": null
          },
          "directoryMode": "0777",
          "fileMode": "0777",
          "relativeMountPath": "extfs"
        }
      ],
      "fileServers": [
        {
          "fileServer": {
            "id": "/subscriptions/edf507a2-6235-4

In [58]:
!az batchai cluster list -w $WORKSPACE -o table

Name    Resource Group    Workspace    VM Size             State    Idle    Running    Preparing    Leaving    Unusable
------  ----------------  -----------  ------------------  -------  ------  ---------  -----------  ---------  ----------
msv100  batchdtdemorg     workspace    STANDARD_NC24RS_V3  steady   2       0          0            0          0


In [59]:
!az batchai cluster node list -c $CLUSTER_NAME -w $WORKSPACE -o table

ID                                IP             SSH Port
--------------------------------  -------------  ----------
tvm-829305193_1-20181217t125904z  40.121.91.247  50001
tvm-829305193_2-20181217t125904z  40.121.91.247  50000
