# Train TensorFlow Model Distributed on Batch AI
In this notebook we will train a TensorFlow model ([ResNet50](https://arxiv.org/abs/1512.03385)) in a distributed fashion using [Horovod](https://github.com/uber/horovod) on the Imagenet dataset. This tutorial will take you through the following steps:
 * [Create Experiment](#experiment)
 * [Upload Training Scripts](#training_scripts)
 * [Submit and Monitor Job](#job)
 * [Clean Up Resources](#clean_up)

In [16]:
import sys
sys.path.append("../common") 

import json
from dotenv import get_key
import os
from utils import write_json_to_file, dotenv_for

Set the USE_FAKE to True if you want to use fake data rather than the ImageNet dataset. This is often a good way to debug your models as well as checking what IO overhead is.

In [None]:
# Variables for Batch AI - change as necessary
dotenv_path = dotenv_for()
GROUP_NAME             = get_key(dotenv_path, 'GROUP_NAME')
FILE_SHARE_NAME        = get_key(dotenv_path, 'FILE_SHARE_NAME')
WORKSPACE              = get_key(dotenv_path, 'WORKSPACE')
NUM_NODES              = int(get_key(dotenv_path, 'NUM_NODES'))
CLUSTER_NAME           = get_key(dotenv_path, 'CLUSTER_NAME')
GPU_TYPE               = get_key(dotenv_path, 'GPU_TYPE')
PROCESSES_PER_NODE     = int(get_key(dotenv_path, 'PROCESSES_PER_NODE'))
STORAGE_ACCOUNT_NAME   = get_key(dotenv_path, 'STORAGE_ACCOUNT_NAME')

EXPERIMENT             = f"distributed_tensorflow_{GPU_TYPE}"
USE_FAKE               = False
DOCKERHUB              = os.getenv('DOCKER_REPOSITORY', "masalvar")

In [56]:
FAKE='-x FAKE=True' if USE_FAKE else ''
TOTAL_PROCESSES = PROCESSES_PER_NODE * NUM_NODES

<a id='experiment'></a>
# Create Experiment
Next we create our experiment.

In [9]:
!az batchai experiment create -n $EXPERIMENT -g $GROUP_NAME -w $WORKSPACE

[K{- Finished ..
  "creationTime": "2018-12-17T13:19:30.658000+00:00",
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchdtdemorg/providers/Microsoft.BatchAI/workspaces/workspace/experiments/distributed_pytorch_v100",
  "name": "distributed_pytorch_v100",
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-12-17T13:19:30.658000+00:00",
  "resourceGroup": "batchdtdemorg",
  "type": "Microsoft.BatchAI/workspaces/experiments"
}
[0m

<a id='training_scripts'></a>
# Upload Training Scripts
We need to upload our training scripts and associated files

In [22]:
json_data = !az storage account keys list -n $STORAGE_ACCOUNT_NAME -g $GROUP_NAME
storage_account_key = json.loads(''.join([i for i in json_data if 'WARNING' not in i]))[0]['value']

In [23]:
%env AZURE_STORAGE_ACCOUNT $STORAGE_ACCOUNT_NAME
%env AZURE_STORAGE_KEY=$storage_account_key

env: AZURE_STORAGE_ACCOUNT=batchdtdemost
env: AZURE_STORAGE_KEY=AtQA2uvmxTSvo0SXnI5FjMOXl+qp5fKwNcPL+Y2N0N/0+EhcRt4RhFuXf+YKvG9qDSrB6ZrgNmJ8fgloABMtSQ==


Upload our training scripts

In [None]:
!az storage file upload --share-name $FILE_SHARE_NAME --source src/imagenet_estimator_tf_horovod.py --path scripts
!az storage file upload --share-name $FILE_SHARE_NAME --source src/resnet_model.py --path scripts
!az storage file upload --share-name $FILE_SHARE_NAME --source ../common/timer.py --path scripts

Let's check our cluster we created earlier

In [25]:
!az batchai cluster list -w $WORKSPACE -o table

Name    Resource Group    Workspace    VM Size             State    Idle    Running    Preparing    Leaving    Unusable
------  ----------------  -----------  ------------------  -------  ------  ---------  -----------  ---------  ----------
msv100  batchdtdemorg     workspace    STANDARD_NC24RS_V3  steady   2       0          0            0          0


<a id='job'></a>
## Submit and Monitor Job
Below we specify the job we wish to execute.  

In [None]:
jobs_dict = {
  "$schema": "https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2017-09-01-preview/job.json",
  "properties": {
    "nodeCount": NUM_NODES,
    "customToolkitSettings": {
      "commandLine": f"echo $AZ_BATCH_HOST_LIST; \
    cat $AZ_BATCHAI_MPI_HOST_FILE; \
    mpirun -np {TOTAL_PROCESSES} --hostfile $AZ_BATCHAI_MPI_HOST_FILE \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
    -mca btl_tcp_if_include eth0 \
    -x NCCL_SOCKET_IFNAME=eth0 \
    -mca btl ^openib \
    -x NCCL_IB_DISABLE=1 \
    -x DISTRIBUTED=True \
    -x AZ_BATCHAI_INPUT_TRAIN \
    -x AZ_BATCHAI_INPUT_TEST \
    --allow-run-as-root \
      {FAKE} \
      python -u $AZ_BATCHAI_INPUT_SCRIPTS/imagenet_estimator_tf_horovod.py"
    },
    "stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/extfs",
    "inputDirectories": [{
        "id": "SCRIPTS",
        "path": "$AZ_BATCHAI_MOUNT_ROOT/extfs/scripts"
      },
      {
        "id": "TRAIN",
        "path": "$AZ_BATCHAI_MOUNT_ROOT/nfs/imagenet",
      },
      {
        "id": "TEST",
        "path": "$AZ_BATCHAI_MOUNT_ROOT/nfs/imagenet",
      },
    ],
    "outputDirectories": [{
        "id": "MODEL",
        "pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/extfs",
        "pathSuffix": "Models"
    }],
    "containerSettings": {
      "imageSourceRegistry": {
        "image": f"{DOCKERHUB}/caia-horovod-tensorflow"
      }
    }
  }
}

In [62]:
write_json_to_file(jobs_dict, 'job.json')

In [None]:
JOB_NAME='tensorflow-horovod-{}'.format(NUM_NODES*PROCESSES_PER_NODE)

We now submit the job to Batch AI

In [64]:
!az batchai job create -n $JOB_NAME --cluster $CLUSTER_NAME -w $WORKSPACE -e $EXPERIMENT -f job.json

[K{- Finished ..
  "caffe2Settings": null,
  "caffeSettings": null,
  "chainerSettings": null,
  "cluster": {
    "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchdtdemorg/providers/Microsoft.BatchAI/workspaces/workspace/clusters/msv100",
    "resourceGroup": "batchdtdemorg"
  },
  "cntkSettings": null,
  "constraints": {
    "maxWallClockTime": "7 days, 0:00:00"
  },
  "containerSettings": {
    "imageSourceRegistry": {
      "credentials": null,
      "image": "masalvar/caia-horovod-pytorch",
      "serverUrl": null
    },
    "shmSize": null
  },
  "creationTime": "2018-12-17T13:47:50.202000+00:00",
  "customMpiSettings": null,
  "customToolkitSettings": {
    "commandLine": "echo $AZ_BATCH_HOST_LIST;     cat $AZ_BATCHAI_MPI_HOST_FILE;     mpirun -np 8 --hostfile $AZ_BATCHAI_MPI_HOST_FILE     -bind-to none -map-by slot     -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH     -mca btl_tcp_if_include eth0     -x NCCL_SOCKET_IFNAME=eth0     -mca btl ^openib     -x N

With the command below we can check the status of the job

In [67]:
!az batchai job list -w $WORKSPACE -e $EXPERIMENT -o table

Name               Cluster    Cluster RG     Cluster Workspace    Tool    Nodes    State    Exit code
-----------------  ---------  -------------  -------------------  ------  -------  -------  -----------
pytorch-horovod-8  msv100     batchdtdemorg  workspace            custom  2        running


To view the files that the job has generated use the command below

In [37]:
!az batchai job file list -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr

[
  {
    "contentLength": 9925,
    "downloadUrl": "https://batchdtdemost.file.core.windows.net/batchdtdemoshare/edf507a2-6235-46c5-b560-fd463ba2e771/batchdtdemorg/workspaces/workspace/experiments/distributed_pytorch_v100/jobs/pytorch-horovod-8/ffa69c05-3f59-41b3-bfc4-370ad3022d9a/stdouterr/execution-tvm-829305193_1-20181217t125904z.log?sv=2016-05-31&sr=f&sig=RDpy9UMuftOa1w2TM6fROekEqc6ISPRmAwsoQufRzig%3D&se=2018-12-17T14%3A29%3A33Z&sp=rl",
    "fileType": "file",
    "lastModified": "2018-12-17T13:26:52+00:00",
    "name": "execution-tvm-829305193_1-20181217t125904z.log"
  },
  {
    "contentLength": 14343,
    "downloadUrl": "https://batchdtdemost.file.core.windows.net/batchdtdemoshare/edf507a2-6235-46c5-b560-fd463ba2e771/batchdtdemorg/workspaces/workspace/experiments/distributed_pytorch_v100/jobs/pytorch-horovod-8/ffa69c05-3f59-41b3-bfc4-370ad3022d9a/stdouterr/execution-tvm-829305193_2-20181217t125904z.log?sv=2016-05-31&sr=f&sig=i1S6%2BgVSpK%2BX1o%2BXOLuNBFJ%2FZrRK8W1d7ZE

We are also able to stream the stdout and stderr that our job produces. This is great to check the progress of our job as well as debug issues.

In [68]:
!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr -f stdout.txt

[33mFile found with URL "https://batchdtdemost.file.core.windows.net/batchdtdemoshare/edf507a2-6235-46c5-b560-fd463ba2e771/batchdtdemorg/workspaces/workspace/experiments/distributed_pytorch_v100/jobs/pytorch-horovod-8/43d5d58c-2ecd-4aa4-a459-93be2f302b7e/stdouterr/stdout.txt?sv=2016-05-31&sr=f&sig=%2F3GTrC%2BW73ccZZ82QFvYxHmjrayV0pquj6EYSeqC%2B0I%3D&se=2018-12-17T14%3A51%3A24Z&sp=rl". Start streaming[0m
10.0.0.5,10.0.0.6
10.0.0.5 slots=4 max-slots=4
10.0.0.6 slots=4 max-slots=4
INFO:__main__:1:  Runnin Distributed
INFO:__main__:0:  Runnin Distributed
INFO:__main__:5:  Runnin Distributed
INFO:__main__:2:  Runnin Distributed
INFO:__main__:3:  Runnin Distributed
INFO:__main__:6:  Runnin Distributed
INFO:__main__:7:  Runnin Distributed
INFO:__main__:4:  Runnin Distributed
INFO:__main__:0:  PyTorch version 0.4.0
INFO:__main__:0:  Setting up fake loaders
INFO:__main__:2:  PyTorch version 0.4.0
INFO:__main__:2:  Setting up fake loaders
INFO:__main__:1:  PyTorch version 0.4.0
INFO:__main__:1

^C


In [69]:
!az batchai job file stream -w $WORKSPACE -e $EXPERIMENT --j $JOB_NAME --output-directory-id stdouterr -f stderr.txt

[33mFile found with URL "https://batchdtdemost.file.core.windows.net/batchdtdemoshare/edf507a2-6235-46c5-b560-fd463ba2e771/batchdtdemorg/workspaces/workspace/experiments/distributed_pytorch_v100/jobs/pytorch-horovod-8/43d5d58c-2ecd-4aa4-a459-93be2f302b7e/stdouterr/stderr.txt?sv=2016-05-31&sr=f&sig=9HSNbBWc0aGcQINWHJz508JAKw935Miy%2BkMwEj184NQ%3D&se=2018-12-17T14%3A51%3A44Z&sp=rl". Start streaming[0m
^C


We can either wait for the job to complete or delete it with the command below.

In [70]:
!az batchai job delete -w $WORKSPACE -e $EXPERIMENT --name $JOB_NAME -y

[K[0minished ..

<a id='clean_up'></a>
## Clean Up Resources
Next we wish to tidy up the resource we created.  
First we reset the default values we set earlier.

In [71]:
!az configure --defaults group=''
!az configure --defaults location=''

 Next we delete the cluster

In [72]:
!az batchai cluster delete -w $WORKSPACE --name $CLUSTER_NAME -g $GROUP_NAME -y

[K[0minished ..

Once the cluster is deleted you will not incur any cost for the computation but you can still retain your experiments and workspace. If you wish to delete those as well execute the commands below.

In [73]:
!az batchai experiment delete -w $WORKSPACE --name $EXPERIMENT -g $GROUP_NAME -y

[K[0minished ..

In [74]:
!az batchai workspace delete -n $WORKSPACE -g $GROUP_NAME -y

[K[0minished ..

Finally we can delete the group and we will have deleted everything created for this tutorial.

In [75]:
!az group delete --name $GROUP_NAME -y

[K[0minished ..