# Batch AI
In this notebook we will go through the steps of setting up the cluster executing the notebooks and pulling the executed notebooks locally. 

We have defined a setup script called setup.py. Here we are simply executing it which will also bring all the varialbes and methods into the notebook namespace. You can also use the setup script inside an ipython environment simply execute anaconda-project run ipython-bait

In [1]:
%run setup_bait.py

Below we setup the cluster and wait for the VMs to be allocated

In [2]:
setup_cluster()

In [11]:
wait_for_cluster()

Cluster state: steady Target: 10; Allocated: 10; Idle: 10; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0


Below we print the status of the cluster. We can see many details of the cluster we created including its name and the docker images for the various DL frameworks.

In [4]:
print_cluster_list()

[{'allocation_state': 'steady',
  'allocation_state_transition_time': '2018-05-07T13:09:29.9359999999999999Z',
  'creation_time': '2018-05-07T13:07:42.752Z',
  'current_node_count': 10,
  'id': '/subscriptions/10d0b7c6-9243-4713-91a9-2730375d3a1b/resourceGroups/baitrg/providers/Microsoft.BatchAI/clusters/mync6',
  'location': 'eastus',
  'name': 'mync6',
  'node_setup': {'mount_volumes': {'azure_file_shares': [{'account_name': 'baitstr',
                                                          'azure_file_url': 'https://baitstr.file.core.windows.net/baitshare',
                                                          'credentials': {},
                                                          'directory_mode': '0777',
                                                          'file_mode': '0777',
                                                          'relative_mount_path': 'azurefileshare'}]}},
  'node_state_counts': {'idle_node_count': 10,
                        'leaving_node_cou

We can submit all the of the jobs with the submit_all function. We also have a submit function for each of the DL frameworks if you wish to execute one seperately.

In [5]:
submit_all(epochs=10)

INFO:__main__:Submitting job run_cntk
INFO:__main__:Submitting job run_chainer
INFO:__main__:Submitting job run_mxnet
INFO:__main__:Submitting job run_keras_cntk
INFO:__main__:Submitting job run_keras_tf
INFO:__main__:Submitting job run_caffe2
INFO:__main__:Submitting job run_pytorch
INFO:__main__:Submitting job run_tf
INFO:__main__:Submitting job run_gluon


We can periodically execute the command below to observe the status of the jobs. Under the current subscription we only have 2 nodes so 2 nodes will be executing in parallel. If the exit-code is anything other than 0 then there has been a problem with the job.

In [7]:
print_jobs_summary()

run_cntk: status:running | exit-code None
run_chainer: status:running | exit-code None
run_mxnet: status:running | exit-code None
run_keras_cntk: status:running | exit-code None
run_keras_tf: status:running | exit-code None
run_caffe2: status:running | exit-code None
run_pytorch: status:running | exit-code None
run_tf: status:running | exit-code None
run_gluon: status:running | exit-code None


We can use the wait_for_job function to wait for the completion of the job. Once it is completed then the stdout is printed out. Let's take a look at the tf job. We can tell the name of the job from the output of the print_jobs_summary as well as the log messages when we submitted the job.

In [8]:
wait_for_job('run_tf')

Cluster state: steady Target: 10; Allocated: 10; Idle: 10; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0
Job state: succeeded ExitCode: 0
Waiting for job output to become available...
OS:  linux
Python:  3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
Numpy:  1.14.2
Tensorflow:  1.8.0
Preparing train set...
Preparing test set...
Done.
(50000, 3, 32, 32) (10000, 3, 32, 32) (50000,) (10000,)
float32 float32 int32 int32
CPU times: user 852 ms, sys: 584 ms, total: 1.44 s
Wall time: 5.53 s
CPU times: user 152 ms, sys: 0 ns, total: 152 ms
Wall time: 149 ms
CPU times: user 512 ms, sys: 340 ms, total: 852 ms
Wall time: 879 ms
0 Train accuracy: 0.515625
1 Train accuracy: 0.59375
2 Train accuracy: 0.5625
3 Train accuracy: 0.5625
4 Train accuracy: 0.71875
5 Train accuracy: 0.734375
6 Train accuracy: 0.734375
7 Train accuracy: 0.71875
8 Train accuracy: 0.859375
9 Train accuracy: 0.859375
Training took 168.220 sec.
CPU times: user 7.61 s, sys: 996 ms, total: 8.6 s
Wall time: 8.58 s

Now lets download one of the notebooks we ran.

In [9]:
download_files('run_tf', 'notebooks')

INFO:__main__:Downloading Tensorflow_run_tf.ipynb


Downloading https://baitstr.file.core.windows.net/baitshare/10d0b7c6-9243-4713-91a9-2730375d3a1b/baitrg/jobs/run_tf/f34ebe05-7fe6-40a8-b431-7389f7e1821d/outputs/notebooks/Tensorflow_run_tf.ipynb?sv=2016-05-31&sr=f&sig=t7oBsMkWoevy9PZkqm1h8Bz45eLhmiqMTIUmLXk%2FmDY%3D&se=2018-05-07T14%3A25%3A18Z&sp=rl ...Done
All files Downloaded


Open the notebook and you can compare the output we printed out from the stdout of the job when we executed the command wait_for_job. We can see that the outputs in the cells are identical. You can download the other notebooks as well by simply supplying the name of the job.

Once all the jobs are complete we can delete them and delete the cluster.

In [12]:
delete_all_jobs()

In [13]:
delete_cluster()

<msrest.polling.poller.LROPoller at 0x7fe270fce7b8>

In [14]:
print_status()

Cluster state: steady Target: 10; Allocated: 10; Idle: 10; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0


These simple methods make it very convenient but may not be suitable for each use case. For more details check out the Batch AI documentation as well as the setup script.