This notebook contains a tutorial on how to train a Deep Learning model on GPUs using Azure Batch shipyard. In this example we will be training a Deep Learning model built using CNTK but Batch Shipyard can be used to train a model built in any framework.

This notebook assumes that nothing has been set up previously and will create everything from scratch. The necessary steps are broken up into the following sections:
* [Setup](#section1)
* [Azure account login](#section2)
* [Create Azure resources](#section3)
* [Transfer model files](#section4)
* [Batch Shiyard configuration](#section5)
* [Using Batch Shipyard](#section6)
* [Delete everything](#section7)


In this specific example Batch shipyard will spin up a node with GPU support and load our Docker image that has the necessary libraries installed. Pull in the Python files that contain our model into the node and execute them. The output of the execution will then be streamed back for us to see. The two files used are [prepare_cifar_data.py](model/prepare_cifar_data.py) and [ConvNet_CIFAR10.py](model/ConvNet_CIFAR10.py). With the former downloading the data and conerting it to the appropriate format and the second constructing, training and evaluating the model.



<a id='installation'></a>
### Installation
If you do not have Batch Shipyard the Azure CLI or Blobxfer installed please follow the instructions below to install them before continuing with this tutorial.  
[Batch Shipyard](https://github.com/Azure/batch-shipyard/blob/master/docs/01-batch-shipyard-installation.md)    
[Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli)  
[Blobxfer](https://github.com/Azure/blobxfer)


In [4]:
import json
from os import path

<a id='section1'></a>

## Setup

We asume here that the shipyard command is in your PATH and can be executed within the notebook with the !shipyard command. If not please add it to your PATH.

Below are the various name definitions for the resources needed to run batch jobs.

In [25]:
# Feel free to modify these
group_name = 'batchcntkrg'
batch_account_name = "batchcntkacc"
storage_account_name = "batchcntkstr"
location = 'eastus' # We are setting everything up in East US
code_share_name = "code"

# Leve these alone for now
STORAGE_ALIAS = "mystorageaccount"
IMAGE_NAME = "microsoft/cntk:2.0.rc1-gpu-python3.5-cuda8.0-cudnn5.1" # The latest CNTK image
STORAGE_ENDPOINT = "core.windows.net"

<a id='section2'></a>

## Azure account login

The command below will initiate a login to your Azure account. It will pop up with an url to go to where you will enter a one off code and log into your Azure account using your browser.

In [None]:
!az login -o table

If you hve multiple subscriptions you can select the one you need with the command below

In [7]:
selected_subscription = '"My Team"'
!az account set --subscription $selected_subscription

<a id='section3'></a>

## Create Azure resources

### Create Group
Azure encourages the use of groups to organise all the Azure components you deploy that way it is easier to find them but also we can deleted a number of resources simply by deleting the Group.

In [8]:
%%time
!az group create -n $group_name -l $location -o table

Location    Name
----------  -----------
eastus      batchcntkrg
CPU times: user 58 ms, sys: 24.8 ms, total: 82.7 ms
Wall time: 3.32 s


### Create Batch account
In this section we define an ARM template to create our batch and storage accounts. Once we have created the accounts we can the use the Azure CLI to query them and obtain the batch_account_key, batch_service_url and storage_account_key which we will need for our Batch Shipyard configuration files later.

In [10]:
template_dict = {
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "batchAccounts_name": {
            "defaultValue": batch_account_name,
            "type": "String"
        },
        "storageAccounts_name": {
            "defaultValue": storage_account_name,
            "type": "String"
        }
    },
    "variables": {},
    "resources": [
        {
            "type": "Microsoft.Batch/batchAccounts",
            "name": "[parameters('batchAccounts_name')]",
            "apiVersion": "2015-12-01",
            "location": location,
            "properties": {
                "autoStorage": {
                    "storageAccountId": "[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccounts_name'))]"
                }
            },
            "resources": [],
            "dependsOn": [
                "[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccounts_name'))]"
            ]
        },
        {
            "type": "Microsoft.Storage/storageAccounts",
            "sku": {
                "name": "Standard_LRS",
                "tier": "Standard"
            },
            "kind": "Storage",
            "name": "[parameters('storageAccounts_name')]",
            "apiVersion": "2016-01-01",
            "location": location,
            "tags": {},
            "properties": {},
            "resources": [],
            "dependsOn": []
        }
    ]
}

In [11]:
template_filename = 'template.json'

In [12]:
with open(template_filename, 'w') as outfile:
    json.dump(template_dict, outfile)

Validate the template

In [13]:
!az group deployment validate --template-file $template_filename -g $group_name

{
  "error": null,
  "properties": {
    "correlationId": "2ec7bbea-75af-4f53-bfc6-88457a3b5e0f",
    "debugSetting": null,
    "dependencies": [
      {
        "dependsOn": [
          {
            "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchcntkrg/providers/Microsoft.Storage/storageAccounts/batchcntkstr",
            "resourceGroup": "batchcntkrg",
            "resourceName": "batchcntkstr",
            "resourceType": "Microsoft.Storage/storageAccounts"
          }
        ],
        "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchcntkrg/providers/Microsoft.Batch/batchAccounts/batchcntkacc",
        "resourceGroup": "batchcntkrg",
        "resourceName": "batchcntkacc",
        "resourceType": "Microsoft.Batch/batchAccounts"
      }
    ],
    "mode": "Incremental",
    "outputs": null,
    "parameters": {
      "batchAccounts_name": {
        "type": "String",
        "value": "batchcntkac

You should see "Succeeded" in the provisioningState field in the JSON above.

Next we deploy the template

In [14]:
%%time
!az group deployment create --template-file $template_filename -g $group_name

{
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchcntkrg/providers/Microsoft.Resources/deployments/template",
  "name": "template",
  "properties": {
    "correlationId": "6b45de5c-2ae9-4dda-ae63-d2ffde46d961",
    "debugSetting": null,
    "dependencies": [
      {
        "dependsOn": [
          {
            "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchcntkrg/providers/Microsoft.Storage/storageAccounts/batchcntkstr",
            "resourceGroup": "batchcntkrg",
            "resourceName": "batchcntkstr",
            "resourceType": "Microsoft.Storage/storageAccounts"
          }
        ],
        "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/batchcntkrg/providers/Microsoft.Batch/batchAccounts/batchcntkacc",
        "resourceGroup": "batchcntkrg",
        "resourceName": "batchcntkacc",
        "resourceType": "Microsoft.Batch/batchAccounts"
      }
    ],
    "mode": "Incremental",
    "o

Again we should see "Succeeded" in the provisioningState field in the JSON above.

Next we retrieve the batch_account_key, batch_service_url and storage_account_key which we will need for the Batch Shipyard configuration files further down.

In [16]:
json_data = !az batch account keys list -n $batch_account_name -g $group_name
batch_account_key = json.loads(''.join(json_data))['primary']

In [17]:
json_data = !az batch account list -g $group_name
batch_service_url = 'https://'+json.loads(''.join(json_data))[0]['accountEndpoint']

In [18]:
json_data = !az storage account keys list -n $storage_account_name -g $group_name
storage_account_key = json.loads(''.join(json_data))[0]['value']

<a id='section4'></a>

## Transfer files
Here we will create a share in the storage account where we will transfer the necessary files to create and execute our model.

In [20]:
%%time
!azure storage share create $code_share_name -a $storage_account_name -k $storage_account_key

[32minfo[39m:    Executing command [1mstorage share create[22m
+
+
[90mdata[39m:    [36m{[39m
[90mdata[39m:    [36m    [1mname[22m: [32m'code'[36m,[39m
[90mdata[39m:    [36m    [1mmetadata[22m: {},[39m
[90mdata[39m:    [36m    [1metag[22m: [32m'"0x8D47FF2F446FB17"'[36m,[39m
[90mdata[39m:    [36m    [1mlastModified[22m: [32m'Mon, 10 Apr 2017 09:21:22 GMT'[36m,[39m
[90mdata[39m:    [36m    [1mrequestId[22m: [32m'7607fda2-001a-0065-43db-b155b9000000'[36m,[39m
[90mdata[39m:    [36m    [1mquota[22m: [32m'5120'[36m,[39m
[90mdata[39m:    [36m    [1mshareUsage[22m: [32m'0'[36m[39m
[90mdata[39m:    [36m}[39m[39m
[32minfo[39m:    [1mstorage share create[22m command [1m[32mOK[39m[22m
CPU times: user 49.4 ms, sys: 16.2 ms, total: 65.6 ms
Wall time: 2.34 s


In [156]:
localfolder = 'model'

In [752]:
%%time
!blobxfer $storage_account_name $code_share_name $localfolder --fileshare --upload --storageaccountkey $storage_account_key 
# Transfers all files in local folder to the share

 azure blobxfer parameters [v0.12.1]
             platform: Linux-3.10.0-327.36.3.el7.x86_64-x86_64-with-centos-7.2.1511-Core
   python interpreter: CPython 3.5.3
     package versions: az.common=1.1.4 az.sml=0.20.5 az.stor=0.33.0 crypt=1.8.1 req=2.13.0
      subscription id: None
      management cert: None
   transfer direction: local->Azure
       local resource: model
      include pattern: None
      remote resource: None
   max num of workers: 12
              timeout: None
      storage account: batchcntkstr
              use SAS: False
  upload as page blob: False
  auto vhd->page blob: False
 upload to file share: True
 container/share name: code
  container/share URI: https://batchcntkstr.file.core.windows.net/code
    compute block MD5: False
     compute file MD5: True
    skip on MD5 match: True
   chunk size (bytes): 4194304
     create container: True
  keep mismatched MD5: False
     recursive if dir: True
component strip on up: 1
        remote delete: False
          

<a id='section5'></a>

## Batch Shiyard configuration
In order to execute a job on Batch Shipyard you need a minimum of 4 configuration files. These are
* [credentials](#credentials)
* [configuration](#configuration)
* [pool](#pool)
* [jobs](#jobs)


<a id='credentials'></a>

### Credentials
Here we define all the credentials necessary for Batch Shipyard to run our job.

In [175]:
credentials = {
    "credentials": {
        "batch": {
            "account_key": batch_account_key,
            "account_service_url": batch_service_url
        },
        "storage": {
            STORAGE_ALIAS : {
                    "account": storage_account_name,
                    "account_key": storage_account_key,
                    "endpoint": STORAGE_ENDPOINT
            }
        }
    }
}

<a id='configuration'></a>

### Configuration
The config mainly contains the configuration for Batch Shipyard. Here we simply define the storage alias that Batch Shipyard should use as well as the image name to use.

In [176]:
config = {
    "batch_shipyard": {
        "storage_account_settings": STORAGE_ALIAS
    },
    "global_resources": {
        "docker_images": [
            IMAGE_NAME
        ]
    }
}

<a id='pool'></a>

### Pool
This is where we define the properties of the compute pool we wish to create. The configuration below creates a pool that is made up of a single NC6 VM running Ubuntu. If you wish to run a job that uses GPUs then you need to use a VM from the NC or NV series. 

In [177]:
pool={
    "pool_specification": {
        "id": "singlegpu",
        "vm_size": "STANDARD_NC6",
        "vm_count": 1,
        "publisher": "Canonical",
        "offer": "UbuntuServer",
        "sku": "16.04-LTS",
        "ssh": {
            "username": "docker"
        },
        "reboot_on_start_task_failed": False,
        "block_until_all_global_resources_loaded": True,
    }
}


<a id='jobs'></a>

### Jobs
In the dictonary below we define the properties of the job we wish to execute. You can see thta we have specified that the image to use is the one we defined at the beginning of this notebook. Another interesting note is thta we specify the gpu switch to true since we want the job to use the GPU. Finally the command is as follows

```
source /cntk/activate-cntk
python $AZ_BATCH_NODE_SHARED_DIR/code/process_cifar_data.py
python $AZ_BATCH_NODE_SHARED_DIR/code/cntk_cifar10.py"
```

Which in essence activate the CNTK Anaconda environment then runs the the process_cifar_data.py script. Once this is done it will run the cntk_cifar10.py script which will train and evaluate the model.

In [734]:
command = 'bash -c "source /cntk/activate-cntk; \
           python $AZ_BATCH_NODE_SHARED_DIR/code/prepare_cifar_data.py; \
           python $AZ_BATCH_NODE_SHARED_DIR/code/ConvNet_CIFAR10.py"'

jobs = {
    "job_specifications": [
        {
            "id": "cntkjob",
            "tasks": [
                {
                    "id": "run_cifar10",# This should be changed per task
                    "image": IMAGE_NAME,
                    "remove_container_after_exit": True,
                    "command": command,
                    "gpu": True,
                }
            ],
            "input_data": {
                "azure_storage": [
                    {
                        "storage_account_settings": STORAGE_ALIAS,
                        "file_share": code_share_name,
                        "blobxfer_extra_options": None,
                        "destination":"$AZ_BATCH_NODE_SHARED_DIR/code"
                    }
                ]
            }
        }
    ]
}

In [735]:
!rm -r config
!mkdir config # Create config file where we will store all our Batch Shipyard configuration files

In [736]:
def write_json_to_file(json_dict, filename):
    """ Simple function to write JSON dictionaries to files
    """
    with open(filename, 'w') as outfile:
        json.dump(json_dict, outfile)

In [737]:
write_json_to_file(credentials, path.join('config', 'credentials.json'))

In [738]:
write_json_to_file(config, path.join('config', 'config.json'))

In [739]:
write_json_to_file(pool, path.join('config', 'pool.json'))

In [740]:
write_json_to_file(jobs, path.join('config', 'jobs.json'))

<a id='section6'></a>

## Using Batch Shipyard

Before we do anything we need to create the pool. This can take a little bit of time so be patient.

In [747]:
%%time
!shipyard pool add --yes --configdir config

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 8.11 µs
2017-04-11 08:07:18.362 INFO - creating container: shipyardtor-batchcntkacc-singlegpu
2017-04-11 08:07:19.057 INFO - creating table: shipyardtorrentinfo
2017-04-11 08:07:19.290 INFO - creating table: shipyardimages
2017-04-11 08:07:19.338 INFO - creating container: shipyardrf-batchcntkacc-singlegpu
2017-04-11 08:07:19.386 INFO - creating queue: shipyardgr-batchcntkacc-singlegpu
2017-04-11 08:07:19.671 INFO - creating table: shipyardregistry
2017-04-11 08:07:19.721 INFO - creating table: shipyarddht
2017-04-11 08:07:19.770 INFO - creating container: shipyardremotefs
2017-04-11 08:07:19.814 INFO - creating table: shipyardgr
2017-04-11 08:07:19.864 INFO - deleting blobs: shipyardtor-batchcntkacc-singlegpu
2017-04-11 08:07:19.937 DEBUG - clearing table (pk=batchcntkacc$singlegpu): shipyardtorrentinfo
2017-04-11 08:07:20.008 DEBUG - clearing table (pk=batchcntkacc$singlegpu): shipyardimages
2017-04-11 08:07:20.057 INFO - de

Once the pool is created we can confirm everything by running the command below.

In [753]:
!shipyard pool list --configdir config

2017-04-11 08:29:09.968 INFO - pool_id=singlegpu [state=PoolState.active allocation_state=AllocationState.steady vm_size=standard_nc6, vm_count=1 target_vm_count=1]


Now that we have confirmed everything is working we can execute our job using the command below. The tail switch at the end will stream stdout from the node.

In [754]:
%%time
!shipyard jobs add --configdir config --tail stdout.txt

2017-04-11 08:29:13.648 INFO - Adding job cntkjob to pool singlegpu
2017-04-11 08:29:14.168 DEBUG - remote file is the same for shipyardtaskrf-cntkjob/run_cifar10.shipyard.envlist, skipping
2017-04-11 08:29:14.168 INFO - Adding task: run_cifar10
2017-04-11 08:29:14.418 DEBUG - attempting to stream file stdout.txt from job=cntkjob task=run_cifar10

************************************************************
CNTK is activated.

Please checkout tutorials and examples here:
  /cntk/Tutorials
  /cntk/Examples

To deactivate the environment run

  source /root/anaconda3/bin/deactivate

************************************************************
Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Done.
Extracting files...
Done.
Preparing train set...
Done.
Preparing test set...
Done.
Writing train text file...
Done.
Writing test text file...
Done.
Converting train data to png images...
Done.
Converting test data to png images...
Done.
Setting up input variables
Creating NN mo

If we simply want to review what happened we can execute the command below. Of course when you delete the job all this information is also deleted.

In [755]:
!shipyard data stream --configdir config --filespec cntkjob,run_cifar10,stdout.txt

2017-04-11 08:48:53.086 DEBUG - attempting to stream file stdout.txt from job=cntkjob task=run_cifar10

************************************************************
CNTK is activated.

Please checkout tutorials and examples here:
  /cntk/Tutorials
  /cntk/Examples

To deactivate the environment run

  source /root/anaconda3/bin/deactivate

************************************************************
Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Done.
Extracting files...
Done.
Preparing train set...
Done.
Preparing test set...
Done.
Writing train text file...
Done.
Writing test text file...
Done.
Converting train data to png images...
Done.
Converting test data to png images...
Done.
Setting up input variables
Creating NN model
Training 1195594 parameters in 14 parameter tensors.

Starting training
Finished Epoch[1 of 30]: [Training] loss = 1.943535 * 50000, metric = 71.96% * 50000 19.817s (2523.1 samples/s);
Finished Epoch[2 of 30]: [Training] loss = 1.524170 * 500

If something goes wrong you can run the following command to get the stderr output from the job.

In [756]:
!shipyard data stream --configdir config --filespec cntkjob,run_cifar10,stderr.txt

2017-04-11 08:48:58.680 DEBUG - attempting to stream file stderr.txt from job=cntkjob task=run_cifar10
-------------------------------------------------------------------
Build info: 

		Built time: Apr  3 2017 07:05:06
		Last modified date: Mon Apr  3 05:11:23 2017
		Build type: release
		Build target: GPU
		With 1bit-SGD: no
		With ASGD: yes
		Math lib: mkl
		CUDA_PATH: /usr/local/cuda-8.0
		CUB_PATH: /usr/local/cub-1.4.1
		CUDNN_PATH: /usr/local/cudnn-5.1
		Build Branch: HEAD
		Build SHA1: 7661e81777360d3222a26dcb969973ce1d4c513f
		Built by Source/CNTK/buildinfo.h$$0 on ef88a481c30f
		Build Path: /home/philly/jenkins/workspace/CNTK-Build-Linux
		MPI distribution: Open MPI
		MPI version: 1.10.3
-------------------------------------------------------------------



To delete the job use the command below. Just beware that this will get rid of all the files created by the job and tasks.

In [757]:
!shipyard jobs del -y --configdir config --wait

2017-04-11 08:49:51.208 INFO - Deleting job: cntkjob
2017-04-11 08:49:51.784 DEBUG - waiting for job cntkjob to delete


To deallocate the VM simply execute the command below. If you do not the VM will be running idle and you will continue to incur charges.

In [758]:
!shipyard pool del -y --configdir config --wait

2017-04-11 08:50:25.623 INFO - Deleting pool: singlegpu
2017-04-11 08:50:25.941 DEBUG - clearing table (pk=batchcntkacc$singlegpu): shipyardtorrentinfo
2017-04-11 08:50:26.418 DEBUG - clearing table (pk=batchcntkacc$singlegpu): shipyardimages
2017-04-11 08:50:26.540 DEBUG - clearing table (pk=batchcntkacc$singlegpu): shipyardgr
2017-04-11 08:50:26.666 DEBUG - clearing table (pk=batchcntkacc$singlegpu): shipyardregistry
2017-04-11 08:50:26.789 DEBUG - clearing table (pk=batchcntkacc$singlegpu): shipyarddht
2017-04-11 08:50:26.857 DEBUG - clearing table (pk=batchcntkacc$singlegpu): shipyardperf
2017-04-11 08:50:26.912 DEBUG - deleting queue: shipyardgr-batchcntkacc-singlegpu
2017-04-11 08:50:27.493 DEBUG - deleting container: shipyardrf-batchcntkacc-singlegpu
2017-04-11 08:50:27.814 DEBUG - deleting container: shipyardtor-batchcntkacc-singlegpu
2017-04-11 08:50:27.874 DEBUG - waiting for pool singlegpu to delete


<a id='section6'></a>

## Delete everything
Once you have deleted the pool all that remains is the storage account and the Batch account. You will only incur a cost for the storage account but if you wish to delete everything execute the commands below.

In [759]:
!az group delete -n $group_name --yes --verbose

[32mStarting long running operation 'Starting group delete'[0m
[32mLong running operation 'Starting group delete' completed with result None[0m
