# V - Automatic Model Selection from Parametric Sweep using Task Dependencies
In this notebook we will be taking the example from the [Parametric Sweep](04_Parameter_Sweep.ipynb) notebook and automating the entire chain using task dependencies in a single Azure Batch job.

* [Setup](#section1)
* [Configure job](#section2)
* [Submit job](#section3)
* [Download best model](#section4)
* [Delete job](#section5)

<a id='section1'></a>

## Setup

Create a simple alias for Batch Shipyard

In [1]:
%alias shipyard SHIPYARD_CONFIGDIR=config python $HOME/batch-shipyard/shipyard.py %l

Check that everything is working

In [2]:
shipyard

Usage: shipyard.py [OPTIONS] COMMAND [ARGS]...

  Batch Shipyard: Provision and Execute Docker Workloads on Azure Batch

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  cert      Certificate actions
  data      Data actions
  fs        Filesystem in Azure actions
  jobs      Jobs actions
  keyvault  KeyVault actions
  misc      Miscellaneous actions
  pool      Pool actions
  storage   Storage actions


Read in the account information we saved earlier

In [3]:
import json

def read_json(filename):
    with open(filename, 'r') as infile:
        return json.load(infile)
    
account_info = read_json('account_information.json')

storage_account_key = account_info['storage_account_key']
storage_account_name = account_info['storage_account_name']
IMAGE_NAME = account_info['IMAGE_NAME']
STORAGE_ALIAS = account_info['STORAGE_ALIAS']

<a id='section2'></a>

## Configure Job

As in the previous job we ran on a single node we will be running the job on GPU enabled nodes. We will be repeating the parametric search we did in the previous notebook but this time instead of downloading all the results and evaluating the model performance we will get a final task to do that for us using task dependencies.

In [4]:
import os

def write_json_to_file(json_dict, filename):
    """ Simple function to write JSON dictionaries to files
    """
    with open(filename, 'w') as outfile:
        json.dump(json_dict, outfile)


Generate the `jobs.json` configuration file

In [5]:
JOB_ID = 'cntk-ps-as-job'

jobs = {
    "job_specifications": [
        {
            "id": JOB_ID,
            "tasks": [
                        {
                            "image": IMAGE_NAME,
                            "task_factory": {
                                "parametric_sweep": {
                                    "product": [
                                        {#num_convolution_layers
                                            "start": 2,
                                            "stop": 4,
                                            "step": 1
                                        },
                                        {#minibatch_size
                                            "start": 32,
                                            "stop": 96,
                                            "step": 32
                                        }
                                    ]
                                }
                            },
                            "command": "bash -c \"source /cntk/activate-cntk; python -u ConvNet_CIFAR10.py --datadir $AZ_BATCH_NODE_SHARED_DIR/data --num_convolution_layers {0} --minibatch_size {1} --max_epochs 30\"",
                            "remove_container_after_exit": True,
                            "gpu": True,
                            "resource_files": [
                                    {
                                        "file_path": "ConvNet_CIFAR10.py",
                                        "blob_source": "https://batchshipyardexamples.blob.core.windows.net/code/ConvNet_CIFAR10.py",
                                        "file_mode":'0777'
                                    }
                            ],
                            "output_data": {
                                "azure_storage": [
                                    {
                                        "storage_account_settings": STORAGE_ALIAS,
                                        "container": "output",
                                        "source": "$AZ_BATCH_TASK_DIR/wd/Models"
                                    },
                                ]
                            },
                        }
            ]  
        }
    ]
}

num_parameter_sweep_tasks = 4
print('number of tasks for parametric sweep {}: {}'.format(JOB_ID, num_parameter_sweep_tasks))

number of tasks for parametric sweep cntk-ps-as-job: 4


Now we'll create the Python program to run that performs the best model selection. Note that this code is nearly similar to the code for selecting the best model locally in the [Parameter sweep notebook](04_Parameter_Sweep.ipynb).

In [6]:
%%writefile autoselect.py
import json
import os
import shutil

def read_json(filename):
    with open(filename, 'r') as infile:
        return json.load(infile)

def scandir(basedir):
    for root, dirs, files in os.walk(basedir):
        for f in files:
            yield os.path.join(root, f) 

MODELS_DIR = os.path.join('wd', 'Models')
            
results_dict = {}
for model in scandir(MODELS_DIR):
    if not model.endswith('.json'):
        continue
    key = model.split(os.sep)[2]  # due to MODELS_DIR path change
    results_dict[key] = read_json(model)

# use items() instead of iteritems() as this will be run in python3
tuple_min_error = min(results_dict.items(), key=lambda x: x[1]['test_metric'])
configuration_with_min_error = tuple_min_error[0]
print('task with smallest error: {} ({})'.format(configuration_with_min_error, tuple_min_error[1]['test_metric']))

# copy best model to wd
MODEL_NAME = 'ConvNet_CIFAR10_model.dnn'
shutil.copy(os.path.join(MODELS_DIR, configuration_with_min_error, MODEL_NAME), '.')

Writing autoselect.py


We now need to prepare the file to be uploaded to the Azure Storage account to be referenced in the task:

In [7]:
INPUT_CONTAINER = 'input-autoselect'
OUTPUT_CONTAINER = 'output-autoselect'
UPLOAD_DIR = 'autoselect_upload'

!rm -rf $UPLOAD_DIR
!mkdir -p $UPLOAD_DIR
!mv autoselect.py $UPLOAD_DIR
!ls -alF $UPLOAD_DIR

total 12
drwxr-xr-x  2 nbuser nbuser 4096 Aug 11 10:43 ./
drwx------ 16 nbuser nbuser 4096 Aug 11 10:43 ../
-rw-r--r--  1 nbuser nbuser 1001 Aug 11 10:43 autoselect.py


Alias `blobxfer` and upload it to `INPUT_CONTAINER`:

In [8]:
%alias blobxfer python -m blobxfer

In [9]:
blobxfer $storage_account_name $INPUT_CONTAINER $UPLOAD_DIR --upload --storageaccountkey $storage_account_key

 azure blobxfer parameters [v0.12.1]
             platform: Linux-4.4.0-87-generic-x86_64-with-debian-stretch-sid
   python interpreter: CPython 2.7.11
     package versions: az.common=1.1.8 az.sml=0.20.5 az.stor=0.35.1 crypt=2.0.3 req=2.18.3
      subscription id: None
      management cert: None
   transfer direction: local->Azure
       local resource: autoselect_upload
      include pattern: None
      remote resource: None
   max num of workers: 24
              timeout: None
      storage account: batch71d77646st
              use SAS: False
  upload as page blob: False
  auto vhd->page blob: False
 upload to file share: False
 container/share name: input-autoselect
  container/share URI: https://batch71d77646st.blob.core.windows.net/input-autoselect
    compute block MD5: False
     compute file MD5: True
    skip on MD5 match: True
   chunk size (bytes): 4194304
     create container: True
  keep mismatched MD5: False
     recursive if dir: True
component strip on up: 1
       

Now we'll append the `auto-model-selection` task which depends on the prior training tasks. The important properties here are `depends_on_range` which specifies a range of task ids the `auto-model-selection` task depends on. Additionally, this task requires data from the prior run task which is specified in `input_data`.

In [10]:
def generate_input_data_spec(job_id, task_id):
    return {
        "job_id": job_id,
        "task_id": task_id,
        "include": ["wd/Models/*_{}_{}/*".format(task_id, job_id)]
    }

input_data = []
for x in range(0, num_parameter_sweep_tasks):
    input_data.append(generate_input_data_spec(JOB_ID, '{}'.format(x)))

model_selection_task = {
    "id": "auto-model-selection",
    "command": 'bash -c "source /cntk/activate-cntk; python -u autoselect.py"',
    "depends_on_range": [0, num_parameter_sweep_tasks - 1],
    "image": IMAGE_NAME,
    "remove_container_after_exit": True,
    "input_data": {
        "azure_batch": input_data,
        "azure_storage": [
            {
                "storage_account_settings": STORAGE_ALIAS,
                "container": INPUT_CONTAINER
            }
        ]
    },
    "output_data": {
        "azure_storage": [
            {
                "storage_account_settings": STORAGE_ALIAS,
                "container": OUTPUT_CONTAINER,
                "include": ["*wd/ConvNet_CIFAR10_model.dnn"],
                "blobxfer_extra_options": "--delete --strip-components 2"
            }
        ]
    }
}

# append auto-model-selection task to jobs
jobs['job_specifications'][0]['tasks'].append(model_selection_task)

In [11]:
write_json_to_file(jobs, os.path.join('config', 'jobs.json'))
print(json.dumps(jobs, indent=4, sort_keys=True))

{
    "job_specifications": [
        {
            "id": "cntk-ps-as-job", 
            "tasks": [
                {
                    "command": "bash -c \"source /cntk/activate-cntk; python -u ConvNet_CIFAR10.py --datadir $AZ_BATCH_NODE_SHARED_DIR/data --num_convolution_layers {0} --minibatch_size {1} --max_epochs 30\"", 
                    "gpu": true, 
                    "image": "microsoft/cntk:2.0-gpu-python3.5-cuda8.0-cudnn5.1", 
                    "output_data": {
                        "azure_storage": [
                            {
                                "container": "output", 
                                "source": "$AZ_BATCH_TASK_DIR/wd/Models", 
                                "storage_account_settings": "mystorageaccount"
                            }
                        ]
                    }, 
                    "remove_container_after_exit": true, 
                    "resource_files": [
                        {
                            "b

<a id='section3'></a>

## Submit job
Check that everything is ok with our pool before we submit our jobs

In [12]:
shipyard pool list

2017-08-11 10:44:11,748 INFO - list of pools
* pool id: gpupool
  * vm size: standard_nc6
  * state: PoolState.active
  * allocation state: AllocationState.steady
  * no resize errors
  * vm count:
    * dedicated:
      * current: 3
      * target: 3
    * low priority:
      * current: 0
      * target: 0
  * node agent: batch.node.ubuntu 16.04


Now that we have confirmed everything is working we can execute our job using the command below. 

In [13]:
shipyard jobs add

2017-08-11 10:44:19,959 INFO - Adding job cntk-ps-as-job to pool gpupool
2017-08-11 10:44:20,682 INFO - uploading file /tmp/tmpMUZiU1 as u'shipyardtaskrf-cntk-ps-as-job/0.shipyard.envlist'
2017-08-11 10:44:21,099 INFO - uploading file /tmp/tmpymGaGJ as u'shipyardtaskrf-cntk-ps-as-job/1.shipyard.envlist'
2017-08-11 10:44:21,703 INFO - uploading file /tmp/tmpnZ5CsU as u'shipyardtaskrf-cntk-ps-as-job/2.shipyard.envlist'
2017-08-11 10:44:22,317 INFO - uploading file /tmp/tmp2Ah4dW as u'shipyardtaskrf-cntk-ps-as-job/3.shipyard.envlist'
2017-08-11 10:44:22,871 DEBUG - submitting 5 tasks (0 -> 4) to job cntk-ps-as-job
2017-08-11 10:44:23,259 INFO - submitted all 5 tasks to job cntk-ps-as-job


Using the command below we can check the status of our jobs. Once all jobs have an exit code we can continue. You can also view the **heatmap** of this pool on [Azure Portal](https://portal.azure.com) to monitor the progress of this job on the compute nodes under your Batch account.

In [16]:
shipyard jobs listtasks --jobid $JOB_ID

2017-08-11 11:11:46,992 INFO - list of tasks for job cntk-ps-as-job
* task id: 0
  * job id: cntk-ps-as-job
  * state: TaskState.completed
  * max retries: 0
  * retention time: 10675199 days, 2:48:05.477581
  * execution details:
    * pool id: gpupool
    * node id: tvm-3257026573_1-20170811t093905z
    * started: 2017-08-11 10:44:24.853753+00:00
    * completed: 2017-08-11 10:54:49.762664+00:00
    * duration: 0:10:24.908911
    * exit code: 0
* task id: 1
  * job id: cntk-ps-as-job
  * state: TaskState.completed
  * max retries: 0
  * retention time: 10675199 days, 2:48:05.477581
  * execution details:
    * pool id: gpupool
    * node id: tvm-3257026573_2-20170811t093905z
    * started: 2017-08-11 10:44:24.884756+00:00
    * completed: 2017-08-11 10:52:26.989212+00:00
    * duration: 0:08:02.104456
    * exit code: 0
* task id: 2
  * job id: cntk-ps-as-job
  * state: TaskState.completed
  * max retries: 0
  * retention time: 10675199 days, 2:48:05.4775

<a id='section4'></a>

## Download best model
The best performing model from the parametric sweep job should now be saved to our `OUTPUT_CONTAINER` container by the `auto-model-selection` task. Let's save this model in `MODELS_DIR`:

In [17]:
MODELS_DIR = 'auto-selected-model'

Download the best performing model:

In [18]:
blobxfer $storage_account_name $OUTPUT_CONTAINER $MODELS_DIR --remoteresource . --download --storageaccountkey $storage_account_key

 azure blobxfer parameters [v0.12.1]
             platform: Linux-4.4.0-87-generic-x86_64-with-debian-stretch-sid
   python interpreter: CPython 2.7.11
     package versions: az.common=1.1.8 az.sml=0.20.5 az.stor=0.35.1 crypt=2.0.3 req=2.18.3
      subscription id: None
      management cert: None
   transfer direction: Azure->local
       local resource: auto-selected-model
      include pattern: None
      remote resource: .
   max num of workers: 24
              timeout: None
      storage account: batch71d77646st
              use SAS: False
  upload as page blob: False
  auto vhd->page blob: False
 upload to file share: False
 container/share name: output-autoselect
  container/share URI: https://batch71d77646st.blob.core.windows.net/output-autoselect
    compute block MD5: False
     compute file MD5: True
    skip on MD5 match: True
   chunk size (bytes): 4194304
     create container: True
  keep mismatched MD5: False
     recursive if dir: True
component strip on up: 1
      

The best model file (`ConvNet_CIFAR10_model.dnn`) is now ready for use.

In [19]:
!ls -alF $MODELS_DIR

total 1920
drwxr-xr-x  2 nbuser nbuser    4096 Aug 11 11:11 ./
drwx------ 17 nbuser nbuser    4096 Aug 11 11:11 ../
-rw-r--r--  1 nbuser nbuser 1957354 Aug 11 11:11 ConvNet_CIFAR10_model.dnn


<a id='section5'></a>

## Delete job

To delete the job use the command below. Just be aware that this will get rid of all the files created by the job and tasks.

In [20]:
shipyard jobs del -y --termtasks --wait

2017-08-11 11:12:04,390 DEBUG - Skipping termination of completed task 0 on job cntk-ps-as-job
2017-08-11 11:12:04,754 DEBUG - Skipping termination of completed task 1 on job cntk-ps-as-job
2017-08-11 11:12:05,129 DEBUG - Skipping termination of completed task 2 on job cntk-ps-as-job
2017-08-11 11:12:05,530 DEBUG - Skipping termination of completed task 3 on job cntk-ps-as-job
2017-08-11 11:12:05,717 DEBUG - Skipping termination of completed task auto-model-selection on job cntk-ps-as-job
2017-08-11 11:12:06,176 INFO - deleting job: cntk-ps-as-job
2017-08-11 11:12:06,358 DEBUG - waiting for job cntk-ps-as-job to delete
2017-08-11 11:12:38,422 INFO - job cntk-ps-as-job does not exist
