# IV - Parameter Sweep
In this notebook we will be running a simple parameter sweep on the model we have. We will then pull the results of our sweep and based on the results of our sweep pull the best performing model from blob.

* [Setup](#section1)
* [Configure job](#section2)
* [Submit job](#section3)
* [Check results](#section4)
* [Download best model](#section5)
* [Delete job](#section6)

<a id='section1'></a>

## Setup

Create a simple alias for Batch Shipyard

In [21]:
%alias shipyard SHIPYARD_CONFIGDIR=config python $HOME/batch-shipyard/shipyard.py %l

Check that everything is working

In [22]:
shipyard

Usage: shipyard.py [OPTIONS] COMMAND [ARGS]...

  Batch Shipyard: Provision and Execute Docker Workloads on Azure Batch

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  cert      Certificate actions
  data      Data actions
  fs        Filesystem in Azure actions
  jobs      Jobs actions
  keyvault  KeyVault actions
  misc      Miscellaneous actions
  pool      Pool actions
  storage   Storage actions


Read in the account information we saved earlier

In [23]:
import json

def read_json(filename):
    with open(filename, 'r') as infile:
        return json.load(infile)
    
account_info = read_json('account_information.json')

storage_account_key = account_info['storage_account_key']
storage_account_name = account_info['storage_account_name']
IMAGE_NAME = account_info['IMAGE_NAME']
STORAGE_ALIAS = account_info['STORAGE_ALIAS']

<a id='section2'></a>

## Configure Job

As in the previous job we ran on a single node we will be running the job on GPU enabled nodes. The difference here is that we will be using a [task factory](https://github.com/Azure/batch-shipyard/blob/master/docs/35-batch-shipyard-task-factory.md). The task factory will generate a number of tasks, each task will have a different set of parmeters that we will be passing to our model training script. This parameters effect the training of the model and in the end the performance of the model. The model and results of its evaluation are recorded and stored on the node. At the end of the task the results are pulled into the specified storage container.

In [24]:
import os


def write_json_to_file(json_dict, filename):
    """ Simple function to write JSON dictionaries to files
    """
    with open(filename, 'w') as outfile:
        json.dump(json_dict, outfile)


Generate the `jobs.json` configuration file

For the purposes of the tutorial, we are constraining the parameter search space to just 4 final combinations.

<a id='section3'></a>

In [25]:
JOB_ID = 'cntk-parametricsweep-job'

jobs = {
    "job_specifications": [
        {
            "id": JOB_ID,
            "tasks": [
                        {
                            "image": IMAGE_NAME,
                            "task_factory": {
                                "parametric_sweep": {
                                    "product": [
                                        {#num_convolution_layers
                                            "start": 2,
                                            "stop": 4,
                                            "step": 1
                                        },
                                        {#minibatch_size
                                            "start": 32,
                                            "stop": 96,
                                            "step": 32
                                        }
                                    ]
                                }
                            },
                            "command": "bash -c \"source /cntk/activate-cntk; python -u ConvNet_CIFAR10.py --datadir $AZ_BATCH_NODE_SHARED_DIR/data --num_convolution_layers {0} --minibatch_size {1} --max_epochs 30\"",
                            "remove_container_after_exit": True,
                            "gpu": True,
                            "resource_files": [
                                    {
                                        "file_path": "ConvNet_CIFAR10.py",
                                        "blob_source": "https://batchshipyardexamples.blob.core.windows.net/code/ConvNet_CIFAR10.py",
                                        "file_mode":'0777'
                                    }
                            ],
                            "output_data": {
                                "azure_storage": [
                                    {
                                        "storage_account_settings": STORAGE_ALIAS,
                                        "container": "output",
                                        "source": "$AZ_BATCH_TASK_DIR/wd/Models"
                                    },
                                ]
                            },
                        }
            ]  
        }
    ]
}

In [26]:
write_json_to_file(jobs, os.path.join('config', 'jobs.json'))
print(json.dumps(jobs, indent=4, sort_keys=True))

{
    "job_specifications": [
        {
            "id": "cntk-parametricsweep-job", 
            "tasks": [
                {
                    "command": "bash -c \"source /cntk/activate-cntk; python -u ConvNet_CIFAR10.py --datadir $AZ_BATCH_NODE_SHARED_DIR/data --num_convolution_layers {0} --minibatch_size {1} --max_epochs 30\"", 
                    "gpu": true, 
                    "image": "microsoft/cntk:2.0-gpu-python3.5-cuda8.0-cudnn5.1", 
                    "output_data": {
                        "azure_storage": [
                            {
                                "container": "output", 
                                "source": "$AZ_BATCH_TASK_DIR/wd/Models", 
                                "storage_account_settings": "mystorageaccount"
                            }
                        ]
                    }, 
                    "remove_container_after_exit": true, 
                    "resource_files": [
                        {
                    

## Submit job
Check that everything is ok with our pool before we submit our jobs

In [27]:
shipyard pool listnodes

2017-08-11 10:19:40,332 INFO - compute nodes for pool gpupool
* node id: tvm-3257026573_1-20170811t093905z
  * state: ComputeNodeState.idle
  * scheduling state: SchedulingState.enabled
  * no errors
  * start task:
    * exit code: 0
    * started: 2017-08-11 09:41:12.873732+00:00
    * completed: 2017-08-11 09:48:32.262192+00:00
    * duration: 0:07:19.388460
  * vm size: standard_nc6
  * dedicated: True
  * ip address: 10.0.0.6
  * running tasks: 0
  * total tasks run: 0
  * total tasks succeeded: 0
* node id: tvm-3257026573_2-20170811t093905z
  * state: ComputeNodeState.idle
  * scheduling state: SchedulingState.enabled
  * no errors
  * start task:
    * exit code: 0
    * started: 2017-08-11 09:40:34.895513+00:00
    * completed: 2017-08-11 09:47:46.141084+00:00
    * duration: 0:07:11.245571
  * vm size: standard_nc6
  * dedicated: True
  * ip address: 10.0.0.5
  * running tasks: 0
  * total tasks run: 0
  * total tasks succeeded: 0
* node id: tvm-

Now that we have confirmed everything is working we can execute our job using the command below. 

In [28]:
shipyard jobs add

2017-08-11 10:19:49,006 INFO - Adding job cntk-parametricsweep-job to pool gpupool
2017-08-11 10:19:50,268 INFO - uploading file /tmp/tmpvMVlvn as u'shipyardtaskrf-cntk-parametricsweep-job/0.shipyard.envlist'
2017-08-11 10:19:50,950 INFO - uploading file /tmp/tmpOF_qOz as u'shipyardtaskrf-cntk-parametricsweep-job/1.shipyard.envlist'
2017-08-11 10:19:51,646 INFO - uploading file /tmp/tmpffi81f as u'shipyardtaskrf-cntk-parametricsweep-job/2.shipyard.envlist'
2017-08-11 10:19:52,511 INFO - uploading file /tmp/tmpSwIVwP as u'shipyardtaskrf-cntk-parametricsweep-job/3.shipyard.envlist'
2017-08-11 10:19:52,759 DEBUG - submitting 4 tasks (0 -> 3) to job cntk-parametricsweep-job
2017-08-11 10:19:53,259 INFO - submitted all 4 tasks to job cntk-parametricsweep-job


Using the command below we can check the status of our job. Only after all tasks have an exit code can we continue with the notebook. Please keep re-running the cell below periodically until you see that all tasks show completed state with an exit code. Continuing on with the notebook without all tasks in the job completing their training execution will result in failure in subsequent cells.

You can also view the **heatmap** of this pool on [Azure Portal](https://portal.azure.com) to monitor the progress of this job on the compute nodes under your Batch account.

In [32]:
shipyard jobs listtasks --jobid $JOB_ID

2017-08-11 10:39:39,162 INFO - list of tasks for job cntk-parametricsweep-job
* task id: 0
  * job id: cntk-parametricsweep-job
  * state: TaskState.completed
  * max retries: 0
  * retention time: 10675199 days, 2:48:05.477581
  * execution details:
    * pool id: gpupool
    * node id: tvm-3257026573_1-20170811t093905z
    * started: 2017-08-11 10:19:54.480387+00:00
    * completed: 2017-08-11 10:30:18.602361+00:00
    * duration: 0:10:24.121974
    * exit code: 0
* task id: 1
  * job id: cntk-parametricsweep-job
  * state: TaskState.completed
  * max retries: 0
  * retention time: 10675199 days, 2:48:05.477581
  * execution details:
    * pool id: gpupool
    * node id: tvm-3257026573_2-20170811t093905z
    * started: 2017-08-11 10:19:54.388684+00:00
    * completed: 2017-08-11 10:27:55.844296+00:00
    * duration: 0:08:01.455612
    * exit code: 0
* task id: 2
  * job id: cntk-parametricsweep-job
  * state: TaskState.completed
  * max retries: 0
  * ret

<a id='section4'></a>

# Check results
The results of our parameter search should now be saved to our output container.

**Note:** You will encounter errors if you did not wait for all tasks to complete with an exit code in the previous cell.

First let's alias `blobxfer` to aid in downloading our blobs. We will aggregate our results in the `MODELS_DIR`.

In [33]:
%alias blobxfer python -m blobxfer

MODELS_DIR = 'psmodels'

In [34]:
blobxfer $storage_account_name output $MODELS_DIR --remoteresource . --download --include "*_$JOB_ID/model_results.json" --storageaccountkey $storage_account_key

 azure blobxfer parameters [v0.12.1]
             platform: Linux-4.4.0-87-generic-x86_64-with-debian-stretch-sid
   python interpreter: CPython 2.7.11
     package versions: az.common=1.1.8 az.sml=0.20.5 az.stor=0.35.1 crypt=2.0.3 req=2.18.3
      subscription id: None
      management cert: None
   transfer direction: Azure->local
       local resource: psmodels
      include pattern: *_cntk-parametricsweep-job/model_results.json
      remote resource: .
   max num of workers: 24
              timeout: None
      storage account: batch71d77646st
              use SAS: False
  upload as page blob: False
  auto vhd->page blob: False
 upload to file share: False
 container/share name: output
  container/share URI: https://batch71d77646st.blob.core.windows.net/output
    compute block MD5: False
     compute file MD5: True
    skip on MD5 match: True
   chunk size (bytes): 4194304
     create container: True
  keep mismatched MD5: False
     recursive if dir: True
component strip on up: 

Now we will combine all of the `model_results.json` files into one dictionary for analysis.

In [35]:
def scandir(basedir):
    for root, dirs, files in os.walk(basedir):
        for f in files:
            yield os.path.join(root, f) 

results_dict = {}
for model in scandir(MODELS_DIR):
    if not model.endswith('.json'):
        continue
    key = model.split(os.sep)[1]
    results_dict[key] = read_json(model)
    
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "09ec4f8d_task-00001_cntk-parametricsweep-job": {
        "parameters": {
            "max_epochs": 30, 
            "minibatch_size": 64, 
            "num_convolution_layers": 2
        }, 
        "test_metric": 0.2509
    }, 
    "270d141e_0_cntk-parametricsweep-job": {
        "parameters": {
            "max_epochs": 30, 
            "minibatch_size": 32, 
            "num_convolution_layers": 2
        }, 
        "test_metric": 0.2339
    }, 
    "6834b826_3_cntk-parametricsweep-job": {
        "parameters": {
            "max_epochs": 30, 
            "minibatch_size": 64, 
            "num_convolution_layers": 3
        }, 
        "test_metric": 0.2349
    }, 
    "6c24fc0b_1_cntk-parametricsweep-job": {
        "parameters": {
            "max_epochs": 30, 
            "minibatch_size": 64, 
            "num_convolution_layers": 2
        }, 
        "test_metric": 0.2293
    }, 
    "80534ba8_2_cntk-parametricsweep-job": {
        "parameters": {
            "max_epo

From the aggregated results dictionary, we select the one with the smallest error:

In [36]:
tuple_min_error = min(results_dict.iteritems(), key=lambda x: x[1]['test_metric'])
configuration_with_min_error = tuple_min_error[0]
print('task with smallest error: {} ({})'.format(configuration_with_min_error, tuple_min_error[1]['test_metric']))

task with smallest error: 80534ba8_2_cntk-parametricsweep-job (0.2212)


<a id='section5'></a>

## Download best model
Now we'll download the corresponding best performing model.

In [37]:
MODEL_NAME = 'ConvNet_CIFAR10_model.dnn'
BEST_MODEL_BLOB_NAME = '{}/{}'.format(configuration_with_min_error, MODEL_NAME)
print(BEST_MODEL_BLOB_NAME)

80534ba8_2_cntk-parametricsweep-job/ConvNet_CIFAR10_model.dnn


In [38]:
blobxfer $storage_account_name output $MODELS_DIR --remoteresource $BEST_MODEL_BLOB_NAME --download --storageaccountkey $storage_account_key

 azure blobxfer parameters [v0.12.1]
             platform: Linux-4.4.0-87-generic-x86_64-with-debian-stretch-sid
   python interpreter: CPython 2.7.11
     package versions: az.common=1.1.8 az.sml=0.20.5 az.stor=0.35.1 crypt=2.0.3 req=2.18.3
      subscription id: None
      management cert: None
   transfer direction: Azure->local
       local resource: psmodels
      include pattern: None
      remote resource: 80534ba8_2_cntk-parametricsweep-job/ConvNet_CIFAR10_model.dnn
   max num of workers: 24
              timeout: None
      storage account: batch71d77646st
              use SAS: False
  upload as page blob: False
  auto vhd->page blob: False
 upload to file share: False
 container/share name: output
  container/share URI: https://batch71d77646st.blob.core.windows.net/output
    compute block MD5: False
     compute file MD5: True
    skip on MD5 match: True
   chunk size (bytes): 4194304
     create container: True
  keep mismatched MD5: False
     recursive if dir: True
comp

In [39]:
!mv $MODELS_DIR/$configuration_with_min_error/$MODEL_NAME $MODELS_DIR
!rm -rf $MODELS_DIR/*_$JOB_ID  # optionally remove all of the temporary result json directories/files
!ls -alF $MODELS_DIR

total 1920
drwxr-xr-x  2 nbuser nbuser    4096 Aug 11 10:40 ./
drwx------ 15 nbuser nbuser    4096 Aug 11 09:48 ../
-rw-r--r--  1 nbuser nbuser 1957354 Aug 11 10:40 ConvNet_CIFAR10_model.dnn


The best model file (`ConvNet_CIFAR10_model.dnn`) is now ready for use.

**Note:** We could have created a Batch task that did the model selection for us using task dependencies. The model selection task would be dependent upon all of the parametric sweep training tasks and would only run after those tasks complete successfully. The Batch task could then proceed with the logic above.

Please see the advanced notebook that shows how this is accomplished: [Automatic Model Selection from Parametric Sweep with Task Dependencies](05_Advanced_Auto_Model_Selection.ipynb)

<a id='section6'></a>

## Delete job

To delete the job use the command below. Just be aware that this will get rid of all the files created by the job and tasks.

In [40]:
shipyard jobs del -y --termtasks --wait

2017-08-11 10:40:55,297 DEBUG - Skipping termination of completed task 0 on job cntk-parametricsweep-job
2017-08-11 10:40:55,651 DEBUG - Skipping termination of completed task 1 on job cntk-parametricsweep-job
2017-08-11 10:40:56,017 DEBUG - Skipping termination of completed task 2 on job cntk-parametricsweep-job
2017-08-11 10:40:56,188 DEBUG - Skipping termination of completed task 3 on job cntk-parametricsweep-job
2017-08-11 10:40:56,375 INFO - deleting job: cntk-parametricsweep-job
2017-08-11 10:40:56,712 DEBUG - waiting for job cntk-parametricsweep-job to delete
2017-08-11 10:41:28,661 INFO - job cntk-parametricsweep-job does not exist
