# IV - Parameter Sweep
In this notebook we will be running a simple parameter sweep on the model we have. We will then pull the results of our sweep and based on the results of our sweep pull the best performing model from blob.

* [Setup](#section1)
* [Configure job](#section2)
* [Submit job](#section3)
* [Check results](#section4)
* [Download best model](#section5)
* [Delete job](#section6)

<a id='section1'></a>

## Setup

Create a simple alias for Batch Shipyard

In [None]:
%alias shipyard SHIPYARD_CONFIGDIR=config python $HOME/batch-shipyard/shipyard.py %l

Check that everything is working

In [None]:
shipyard

Read in the account information we saved earlier

In [None]:
import json

def read_json(filename):
    with open(filename, 'r') as infile:
        return json.load(infile)
    
account_info = read_json('account_information.json')

storage_account_key = account_info['storage_account_key']
storage_account_name = account_info['storage_account_name']

<a id='section2'></a>

## Configure Job

As in the previous job we ran on a single node we will be running the job on GPU enabled nodes. The difference here is that depending on the number of combinations we will be creating the same number of tasks. Each task will have a different set of parmeters that we will be passing to our model training script. This parameters effect the training of the model and in the end the performance of the model. The model and results of its evaluation are recorded and stored on the node. At the end of the task the results are pulled into the specified storage container..

In [None]:
from copy import copy
import os
import random
from sklearn.grid_search import ParameterGrid
from toolz import curry, pipe

def write_json_to_file(json_dict, filename):
    """ Simple function to write JSON dictionaries to files
    """
    with open(filename, 'w') as outfile:
        json.dump(json_dict, outfile)

CNTK_TRAIN_DATA_FILE = 'Train_cntk_text.txt'
CNTK_TEST_DATA_FILE = 'Test_cntk_text.txt'
URL_FMT = 'https://{}.blob.core.windows.net/{}/{}'

def select_random_data_storage_container():
    """Randomly select a storage account and container for CNTK train/test data.
    This is specific for the workshop to help distribute attendee load. This
    function will only work on Python2"""
    ss = random.randint(0, 4)
    cs = random.randint(0, 4)
    sa = '{}{}bigai'.format(ss, chr(ord('z') - ss))
    cont = '{}{}{}'.format(cs, chr(ord('i') - cs * 2), chr(ord('j') - cs * 2))
    return sa, cont

def create_resource_file_list():
    sa, cont = select_random_data_storage_container()
    ret = [{
        'file_path': CNTK_TRAIN_DATA_FILE,
        'blob_source': URL_FMT.format(sa, cont, CNTK_TRAIN_DATA_FILE)
    }]
    sa, cont = select_random_data_storage_container()
    ret.append({
        'file_path': CNTK_TEST_DATA_FILE,
        'blob_source': URL_FMT.format(sa, cont, CNTK_TEST_DATA_FILE)
    })
    return ret

def compose_command(num_convolution_layers, minibatch_size, max_epochs=30):
    cmd_str = ' '.join(("source /cntk/activate-cntk;",
                        "python -u /code/ConvNet_CIFAR10.py",
                        "--num_convolution_layers {num_convolution_layers}",
                        "--minibatch_size {minibatch_size}",
                        "--max_epochs {max_epochs}")).format(num_convolution_layers=num_convolution_layers,
                                                             minibatch_size=minibatch_size,
                                                             max_epochs=max_epochs)
    return 'bash -c "{}"'.format(cmd_str)

@curry
def append_parameter(param_name, param_value, data_dict):
    data_dict[param_name]=param_value
    return data_dict

def task_generator(parameters):
    for params in ParameterGrid(parameters):
        yield pipe(copy(_TASK_TEMPLATE),
                   append_parameter('command', compose_command(**params)),
                   append_parameter('resource_files', create_resource_file_list()))

Generate the `jobs.json` configuration file

In [None]:
OUTPUT_STORAGE_ALIAS = "mystorageaccount"

IMAGE_NAME = "masalvar/cntkcifar" # Custom CNTK image

_TASK_TEMPLATE = {
    "image": IMAGE_NAME,
    "remove_container_after_exit": True,
    "gpu": True,
    "output_data": {
        "azure_storage": [
            {
                "storage_account_settings": OUTPUT_STORAGE_ALIAS,
                "container": "output",
                "source": "$AZ_BATCH_TASK_DIR/wd/Models"
            },
        ]
    },
}

For the purposes of the workshop, we are constraining the parameter search space to just 3 final combinations (4 will be generated but we will remove the first one such that we only need to wait for "one round" of processing. In other words, since we have 3 total compute nodes in the pool and 3 tasks, we only have to wait for "one round."

In [None]:
parameters = {
    "num_convolution_layers": [2, 3],  # this could be expanded to [2, 3, 4] for example
    "minibatch_size": [32, 64]         # this could be expanded to [32, 64, 128] for example
}

In [None]:
JOB_ID = 'cntk-parametricsweep-job'

jobs = {
    "job_specifications": [
        {
            "id": JOB_ID,
            "tasks": list(task_generator(parameters))    
        }
    ]
}

# for purposes of expediency in the workshop, we'll remove one of the tasks to
# make 3 total to match the number of compute nodes in our pool
del jobs['job_specifications'][0]['tasks'][0]

print('number of tasks for parametric sweep {}: {}'.format(JOB_ID, len(jobs['job_specifications'][0]['tasks'])))

In [None]:
write_json_to_file(jobs, os.path.join('config', 'jobs.json'))
print(json.dumps(jobs, indent=4, sort_keys=True))

<a id='section3'></a>

## Submit job
Check that everything is ok with our pool before we submit our jobs

In [None]:
shipyard pool listnodes

Now that we have confirmed everything is working we can execute our job using the command below. 

In [None]:
shipyard jobs add

Using the command below we can check the status of our job. Only after all tasks have an exit code can we continue with the notebook. Please keep re-running the cell below periodically until you see that all tasks show completed state with an exit code. Continuing on with the notebook without all tasks in the job completing their training execution will result in failure in subsequent cells.

You can also view the **heatmap** of this pool on [Azure Portal](https://portal.azure.com) to monitor the progress of this job on the compute nodes under your Batch account.

In [None]:
shipyard jobs listtasks --jobid $JOB_ID

<a id='section4'></a>

# Check results
The results of our parameter search should now be saved to our output container.

**Note:** You will encounter errors if you did not wait for all tasks to complete with an exit code in the previous cell.

First let's alias `blobxfer` to aid in downloading our blobs. We will aggregate our results in the `MODELS_DIR`.

In [None]:
%alias blobxfer python -m blobxfer

MODELS_DIR = 'psmodels'

In [None]:
blobxfer $storage_account_name output $MODELS_DIR --remoteresource . --download --include "*_$JOB_ID/model_results.json" --storageaccountkey $storage_account_key

Now we will combine all of the `model_results.json` files into one dictionary for analysis.

In [None]:
def scandir(basedir):
    for root, dirs, files in os.walk(basedir):
        for f in files:
            yield os.path.join(root, f) 

results_dict = {}
for model in scandir(MODELS_DIR):
    if not model.endswith('.json'):
        continue
    key = model.split(os.sep)[1]
    results_dict[key] = read_json(model)
    
print(json.dumps(results_dict, indent=4, sort_keys=True))

From the aggregated results dictionary, we select the one with the smallest error:

In [None]:
tuple_min_error = min(results_dict.iteritems(), key=lambda x: x[1]['test_metric'])
configuration_with_min_error = tuple_min_error[0]
print('task with smallest error: {} ({})'.format(configuration_with_min_error, tuple_min_error[1]['test_metric']))

<a id='section5'></a>

## Download best model
Now we'll download the corresponding best performing model.

In [None]:
MODEL_NAME = 'ConvNet_CIFAR10_model.dnn'
BEST_MODEL_BLOB_NAME = '{}/{}'.format(configuration_with_min_error, MODEL_NAME)
print(BEST_MODEL_BLOB_NAME)

In [None]:
blobxfer $storage_account_name output $MODELS_DIR --remoteresource $BEST_MODEL_BLOB_NAME --download --storageaccountkey $storage_account_key

In [None]:
!mv $MODELS_DIR/$configuration_with_min_error/$MODEL_NAME $MODELS_DIR
!rm -rf $MODELS_DIR/*_$JOB_ID  # optionally remove all of the temporary result json directories/files
!ls -alF $MODELS_DIR

The best model file (`ConvNet_CIFAR10_model.dnn`) is now ready for use.

**Note:** We could have created a Batch task that did the model selection for us using task dependencies. The model selection task would be dependent upon all of the parametric sweep training tasks and would only run after those tasks complete successfully. The Batch task could then proceed with the logic above.

Please see the advanced notebook that shows how this is accomplished: [Automatic Model Selection from Parametric Sweep with Task Dependencies](06_Advanced_Auto_Model_Selection.ipynb)

<a id='section6'></a>

## Delete job

To delete the job use the command below. Just be aware that this will get rid of all the files created by the job and tasks.

In [None]:
shipyard jobs del -y --termtasks --wait

## Next Steps
You can proceed to the [Notebook: Clean Up](05_Clean_up.ipynb) if you are done for now, or proceed to one of the following additional Notebooks:
* [Notebook: Automatic Model Selection](06_Advanced_Auto_Model_Selection.ipynb)
* [Notebook: Tensorboard Visualization](07_Advanced_Tensorboard.ipynb) - note this requires running this notebook on your own machine
* [Notebook: Parallel and Distributed](08_Advanced_Parallel_and_Distributed.ipynb)
* [Notebook: Keras with TensorFlow](09_Keras_Single_GPU_Training_With_Tensorflow.ipynb)