# Hyperparameter Search
Perform a hyperparameter search.

The steps are
- [import lirbaries and dotenv parameters](#import)
- [create a Batch AI client](#client),
- [create the Batch AI job configuration parameters](#parameters),
- [generate combinations of hyperparameter values](#combinations),
- [generate a job for each combination](#jobs),
- [run the jobs on Batch AI](#run),
- [extract the performance of each combination](#extract),
- [identify the combination that had the best performance](#best), and
- [use this combination to build and save the best model](#build).

## Imports <a id='import'></a>

In [None]:
from __future__ import print_function
import os
import sys
import glob
import pandas as pd
import dotenv
import azure.mgmt.batchai.models as models
from azure.storage.blob import BlockBlobService
from azure.storage.file import FileService
sys.path.append('.')
import utilities as utils
from utilities.job_factory import ParameterSweep, NumericParameter, DiscreteParameter
%load_ext dotenv

In the next cell are the names of various files and services used or created in this notebook.

In [None]:
# The location of the dotenv file
dotenv_path = dotenv.find_dotenv()
# The mount point of the Azure file share in the Docker container
dotenv.set_key(dotenv_path, 'azure_file_share_mount_path', 'afs')
# The mount point of the Azure blob container in the the Docker container
dotenv.set_key(dotenv_path, 'azure_blob_mount_path', 'bfs')
# The Batch AI experiment
dotenv.set_key(dotenv_path, 'experiment_name', 'hyperparameter_search_experiment')

Import the contents of the `.env` file into the environment

In [None]:
%dotenv -o

Define Python variables used in this notebook.

In [None]:
configuration_path = os.getenv('configuration_path')
image_name = os.getenv('docker_login') + os.getenv('image_repo') + ':latest'
azure_blob_container_name = os.getenv('azure_blob_container_name')
dataset_path = os.getenv('dataset_path')
azure_file_share_name = os.getenv('azure_file_share_name')
script_path = os.getenv('script_path')
script_name = os.getenv('script_name')
cluster_name = os.getenv('cluster_name')
azure_blob_mount_path = os.getenv('azure_blob_mount_path')
azure_file_share_mount_path = os.getenv('azure_file_share_mount_path')
experiment_name = os.getenv('experiment_name')

## Create a Batch AI client <a id='client'></a>
Read the configuration, and use it to create a Batch AI client.

In [None]:
cfg = utils.config.Configuration(configuration_path)
client = utils.config.create_batchai_client(cfg)

## Define the parameters common to all jobs  <a id='parameters'></a>
Specify the Docker image used to create the Docker containers that run the experiment's jobs.

In [None]:
container_settings = models.ContainerSettings(
    image_source_registry=models.ImageSourceRegistry(image=image_name)
)

Define the volumes to be mounted and their mount points in each Docker container's file system. These will give the containers access to the datasets and script, and a location to store job_accuracies.

In [None]:
mount_volumes = models.MountVolumes(
    azure_file_shares=[
        models.AzureFileShareReference(
            account_name=cfg.storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=cfg.storage_account_key),
            azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                cfg.storage_account_name, azure_file_share_name),
            relative_mount_path=azure_file_share_mount_path)
    ],
    azure_blob_file_systems=[
        models.AzureBlobFileSystemReference(
            account_name=cfg.storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=cfg.storage_account_key),
            container_name=azure_blob_container_name,
            relative_mount_path=azure_blob_mount_path)
    ]
)

Define the locations in a container's file system for
- storing the job's standard output and error,
- obtaining the datasets, and
- storing the job's outputs.

In [None]:
std_out_err_path_prefix = '$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path)

input_directories = [
    models.InputDirectory(
        id='SCRIPT',
        path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(azure_blob_mount_path, dataset_path))
]

output_directories = [
    models.OutputDirectory(
        id='ALL',
        path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path))
]

Define the path to the training script.

In [None]:
python_script_file_path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}/{2}'.format(
    azure_file_share_mount_path, script_path, script_name)

## Generate the combinations of hyperparameters <a id='combinations'></a>
Define specifications for the hyperparameters, and use them to create a parameter substitution object. We choose a single value for the number of estimators that is enough to let us reliably identify the best of the parameter configurations. Once we have the best combination, we will build a model using a larger number of estimators to boost the performance.

In [None]:
param_specs = [
    DiscreteParameter(
        parameter_name="ESTIMATORS",
        values=[1000]
    ),
    DiscreteParameter(
        parameter_name="NGRAMS",
        values=[1, 2, 3, 4]
    ),
    DiscreteParameter(
        parameter_name="MATCH",
        values=[10, 20, 30, 40]
    ),
    DiscreteParameter(
        parameter_name="MIN_CHILD_SAMPLES",
        values=[5, 10, 20]
    ),
    DiscreteParameter(
        parameter_name="WEIGHT",
        values=["", "--unweighted"]
    ),
]

parameters = ParameterSweep(param_specs)

We define the command line arguments that will be passed to the training script. We will use the parameter substitution object to specify where we would like to substitute the values of the hyperparameters in the command line. Note that `parameters` is used like a dict, with the `parameter_name` being used as the key to specify which parameter to substitute. When `parameters.generate_jobs` is called below, the `parameters[name]` variables will be replaced with actual values.

In [None]:
command_line_args = '--inputs $AZ_BATCHAI_INPUT_SCRIPT --outputs $AZ_BATCHAI_OUTPUT_ALL'\
    ' --estimators {estimators}'\
    ' --ngrams {ngrams}'\
    ' --match {match}'\
    ' --min_child_samples {min_child_samples}'\
    ' {weight}'.format(
    estimators=parameters['ESTIMATORS'],
    ngrams=parameters['NGRAMS'],
    match=parameters['MATCH'],
    min_child_samples=parameters['MIN_CHILD_SAMPLES'],
    weight=parameters['WEIGHT'])

We put the script path and command line arguments together in a module settings structure.

In [None]:
custom_toolkit_settings = models.CustomToolkitSettings(
        command_line=' '.join(['python', python_script_file_path, command_line_args]),
    )
print(custom_toolkit_settings.command_line)

## Generate the jobs for each combination <a id='jobs'></a>
Retrieve the cluster information.

In [None]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)

Use the information from above to create a job control parameter structure.

In [None]:
jcp = models.JobCreateParameters(
    cluster=models.ResourceId(id=cluster.id),
    node_count=1,
    std_out_err_path_prefix=std_out_err_path_prefix,
    input_directories=input_directories,
    output_directories=output_directories,
    mount_volumes=mount_volumes,
    container_settings=container_settings,
    custom_toolkit_settings=custom_toolkit_settings
    )

And generate a list of jobs to submit, each with a combinations of the parameters.

In [None]:
jobs_to_submit, param_combinations = parameters.generate_jobs(jcp)
print('{:,} configurations.'.format(len(param_combinations)))
print('The command line of the first job is\n{}'.format(jobs_to_submit[0].custom_toolkit_settings.command_line))

## Run the jobs in an experiment <a id='run'></a>
Create a new experiment called `hyperparameter_search_experiment`, and create a job submitter to add jobs to the experiment's job queue.

In [None]:
experiment = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()
experiment_utils = utils.experiment.ExperimentUtils(client, cfg.resource_group, cfg.workspace, experiment_name)

Submit the jobs to the experiment.

In [None]:
submitted_jobs = experiment_utils.submit_jobs(jobs_to_submit, 'hyperparam_job2').result()

Wait for the experiment to complete. This should take about an hour and a half.

You can interrupt this cell before all of the experiment's jobs have completed, and later run it again as needed.

In [None]:
%%time
_ = experiment_utils.wait_all_jobs()

To interrupt the experiment before it's complete, you can delete all its queued and running jobs.

In [None]:
if False:
    experiment_utils.delete_jobs_in_experiment(execution_state=models.ExecutionState.queued)
    experiment_utils.delete_jobs_in_experiment(execution_state=models.ExecutionState.running)

## Extract the performance results  <a id='extract'></a>
Get a list of successful jobs in the experiment.

In [None]:
jobs = [job
        for job in client.jobs.list_by_experiment(cfg.resource_group, cfg.workspace, experiment_name)
        if job.execution_state == models.ExecutionState.succeeded]

Define an extractor that pulls desired metric from a job's log file. In this example, we extract the number between "`INFO:root:Accuracy @3 =`" and "`%`".

In [None]:
accuracy_extractor = utils.job.MetricExtractor(
    output_dir_id='ALL',
    logfile='TrainTestClassifier.log',
    regex='INFO:root:Accuracy @3 = (.*?)\%')

Get the metric values from the log files of the finished jobs.

In [None]:
%%time
job_accuracies = experiment_utils.get_metrics_for_jobs(jobs, accuracy_extractor)

## Find the best set of parameters <a id='best'></a>
Sort the metrics in decreasing order, and print a summary description of the values.

In [None]:
job_accuracies.sort(key=lambda r: r['metric_value'], reverse=True)

accuracies = pd.Series({result['job_name']: result['metric_value']
                        for result in job_accuracies},
                       name='Accuracy @3')
accuracies.describe().to_frame().transpose()

Print the best configuration.

In [None]:
best_params = {ev.name[len('PARAM_'):]:ev.value for ev in job_accuracies[0]['job'].environment_variables}
print("Best job: {0} with parameters:".format(job_accuracies[0]['job_name']))
pd.Series(best_params, name='Value').to_frame()

## Use them to build the best model <a id='build'></a>
Define variables that hold the best combination of parameters, and the number of estimators to use. Typically, increasing the number of estimators will increase the performance. 

In [None]:
best_estimators = 8 * int(best_params['ESTIMATORS'])
best_min_child_samples = best_params['MIN_CHILD_SAMPLES']
best_match = best_params['MATCH']
best_ngrams = best_params['NGRAMS']
best_weight = best_params['WEIGHT']

Run the training script with the best parameters, and save the model. This should take anywhere from ten minutes to an hour and a half depending on the size of the features determined by the hyperparameters, in particular `match` and `ngrams` (larger is longer).

In [None]:
%run -t TrainTestClassifier.py\
    --save\
    --estimators $best_estimators\
    --match $best_match\
    --ngrams $best_ngrams\
    --min_child_samples $best_min_child_samples\
    $best_weight

To tear down the experiment and all related resources go to [the last notebook](06_Tear_Down.ipynb).