Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# HyperParameter tuning


Let's get started. First let's import some Python libraries.

In [None]:
%matplotlib inline
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt

In [None]:
import azureml
from azureml.core import Workspace, Run

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [None]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

## Create an Azure ML experiment
Let's create an experiment and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

In [None]:
from azureml.core import Experiment

script_folder = './keras-cifar10'

exp = Experiment(workspace=ws, name='keras-cifar10')

## Create Batch AI cluster as compute target
[Batch AI](https://docs.microsoft.com/en-us/azure/batch-ai/overview) is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Batch AI cluster in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.

## Intelligent hyperparameter tuning
We have trained the model with one set of hyperparameters, now let's how we can do hyperparameter tuning by launching multiple runs on the cluster. First let's define the parameter space using random sampling.

In [None]:
from azureml.train.hyperdrive import *

ps = RandomParameterSampling(
    {
        '--batch-size': choice(25, 50, 100),
        '--decay': choice(1e-7, 1e-6, 1e-5),
        '--learning-rate': choice(1e-5, 1e-4, 1e-3)
    }
)

Next, we will create a new estimator without the above parameters since they will be passed in later. Note we still need to keep the `data-folder` parameter since that's not a hyperparamter we will sweep.

In [None]:
est = TensorFlow(source_directory=script_folder,
                 script_params={'--data-folder': ws.get_default_datastore().as_mount(), '--epochs': 25},
                 compute_target=compute_target,
                 entry_script='cifar10_azureml.py',  
                 conda_packages=['keras', 'h5py'],
                 use_gpu=True)

Now we will define an early termnination policy. The `BanditPolicy` basically states to check the job every 2 iterations. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.

In [None]:
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

Now we are ready to configure a run configuration object, and specify the primary metric `validation_acc` that's recorded in your training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of samples to 20, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster.

In [None]:
htc = HyperDriveRunConfig(estimator=est, 
                          hyperparameter_sampling=ps, 
                          primary_metric_name='validation_acc', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=20,
                          max_concurrent_runs=4)

Finally, let's launch the hyperparameter tuning job.

In [None]:
htr = exp.submit(config=htc)

We can use a run history widget to show the progress. Be patient as this might take a while to complete.

In [None]:
RunDetails(htr).show()

In [None]:
htr.wait_for_completion(show_output = True)

## Find and register best model
When all the jobs finish, we can find out the one that has the highest accuracy.

In [None]:
best_run = htr.get_best_run_by_primary_metric()
print(best_run)

Now let's list the model files uploaded during the run.

In [None]:
print(best_run.get_file_names()

We can then register the folder (and all files in it) as a model named `tf-dnn-mnist` under the workspace for deployment.

In [None]:
model = best_run.register_model(model_name='tf-dnn-mnist', model_path='outputs/model')