Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Hyperparameter Tuning of the Online Anomaly Detection algorithm

## Introduction

In the previous notebook, you learned to leverage the AML SDK features for Machine Learning experimentation to test the performance of our online solution for Anomaly Detection.  These tools allowed you to test the solution with different parameter settings.

In this lab, we are going to take it a step further and use Azure `HyperDrive` to do the hard work of finding the best parameters for us. 

Typically it would be used to tune hyperparameters in Machine learning algorithms, such as the regularization constant in a support vector machine, or the number of hidden layers in a neural network.  

However, HyperDrive was designed to be extremely flexible architecture.  You can combine it with any script that accepts hyper parameters arguments and returns a number that you are tyring to either minimize or maximize by finding the correct setting for your hyperparameters.  This is exactly what we are going to do here.

## Getting started

Let's get started. First let's import some Python libraries.

In [1]:
#%matplotlib inline

import numpy as np
import os

In [2]:
import azureml
from azureml.core import Workspace, Run

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.0.21


## Diagnostics
Opt-in diagnostics for better experience, quality, and security of future releases.

In [3]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

Turning diagnostics collection on. 


## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [4]:
from azureml.core.workspace import Workspace

# config_path = '/dbfs/tmp/'

# If you are running this on Jupyter, you may want to run 
config_path = os.path.expanduser('~')

ws = Workspace.from_config(path=os.path.join(config_path, 'aml_config','config.json'))

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Found the config file in: C:\Users\sethmott\aml_config\config.json


If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Workspace name: amlws
Azure region: eastus2
Subscription id: 5be49961-ea44-42ec-8021-b728be90d58c
Resource group: sethmottaml


## Create an Azure ML experiment
Let's create an experiment named "ADMLExp" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

In [5]:
from azureml.core import Experiment

script_folder = 'scripts'

os.makedirs(script_folder, exist_ok=True)

exp = Experiment(workspace=ws, name='ADMLExp')

## Download telemetry dataset
In order to test on the telemetry dataset we will first need to download it from Yan LeCun's web site directly and save them in a `data` folder locally.

In [7]:
import os
import urllib

data_path = os.path.join(config_path, 'data')
os.makedirs(data_path, exist_ok=True)

container = 'https://coursematerial.blob.core.windows.net/data/telemetry'

urllib.request.urlretrieve(container + 'telemetry.csv', filename=os.path.join(data_path, 'telemetry.csv'))
urllib.request.urlretrieve(container + 'anoms.csv', filename=os.path.join(data_path, 'anoms.csv'))

HTTPError: HTTP Error 404: The specified blob does not exist.

## Upload dataset to default datastore 
A [datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data) is a place where data can be stored that is then made accessible to a Run either by means of mounting or copying the data to the compute target. A datastore can either be backed by an Azure Blob Storage or and Azure File Share (ADLS will be supported in the future). For simple data handling, each workspace provides a default datastore that can be used, in case the data is not already in Blob Storage or File Share.

In this next step, we will upload the training and test set into the workspace's default datastore, which we will then later be mount on a Batch AI cluster for training.

In [None]:
ds = ws.get_default_datastore()
ds.upload(src_dir=data_path, target_path='telemetry', overwrite=True, show_progress=True)

## Create Batch AI cluster as compute target
[Batch AI](https://docs.microsoft.com/en-us/azure/batch-ai/overview) is a service for provisioning and managing clusters of Azure virtual machines for running machine learning workloads. Let's create a new Batch AI cluster in the current workspace, if it doesn't already exist. We will then run the training script on this compute target.

If we could not find the cluster with the given name in the previous cell, then we will create a new cluster here. We will create a AmlCompute Cluster of `Standard_DS3_v2` CPU VMs. This process is broken down into 3 steps:
1. create the configuration (this step is local and only takes a second)
2. create the Batch AI cluster (this step will take about **20 seconds**)
3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell

In [None]:
from azureml.core.compute import ComputeTarget, BatchAiCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "ADPMAmlCompute"

try:
    # look for the existing cluster by name
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    if type(compute_target) is BatchAiCompute:
        print('Found existing compute target {}.'.format(cluster_name))
    else:
        print('{} exists but it is not a Batch AI cluster. Please choose a different name.'.format(cluster_name))
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = BatchAiCompute.provisioning_configuration(vm_size="Standard_DS3_v2",
                                                                #vm_priority='lowpriority', # optional
                                                                autoscale_enabled=True,
                                                                cluster_min_nodes=1, 
                                                                cluster_max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
    # Use the 'status' property to get a detailed status for the current cluster. 
    print(compute_target.status.serialize())

## Download the execution script into the script folder

The execution script is already created for you. You can simply copy it into the script folder. You could also use the one from the previous lab, but let's play it safe.

In [None]:
# download the script file from the repo. 
urllib.request.urlretrieve(
    'https://raw.githubusercontent.com/Azure/LearnAI-ADPM/master/solutions/sample_run_AmlCompute.py', 
    filename=os.path.join(script_folder, 'sample_run_AmlCompute.py'))

Make sure the execution script looks correct.

In [None]:
with open(os.path.join(script_folder,'sample_run_AmlCompute.py'), 'r') as f:
    print(f.read())

## Configure Estimator and policy for hyperparameter tuning

We have trained the model with one set of hyperparameters, now let's how we can do hyperparameter tuning by launching multiple runs on the cluster. First let's define the parameter space using random sampling.

In [None]:
from azureml.train.hyperdrive import *

ps = RandomParameterSampling(
    {
        '--window_size': choice(100, 500, 1000, 2000, 5000),
        '--com': choice(4, 6, 12, 24)
    }
)

Next, we will create a new estimator without the above parameters since they will be passed in later. Note we still need to keep the `data-folder` parameter since that's not a hyperparamter we will sweep.

## Create an AzureML training Estimator

Next, we construct an `azureml.train.Estimator` estimator object, use the Batch AI cluster as compute target, and pass the mount-point of the datastore to the training code as a parameter.

The estimator is providing a simple way of launching a custom job on a compute target.  It will automatically provide a docker image, if additional pip or conda packages are required, their names can be passed in via the `pip_packages` and `conda_packages` arguments and they will be included in the resulting docker.

In our case, we will need to install the following `pip_packages`: `numpy`, `pandas`, `scikit-learn`.

Unlike in the previous lab, we do not provide hyperparameters as `script_params` to the Estimator, because they will be set by `HyperDrive`.

In [None]:
from azureml.train.estimator import Estimator

script_params = {
    '--data-folder': ws.get_default_datastore().as_mount(),
# We are not using the following parameters, because they will be set by HyperDrive
#     '--window_size': 500,
#     '--com': 12
}

est = Estimator(source_directory=script_folder,
                 script_params=script_params,
                 compute_target=compute_target,
                 entry_script='sample_run_AmlCompute.py',
                 pip_packages=['numpy','pandas','scikit-learn','pyculiarity'])

Now we will define an early termnination policy. The `BanditPolicy` basically states to check the job every 2 iterations. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.

In [None]:
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1, delay_evaluation=250)

Now we are ready to configure a run configuration object, and specify the primary metric `validation_acc` that's recorded in your training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of samples to 20, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster.

In [None]:
htc = HyperDriveRunConfig(estimator=est, 
                          hyperparameter_sampling=ps, 
                          primary_metric_name='fbeta_score', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          policy=policy,
                          max_total_runs=30,
                          max_concurrent_runs=4)

Finally, let's launch the hyperparameter tuning job.

In [None]:
htr = exp.submit(config=htc)

In [None]:
htr

If you want, you can wait for completion, by uncommenting the next cell. But you can also skip the cell and look at preliminary results. You will still have to wait for about a minute before the first results show up.

In [None]:
# htr.wait_for_completion(show_output = True)

## Find and register best model
When all the jobs finish, we can find out the one that has the highest accuracy.

**Note**: If you get a `TrainingException` or a `KeyError` below, you probably just have to wait until the first training run is completed.

In [None]:
best_run = htr.get_best_run_by_primary_metric()

### Hands-on lab

Go to the Azure portal and explore how HyperDrive logs run metrics there.

### End lab

Now let's list the model files uploaded during the run.

In [None]:
run_details = best_run.get_details()

print("arguments of best run: %s" % (run_details['runDefinition']['Arguments']))
best_run.get_metrics()['final_fbeta_score']

In [None]:
run_details

### Hands-on lab

Use python `help(azureml.train.hyperdrive)` and expore the documentation for HyperDrive.  

Above, we used a BanditPolicy. Try to fully understand what the parameters of the policy are. 

Try to pick one other policy and see whether you can replace the BanditPolicy above and run a new HyperDrive job.

### End Lab

## Clean up

We can also delete the computer cluster. But remember if you set the `cluster_min_nodes` value to 0 when you created the cluster, once the jobs are finished, all nodes are deleted automatically. So you don't have to delete the cluster itself since it won't incur any cost. Next time you submit jobs to it, the cluster will then automatically "grow" up to the `cluster_min_nodes` which is set to 4.

In [None]:
# delete the cluster if you need to.
compute_target.delete()