Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

## 07. HyperDrive with scikit-learn
- Create Batch AI cluster
- Train on a single node
- Set up Hyperdrive
- Parameter sweep with Hyperdrive on Batch AI cluster
- Monitor parameter sweep runs with run history widget
- Find best model

## Prerequisites
Make sure you go through the [00. Installation and Configuration](00.configuration.ipynb) Notebook first if you haven't.

In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

## Initialize Workspace

Initialize a workspace object from persisted configuration.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Create An Experiment
**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

In [None]:
from azureml.core import Experiment
experiment_name = 'hyperdrive-with-sklearn'
experiment = Experiment(workspace = ws, name = experiment_name)

Create a folder to store the training script.

In [None]:
import os
script_folder = './samples/hyperdrive-with-sklearn'
os.makedirs(script_folder, exist_ok = True)

## Provision New Cluster
Create a new Batch AI cluster using the following Python code.

**Note**: As with other Azure services, there are limits on certain resources (for eg. BatchAI cluster size) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azureml.core.compute import BatchAiCompute
from azureml.core.compute import ComputeTarget

# choose a name for your cluster
batchai_cluster_name = ws.name + "cpu"

found = False
# see if this compute target already exists in the workspace
for ct in ws.compute_targets():
    print(ct.name, ct.type)
    if (ct.name == batchai_cluster_name and ct.type == 'BatchAI'):
        found = True
        print('found compute target. just use it.')
        compute_target = ct
        break
        
if not found:
    print('creating a new compute target...')
    provisioning_config = BatchAiCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", # for GPU, use "STANDARD_NC6"
                                                                #vm_priority = 'lowpriority', # optional
                                                                autoscale_enabled = True,
                                                                cluster_min_nodes = 1, 
                                                                cluster_max_nodes = 4)

    # create the cluster
    compute_target = ComputeTarget.create(ws,batchai_cluster_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current BatchAI cluster status, use the 'status' property    
    print(compute_target.status.serialize())

## Ridge Regression with scikit-learn

In [None]:
from shutil import copyfile
# copy the diabetes_sklearn.py file to the project folder
copyfile('./diabetes_sklearn.py', os.path.join(script_folder, 'diabetes_sklearn.py'))

In [None]:
# review the diabetes_sklearn.py file if you'd like
with open(os.path.join(script_folder, 'diabetes_sklearn.py'), 'r') as fin:
    print (fin.read())

## Create an estimator for the sklearn script
You can use an estimator pattern to run the script. 

In [None]:
from azureml.train.estimator import Estimator
script_params = {
    '--alpha': 0.1
}

sk_est = Estimator(source_directory = script_folder,
                   script_params = script_params,
                   compute_target = compute_target,
                   entry_script = 'diabetes_sklearn.py',
                   conda_packages = ['scikit-learn'])
                   #custom_docker_base_image = 'ninghai/azureml:0.3') # use a custom image here

In [None]:
# start the job
from azureml.core.experiment import Experiment

run = experiment.submit(sk_est)

### View run details
**IMPORTANT**: please use Chrome to navigate the below URL.

In [None]:
run

In [None]:
run.wait_for_completion(show_output = True)

In [None]:
from azureml.train.widgets import RunDetails

RunDetails(run).show()

You can also check the Batch AI cluster and job status using az-cli commands:

```shell
# check cluster status. You can see how many nodes are running.
$ az batchai cluster list

# check job status. You can see how many jobs are running
$ az batchai job list
```

## Now Try a Hyperdrive run

In [None]:
from azureml.train.hyperdrive import *

# parameter space to sweep over
ps = RandomParameterSampling(
    {
        "alpha": uniform(0.0, 1.0)
    }
)

# early termniation policy
# check every 2 iterations and if the primary metric (epoch_val_acc) falls
# outside of the range of 10% of the best recorded run so far, terminate it.
etp = BanditPolicy(slack_factor = 0.1, evaluation_interval = 2)

# Hyperdrive run configuration
hrc = HyperDriveRunConfig(
    estimator = sk_est,
    hyperparameter_sampling = ps,
    policy = etp,
    # metric to watch (for early termination)
    primary_metric_name = 'mse',
    # terminate if metric falls below threshold
    primary_metric_goal = PrimaryMetricGoal.MINIMIZE,
    max_total_runs = 20,
    max_concurrent_runs = 4,
)

In [None]:
# Start Hyperdrive run

hr = experiment.submit(hrc)

### Use a widget to show runs
Runs will automatically start to show in the following widget once rendered. You can keep the Notebook open and watch them "grow".

In [None]:
from azureml.train.widgets import RunDetails
RunDetails(hr).show()

**Note**: This is a sample image with 200 runs. Your result might look different.
![img](../images/hyperdrive-sklearn.png)

In [None]:
# check cluster status, pay attention to the # of running nodes
# !az batchai cluster list -o table

# check the Batch AI job queue. Notice the Job name is the run history Id. Pay attention to the State of the job.
# !az batchai job list -o table

In [None]:
run

### Find best run
Wait until all Hyperdrive runs finish before running the below cells.

In [None]:
run.wait_for_completion(show_output = True)

In [None]:
hr.get_status()

In [None]:
from tqdm import tqdm

runs = {}

for r in tqdm(hr.get_children()):
    metrics = r.get_metrics()
    if ('mse' in metrics.keys()):
        runs[r.id] = metrics

In [None]:
import numpy as np
best_run_id = min(runs, key = lambda k: runs[k]['mse'])
best_run = runs[best_run_id]
print('Best Run: alpha = {0:.4f}, MSE = {1:.4f}'.format(best_run['alpha'], best_run['mse']))

### Plot the best run [Optional] 
Note you will need to install `matplotlib` for this.

In [None]:
%matplotlib inline
import matplotlib
from matplotlib import pyplot as plt

In [None]:
# get metrics of alpha and mse for all runs
metrics = np.array([[runs[r]['alpha'], runs[r]['mse']] for r in runs])

# sort the metrics by alpha values
metrics = np.array(sorted(metrics, key = lambda m: m[0]))

In [None]:
plt.title('MSE over alpha', fontsize = 16)

plt.plot(metrics[:,0], metrics[:,1], 'r--')
plt.plot(metrics[:,0], metrics[:,1], 'bo')

plt.xlabel('alpha', fontsize = 14)
plt.ylabel('mean squared error', fontsize = 14)

plt.show()