# Hyperparameter Tuning using HyperDrive

In [1]:
# Importing dependencies
from azureml.core import Workspace, Experiment, Model
from azureml.core.compute import ComputeTarget, AmlCompute

from azureml.core.run import Run
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice, uniform
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core import Environment, ScriptRunConfig


import os
import joblib
import logging
import argparse

## Dataset

For this project, I'm using the Heart Failure Prediction dataset from Kaggle. It contains 12 clinical features that can be used to predict mortality by heart failure. I have downloaded this data and stored in my github repository, using Tabular Datset Factory to get the data in a tabluar form.

In [2]:
# Getting data in tabular form from the path
path = "https://raw.githubusercontent.com/ShashiChilukuri/Data2Deployment-AzureML/main/heart_failure_clinical_records_dataset.csv"
data = TabularDatasetFactory.from_delimited_files(path)

In [3]:
# Creating workspace and experiment
ws = Workspace.from_config()
experiment_name = 'ClassifyHeartFailure-HyperDrive'

experiment=Experiment(ws, experiment_name)

In [4]:
# Check if. cluster exists, if not create one
cluster_name = "Compute-Standard"

try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS3_V2',min_nodes=1, max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True)

# get status of the cluster
print(cpu_cluster.get_status().serialize())

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 3, 'targetNodeCount': 3, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 3, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2022-06-28T00:32:32.023000+00:00', 'errors': None, 'creationTime': '2022-06-27T22:08:07.283404+00:00', 'modifiedTime': '2022-06-27T22:08:13.905336+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 1, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT1800S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_DS3_V2'}


## Hyperdrive Configuration

To predict heart failure, I'm using Random Forest model and to fine tune the model parameters, using the Azure HyperDrive functionality. HyperDrive needs parameter sampler and early stopping policy to be feed in. For parameter sampling, used Random paramter sampling to sample over a hyperparameter search space. Picked this because this it is quicker than Grid search sampler as the parameter selection is random in nature. With respect to early stopping, I used Bandit early terminatin policy. Reason for selecting Bandit early termination policy is that it allows to select an interval and once it exceeds the specified interval, this policy will ends the job. It easy to use and provides more flexibility over other stopping policies such as median stopping.

Hyper Drive config setting guides in picking the best model. For this configuration, along with the parameter sampling and policy, used "accuracy" as primary metric as it is good metric for simple datasets, and the goal of this metric is to maximize as higher the accuracy better the model is. While the max total runs is 20 and concurrently it can run upto 4 runs.  

In [5]:
# Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

# Create the different params that you will be using during training
parameter_space = {"--n_estimators": choice(10, 20, 40), "--min_samples_split": choice(2,4,6)}
param_sampling = RandomParameterSampling(parameter_space = parameter_space)

# Create your estimator and hyperdrive config
# estimator = <your estimator here>

# Setup environment for your training run
sklearn_env = Environment.from_conda_specification(name='sklearn-env', file_path='conda_dependencies.yml')

# Create a ScriptRunConfig Object to specify the configuration details of your training job
src = ScriptRunConfig(source_directory = ".",
                      script='train.py',
                      compute_target=cluster_name,
                      environment = sklearn_env)


hyperdrive_run_config = HyperDriveConfig(run_config=src,
                                         hyperparameter_sampling=param_sampling,
                                         primary_metric_name='Accuracy',
                                         primary_metric_goal= PrimaryMetricGoal("MAXIMIZE"),
                                         max_total_runs=20,
                                         max_concurrent_runs=4,
                                         policy=early_termination_policy)

In [6]:
#Submit the experiment
hyperdrive_run = experiment.submit(hyperdrive_run_config, show_output=True)

## Run Details

`RunDetails` widget to show the different experiments.

In [7]:
RunDetails(hyperdrive_run).show()

hyperdrive_run.wait_for_completion(show_output=True)

assert(hyperdrive_run.get_status() == "Completed")

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_ccd86951-7948-4255-a989-028b2030e6c7
Web View: https://ml.azure.com/runs/HD_ccd86951-7948-4255-a989-028b2030e6c7?wsid=/subscriptions/6971f5ac-8af1-446e-8034-05acea24681f/resourcegroups/aml-quickstarts-199609/workspaces/quick-starts-ws-199609&tid=660b3398-b80e-49d2-bc5b-ac1dc93b5254

Streaming azureml-logs/hyperdrive.txt

"<START>[2022-06-28T01:18:49.214410][API][INFO]Experiment created<END>\n""<START>[2022-06-28T01:18:49.789213][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n"<START>[2022-06-28T01:18:50.5478197Z][SCHEDULER][INFO]Scheduling job, id='HD_ccd86951-7948-4255-a989-028b2030e6c7_0'<END><START>[2022-06-28T01:18:50.6724721Z][SCHEDULER][INFO]Scheduling job, id='HD_ccd86951-7948-4255-a989-028b2030e6c7_1'<END><START>[2022-06-28T01:18:50.8976648Z][SCHEDULER][INFO]Scheduling job, id='HD_ccd86951-7948-4255-a989-028b2030e6c7_3'<END>"<START>[2022-06-28T01:18:50.821411][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to 

## Best Model

Get the best model from the hyperdrive experiments and display all the properties of the model.

In [8]:
best_hd_run = hyperdrive_run.get_best_run_by_primary_metric()
best_hd_run_metrics = best_hd_run.get_metrics()

print('Best Run Id: ', best_hd_run.id)
print('\n Best Run Metrics: ', best_hd_run_metrics)

Best Run Id:  HD_ccd86951-7948-4255-a989-028b2030e6c7_0

 Best Run Metrics:  {'N.O trees in the forest:': 20, 'Min samples to split:': 2, 'Accuracy': 0.7888888888888889}


In [9]:
# Save the best model
os.makedirs("./outputs", exist_ok=True)
joblib.dump(value=best_hd_run.id,filename='outputs/best_hyperdrive_run_model.joblib')

['outputs/best_hyperdrive_run_model.joblib']

## Model Deployment

As part of the project, trained both AutoML model (in the other notebook) and also the Hyper drive based model(In this notebook). Best model out of these two are picked for deployment. 

Irrespective which model is picked both models are registered. Below is the registration of hyper drive model.

In [10]:
model = best_hd_run.register_model(model_name='heart_failure_hyperdrive', 
                                   model_path='outputs/', 
                                   properties={'Accuracy': best_hd_run_metrics['Accuracy'],
                                               'N Estimators': best_hd_run_metrics['N.O trees in the forest:'],
                                               'Min Samples Split': best_hd_run_metrics['Min samples to split:']})

As we have seen that Auto ML model performed best compared to hyper drive model (this notebook). So, will be deployming Auto ML model. No further steps to process in this notebook.