# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
import os
import joblib
import logging
import json
import requests

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset

from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails

from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice

In [2]:
ws = Workspace.from_config()

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')


Workspace name: quick-starts-ws-141035
Azure region: southcentralus
Subscription id: 9a7511b8-150f-4a58-8528-3e7d50216c31
Resource group: aml-quickstarts-141035


In [3]:
experiment_name = 'hyperdrive-experiment'
experiment = Experiment(ws, experiment_name)

In [4]:
# Create compute cluster
cluster_name = "capstone-cluster"

# Verify that cluster does not exist already
try:
    compute_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration( vm_size='STANDARD_DS3_V2', max_nodes=6 )
    compute_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

compute_cluster.wait_for_completion(show_output=True, min_node_count=1, timeout_in_minutes=10)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [5]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "Heart-Failure-Dataset"
description_text = "Dataset for heart failure prediction."

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 
        
        print("Dataset found in the workspace")

if not found:
        # Create AML Dataset and register it into Workspace
        data = 'https://raw.githubusercontent.com/TahreemArif/ML-Azure-Capstone-Project/master/heart_failure_clinical_records_dataset.csv'
        dataset = Dataset.Tabular.from_delimited_files(data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()

Dataset found in the workspace


## Hyperdrive Configuration

The task at hand is a classification task, i.e., to predict heart failure. The algorithm used for this task is Logistic regression. The following two hyperparameters are being chosen for tuning by hyperdrive:

   * Inverse of Regularization Strength (C) : It controls the penalty strength, which is effective to prevent overfitting.
   * Maximum Iterations (max_iter): Its the maximum number of iterations taken to converge. 
  
The primary metric used here is accuracy, and the goal is to maximize accuracy. An early termination policy (Bandit Policy) is also used to improve computational efficiency by terminating the runs where the primary_metric is not within the specified slack_factor.  


In [6]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval = 2)

#TODO: Create the different params that you will be using during training
param_sampling = RandomParameterSampling({
    '--C': choice(0.001, 0.01, 0.1, 1, 10, 100),
    '--max_iter': choice(range(25, 500, 25))
})

#TODO: Create your estimator and hyperdrive config
estimator = SKLearn(source_directory = './', entry_script = 'train.py', compute_target = compute_cluster)


hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                         hyperparameter_sampling=param_sampling,
                                         policy=early_termination_policy,
                                         primary_metric_name='Accuracy',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=25,
                                         max_concurrent_runs=5)

'SKLearn' estimator is deprecated. Please use 'ScriptRunConfig' from 'azureml.core.script_run_config' with your own defined environment or the AzureML-Tutorial curated environment.


In [7]:
#TODO: Submit your experiment
hyperdrive_run = experiment.submit(hyperdrive_run_config)



## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [8]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [9]:
hyperdrive_run.wait_for_completion()

{'runId': 'HD_41fd543d-e1ff-46cf-ac5a-57622f0498c2',
 'target': 'capstone-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-03-21T11:52:34.480417Z',
 'endTimeUtc': '2021-03-21T12:06:27.842266Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': 'a88adf2f-82d7-4974-af9d-2cb21debdcd7',
  'score': '0.8166666666666667',
  'best_child_run_id': 'HD_41fd543d-e1ff-46cf-ac5a-57622f0498c2_4',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg141035.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_41fd543d-e1ff-46cf-ac5a-57622f0498c2/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=6keg71JnWiLGxvIQOudQaNSt0BUrgeHilTdxRABda7Y%3D&st=2021-03-21T11%3A56%3A52Z&se=2021-03-21T20%3A06%3A52Z&sp=r'},
 'submittedBy': 'ODL_User 141035'}

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [10]:
best_hyperdrive_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_hyperparameters = best_hyperdrive_run.get_details()['runDefinition']['arguments']

In [11]:
print("Best run ID: {}".format(best_hyperdrive_run.id))
print("Best Hyperparamters: {}".format(best_run_hyperparameters))
print("Accuracy: {}".format(best_hyperdrive_run.get_metrics()["Accuracy"]))

Best run ID: HD_41fd543d-e1ff-46cf-ac5a-57622f0498c2_4
Best Hyperparamters: ['--C', '0.001', '--max_iter', '450']
Accuracy: 0.8166666666666667


In [12]:
#TODO: Save the best model
joblib.dump(value=best_run_hyperparameters, filename='hyperdrive_model.joblib')

['hyperdrive_model.joblib']

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service