# Hyperparameter Tuning using HyperDrive

In [1]:
from azureml.core import Workspace, Experiment

## Dataset 

Dataset was created in the AutoML Notebook. As mentioned there, I use Credit Card Default dataset from the UCI Machine Learning Repository. The dataset can be found here - https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients# 

I have downloaded the .csv file, pre-processed it and stored the CSV in the datastore. The train.py script loads the data during the __main__() call and trains a model

In [2]:
ws = Workspace.from_config()
# choose a name for experiment
experiment_name = 'capstone-project'
experiment=Experiment(ws, experiment_name)

We now load the cluster, and if it doesn't exist, we create one

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "compute-train"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', 
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    
compute_target.wait_for_completion(show_output=True)

Found existing compute target
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Hyperdrive Configuration

We also need to create a virtual environment for the hyperdrive experiment to run in. The Virtual Environment will load the necessary libraries (scikit-learn) 

In [4]:
%%writefile conda_dependencies.yml

dependencies:
- python=3.6.2
- scikit-learn
- pip:
  - azureml-defaults

Writing conda_dependencies.yml


In [5]:
from azureml.core import Environment

sklearn_env = Environment.from_conda_specification(name = 'sklearn-env', file_path = './conda_dependencies.yml')

I have decided to use a Logistic Regression Model. The simplicity of the model means that the hyperparameters can be changed with good effect. I have deicded to use 2 hyperparameters to optimize - C which is the regularization parameter, and l1_ratio parameter, which decides the weighting between L1 and L2 regularization. I use a BanditPolicy as early termination since I don't want experiments where Accuracy is less than 0.1 of the best. The config settings just include the training script, the environment, the primary metric which I have decided to be accuracy since we are solving a classification problem.

In [6]:
from azureml.widgets import RunDetails
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice
import os
from azureml.core import ScriptRunConfig

#Using a Bandit Policy as an early termination policy. 
#I want to terminate the experiment if the primary metric is 0.1 less than the best metric
early_termination_policy = BanditPolicy(slack_amount=0.1)

#I create a RandomParameterSampling instance that chooses randomly between the choices I give for the hyperparameters
param_sampling = RandomParameterSampling({
    "--C": choice(0.01, 0.05, 0.1, 0.5, 1, 100),
    "--l1": choice(0.1, 0.3, 0.6, 0.9)
    }
)

#Using a Python train.py file instead of estimator since it is deprecated. This is done using the ScriptRunConfig class
src = ScriptRunConfig(source_directory=".",
                      script='train.py',
                      arguments=['--C', 1.0 , '--l1', 0.5],
                      compute_target=compute_target,
                      environment=sklearn_env)


# Creating a HyperDriveConfig
hyperdrive_config = HyperDriveConfig(run_config=src,
                                     hyperparameter_sampling=param_sampling,
                                     policy=early_termination_policy,
                                     primary_metric_name='Accuracy',
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=25,
                                     max_concurrent_runs=4)

## Run Details
We use the RunDetails widget to track the progress of the run

In [12]:
hyperdrive_run = experiment.submit(hyperdrive_config)
RunDetails(hyperdrive_run).show()
hyperdrive_run.wait_for_completion(show_output=True)

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_8f767b81-65e4-4d8c-9aba-136b1663c218
Web View: https://ml.azure.com/experiments/capstone-project/runs/HD_8f767b81-65e4-4d8c-9aba-136b1663c218?wsid=/subscriptions/f5878af0-ca26-411c-9906-acf91f5420e2/resourcegroups/grouprisk/workspaces/msc-lrn-dev

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-02-14T22:12:47.578025][API][INFO]Experiment created<END>\n""<START>[2021-02-14T22:12:48.081214][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-02-14T22:12:48.300590][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2021-02-14T22:12:49.0505887Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>

Execution Summary
RunId: HD_8f767b81-65e4-4d8c-9aba-136b1663c218
Web View: https://ml.azure.com/experiments/capstone-project/runs/HD_8f767b81-65e4-4d8c-9aba-136b1663c218?wsid=/subscriptions/f5878af0-ca26-411c-9906-acf

{'runId': 'HD_8f767b81-65e4-4d8c-9aba-136b1663c218',
 'target': 'compute-train',
 'status': 'Completed',
 'startTimeUtc': '2021-02-14T22:12:47.307256Z',
 'endTimeUtc': '2021-02-14T22:27:23.021295Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '3d0d7e7b-c96c-4074-aae1-926cd7e20ce8',
  'score': '0.7768',
  'best_child_run_id': 'HD_8f767b81-65e4-4d8c-9aba-136b1663c218_1',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://msclrndev4577006030.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_8f767b81-65e4-4d8c-9aba-136b1663c218/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=W1HRnFgr1%2FFEOb9XNHWgR1ll0B8bbynFZZgwZtyzTn4%3D&st=2021-02-14T22%3A17%3A23Z&se=2021-02-15T06%3A27%3A23Z&sp=r'},
 'submittedBy': 'Avineil Jain'}

## Best Model


In [13]:
import joblib
# I get the best run, and I save it to the workspace by registering it 

best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])
model = best_run.register_model(model_name='Hyperdrive_best', model_path='outputs/model.joblib')

['--C', '1', '--l1', '0.5', '--C', '0.01', '--l1', '0.9']


## Model Deployment

I will deploy the model that is trained by AutoML since it has much higher accuracy and performs much better. 