# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [23]:
import os
import joblib
from azureml.widgets import RunDetails
from azureml.core.model import Model
from azureml.core import Workspace, Experiment, Environment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.script_run_config import ScriptRunConfig
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice, randint
from azureml.core.conda_dependencies import CondaDependencies

## Dataset

A Kaggle dataset "60k Stack Overflow Questions with Quality Rating" has been used for this project. The dataset can be downloaded from the following link: https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate. Questions from 2016-2020 are labelled in three categories (HQ, LQ_EDIT, LQ_CLOSE), according to their quality. 45000 rows are subsetted for the project. Title and Body of each question has been appended and considered as being a single text. This text has been used as the only predictor of quality, tags are ignored. For simplicity, the task has been reduced to a binary classification between HQ and LQ, meaning that LQ_EDIT and LQ_CLOSE have been aggregated to one class, LQ.

Binary classification of the texts have been made on a embedding matrix of word counts.

In [2]:
ws = Workspace.from_config()
experiment_name = 'capstone-proj'

azureml_env = Environment.get(workspace=ws, name="AzureML-Minimal")
azureml_env.save_to_directory(path="azureml_env", overwrite=True)
env = Environment.load_from_directory("azureml_env")
env.name = "Capstone Project"
conda_dep = CondaDependencies()
conda_dep.add_pip_package("joblib")
conda_dep.add_pip_package("lightgbm")

env.python.conda_dependencies=conda_dep

exp = Experiment(ws, experiment_name)
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-144899
aml-quickstarts-144899
southcentralus
2c48c51c-bd47-40d4-abbe-fb8eabd19c8c


In [3]:
run = exp.start_logging()

In [4]:
amlcompute_cluster_name = "comp-capstone"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)
# For a more detailed view of current AmlCompute status, use get_status().
compute_target.get_status()

Creating...
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded......................................................................................................................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"


<azureml.core.compute.amlcompute.AmlComputeStatus at 0x7f7d601209e8>

## Hyperdrive Configuration

GradientBoostingClassifier from scikit-learn has been used for its practical usage, and gradient boosters' overall good performance in any kind of tabular datasets. The idea of using many base estimators and weighting data points, features, and estimators at each iteration is very powerful.

The hyperparameters to search are the two most fundamental ones for gradient boosters, learning rate and the number of estimators. High number of estimators and high learning rates would case overfitting. While less estimators would lead to underfitting and a very small learning rate might get stuck at an infeasible local optimum.

The early stopping policy (Bandit Policy) checks the optimization job every 2 iterations and prevented the tuning process going astray with the values that are not much contibuting to the metric improvement. If the metric falls outside of the slack value range (top %20 here), the job is terminated. However, it doesn't seem to be perfectly implemented when we check the logs (there is a warning that primary metric is not chosen, although it has been chosen properly and available to see in Azure ML UI). Any way, the GradientBoostingClassifier has its own early stopper, so we are safe with this respect.

The other config settings are provided by using ScriptRunConfig with a python enviroment (a mixture of pip and conda packages as defined above previously). Total runs are restricted to 20 and concurrent runs to 4, as expected by Udacity reviewers.


In [5]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.2)

#TODO: Create the different params that you will be using during training
param_sampling = RandomParameterSampling(
    {
        '--learning_rate': uniform(0.01,0.15),
        '--n_estimators': choice(100,300,500)
    }
)

#TODO: Create your estimator and hyperdrive config
src = ScriptRunConfig(source_directory=os.getcwd(), environment=env,
compute_target=compute_target, script="train.py")

hyperdrive_run_config = HyperDriveConfig(run_config=src,
hyperparameter_sampling=param_sampling, policy=early_termination_policy, primary_metric_name="AUC", 
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, max_total_runs=20, max_concurrent_runs=4)

In [6]:
#TODO: Submit your experiment
exp_run = exp.submit(config=hyperdrive_run_config)

## Run Details
`RunDetails` widget has run but did not show any output except logs. Screenshots from Azure ML UI has been provided in Readme.md.

In [7]:
exp_run.wait_for_completion()

{'runId': 'HD_99eabec3-c743-4baa-8150-6c9738c7a37c',
 'target': 'comp-capstone',
 'status': 'Completed',
 'startTimeUtc': '2021-05-15T18:31:56.530264Z',
 'endTimeUtc': '2021-05-15T18:54:37.839221Z',
 'properties': {'primary_metric_config': '{"name": "AUC", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '25ab491c-f901-428d-bb05-37c82e1cce93',
  'score': '0.7683815205914833',
  'best_child_run_id': 'HD_99eabec3-c743-4baa-8150-6c9738c7a37c_17',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg144899.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_99eabec3-c743-4baa-8150-6c9738c7a37c/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=kz7yZ%2BcjzHJZnD1zkbd1ZS426rrcz4o3tDaPkaVru5Q%3D&st=2021-05-15T18%3A45%3A20Z&se=2021-05-16T02%3A55%3A20Z&sp=r'},
 'submittedBy': 'ODL_User 144899'}

In [8]:
RunDetails(exp_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [9]:
best_run = exp_run.get_best_run_by_primary_metric()


In [10]:
best_run.get_metrics()

{'Learning Rate': 0.14066546074402428,
 'Number of Estimators': 500,
 'AUC': 0.7683815205914833,
 'Accuracy': 0.7681818181818182}

In [11]:
best_run.get_file_names()

['azureml-logs/55_azureml-execution-tvmps_8f6e042aff84bd467b154e46177868ea5b961fb254ba111a9f5ea85ef89d73f9_d.txt',
 'azureml-logs/65_job_prep-tvmps_8f6e042aff84bd467b154e46177868ea5b961fb254ba111a9f5ea85ef89d73f9_d.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/75_job_post-tvmps_8f6e042aff84bd467b154e46177868ea5b961fb254ba111a9f5ea85ef89d73f9_d.txt',
 'azureml-logs/process_info.json',
 'azureml-logs/process_status.json',
 'logs/azureml/104_azureml.log',
 'logs/azureml/job_prep_azureml.log',
 'logs/azureml/job_release_azureml.log',
 'outputs/hd_model 2021-05-15 18-53 -lr=0.1407 -ne=500.joblib']

In [14]:
best_run.get_file_names()[-1]

'outputs/hd_model 2021-05-15 18-53 -lr=0.1407 -ne=500.joblib'

In [13]:
best_run = exp_run.get_best_run_by_primary_metric()
print(best_run.get_file_names())
# Download the best model to local storage
best_run.download_file(best_run.get_file_names()[-1])

# Save the downloaded file to local storage
local_path='./outputs'
os.makedirs(local_path, exist_ok=True)

best_hd_model = joblib.load(open(best_run.get_file_names()[-1].split('/')[1], 'rb'))
joblib.dump(best_hd_model, 'outputs/best_hd_model.joblib')

['azureml-logs/55_azureml-execution-tvmps_8f6e042aff84bd467b154e46177868ea5b961fb254ba111a9f5ea85ef89d73f9_d.txt', 'azureml-logs/65_job_prep-tvmps_8f6e042aff84bd467b154e46177868ea5b961fb254ba111a9f5ea85ef89d73f9_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_8f6e042aff84bd467b154e46177868ea5b961fb254ba111a9f5ea85ef89d73f9_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/104_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/hd_model 2021-05-15 18-53 -lr=0.1407 -ne=500.joblib']


Trying to unpickle estimator DummyClassifier from version 0.24.2 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.
Trying to unpickle estimator DecisionTreeRegressor from version 0.24.2 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.
Trying to unpickle estimator GradientBoostingClassifier from version 0.24.2 when using version 0.22.2.post1. This might lead to breaking code or invalid results. Use at your own risk.


['outputs/best_hd_model.joblib']

## Model Deployment

This model has not been deployed as a web service. Please check the automl.ipynb

Still, the best HyperDrive model that is already downloaded is registered under the name hdr_best_gbc

In [26]:
Model.register(workspace=ws, model_path='outputs/best_hd_model.joblib', model_name="hdr_best_gbc",
 description="Stackoverflow Quality Prediction - HyperDrive.")



Registering model hdr_best_gbc


Model(workspace=Workspace.create(name='quick-starts-ws-144899', subscription_id='2c48c51c-bd47-40d4-abbe-fb8eabd19c8c', resource_group='aml-quickstarts-144899'), name=hdr_best_gbc, id=hdr_best_gbc:1, version=1, tags={}, properties={})