# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
from azureml.core import Workspace, Experiment, Environment

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform
from azureml.train.hyperdrive import choice

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import os
import shutil
import joblib

import numpy as np
import pandas as pd


In [3]:
ws = Workspace.from_config()
experiment_name = 'capstone-project'

experiment=Experiment(ws, experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')


run = experiment.start_logging()

Workspace name: quick-starts-ws-133870
Azure region: southcentralus
Subscription id: 9b72f9e6-56c5-4c16-991b-19c652994860
Resource group: aml-quickstarts-133870


In [4]:
my_env = Environment.get(workspace=ws, name="AzureML-AutoML")


### Creating the compute

In [5]:
cpu_cluster_name = "cpu-cluster"

   # Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [6]:
# Create a project_folder if it doesn't exist
if not os.path.isdir('data'):
    os.mkdir('data')
    
if not os.path.exists('project_folder'):
    os.makedirs('project_folder')
    
project_folder="./project_folder/"

In [28]:
shutil.copy("train.py", project_folder)

'./project_folder/train.py'

In [8]:
train_df = pd.read_csv("Train.csv")
test_df = pd.read_csv("Test.csv")

train_df.head()


Unnamed: 0,ID,text,label
0,SUAVK39Z,I feel that it was better I dieAm happy,Depression
1,9JDAGUV3,Why do I get hallucinations?,Drugs
2,419WR1LQ,I am stresseed due to lack of financial suppor...,Depression
3,6UY7DX6Q,Why is life important?,Suicide
4,FYC0FTFB,How could I be helped to go through the depres...,Depression


In [9]:
le = LabelEncoder()
train_df["label"] = le.fit_transform(train_df["label"].tolist())
train_df.head()

Unnamed: 0,ID,text,label
0,SUAVK39Z,I feel that it was better I dieAm happy,1
1,9JDAGUV3,Why do I get hallucinations?,2
2,419WR1LQ,I am stresseed due to lack of financial suppor...,1
3,6UY7DX6Q,Why is life important?,3
4,FYC0FTFB,How could I be helped to go through the depres...,1


In [11]:
train_df.to_csv('data/train.csv', index=False)
test_df.to_csv('data/test.csv', index=False)

In [12]:
datastore=ws.get_default_datastore()

In [13]:
datastore.upload(src_dir="./data", target_path="mental_health", show_progress=True)

Uploading an estimated of 2 files
Uploading ./data/test.csv
Uploaded ./data/test.csv, 1 files out of an estimated total of 2
Uploading ./data/train.csv
Uploaded ./data/train.csv, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_382eceb52c374d6aaa49f9581e1e5e64

In [18]:
shutil.copy("./data/train.csv", project_folder)

'./project_folder/train.csv'

In [14]:
from azureml.data.dataset_factory import TabularDatasetFactory


In [15]:
dataset = TabularDatasetFactory.from_delimited_files(path=datastore.path("mental_health/train.csv"), separator=",")

In [16]:
dataset.register(ws, "mental_health_clf", create_new_version=True)

{
  "source": [
    "('workspaceblobstore', 'mental_health/train.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "9b3b620c-03ce-472b-ad6f-ac3a1ab5480a",
    "name": "mental_health_clf",
    "version": 1,
    "workspace": "Workspace.create(name='quick-starts-ws-133870', subscription_id='9b72f9e6-56c5-4c16-991b-19c652994860', resource_group='aml-quickstarts-133870')"
  }
}

## Hyperdrive Configuration

*TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.*

I will be using XGBoost Classifier, a boosted decision tree model, as it has proven to be performant and very fast. I chose to tune max_depth and n_estimators as they are the ones that usually make a huge difference in the results and have a wide range.

I used RandomParameterSampling because it computationally and time efficient. The differences between random and grid sampling are discussed below.

| Random Sampling | Grid Sampling
| -----------------|--------------
| Supports early termination| Supports early termination
| Can use both discrete and continous values | Can only use choice (discrete values)
| Selects randomly | Tries all possible combinations of the hyperparameters
| More computationally efficient (we have limited resources)| Less computationally efficient
| Takes less time (the lab has a time limit) | Takes more time

* According to Udacity notes, an early termination policy specifies that if you have a certain number of failures, HyperDrive will stop looking for the answer. This means that the hyperparameter tuning will take less time overall as the run will be stopped and another hyperameter set run initiated. For early stooping, we used BanditPolicy as it would take less time by stopping runs that are out of the defined slack.

In [43]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5)

#TODO: Create the different params that you will be using during training
param_sampling = RandomParameterSampling({
    "n_estimators": choice(200, 400, 500, 700, 1000),
    "max_depth": choice(range(2,7))                                      }
)

if "training" not in os.listdir():
    os.mkdir("./training")

#TODO: Create your estimator and hyperdrive config -used ScriptRunConfig
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=project_folder,
                      script='train.py',
                      compute_target=cpu_cluster,
                      arguments = ['--input_data', dataset.as_named_input("mental_health")],
                      environment =my_env,
                    )

hyperdrive_run_config = HyperDriveConfig(run_config=src,
                                     hyperparameter_sampling=param_sampling, 
                                     primary_metric_name='Accuracy',
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     policy=early_termination_policy,
                                     max_total_runs=12,
                                     max_concurrent_runs=4)

In [44]:
#TODO: Submit your experiment
hyperdrive_run = experiment.submit(config=hyperdrive_run_config)

## Run Details

*OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?*

The models had close results ranging from accuracy of 0.76 to 0.798. Models with higher estimators and max_depth performed poorly probably because of overfitting on the train dataset.


TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [45]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [46]:
hyperdrive_run.wait_for_completion(show_output=True)

RunId: HD_401a63d3-75df-466c-8e88-2c9cd2b30e78
Web View: https://ml.azure.com/experiments/capstone-project/runs/HD_401a63d3-75df-466c-8e88-2c9cd2b30e78?wsid=/subscriptions/9b72f9e6-56c5-4c16-991b-19c652994860/resourcegroups/aml-quickstarts-133870/workspaces/quick-starts-ws-133870

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-01-07T20:52:19.778287][API][INFO]Experiment created<END>\n""<START>[2021-01-07T20:52:20.231859][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-01-07T20:52:20.459595][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2021-01-07T20:52:21.6159929Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END><START>[2021-01-07T20:52:52.0090893Z][SCHEDULER][INFO]Scheduling job, id='HD_401a63d3-75df-466c-8e88-2c9cd2b30e78_0'<END><START>[2021-01-07T20:52:52.0082627Z][SCHEDULER][INFO]The execution enviro

{'runId': 'HD_401a63d3-75df-466c-8e88-2c9cd2b30e78',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-07T20:52:19.568313Z',
 'endTimeUtc': '2021-01-07T21:04:39.2125Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '8d2d0501-7f51-4c6a-9de7-9c0a4d3ffc52',
  'score': '0.7983870967741935',
  'best_child_run_id': 'HD_401a63d3-75df-466c-8e88-2c9cd2b30e78_2',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg133870.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_401a63d3-75df-466c-8e88-2c9cd2b30e78/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=CWf9edFqpz9REBH4d0vyAbFy4Ql6PUnoJDdVeOO3DTI%3D&st=2021-01-07T20%3A54%3A47Z&se=2021-01-08T05%3A04%3A47Z&sp=r'}}

In [47]:
assert(hyperdrive_run.get_status() == "Completed")

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [50]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])


['--input_data', 'DatasetConsumptionConfig:mental_health', '--max_depth', '3', '--n_estimators', '500']


In [56]:
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
capstone-project,HD_401a63d3-75df-466c-8e88-2c9cd2b30e78_2,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [51]:
for file in best_run.get_file_names():
    print(file)

azureml-logs/55_azureml-execution-tvmps_c4894330e38a04f502f5aab2cf763f0878173efe1024f7a0b98d4471ed5e593b_d.txt
azureml-logs/65_job_prep-tvmps_c4894330e38a04f502f5aab2cf763f0878173efe1024f7a0b98d4471ed5e593b_d.txt
azureml-logs/70_driver_log.txt
azureml-logs/75_job_post-tvmps_c4894330e38a04f502f5aab2cf763f0878173efe1024f7a0b98d4471ed5e593b_d.txt
azureml-logs/process_info.json
azureml-logs/process_status.json
logs/azureml/101_azureml.log
logs/azureml/dataprep/backgroundProcess.log
logs/azureml/dataprep/backgroundProcess_Telemetry.log
logs/azureml/dataprep/engine_spans_5ddde550-0ed9-47a8-90eb-179ddbd5749c.jsonl
logs/azureml/dataprep/python_span_49c1b5b5-ef2c-4d42-9333-fc9a695bbc0f.jsonl
logs/azureml/dataprep/python_span_5ddde550-0ed9-47a8-90eb-179ddbd5749c.jsonl
logs/azureml/job_prep_azureml.log
logs/azureml/job_release_azureml.log
outputs/model.pkl


In [52]:
accuracy=best_run.get_metrics()["Accuracy"]
print("Accuracy: ",accuracy)

Accuracy:  0.7983870967741935


In [54]:
#TODO: Save the best model
model = best_run.register_model(model_name='mental-health-clf', model_path="outputs/model.pkl")
best_run.download_file("outputs/model.pkl" ) 

## Free resources

In [None]:
# delete the created compute
try:
    cpu_cluster.delete()
except ComputeTargetException:
    print("compute doesn't exist")
# delete the workspace
ws.delete()
