# Hyperparameter Tuning using HyperDrive

we start first  by Importing the Dependencies (not all of them are present here)

In [1]:
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

import os

## Dataset

Next we used an indepandent train.py file to first  get the data extracted from https://raw.githubusercontent.com/hananeouhammouch/Parkinsons-detection/master/parkinsons.data , then define the parameters (C , max_iter), then  clean the data (by removing the name of the personne  and defining the dependent and independent variable) and finally set the training and the testing data 

# Train.py 

`from sklearn.linear_model import LogisticRegression`

`import argparse`

`import os`

`import numpy as np`

`from sklearn.metrics import mean_squared_error`

`import joblib`

`from sklearn.model_selection import train_test_split`

`from sklearn.preprocessing import OneHotEncoder`

`import pandas as pd`

`from azureml.core.run import Run`

`from azureml.data.dataset_factory import TabularDatasetFactory`

`def clean_data(data):`
  
  `  # Clean the data`
   ` x_df = data.to_pandas_dataframe().dropna()`
   ` x_df.drop("name", inplace=True, axis=1)`
    `y_df = x_df.pop("status")`
    
   ` return x_df, y_df`
    

`def main():`
   ` # Add arguments to the script`
   ` parser = argparse.ArgumentParser()`

   ` parser.add_argument('--C', type=float, default=1.0, help="Inverse of regularization strength. Smaller values cause stronger regularization")`
   ` parser.add_argument('--max_iter', type=int, default=100, help="Maximum number of iterations to converge")`

   ` args = parser.parse_args()`
    
   ` # Create TabularDataset using TabularDatasetFactory`
   ` # Data is located at:`
   
   ` path_file="https://raw.githubusercontent.com/hananeouhammouch/Parkinsons-detection/master/parkinsons.data"`

    `ds =TabularDatasetFactory.from_delimited_files(path=path_file)`

   ` x, y = clean_data(ds)`

    `# Split data into train and test sets.`

   ` x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30)`

   ` run = Run.get_context()`

   ` run.log("Regularization Strength:", np.float(args.C))`
   ` run.log("Max iterations:", np.int(args.max_iter))`

   ` model = LogisticRegression(C=args.C, max_iter=args.max_iter).fit(x_train, y_train)`

  `  accuracy = model.score(x_test, y_test)`
  `  run.log("Accuracy", np.float(accuracy))`

`if __name__ == '__main__':`
 `   main()`
   

# Define the workspace and the experiment name 



In [2]:
ws = Workspace.from_config()
experiment_name = 'Parkinson-classification'

experiment=Experiment(workspace=ws, name=experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = experiment.start_logging()

Workspace name: quick-starts-ws-132695
Azure region: southcentralus
Subscription id: d4ad7261-832d-46b2-b093-22156001df5b
Resource group: aml-quickstarts-132695


# Hyperdrive Configuration and Execution

##  Define the compute cluster (STANDARD_DS3_V2 , 4Nodes)

In [3]:
amlcompute_cluster_name = "cpu-clusters"

try:
    aml_compute = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS3_V2',
                                                           max_nodes=4)
    aml_compute = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

aml_compute.wait_for_completion(show_output=True , min_node_count = 1, timeout_in_minutes = 2)

Creating
Succeeded....................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"


## Explain the model and the Reason for chosing the different algorithm ,hyperparameters, termination policy and config settings.


The algorithm we choose for this classification problem, is LogisticRegression because we are trying to predict if a patient will have the parkinson disease based on a range of biomedical voice measurements (yes or no) which means two outcomes.

And To improve the model we optimize the hyperparameters using Azure Machine Learning's tuning capabilities Hyperdrive

First of all, we define the hyperparameter space to sweep over. which means tuning the C and max_iter parameters. In this step, we use the random sampling RandomParameterSampling to try different configuration sets of hyperparameters to maximize the primary metric to make the tuning more specific

Then we define the termination Policy for every run using BanditPolicy based on a slack factor equal to 0.01 as criteria for evaluation to conserves resources by terminating runs that are poorly performing and anssure that every run will give better result than the one before

Once completed we create the SKLearn estimator

An finally we define the hyperdrive configuration where we set 20 as the maximum of iteration (why because we don't have a lot of data) and used the element defined above before submiting the experiment 

In [4]:
#Create the different params that  will be using during training
param_sampling =RandomParameterSampling( {
    "--C":  choice(0.1, 0.2, 0.3, 0.4, 0.5),
    "--max_iter":  choice(100, 150, 200, 250, 300)
    }
)

#Create an early termination policy.
early_termination_policy = BanditPolicy(evaluation_interval=1, slack_factor=0.01)

#Create the estimator and the hyperdrive
estimator =  SKLearn(source_directory='./', 
                entry_script='train.py', compute_target=aml_compute)


hyperdrive_run_config =HyperDriveConfig(hyperparameter_sampling=param_sampling, 
                                    primary_metric_name='Accuracy', 
                                    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                    policy=early_termination_policy,
                                    max_total_runs=20,
                                    max_concurrent_runs=4,
                                    estimator=estimator
                                   )

'SKLearn' estimator is deprecated. Please use 'ScriptRunConfig' from 'azureml.core.script_run_config' with your own defined environment or the AzureML-Tutorial curated environment.


In [5]:
#Submit the experiment

hyperdrive_run = experiment.submit(config=hyperdrive_run_config)




## Run Details

we used the `RunDetails` widget to show the different experiments.

In [6]:
RunDetails(hyperdrive_run).show()

hyperdrive_run.wait_for_completion(show_output=True)

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

RunId: HD_dd822bff-dc7e-4393-a5a2-9faf2ff77046
Web View: https://ml.azure.com/experiments/Parkinson-classification/runs/HD_dd822bff-dc7e-4393-a5a2-9faf2ff77046?wsid=/subscriptions/d4ad7261-832d-46b2-b093-22156001df5b/resourcegroups/aml-quickstarts-132695/workspaces/quick-starts-ws-132695

Streaming azureml-logs/hyperdrive.txt

"<START>[2020-12-31T21:15:18.822174][API][INFO]Experiment created<END>\n""<START>[2020-12-31T21:15:19.556874][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n""<START>[2020-12-31T21:15:19.388870][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n"<START>[2020-12-31T21:15:20.3353703Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>

Execution Summary
RunId: HD_dd822bff-dc7e-4393-a5a2-9faf2ff77046
Web View: https://ml.azure.com/experiments/Parkinson-classification/runs/HD_dd822bff-dc7e-4393-a5a2-9faf2ff77046?wsid=/s

{'runId': 'HD_dd822bff-dc7e-4393-a5a2-9faf2ff77046',
 'target': 'cpu-clusters',
 'status': 'Completed',
 'startTimeUtc': '2020-12-31T21:15:18.075049Z',
 'endTimeUtc': '2020-12-31T21:27:43.612848Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': 'f78d3ee8-247a-4bd9-8e04-0b0de03b1888',
  'score': '0.9661016949152542',
  'best_child_run_id': 'HD_dd822bff-dc7e-4393-a5a2-9faf2ff77046_9',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg132695.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_dd822bff-dc7e-4393-a5a2-9faf2ff77046/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=XJpT0kO%2Fp5PA0Mt42JghJ8IAPa4Czhok4%2FQMo8V8xq8%3D&st=2020-12-31T21%3A17%3A54Z&se=2021-01-01T05%3A27%3A54Z&sp=r'}}

## Best Model

In the cell below, we get the best model from the hyperdrive experiments and display all the properties of the model.

In [7]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])
print(best_run.get_file_names())

['--C', '0.2', '--max_iter', '200']
['azureml-logs/55_azureml-execution-tvmps_bbd1377372381a54cce55fd7086c398db85cc976122004d210c02836d2955e0b_d.txt', 'azureml-logs/65_job_prep-tvmps_bbd1377372381a54cce55fd7086c398db85cc976122004d210c02836d2955e0b_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_bbd1377372381a54cce55fd7086c398db85cc976122004d210c02836d2955e0b_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/103_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log']


In [8]:
#Save and register the best model
model = best_run.register_model(model_name='Parkinson_detection', model_path='./')
