# Hyperparameter Tuning using HyperDrive

## Introduction
Asteroids are minor planets, especially of the inner Solar System. Larger asteroids have also been called planetoids.
The study of asteroids is also crucial as historical events prove some of them being hazardous.

For the purpose of this Capstone project, I thought of using machine learning to predict whether an asteroid could be hazardous or not.

## Dataset
The data is about Asteroids and is provided by NEOWS(Near-Earth Object Web Service). It is a NASA's dataset and can be found on Kaggle. One can download the dataset from [this](https://www.kaggle.com/shrutimehta/nasa-asteroids-classification/download) link.

The dataset contains various information about the asteroids and labels each asteroid as hazardous(1) or non-hazardous(0).
The dataset consists of ***4687 data instances(rows) and 40 features(columns)***. Also, there are no null values in the dataset.

For the purpose of this project, I downloaded this dataset and saved it in the project's GitHub repository and accessing it using [this](https://raw.githubusercontent.com/Anupriya-S/Capstone-Azure-Machine-Learning-Engineer/main/nasa.csv) link.

## Azure Machine Learning SDK-specific Imports
In the cell below, we are importing all the dependencies that we will need to complete the project.

In [1]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
from azureml.core import Environment

## Initialize Workspace and Create an Azure ML Experiment
Let's initialize a workspace object from persisted configuration and create an experiment named "capstone_hyperdrive".

In [2]:
ws = Workspace.from_config()
experiment_name = 'capstone_hyperdrive'

exp=Experiment(ws, experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code EMAH34ZFB to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.
Workspace name: quick-starts-ws-137239
Azure region: southcentralus
Subscription id: 61c5c3f0-6dc7-4ed9-a7f3-c704b20e3b30
Resource group: aml-quickstarts-137239


## Create or Attach an AmlCompute Cluster
We need to create a compute target for our HyperDrive run. This compute target can also be used for running the AutoML run in the next step.

In [3]:
aml_compute_target = "cpu-cluster"

try:
  aml_compute = AmlCompute(ws, aml_compute_target)
  print("Found existing compute target!")
except ComputeTargetException:
  print("Creating new compute cluster...")
  provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", min_nodes = 1, max_nodes = 4)
  aml_compute = ComputeTarget.create(ws, aml_compute_target, provisioning_config)
  aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

print("Azure Machine Learning Compute Cluster Created!")

Creating new compute cluster...
Creating
Succeeded.......................
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
Azure Machine Learning Compute Cluster Created!


## Hyperdrive Configuration
Since this a classification problem we are using ***Decision Tree Classifier***. It is the most powerful and popular tool for classification and prediction. It can handle high dimensional data and has high accuracy in general. And because our dataset has a large number of features, that's why, Decision Tree Classifier is a good choice.

For defining the parameter sampler, we used the ***Random Sampling*** method. The benefit of using ***Random sampling*** over any other method is that it picks up the parameters' values randomly that saves time, and the result is almost as good as any other method.

In this case, the parameter search space consists of three hyperparameters:
1. Maximum Depth of the Decision tree
2. Minimum number of samples to be present at a node before splitting further
3. Minimum decrease in impurity after every split

The early termination policy (***BanditPolicy***) specifies that if you have a certain number of failures, HyperDrive will stop looking for the answer. In this case, it basically states to check the job every two iterations. If the primary metric (Accuracy) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise helping reach our target metric.

In ***HyperDriveConfig*** we specified Accuracy as the primary metric and our goal is to maximaize it. Hyperdrive will execute 40 runs at max.

***Note: Since we are using a curated environment of Azure ML we don't need to specifically create an environment file.***

In [4]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

#TODO: Create the different params that you will be using during training
param_sampling = RandomParameterSampling(
    {
        "--max_depth": choice(5, 10, 15, 20, 25),
        "--min_samples_split": choice(2, 10, 50, 90, 100, 150, 200),
        "--min_impurity_decrease": uniform(0.0, 1.0)
    }
)

sklearn_env = Environment.get(workspace=ws, name='AzureML-Tutorial')

#TODO: Create your estimator and hyperdrive config
estimator = SKLearn(source_directory=".",
              compute_target=aml_compute,
              entry_script='train.py',
              environment_definition=sklearn_env)

hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
                                     hyperparameter_sampling=param_sampling, 
                                     primary_metric_name="Accuracy",
                                     primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                     max_total_runs=40,
                                     max_concurrent_runs=4,
                                     policy=early_termination_policy)

'SKLearn' estimator is deprecated. Please use 'ScriptRunConfig' from 'azureml.core.script_run_config' with your own defined environment or the AzureML-Tutorial curated environment.
If environment_definition or conda_dependencies_file_path is specified, Azure ML will not install any framework related packages on behalf of the user.


In [5]:
#TODO: Submit your experiment

hyperdrive_run = exp.submit(hyperdrive_run_config, show_output=True)



## Run Details
In the cell below, we are using the `RunDetails` widget to show the training logs almost in real-time.

In [6]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [7]:
hyperdrive_run.wait_for_completion(show_output=True)

RunId: HD_82e5ec1b-3394-4e69-b892-fe39505eca3f
Web View: https://ml.azure.com/experiments/capstone_hyperdrive/runs/HD_82e5ec1b-3394-4e69-b892-fe39505eca3f?wsid=/subscriptions/61c5c3f0-6dc7-4ed9-a7f3-c704b20e3b30/resourcegroups/aml-quickstarts-137239/workspaces/quick-starts-ws-137239

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-02-04T15:27:19.663836][API][INFO]Experiment created<END>\n""<START>[2021-02-04T15:27:20.458412][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2021-02-04T15:27:21.0133695Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>"<START>[2021-02-04T15:27:20.153611][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n"<START>[2021-02-04T15:27:51.6482506Z][SCHEDULER][INFO]Scheduling job, id='HD_82e5ec1b-3394-4e69-b892-fe39505eca3f_3'<END><START>[2021-02-04T15:27:51.6263948Z][SCHEDULER][INFO]Scheduling job, i

{'runId': 'HD_82e5ec1b-3394-4e69-b892-fe39505eca3f',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-02-04T15:27:19.504945Z',
 'endTimeUtc': '2021-02-04T15:46:34.762952Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '025b21e3-6cdc-49ab-8536-87287965e5b2',
  'score': '0.9936034115138592',
  'best_child_run_id': 'HD_82e5ec1b-3394-4e69-b892-fe39505eca3f_21',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg137239.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_82e5ec1b-3394-4e69-b892-fe39505eca3f/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=GJcmWYJMKqO%2FzUZfKcXUjy8ldRSKWBEaEos6YFz1Ugk%3D&st=2021-02-04T15%3A36%3A56Z&se=2021-02-04T23%3A46%3A56Z&sp=r'},
 'submittedBy': 'ODL_User 137239'}

In [8]:
assert(hyperdrive_run.get_status() == "Completed")

## Best Model

In the cell below, we are retrieving the best model from the hyperdrive experiment using `get_best_run_by_primary_metric()` method. Further, we are displaying the details associated with the best HyperDrive run and they are:
1. Run ID
2. Values of the hyperparameters optimized
3. Accuracy
4. Files uploaded during the run

In [9]:
#TODO: Save the best model

best_run = hyperdrive_run.get_best_run_by_primary_metric()

# details associated with the best HyperDrive run
print("Run ID:", best_run.id)
print(best_run.get_details()['runDefinition']['arguments'])
print("Accuracy =", best_run.get_metrics()['Accuracy'])

# list the model files uploaded during the run
print("\n", best_run.get_file_names())

Run ID: HD_82e5ec1b-3394-4e69-b892-fe39505eca3f_21
['--max_depth', '20', '--min_impurity_decrease', '0.03315527414519093', '--min_samples_split', '2']
Accuracy = 0.9936034115138592

 ['azureml-logs/55_azureml-execution-tvmps_74375af68f2d4e81ed4e9e87c14fc21ac4c48a222a1e3ae6e8874f0d358455ac_d.txt', 'azureml-logs/65_job_prep-tvmps_74375af68f2d4e81ed4e9e87c14fc21ac4c48a222a1e3ae6e8874f0d358455ac_d.txt', 'azureml-logs/70_driver_log.txt', 'azureml-logs/75_job_post-tvmps_74375af68f2d4e81ed4e9e87c14fc21ac4c48a222a1e3ae6e8874f0d358455ac_d.txt', 'azureml-logs/process_info.json', 'azureml-logs/process_status.json', 'logs/azureml/103_azureml.log', 'logs/azureml/job_prep_azureml.log', 'logs/azureml/job_release_azureml.log', 'outputs/hyperdrive-model.joblib']


In the cell below, we are registering the best model trained by HyperDrive.

In [10]:
# register the folder as a model

h_model = best_run.register_model(model_name='hyperdrive-model', model_path='outputs/hyperdrive-model.joblib')

## Model Deployment

Since this is not the best of the two models we have we will not go further with the deployment of this model.