# Hyperparameter Tuning using HyperDrive


## Imports
Importing all the dependencies needed to complete the project

In [1]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice

import os
import joblib
import pandas as pd
import logging

## Workspace Configuration

In [2]:
ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="udacity-project")

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Workspace name: quick-starts-ws-140005
Azure region: southcentralus
Subscription id: 9e65f93e-bdd8-437b-b1e8-0647cd6098f7
Resource group: aml-quickstarts-140005


## Create compute cluster

In [3]:
# Choose a name for your cluster.
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

## Dataset

### Overview

In this project, the *lending club* dataset from the LendingClub American peer-to-peer lending company was used. The purpose is to use data for risk analytics and minimization in a banking and financial context. To achieve that, statistical information about past loan applicants is used to build a model using supervised learning, where the labels are whether or not the applicant failed to fully repay the loan, to be able to predict if a new applicant is likely to repay the loan. The aim is for the model to identify patterns in the dataset that can be used to determine the outcome of the new application based on the financial history of the applicant. In this way, the probability of defaulting the loan can be assessed and lenders can make an informed decision accordingly that may reduce the loss of business for the company by cutting down the credit loss, e.g. by denying the loan, raising interest rates, offering a different loan amount, etc.

The task at hand is therefore, not only to train an accurate predicitive model usin a logistic regression algorithm, but also to gain an insight into the most important features that determine the result yielded by the model. This allows the company to understand which variables are strong indicators of loan default and apply this knowledge in future risk assessment.

Please note that here only an overview of the dataset is provided and the actual EDA process is carried out in the 'train.py' script.

Below is a table with all the inormation available in the dataset for training the model.


|      LoanStatNew     |                                                                                                Description                                                                                               |
|:--------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| loan_amnt            | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.                             |
| term                 | The number of payments on the loan. Values are in months and can be either 36 or 60.                                                                                                                     |
| int_rate             | Interest Rate on the loan                                                                                                                                                                                |
| installment          | The monthly payment owed by the borrower if the loan originates.                                                                                                                                         |
| grade                | LC assigned loan grade                                                                                                                                                                                   |
| sub_grade            | LC assigned loan subgrade                                                                                                                                                                                |
| emp_title            | The job title supplied by the Borrower when applying for the loan.*                                                                                                                                      |
| emp_length           | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.                                                                        |
| home_ownership       | The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER                                                    |
| annual_inc           | The self-reported annual income provided by the borrower during registration.                                                                                                                            |
| verification_status  | Indicates if income was verified by LC, not verified, or if the income source was verified                                                                                                               |
| issue_d              | The month which the loan was funded                                                                                                                                                                      |
| loan_status          | Current status of the loan                                                                                                                                                                               |
| purpose              | A category provided by the borrower for the loan request.                                                                                                                                                |
| title                | The loan title provided by the borrower                                                                                                                                                                  |
| zip_code             | The first 3 numbers of the zip code provided by the borrower in the loan application.                                                                                                                    |
| addr_state           | The state provided by the borrower in the loan application                                                                                                                                               |
| dti                  | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
| earliest_cr_line     | The month the borrower's earliest reported credit line was opened                                                                                                                                        |
| open_acc             | The number of open credit lines in the borrower's credit file.                                                                                                                                           |
| pub_rec              | Number of derogatory public records                                                                                                                                                                      |
| revol_bal            | Total credit revolving balance                                                                                                                                                                           |
| revol_util           | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.                                                                               |
| total_acc            | The total number of credit lines currently in the borrower's credit file                                                                                                                                 |
| initial_list_status  | The initial listing status of the loan. Possible values are – W, F                                                                                                                                       |
| application_type     | Indicates whether the loan is an individual application or a joint application with two co-borrowers                                                                                                     |
| mort_acc             | Number of mortgage accounts.                                                                                                                                                                             |
| pub_rec_bankruptcies | Number of public record bankruptcies                                                                                                                                                                     |


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [None]:
# Create Data directory and download dataset from Kaggle
# %mkdir Data
# %wget 

In [5]:
# Data was uploaded from a local file
df = pd.read_csv('./Data/lending_club_loan.csv')

## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

### Hyperparameters

In this experiment, a Logistic Regression model is trained using supervised learning for prediction purposes. The Hyperdrive run sweeps different hyperparameters combinations to find the values that optimize the model's main metric. Two hyperparameters are used for the sampling: inverse of regularization strength "C" and maximum number of iterations.

Regularization is used to prevent the model from overfitting when many features are present but not too much data is available. Large parameter values are penalyzed lest the model almost perfectly fits the training data and deters generalization. A larger value of the regularization strength for a given parameter reduces the probability of the parameter's value to increase as an adjusting response for small perturbations in the data. In this case, since it is the inverse of the regularization strength, the values for this parameter need to be small and are sampled from an uniform distribution between 0.01 to 1.

The second hyperparameter, maximum number of iterations, is the number of iterations that are allowed for the model to converge, in this case when the accuracy is at its maximum value, given a set of parameters. This hyperparameter is used to prevent the model to overfit and/or run without converging.

### Termination policy

A `BanditPolicy` was used. Every two runs (this can be modified by the evaluation_interval parameter), the primary metric will be compared to the best performing run so far and if the value is not within the slack reference interval (best value of the primary metric +/- slack_factor) the run is aggresively terminated.

### Config settings

The training script and the compute target are added to an SKLearn esimator, which also contains the framework's dependencies.

The final step is to place the hyperparameters, termination policy and the estimator in the `hyper_drive_config` constructor along with the main metric to be looked at during training, **Accuracy**, and the target for that metric, which is to be **maximized**. The final two parameters ,`max_total_runs` and `max_concurrent_runs`, determine the maximum number of runs and the maximum number of runs to execute parallelly respectively.   

Finally, the hyperdrive run is submitted and the model from the best run saved.

In [4]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.
policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

#TODO: Create the different params that you will be using during training
ps = RandomParameterSampling(
    {
        '--C' : uniform(0.01,1),
        '--max_iter': choice(100,200,500,1000)
    }
)

#TODO: Create your estimator and hyperdrive config
est = SKLearn(
    source_directory = './',
    compute_target = compute_target,
    entry_script = 'train.py'
)

hyperdrive_config = HyperDriveConfig(
    estimator = est,
    hyperparameter_sampling = ps,
    policy = policy,
    primary_metric_name = 'Accuracy',
    primary_metric_goal = PrimaryMetricGoal.MAXIMIZE,
    max_total_runs = 16,
    max_concurrent_runs = 4
)

'SKLearn' estimator is deprecated. Please use 'ScriptRunConfig' from 'azureml.core.script_run_config' with your own defined environment or the AzureML-Tutorial curated environment.


In [5]:
#TODO: Submit your experiment

hyperdrive_run = exp.submit(config=hyperdrive_config)
hyperdrive_run.wait_for_completion(show_output=True)



RunId: HD_7d331357-f725-491d-8619-84d4cccc7657
Web View: https://ml.azure.com/experiments/udacity-project/runs/HD_7d331357-f725-491d-8619-84d4cccc7657?wsid=/subscriptions/9e65f93e-bdd8-437b-b1e8-0647cd6098f7/resourcegroups/aml-quickstarts-140005/workspaces/quick-starts-ws-140005

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-03-07T17:17:53.206718][API][INFO]Experiment created<END>\n""<START>[2021-03-07T17:17:53.904731][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-03-07T17:17:54.819914][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2021-03-07T17:17:55.3577629Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>

Execution Summary
RunId: HD_7d331357-f725-491d-8619-84d4cccc7657
Web View: https://ml.azure.com/experiments/udacity-project/runs/HD_7d331357-f725-491d-8619-84d4cccc7657?wsid=/subscriptions/9e65f

{'runId': 'HD_7d331357-f725-491d-8619-84d4cccc7657',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-03-07T17:17:52.949985Z',
 'endTimeUtc': '2021-03-07T17:32:16.871536Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '53f3321b-5e4c-4481-b455-a607f0e061bc',
  'score': '0.8367365011892111',
  'best_child_run_id': 'HD_7d331357-f725-491d-8619-84d4cccc7657_2',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg140005.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_7d331357-f725-491d-8619-84d4cccc7657/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=jaGpRyCNHCdENot0ML42TgMS55dHGfPeXeJNHjP4vHg%3D&st=2021-03-07T17%3A23%3A00Z&se=2021-03-08T01%3A33%3A00Z&sp=r'},
 'submittedBy': 'ODL_User 140005'}

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [6]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [7]:
# Print top five child runs for comparison
top_5 = hyperdrive_run.get_children_sorted_by_primary_metric(top=5, reverse=False, discard_no_metric=False)
for i in range(len(top_5)):
    print(top_5[i])

{'run_id': 'HD_7d331357-f725-491d-8619-84d4cccc7657_2', 'hyperparameters': '{"--C": 0.5898041139069939, "--max_iter": 200}', 'best_primary_metric': 0.8367365011892111, 'status': 'Completed'}
{'run_id': 'HD_7d331357-f725-491d-8619-84d4cccc7657_6', 'hyperparameters': '{"--C": 0.9621344661493635, "--max_iter": 500}', 'best_primary_metric': 0.8367238500075908, 'status': 'Completed'}
{'run_id': 'HD_7d331357-f725-491d-8619-84d4cccc7657_0', 'hyperparameters': '{"--C": 0.8159483668786063, "--max_iter": 1000}', 'best_primary_metric': 0.8367238500075908, 'status': 'Completed'}
{'run_id': 'HD_7d331357-f725-491d-8619-84d4cccc7657_3', 'hyperparameters': '{"--C": 0.1851288640794702, "--max_iter": 100}', 'best_primary_metric': 0.8367111988259703, 'status': 'Completed'}
{'run_id': 'HD_7d331357-f725-491d-8619-84d4cccc7657_8', 'hyperparameters': '{"--C": 0.5768914096614193, "--max_iter": 200}', 'best_primary_metric': 0.83669854764435, 'status': 'Completed'}


## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [8]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['arguments']
run_parameter_values = best_run.get_details()

print('Best Run Id: ', best_run.id)
print('\nAccuracy: ', best_run_metrics['Accuracy'])
print('\nHyperparameters: ', parameter_values)
print('\nRun parameters: ', run_parameter_values)

Best Run Id:  HD_7d331357-f725-491d-8619-84d4cccc7657_2

Accuracy:  0.8367365011892111

Hyperparameters:  ['--C', '0.5898041139069939', '--max_iter', '200']

Run parameters:  {'runId': 'HD_7d331357-f725-491d-8619-84d4cccc7657_2', 'target': 'cpu-cluster', 'status': 'Completed', 'startTimeUtc': '2021-03-07T17:22:56.856462Z', 'endTimeUtc': '2021-03-07T17:25:22.281355Z', 'properties': {'_azureml.ComputeTargetType': 'amlcompute', 'ContentSnapshotId': '53f3321b-5e4c-4481-b455-a607f0e061bc', 'ProcessInfoFile': 'azureml-logs/process_info.json', 'ProcessStatusFile': 'azureml-logs/process_status.json'}, 'inputDatasets': [], 'outputDatasets': [], 'runDefinition': {'script': 'train.py', 'command': '', 'useAbsolutePath': False, 'arguments': ['--C', '0.5898041139069939', '--max_iter', '200'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'cpu-cluster', 'dataReferences': {}, 'data': {}, 'outputData': {}, 'jobName': None, 'maxRunDurationSeconds': None, 'nod

In [9]:
#TODO: Save the best model
os.makedirs('outputs',exist_ok=True)
joblib.dump(value=best_run.id, filename='outputs/model.joblib')

['outputs/model.joblib']