# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

## Dataset

### Overview

For this project, the data used is **Mobile Price Classification** ([data source](https://www.kaggle.com/iabhishekofficial/mobile-price-classification?select=train.csv))
from Kaggle website. The description provided in Kaggle is the following one:

```
Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.

He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.

Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. But he is not so good at Machine Learning. So he needs your help to solve this problem.

In this problem you do not have to predict actual price but a price range indicating how high the price is.
```

We are using the *train.csv* file.

### Task
*TODO*: Explain the task you are going to be solving with this dataset and the features you will be using for it.

As described above, we are using some technical characteristics of mobile phones
to classify their prices between 0 and 3. So that, we have a Multi-Label
Classification Problem.

The features available are the following:

* **battery_power**: Total energy a battery can store in one time measured in mAh.

* **blue**: Has bluetooth or not.

* **clock_speed**: speed at which microprocessor executes instructions.

* **dual_sim**: Has dual sim support or not.

* **fc**: Front Camera mega pixels

* **four_g**: Has 4G or not.

* **int_memory**: Internal Memory in Gigabytes.

* **m_dep**: Mobile Depth in cm.

* **mobile_wt**: Weight of mobile phone.

* **n_cores**: Number of cores of processor.

* **pc**: Primary Camera mega pixels.

* **px_height**: Pixel Resolution Height.

* **px_width**: Pixel Resolution Width.

* **ram**: Random Access Memory in Mega Bytes.

* **sc_h**: Screen Height of mobile in cm.

* **sc_w**: Screen Width of mobile in cm.

* **talk_time**: longest time that a single battery charge will last when you are.

* **three_g**: Has 3G or not.

* **touch_screen**: Has touch screen or not.

* **wifi**: Has wifi or not.

* **price_range**: This is the target variable with value of 0 (low cost), 1 (medium cost), 2 (high cost) and 3 (very high cost).


In this data we have a balanced target for training set, i.e., each class has almost the same representation. This is important because it makes it easier to create a general model using classical.

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [3]:
ws = Workspace.from_config()
experiment_name = 'hyperdrive-mobile'

experiment=Experiment(ws, experiment_name)

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code A8WWBGX54 to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.


## Hyperdrive Configuration

**Steps**
1. Select Parameter Sampler
2. Define a Policy
3. Create a estimator using `train.py`.
4. Define a `HyperDriveConfig` that aims to maximise accuracy with less than 10 runs and 4 concurrent at most because we have 4 nodes in the cluster at most.
5. Submit the job and review the results.
6. Register the model with `.register_model()`. 

* **Parameter sampler**: The parameter sampler I chose was `RandomParameterSampling` 
because it accepts both discrete and continuous hyperparameters. For `lr`, a uniform 
distribution choice was made in a range between 0.01 an 0.3 which is a quite common 
range for this kind of tasks. For `max_depth` we choose an arbitrary value between 3 and 10.
Typically, larger values of `max_depth`can incur in overfitting. In addition to this, `num_leaves`
and `n_estimators`have been chosen similarly, trying to cover the range of typical values
based on my real personal experience. For these 3 hyperparameters, a random discrete
choice was used.

* **Early Stopping Policy**: The Policy chosen is `Bandit Policy`. This one tries
to avoid unnecessary runs by comparing the metric obtained during a set of runs 
and, if it's much worst than the best one (given an `slack factor`), 
then there`s no more runs.

To do the estimation we use the file train.py where a bit of feature engineering is done. We define 3 new variables: `Vol_Dens` which is mobile weight divided by screen volume, `px_dens` which is pixels density in mobile's screen and `talk_cons` which represents battery consumption due to calls.

After that, train-test split is done with 80-20 ratio and LGBMClassifier is applied.

For the estimator definition, we had to specify that we need to install `lightgbm` in *pip_packages* parameter.

The metric we want to optimize is Accuracy and we allow 10 runs in batches of 4,
as many as max_clusters we've set up for our Compute Cluster instance.

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Create compute cluster
# Use vm_size = "Standard_D2_V2" in your provisioning configuration.
# max_nodes should be no greater than 4.

### YOUR CODE HERE ###

# Choose a name for your CPU cluster
cpu_cluster_name = "hd-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                            max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Creating....
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [5]:

from azureml.widgets import RunDetails
from azureml.train.sklearn import SKLearn
from azureml.core import ScriptRunConfig
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.policy import BanditPolicy
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import uniform, choice
import os

# Specify parameter sampler
ps = RandomParameterSampling({
    "lr": uniform(.01, 0.3),
    "max_depth": choice(3, 4, 5, 6, 7, 8, 9, 10),
    "num_leaves": choice(20, 40, 60, 80, 100, 120, 140),
    "n_estimators": choice(100, 200, 300, 400, 500, 600, 700)
    }
)

# Specify a Policy
policy = BanditPolicy(slack_factor=.1, evaluation_interval=2)

if "training" not in os.listdir():
    os.mkdir("./training")

# Create a SKLearn estimator for use with train.py
est = SKLearn(source_directory=os.path.join('./'),
                      entry_script='train.py',
 #                     arguments=['--C', '--max_iter'],
                      compute_target=cpu_cluster,
                      pip_packages=['lightgbm'])
# est = exp.submit(est)

# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(estimator=est,
                                    hyperparameter_sampling=ps,
                                    policy=policy,
                                    primary_metric_name='Accuracy',
                                    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                    max_total_runs=10,
                                    max_concurrent_runs=4)

#h)yper_run = HyperDriveRun(exp, run.get_snapshot_id, hyperdrive_config=hyperdrive_config)
#RunDetails(hyper_run)

'SKLearn' estimator is deprecated. Please use 'ScriptRunConfig' from 'azureml.core.script_run_config' with your own defined environment or the AzureML-Tutorial curated environment.


In [13]:
#TODO: Submit your experiment
hyperdrive_run = experiment.submit(config=hyperdrive_config)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [14]:
RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [15]:
hyperdrive_run.wait_for_completion(show_output=True)

RunId: HD_ba14f286-4616-463b-a895-54b15d1464f6
Web View: https://ml.azure.com/experiments/hyperdrive-mobile/runs/HD_ba14f286-4616-463b-a895-54b15d1464f6?wsid=/subscriptions/f5091c60-1c3c-430f-8d81-d802f6bf2414/resourcegroups/aml-quickstarts-141054/workspaces/quick-starts-ws-141054

Streaming azureml-logs/hyperdrive.txt

"<START>[2021-03-21T16:47:23.616605][API][INFO]Experiment created<END>\n"<START>[2021-03-21T16:47:24.3246647Z][SCHEDULER][INFO]The execution environment is being prepared. Please be patient as it can take a few minutes.<END>"<START>[2021-03-21T16:47:25.679057][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space<END>\n""<START>[2021-03-21T16:47:26.033843][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target.<END>\n"<START>[2021-03-21T16:47:54.6280926Z][SCHEDULER][INFO]The execution environment was successfully prepared.<END><START>[2021-03-21T16:47:54.6342739Z][SCHEDULER][INFO]Scheduling job, id='HD_ba14f2

{'runId': 'HD_ba14f286-4616-463b-a895-54b15d1464f6',
 'target': 'hd-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-03-21T16:47:23.380555Z',
 'endTimeUtc': '2021-03-21T16:56:32.953874Z',
 'properties': {'primary_metric_config': '{"name": "Accuracy", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '3d0d4f7d-1211-4233-84b1-a6e58bf46695',
  'score': '0.9325',
  'best_child_run_id': 'HD_ba14f286-4616-463b-a895-54b15d1464f6_4',
  'best_metric_status': 'Succeeded'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://mlstrg141054.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_ba14f286-4616-463b-a895-54b15d1464f6/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=EMsAFSOIWzWYPdXeQwNHzzc6HAmLZXzMnSDbFCNqEyA%3D&st=2021-03-21T16%3A46%3A36Z&se=2021-03-22T00%3A56%3A36Z&sp=r'},
 'submittedBy': 'ODL_User 141054'}

## Best Model

In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [16]:
import joblib
# Get your best run and save the model from that run.

### YOUR CODE HERE ###
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['arguments']

print('Best Run Id: ', best_run.id)
print('\n Accuracy: ', best_run_metrics['Accuracy'])
#print('\n Inverse regularisation value (C): ', parameter_values[1])
#print('\n Max Iterations: ', parameter_values[3])
print(parameter_values)
# Save model
print('\n SAVE MODEL...')
final_model = best_run.register_model(model_name = 'hypermodel', model_path = '/outputs/model.joblib')
print('\n SAVE MODEL...')

Best Run Id:  HD_ba14f286-4616-463b-a895-54b15d1464f6_4

 Accuracy:  0.9325
['--lr', '0.14521344766025812', '--max_depth', '3', '--n_estimators', '200', '--num_leaves', '60']

 SAVE MODEL...

 SAVE MODEL...


In [21]:
final_model

Experiment,Id,Type,Status,Details Page,Docs Page
hyperdrive-mobile,HD_ba14f286-4616-463b-a895-54b15d1464f6_4,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [23]:
best_run.get_details()

{'runId': 'HD_ba14f286-4616-463b-a895-54b15d1464f6_4',
 'target': 'hd-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-03-21T16:54:10.203939Z',
 'endTimeUtc': '2021-03-21T16:54:58.265218Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '3d0d4f7d-1211-4233-84b1-a6e58bf46695',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--lr',
   '0.14521344766025812',
   '--max_depth',
   '3',
   '--n_estimators',
   '200',
   '--num_leaves',
   '60'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'hd-cluster',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': None,
  'nodeCount': 1,
  'priority': None,
  'credentialPassthrough': False,
  

In [17]:
from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(
                e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        else:
            pprint(step[1].get_params())
            print()

print_model(final_model)


AttributeError: 'Model' object has no attribute 'steps'