# Hyperparameter Tuning using HyperDrive

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
# Lot of the following comes from
# https://github.com/microsoft/MLHyperparameterTuning/blob/master/04_Hyperparameter_Random_Search.ipynb
import pandas as pd
import time
from azureml.core import Workspace, Experiment, Dataset, Datastore
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import (
    RandomParameterSampling, choice, PrimaryMetricGoal,
    HyperDriveConfig, MedianStoppingPolicy)
from azureml.widgets import RunDetails
import azureml.core

print('azureml.core.VERSION={}'.format(azureml.core.VERSION))

azureml.core.VERSION=1.22.0


## Dataset

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [2]:
ws = Workspace.from_config()
experiment_name = 'your experiment name here'

experiment=Experiment(ws, experiment_name)

UserErrorException: UserErrorException:
	Message: Experiment name must be between 1 and 255 characters long. Its first character has to be alphanumeric, and the rest may contain hyphens and underscores. No whitespace is allowed.
	InnerException None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Experiment name must be between 1 and 255 characters long. Its first character has to be alphanumeric, and the rest may contain hyphens and underscores. No whitespace is allowed."
    }
}

# Note!! Because I'm using the same data as in automl run (so that I can compare them), the data preprocessing was already done in the automl.ipynb, so I will not run the following cell here, but just read the the data from the .csv file created in automl.ipynb. I put this code here in case there is some Udacity Rubik thing again that everything must be everywhere by the book...

In [None]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'uda3experiment'

experiment=Experiment(ws, experiment_name)

import pandas as pd
import numpy as np
data = pd.read_csv('data/Processed_DJI.csv')
# calculate the difference to next date
df = pd.concat([data,data[['Close']].diff(periods=-1).rename({'Close':'Close Diff'}, axis=1)], axis=1)
# create label 1 if stock goes up, or 0 if it goes down
df['y']= np.where(df['Close Diff'] < 0, 1, 0)
#  Use padding to fill NA: this way I am not using future data for the missing data handling
df2 = df.fillna(method='pad')
# Remove the close and close diff as we are now on interested in the binary indicator of up/down
# Removing also the EMA's that are calculated from several past values: these would generate 
# several nan values in the beginnign of the data, as EMA_200 can only be calculated once we have
# observed 200 days... we only have 2000 points of data so having 200 data points with nan on a feature
# is not worth it.
# Also removing Date as it seems like a variable that even if it is predictive, would not generalize
# to future.
df3 = df2.drop(columns= ['EMA_50','EMA_200', 'Close Diff', 'Close', 'Date', 'Name' ])
# and then removing any data points with missing features
df3 = df3.dropna()
normalized_df=(df3-df3.iloc[0:20].min())/(df3.iloc[0:20].max()-df3.iloc[0:20].min())
normalized_df.reset_index()
normalized_df = normalized_df.drop(normalized_df.index[0:20])
normalized_df['y'] = normalized_df['y'].astype(int)
normalized_df.to_csv('data/normalized_data.csv')

In [None]:
ws = Workspace.from_config()
datastore = ws.get_default_datastore()
dataset = Dataset.Tabular.from_delimited_files(datastore.path('data/normalized_data.csv'))

In [None]:
df = dataset.to_pandas_dataframe()

In [5]:
labels = df['y']
features = df.drop(['y'], axis=1)

In [6]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=.2, random_state=42)

In [7]:
# loosely following this article for basic random forest classifier with hyperparameter tuning:
# https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
from sklearn.ensemble import RandomForestClassifier
from pprint import pprint

rf = RandomForestClassifier(random_state = 42) #  meaning of life etc.

In [8]:
rf.fit(train_features, train_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

## Hyperdrive Configuration

TODO: Explain the model you are using and the reason for chosing the different hyperparameters, termination policy and config settings.

In [9]:
from azureml.train.hyperdrive import HyperDriveConfig
from azureml.train.hyperdrive import RandomParameterSampling, BanditPolicy, uniform, PrimaryMetricGoal


In [10]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "cheap-compute"

try:
    # Check for existing compute target
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        training_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)

Found existing cluster, use it.


In [11]:
# TODO: Create an early termination policy. This is not required if you are using Bayesian sampling.

from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.widgets import RunDetails

experiment_folder = "experiment_folder"
# Create a Python environment for the experiment
sklearn_env = Environment("sklearn-env")

# Ensure the required packages are installed (we need scikit-learn, Azure ML defaults, and Azure ML dataprep)
packages = CondaDependencies.create(conda_packages=['scikit-learn','pip'],
                                    pip_packages=['azureml-defaults','azureml-dataprep[pandas]'])
sklearn_env.python.conda_dependencies = packages

# Get the training dataset
dji_ds = dataset

# Create a script config
script_config = ScriptRunConfig(source_directory=experiment_folder,
                                script='dji_training.py',
                                # Add non-hyperparameter arguments -in this case, the training dataset
                                arguments = ['--input-data', dji_ds.as_named_input('training_data')],
                                environment=sklearn_env,
                                compute_target = training_cluster)

# I'll pick the Median Stopping Policy with following parametersas in the Microsoft documentation it states:
# "For a conservative policy that provide savings without terminating promising jobs, consider a Median Stopping Policy 
# with evaluation_interval 1 and delay_evaluation 5. These are conservative settings, that can provide approximately
# 25%-35% savings with no loss on primary metric (based on our evaluation data).""
# from https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters
from azureml.train.hyperdrive import MedianStoppingPolicy
early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=5)

#TODO: Create the different params that you will be using during training
param_sampling = GridParameterSampling( {
    "n_estimators": choice(800,900,1000,1100,1200),
    "min_samples_split": choice(4,5,6),
    "min_samples_leaf": choice(3,4,5),
    "max_depth":choice(90,100,110)
})

#TODO: Create your estimator and hyperdrive config
#estimator = <your estimator here> 
# Microsoft doc says that estimators are deprecated so not sure why this is here...
# https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.sklearn.sklearn?view=azure-ml-py

hyperdrive_run_config = HyperDriveConfig(run_config=script_config,
                             hyperparameter_sampling=param_sampling,
                             policy=early_termination_policy,
                             primary_metric_name="AUC",
                             primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                             max_total_runs=2,
                             max_concurrent_runs=4)

In [12]:
#TODO: Submit your experiment
experiment = Experiment(workspace=ws, name='Dji-hyperdrive-fixed')
run = experiment.submit(config=hyperdrive_run_config)

# Show the status in the notebook as the experiment runs
RunDetails(run).show()
run.wait_for_completion()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"

Current provisioning state of AmlCompute is "Deleting"



{'runId': 'HD_603f2e99-1a1a-4bf7-93ed-4aed2bae620f',
 'target': 'cheap-compute',
 'status': 'Canceled',
 'startTimeUtc': '2021-03-14T10:20:59.619548Z',
 'endTimeUtc': '2021-03-14T10:30:09.24375Z',
 'error': {'error': {'code': 'UserError',
   'message': 'All child runs cancelled.',
   'messageParameters': {},
   'details': []},
  'time': '0001-01-01T00:00:00.000Z'},
 'properties': {'primary_metric_config': '{"name": "AUC", "goal": "maximize"}',
  'resume_from': 'null',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'ContentSnapshotId': '6cad4917-3e25-4431-ad4a-5435f1ca22ce'},
 'inputDatasets': [],
 'outputDatasets': [],
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://uda3ws4193020758.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_603f2e99-1a1a-4bf7-93ed-4aed2bae620f/azureml-logs/hyperdrive.txt?sv=2019-02-02&sr=b&sig=EfVtktDNNfgVhf5yo2G4H82ZiF1KmFkMp6xgSVPgn5M%3D&st=2021-03-14T10%3A20%3A27Z&se=2021-03-14T18%3A30%3A27Z&sp=r'},
 'submi

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [63]:
RunDetails(run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

## Best Model

TODO: In the cell below, get the best model from the hyperdrive experiments and display all the properties of the model.

In [13]:
best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()


print('Best Run Id: ', best_run.id)
print('\n AUC:', best_run_metrics['AUC'])
best_run.get_details()['runDefinition']['arguments']

AttributeError: 'NoneType' object has no attribute 'get_metrics'

In [46]:
best_run.get_details()['runDefinition']['arguments']

['--input-data',
 'DatasetConsumptionConfig:training_data',
 '--n_estimators',
 '900',
 '--min_samples_split',
 '4',
 '--min_samples_leaf',
 '3',
 '--max_depth',
 '90']

In [47]:
# I don't know what is meant by "all the properties of the model" so here is a full dump to satisfy the rubrik
best_run.get_details()

{'runId': 'HD_02d49c48-b199-41ad-99e0-16726955bfdb_1',
 'target': 'cheap-compute',
 'status': 'Completed',
 'startTimeUtc': '2021-03-13T15:06:57.263562Z',
 'endTimeUtc': '2021-03-13T15:09:16.874314Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '6cad4917-3e25-4431-ad4a-5435f1ca22ce',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': 'a04bb4a5-511d-4b33-9f47-48428a35379f'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'training_data', 'mechanism': 'Direct'}}],
 'outputDatasets': [],
 'runDefinition': {'script': 'dji_training.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--input-data',
   'DatasetConsumptionConfig:training_data',
   '--n_estimators',
   '900',
   '--min_samples_split',
   '4',
   '--min_samples_leaf',
   '3',
   '--max_depth',
   '90'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  '

In [51]:
model = best_run.register_model(model_name='dji_hyper_model',
                           model_path='outputs/dji_model.pkl')
print(model.name, model.id, model.version, sep='\t')

dji_hyper_model	dji_hyper_model:1	1


In [56]:
from azureml.core.model import Model
moddy = Model(ws,name='dji_hyper_model')

In [60]:
moddy.download('.')

'dji_model.pkl'

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service