# Hyperparameter Search
In this notebook, we create an AML cluster, and use it to search for the best set of hyperparameters for the model.

The steps in this notebook are
- [import libraries](#import),
- [read in the Azure ML workspace](#workspace),
- [create an AML cluster](#cluster),
- [upload the data to the cloud](#upload),
- [define a hyperparameter search configuration](#configuration),
- [create an estimator](#estimator),
- [submit the estimator](#submit), and
- [get the results](#results).

## Imports  <a id='import'></a>

In [1]:
import os
import pandas as pd
import time
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration, DataReferenceConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.train.automl import AutoMLConfig
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import RandomParameterSampling, choice, PrimaryMetricGoal, HyperDriveRunConfig
from azureml.widgets import RunDetails
import azureml.core
from get_auth import get_auth
print('azureml.core.VERSION={}'.format(azureml.core.VERSION))

azureml.core.VERSION=1.0.41


## Read in the Azure ML workspace <a id='workspace'></a>
Read in the the workspace created in a previous notebook.

In [2]:
auth = get_auth()
ws = Workspace.from_config(auth=auth)
ws_details = ws.get_details()
print('Name:\t\t{}\nLocation:\t{}'
      .format(ws_details['name'],
              ws_details['location']))

Trying to create Workspace with CLI Authentication
Name:		hypetuning
Location:	eastus2


## Create an AML cluster <a id='cluster'></a>
Define the properties of the cluster needed.

In [3]:
cluster_name = 'hypetuning'
provisioning_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_D4_v2',
        # vm_priority = 'lowpriority', # optional
        max_nodes=16)

Create the configured cluster if it doesn't already exist, or retrieve it if it does exist. Creation can take about a minute.

In [4]:
if cluster_name in ws.compute_targets:
    compute_target = ws.compute_targets[cluster_name]
    if type(compute_target) is not AmlCompute:
        raise Exception('Compute target {} is not an AML cluster.'
                        .format(cluster_name))
    print('Using pre-existing AML cluster {}'.format(cluster_name))
else:
    # Create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, provisioning_config)

    # You can poll for a minimum number of nodes and set a specific timeout. 
    # If min node count is provided, provisioning will use the scale settings for the cluster.
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Using pre-existing AML cluster hypetuning


Print a detailed view of the cluster.    

In [5]:
pd.Series(compute_target.get_status().serialize(), name='Value').to_frame()

Unnamed: 0,Value
currentNodeCount,0
targetNodeCount,0
nodeStateCounts,"{'preparingNodeCount': 0, 'runningNodeCount': ..."
allocationState,Steady
allocationStateTransitionTime,2019-06-01T01:00:34.184000+00:00
errors,
creationTime,2019-05-31T20:46:59.611898+00:00
modifiedTime,2019-05-31T20:47:15.353894+00:00
provisioningState,Succeeded
provisioningStateTransitionTime,


## Upload the data to the cloud <a id='upload'></a>
Prepare the data in X/y form for AutoML.

In [6]:
data_path = "data"
train_path = os.path.join(data_path, "balanced_pairs_train.tsv")
tune_path = os.path.join(data_path, "balanced_pairs_tune.tsv")
train = pd.read_csv(train_path, sep='\t', encoding='latin1')
tune = pd.read_csv(tune_path, sep='\t', encoding='latin1')
feature_columns = ["Text_x", "Text_y"]
label_column = "Label"
train_X = (train.Text_x + ' ' + train.Text_y)  # train_X = train[feature_columns]
train_y = train[label_column]
tune_X = (tune.Text_x + ' ' + tune.Text_y)  # tune_X = tune[feature_columns]
tune_y = tune[label_column]
train_label_counts = train[label_column].value_counts()
train_label_weight = train.shape[0] / (train_label_counts.shape[0] * train_label_counts)
train_weight = train[label_column].apply(lambda x: train_label_weight[x])
tune_label_counts = tune[label_column].value_counts()
tune_label_weight = tune.shape[0] / (tune_label_counts.shape[0] * tune_label_counts)
tune_weight = tune[label_column].apply(lambda x: tune_label_weight[x])

Write the X/y data out to a data directory.

In [7]:
automl_data_path = "automl_data"
os.makedirs(automl_data_path, exist_ok=True)

train_X_path = os.path.join(automl_data_path, "train_X.tsv")
train_X.to_csv(train_X_path, sep='\t', header=True, index=False)

train_y_path = os.path.join(automl_data_path, "train_y.tsv")
train_y.to_csv(train_y_path, sep='\t', header=True, index=False)

train_weight_path = os.path.join(automl_data_path, "train_weight.tsv")
train_weight.to_csv(train_weight_path, sep='\t', header=True, index=False)

tune_X_path = os.path.join(automl_data_path, "tune_X.tsv")
tune_X.to_csv(tune_X_path, sep='\t', header=True, index=False)

tune_y_path = os.path.join(automl_data_path, "tune_y.tsv")
tune_y.to_csv(tune_y_path, sep='\t', header=True, index=False)

tune_weight_path = os.path.join(automl_data_path, "tune_weight.tsv")
tune_weight.to_csv(tune_weight_path, sep='\t', header=True, index=False)

We put the data in a particular directory on the workspace's default data store. This will show up in the same location in the file system of every job running on the Batch AI cluster.

Get a handle to the workspace's default data store.

In [8]:
ds = ws.get_default_datastore()

Upload the data. We use `overwrite=False` to avoid taking the time to re-upload the data should files with the same names be already present. If you change the data and want to refresh what's uploaded, use `overwrite=True`.

In [9]:
ds.upload(src_dir=os.path.join('.', automl_data_path), target_path='data', overwrite=True, show_progress=True)

Uploading ./automl_data/train_X.tsv
Uploading ./automl_data/train_weight.tsv
Uploading ./automl_data/train_y.tsv
Uploading ./automl_data/tune_X.tsv
Uploading ./automl_data/tune_weight.tsv
Uploading ./automl_data/tune_y.tsv
Uploaded ./automl_data/tune_y.tsv, 1 files out of an estimated total of 6
Uploaded ./automl_data/train_y.tsv, 2 files out of an estimated total of 6
Uploaded ./automl_data/tune_weight.tsv, 3 files out of an estimated total of 6
Uploaded ./automl_data/train_weight.tsv, 4 files out of an estimated total of 6
Uploaded ./automl_data/tune_X.tsv, 5 files out of an estimated total of 6
Uploaded ./automl_data/train_X.tsv, 6 files out of an estimated total of 6


$AZUREML_DATAREFERENCE_4d1e69a1b0d6466c9daed63dd75f615c

Create a data reference to download the data to an absolute location on the nodes.

In [10]:
dr = DataReferenceConfiguration(datastore_name=ds.name, 
                   path_on_datastore="data", 
                   path_on_compute=os.path.join("/tmp", "azureml"),
                   mode='download', # download files from datastore to compute target
                   overwrite=False)

Create the `get_data.py` file in the `scripts` directory.

In [11]:
%%writefile scripts/get_data.py

import os
import pandas as pd

def get_data():
    automl_data_path = os.path.join("/tmp", "azureml", "data")
    
    train_X_path = os.path.join(automl_data_path, "train_X.tsv")
    train_X = pd.read_csv(train_X_path, sep='\t', encoding='latin1').values.flatten()

    train_y_path = os.path.join(automl_data_path, "train_y.tsv")
    train_y = pd.read_csv(train_y_path, sep='\t', encoding='latin1').values.flatten()

    train_weight_path = os.path.join(automl_data_path, "train_weight.tsv")
    train_weight = pd.read_csv(train_weight_path, sep='\t', encoding='latin1').values.flatten()

    tune_X_path = os.path.join(automl_data_path, "tune_X.tsv")
    tune_X = pd.read_csv(tune_X_path, sep='\t', encoding='latin1').values.flatten()

    tune_y_path = os.path.join(automl_data_path, "tune_y.tsv")
    tune_y = pd.read_csv(tune_y_path, sep='\t', encoding='latin1').values.flatten()

    tune_weight_path = os.path.join(automl_data_path, "tune_weight.tsv")
    tune_weight = pd.read_csv(tune_weight_path, sep='\t', encoding='latin1').values.flatten()

    data = {
        "X" : train_X, "y" : train_y, "sample_weight": train_weight,
        "X_valid": tune_X, "y_valid": tune_y, "sample_weight_valid": tune_weight 
    }
    return data


Overwriting scripts/get_data.py


In [12]:
# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to the Linux DSVM
conda_run_config.target = compute_target

# set the data reference of the run coonfiguration
conda_run_config.data_references = {ds.name: dr}

cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'])
conda_run_config.environment.python.conda_dependencies = cd

automated_ml_config = AutoMLConfig(task="classification",
                                   debug_log="dbpedia_auc.log",
                                   path="scripts",
                                   data_script="get_data.py",
                                   primary_metric="AUC_weighted",
                                   run_configuration=conda_run_config,
                                   preprocess=True,
                                   enable_feature_sweeping=False,
                                   iterations=50,
                                   iteration_timeout_minutes=90,
                                   max_concurrent_iterations=16,
                                   max_cores_per_iteration=4)

## Run the search <a id='submit'></a>
Get an experiment to run the search; create it if it doesn't already exist.

In [13]:
exp = Experiment(workspace=ws, name='hypetuning')

Submit the configuration to be run. This should return almost immediately, and the value will be a run object.

In [14]:
run = exp.submit(automated_ml_config)
run

Experiment,Id,Type,Status,Details Page,Docs Page
hypetuning,AutoML_22e6fd35-8169-45e1-ad62-44ac1f127028,automl,Preparing,Link to Azure Portal,Link to Documentation


The experiment returns a run that when printed shows a table with a link to the `Details Page` in the Azure Portal. That page will let you monitor the status of this run and that of its children runs. By clicking on a particular child run, you can see its details, files output by the script for that configuration, and the logs of the run, including the `driver.log` with the script's print outs.

If you want to cancel this trial, run the code in the cell below.

In [15]:
# run.cancel()

Save the ID of the run in a file. You may use this at a later time to recover the run, as is shown in the next notebook.

In [16]:
run_id = run.id
run_id_path = "run_id.txt"
with open(run_id_path, "w") as fp:
    fp.write(run_id)

Until all children runs have either failed or completed, the parent run's status will not be `Completed`. Other possible run statuses include `Preparing`, `Running`, `Finalizing`, and `Failed`.

In [17]:
run.get_status()

'Preparing'

Use the RunDetails widget to monitor the execution of the AutoML trial.

In [18]:
RunDetails(run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

Wait for the runs to complete. This returns a `dict` with detailed information about the run. Here, we see that the run has `Completed`.

In [17]:
%%time

run_status = run.wait_for_completion()
print(run_status['status'])
if run_status['status'] != 'Completed':
    raise Exception('The run did not successfully complete.')

Completed
CPU times: user 2min 58s, sys: 7.26 s, total: 3min 6s
Wall time: 3h 57min 33s


## Select the best model <a id='results'></a>
We can automatically select the best run.

In [None]:
best_run = run.get_output()

In [None]:
help(run)

In [None]:
best_run[0]

## Test the best model
Read in the test data.

In [None]:
test_path = os.path.join(data_path, "balanced_pairs_test.tsv")
test = pd.read_csv(test_path, sep='\t', encoding='latin1')
test_X = (test.Text_x + ' ' + test.Text_y)  # test_X = test[feature_columns]
test_y = test[label_column]

In [None]:
automl_data_path = "automl_data"

test_X_path = os.path.join(automl_data_path, "test_X.tsv")
test_X.to_csv(test_X_path, sep='\t', header=True, index=False)

test_y_path = os.path.join(automl_data_path, "test_y.tsv")
test_y.to_csv(test_y_path, sep='\t', header=True, index=False)

In [None]:
test['probabilities'] = best_run[1].predict_proba(test_X.values)[:, 1]

# Order the testing data by dupe Id and question AnswerId.
group_column = 'Id_x'
answerid_column = 'AnswerId_y'
test.sort_values([group_column, answerid_column], inplace=True)

In [None]:
# Group each dupe probabilities for each question.
probabilities = (
    test.probabilities
    .groupby(test[group_column], sort=False)
    .apply(lambda x: tuple(x.values)))

# Get the individual records.
output_columns_x = ['Id_x', 'AnswerId_x', 'Text_x']
test_score = (test[output_columns_x]
              .drop_duplicates()
              .set_index(group_column))
test_score['probabilities'] = probabilities
test_score.reset_index(inplace=True)
test_score.columns = ['Id', 'AnswerId', 'Text', 'probabilities']

In [None]:
import numpy as np

def score_rank(scores):
    """Compute the ranks of the scores."""
    return pd.Series(scores).rank(ascending=False)


def label_index(label, label_order):
    """Compute the index of label in label_order."""
    loc = np.where(label == label_order)[0]
    if loc.shape[0] == 0:
        return None
    return loc[0]


def label_rank(label, scores, label_order):
    """Compute the rank of the true label given the scores of the question labels."""
    loc = label_index(label, label_order)
    if loc is None:
        return len(scores) + 1
    return score_rank(scores)[loc]

In [None]:
print("Evaluating the model's performance.")

# Collect the ordered AnswerId for computing scores.
labels = sorted(train[answerid_column].unique())
label_order = pd.DataFrame({'label': labels})

# Compute the ranks of the correct answers.
test_score['Ranks'] = test_score.apply(lambda x:
                                       label_rank(x.AnswerId,
                                                  x.probabilities,
                                                  label_order.label),
                                       axis=1)

# Compute the number of correctly ranked answers
args_rank = 3
for i in range(1, args_rank+1):
    print('Accuracy @{} = {:.2%}'
          .format(i, (test_score['Ranks'] <= i).mean()))
mean_rank = test_score['Ranks'].mean()
print('Mean Rank {:.4f}'.format(mean_rank))

In [None]:
test.columns

Create a scoring dataframe that groups each dupe's probabilities and its AnswerIds

In [None]:
test_rank = test.groupby(group_column).apply(lambda x: label_rank(x.AnswerId_x.values, x.probabilities.values, x.AnswerId_y.values))

In [None]:
test_rank.shape

In [None]:
test_rank[:5]

In [None]:
args_rank = 3
for i in range(1, args_rank+1):
    print('Accuracy @{} = {:.2%}'
          .format(i, (test_rank <= i).mean()))
mean_rank = test_rank.mean()
print('Mean Rank {:.4f}'.format(mean_rank))

In [None]:
def dupe_ranks(x):
    y = pd.Series({'AnswerId': x.AnswerId_x.iloc[0],
         'probabilities': tuple(x.probabilities.values),
         'AnswerIds': tuple(x.AnswerId_y.values),
         'Rank': label_rank(x.AnswerId_x.values, x.probabilities.values, x.AnswerId_y.values)
        })
    return y

In [None]:
test_rank = test.groupby(group_column).apply(dupe_ranks)

In [None]:
test_rank.head()

In [None]:
test_rank.Rank.mean()