# Tune Hyperparameters

There are many machine learning algorithms that require *hyperparameters* (parameter values that influence training, but can't be determined from the training data itself). For example, when training a logistic regression model, you can use a *regularization rate* hyperparameter to counteract bias in the model; or when training a convolutional neural network, you can use hyperparameters like *learning rate* and *batch size* to control how weights are adjusted and how many data items are processed in a mini-batch respectively. The choice of hyperparameter values can significantly affect the performance of a trained model, or the time taken to train it; and often you need to try multiple combinations to find the optimal solution.

In this case, you'll train a classification model with two hyperparameters, but the principles apply to any kind of model you can train with Azure Machine Learning.

## Connect to your workspace

To get started, connect to your workspace.

> **Note**: If you haven't already established an authenticated session with your Azure subscription, you'll be prompted to authenticate by clicking a link, entering an authentication code, and signing into Azure.

In [1]:
import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from datetime import datetime
import pytz

# Load the workspace from the saved config file
ws = Workspace.from_config()
               
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

print("Current DateTime: ", datetime.now(pytz.timezone("America/New_York")).strftime("%m/%d/%Y %H:%M:%S"))

Ready to use Azure ML 1.41.0 to work with nahmed30-azureml-workspace
nahmed30-azureml-workspace
epe-poc-nazeer
centralus
16bc73b5-82be-47f2-b5ab-f2373344794c
Current DateTime:  09/07/2022 23:40:45


In [2]:
import os
from azureml.core.experiment import Experiment

# Choose a name for the run history container in the workspace.
# NOTE: update these to match your existing experiment name
experiment_folder = 'emailspam_hyperdrive_experimentv5'
experiment_name = 'emailspam_hyperdrive_experimentv5'
project_folder = 'emailspam_hyperdrive_experimentv5'

os.makedirs(project_folder, exist_ok=True)

print('Project Folder is ready.')

experiment = Experiment(ws, experiment_name)
experiment

Project Folder is ready.


Name,Workspace,Report Page,Docs Page
emailspam_hyperdrive_experimentv5,nahmed30-azureml-workspace,Link to Azure Machine Learning studio,Link to Documentation


## Prepare data

In this lab, you'll use a dataset containing details of diabetes patients. Run the cell below to create this dataset (if it already exists, the existing version will be used)

In [3]:
from azureml.core.dataset import Dataset

# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "UdacityPrjEmailSpamDataSet"
description_text = "Spam Detection DataSet for Udacity Capstone Proj "

dataset = None
if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

df = dataset.to_pandas_dataframe()
df.describe()


Unnamed: 0,v1,v2,Column3,Column4,Column5
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


## Create compute

Hyperparameter tuning involves running multiple training iterations with different hyperparameter values and comparing the performance metrics of the resulting models. To do this efficiently, we'll take advantage of on-demand cloud compute and create a cluster - this will allow multiple training iterations to be run concurrently.

Use the following code to specify an Azure Machine Learning compute cluster (it will be created if it doesn't already exist).

> **Important**: Change *your-compute-cluster* to the name of your compute cluster in the code below before running it! Cluster names must be globally unique names between 2 to 16 characters in length. Valid characters are letters, digits, and the - character.

In [4]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# NOTE: update the cluster name to match the existing cluster
# Choose a name for your CPU cluster
amlcompute_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)
# For a more detailed view of current AmlCompute status, use get_status().

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Prepare a training script

Now let's create a folder for the training script you'll use to train the model.

> **Note**: Compute instances and clusters are based on standard Azure virtual machine images. For this exercise, the *Standard_DS11_v2* image is recommended to achieve the optimal balance of cost and performance. If your subscription has a quota that does not include this image, choose an alternative image; but bear in mind that a larger image may incur higher cost and a smaller image may not be sufficient to complete the tasks. Alternatively, ask your Azure administrator to extend your quota.

You'll need a Python environment to be hosted on the compute, so let's define that as Conda configuration file.

In [5]:
%%writefile $project_folder/emailspam_hyperdrive_env_v5.yml
name: batch_environment
dependencies:
- python=3.8.5
- scikit-learn
- pandas
- numpy
- regex
- tensorflow=2.2.0
- nltk
- pip
- pip:
  - azureml-defaults


Overwriting emailspam_hyperdrive_experimentv5/emailspam_hyperdrive_env_v5.yml


In [6]:
%%writefile $project_folder/emailspam_hyperdrive_training_v5.py

import logging
import os
import csv
from datetime import datetime

import numpy as np
import pandas as pd
from sklearn import datasets

import tensorflow as tf

import regex as re

from tensorflow import keras
from tensorflow.keras import layers

from sklearn.preprocessing import StandardScaler
from tensorflow.keras import models, layers
# from keras-tuner import HyperModel, RandomSearch, Hyperband, BayesianOptimization

import nltk
from nltk.corpus import stopwords

from sklearn.metrics import roc_auc_score, roc_curve

import azureml.core
from azureml.core import Workspace
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace

from azureml.core.dataset import Dataset

import argparse, joblib, os
from azureml.core import Run


# Get the experiment run context
run = Run.get_context()

# Get script arguments
parser = argparse.ArgumentParser()

# Input dataset
parser.add_argument("--input-data", type=str, dest='input_data', help='training dataset')

#    "--num_layers": uniform(2, 20),
#    "--learning_rate": choice(1e-2, 1e-3, 1e-4) 

parser.add_argument('--learning_rate', type=float, default=1.0, help="Inverse of regularization strength. Smaller values cause stronger regularization")
parser.add_argument('--num_layers', type=int, default=10, help="Number of layers")
parser.add_argument('--units', type=int, default=64, help="Number of nodes")



# Add arguments to args collection
args = parser.parse_args()

# Log Hyperparameter values
run.log('learning_rate',  np.float(args.learning_rate))
run.log("Number of Layers:", np.int(args.num_layers))  
run.log("Number of Nodes:", np.int(args.units))  

# load the email spam dataset
print("Loading Data...")
df = run.input_datasets['training_data'].to_pandas_dataframe() # Get the training data from the estimator input


nltk.download('stopwords')

stop_words= set(stopwords.words("english"))

stop_words.update(['https', 'http', 'amp', 'CO', 't', 'u', 'new', "I'm", "would"])


#--------------------------------------------------------

df = df.replace('spam', 1)
df = df.replace('ham', 0)

#--------------------------------------------------------
print(df.describe())

print(df.head())

def cleanText(text):
    whitespace = re.compile(r"\s+")
    web_address = re.compile(r"(?i)http(s):\/\/[a-z0-9.~_\-\/]+")
    user = re.compile(r"(?i)@[a-z0-9_]+")
    text = text.replace('.', '')
    text = whitespace.sub(' ', text)
    text = web_address.sub('', text)
    text = user.sub('', text)
    text = re.sub(r"\[[^()]*\]", "", text)
    text = re.sub(r"\d+", "", text)
    text = re.sub(r'[^\w\s]','',text)
    text = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", text)
    return text.lower()

df.v2 = [cleanText(item) for item in df.v2]

#---------------------------------------------------------

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.oov_token = '<oovToken>'
tokenizer.fit_on_texts(df.v2)
vocab = tokenizer.word_index
vocabCount = len(vocab)+1

#----------------------------------------------------------

SPLIT = 5000

xTrain = tf.keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to_sequences(df.v2.to_numpy()), padding='pre', maxlen=171)
yTrain = df.v1.to_numpy()
dim = xTrain.shape[1]
xTest = xTrain[SPLIT:]
yTest = yTrain[SPLIT:]

xTrain = xTrain[:SPLIT]
yTrain = yTrain[:SPLIT]

print("******************************xTrain shape*************************")
xTrain.shape, yTrain.shape, xTest.shape, yTest.shape

print(xTrain.shape, yTrain.shape, xTest.shape, yTest.shape)


#-----------------------HP Model-----------------------------


hpmodel = tf.keras.Sequential()
hpmodel.add(tf.keras.layers.Embedding(input_dim=vocabCount+1, output_dim=64, input_length=dim))
hpmodel.add(tf.keras.layers.GlobalAveragePooling1D())
for i in range(args.num_layers):
        hpmodel.add(tf.keras.layers.Dense(args.units, activation='relu'))
hpmodel.add(tf.keras.layers.Dense(1, activation='sigmoid')) 


hpmodel.compile(loss='binary_crossentropy', 
                optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate), 
                        metrics=['accuracy'])

hpmodel.fit(xTrain, yTrain, epochs=10, shuffle=True)


# calculate accuracy
y_hat = hpmodel.predict(xTest)
acc = np.average(y_hat == yTest)
print('accuracy:', acc)
run.log('accuracy', np.float(acc))


# calculate AUC
# y_scores = hpmodel.predict_proba(xTest)
# auc = roc_auc_score(yTest,y_scores[:,1])
# print('AUC: ' + str(auc))
# run.log('AUC', np.float(auc))

# Save the model in the run outputs
# os.makedirs('outputs', exist_ok=True)
# joblib.dump(value=hpmodel, filename='outputs/emailspam_model.pkl')


#--------------------------------------------------------------


text = "I'll meet you at time square on Monday morning 10 AM"
processedText = cleanText(text)

finalText = tf.keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to_sequences([processedText]), padding='pre', maxlen=171)
prediction = hpmodel.predict(finalText)
print("prediction shape", prediction.shape)
print(prediction)

print("****************HAM/SPAM******************************")
print(np.int(np.rint(prediction[0,0])))
print("****************HAM/SPAM******************************")

#-------------------MODEL OUTPUT----------------------------------------

text = "Congratulations, you have won a 10000 dollars lottery. Please give your bank details to claim the money"
processedText = cleanText(text)

finalText = tf.keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to_sequences([processedText]), padding='pre', maxlen=171)
prediction = hpmodel.predict(finalText)
print("prediction shape", prediction.shape)
print(prediction)

print("****************HAM/SPAM******************************")
print(np.int(np.rint(prediction[0,0])))
print("****************HAM/SPAM******************************")

run.complete()


Overwriting emailspam_hyperdrive_experimentv5/emailspam_hyperdrive_training_v5.py


In [7]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.train.hyperdrive import GridParameterSampling, HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.widgets import RunDetails
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.parameter_expressions import choice, uniform

# Create a Python environment for the experiment
# hyper_env = Environment.from_conda_specification("experiment_env", experiment_folder + "/hyperdrive_env.yml")

hyper_env = Environment.from_conda_specification("experiment_env", experiment_folder + "/emailspam_hyperdrive_env_v5.yml")


# Get the training dataset
emailspam_ds = ws.datasets.get("UdacityPrjEmailSpamDataSet")

# Create a script config
script_config = ScriptRunConfig(source_directory=experiment_folder,
                                script='emailspam_hyperdrive_training_v5.py',
                                # Add non-hyperparameter arguments -in this case, the training dataset
                                arguments = ['--input-data', emailspam_ds.as_named_input('training_data')],
                                environment=hyper_env,
                                compute_target = amlcompute_cluster_name)


# hp.Int('num_layers', 2, 20)
# hp.Int('units_' + str(i), min_value=32,max_value=512, step=32)
# hp.Choice('learning_rate', [1e-2, 1e-3, 1e-4])), 

# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters


params = RandomParameterSampling( 
    {
    "--num_layers": choice(range(2, 4)),
    "--units": choice(16, 32, 64),
    "--learning_rate": choice(1e-2, 1e-3, 1e-4) 
    }
    )


# Configure hyperdrive settings
hyperdrive = HyperDriveConfig(run_config=script_config, 
                          hyperparameter_sampling=params, 
                          policy=None, # No early stopping policy
                          primary_metric_name='accuracy', # Find the highest Accuracy metric
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=16, # Restict the experiment to 8 iterations
                          max_concurrent_runs=2) # Run up to 2 iterations in parallel

# Run the experiment
# experiment = Experiment(workspace=ws, name='mslearn-emailspam-hyperdrive')


run = experiment.submit(config=hyperdrive)

# Show the status in the notebook as the experiment runs
RunDetails(run).show()
run.wait_for_completion()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

**Straight to hyperparameter tuning **

## Run a hyperparameter tuning experiment

Azure Machine Learning includes a hyperparameter tuning capability through *hyperdrive* experiments. These experiments launch multiple child runs, each with a different hyperparameter combination. The run producing the best model (as determined by the logged target performance metric for which you want to optimize) can be identified, and its trained model selected for registration and deployment.

> **Note**: In this example, we aren't specifying an early stopping policy. Such a policy is only relevant if the training script performs multiple training iterations, logging the primary metric for each iteration. This approach is typically employed when training deep neural network models over multiple *epochs*.

In [None]:
run.get_metrics()

In [None]:
#------------------------HP Best run----------------------------

# Print all child runs, sorted by the primary metric
for child_run in run.get_children_sorted_by_primary_metric():
    print(child_run)

# Get the best run, and its metrics and arguments
best_run = run.get_best_run_by_primary_metric()

print("-----------------BEST RUN----------------------------")
print(best_run)

best_run_metrics = best_run.get_metrics()

script_arguments = best_run.get_details() ['runDefinition']['arguments']
print('Best Run Id: ', best_run.id)
# print(' -AUC:', best_run_metrics['AUC'])
print(' -Accuracy:', best_run_metrics['accuracy'])
print(' -Arguments:',script_arguments)


https://www.tensorflow.org/tutorials/keras/keras_tuner

Now that you've found the best run, you can register the model it trained.

> **More Information**: For more information about Hyperdrive, see the [Azure ML documentation](https://docs.microsoft.com/azure/machine-learning/how-to-tune-hyperparameters).