# Hyper-parameter tuning

**Learning Objectives**
1. Understand various approaches to hyperparameter tuning
2. Automate hyperparameter tuning using AI Platform

## Introduction

Let's see if we can improve upon that by tuning our hyperparameters.

Hyperparameters are parameters that are set *prior* to training a model, as opposed to parameters which are learned *during* training. 

These include learning rate and batch size, but also model design parameters such as type of activation function and number of hidden units.

Here are the four most common ways to finding the ideal hyperparameters:
1. Manual
2. Grid Search
3. Random Search
4. Bayesian Optimzation

**1. Manual**

Traditionally, hyperparameter tuning is a manual trial and error process. A data scientist has some intution about suitable hyperparameters which they use as a starting point, then they observe the result and use that information to try a new set of hyperparameters to try to beat the existing performance. 

Pros
- Educational, builds up your intuition as a data scientist
- Inexpensive because only one trial is conducted at a time

Cons
- Requires alot of time and patience

**2. Grid Search**

On the other extreme we can use grid search. Define a discrete set of values to try for each hyperparameter then try every possible combination. 

Pros
- Can run hundreds of trials in parallel using the cloud
- Gauranteed to find the best solution within the search space

Cons
- Expensive

**3. Random Search**

Alternatively define a range for each hyperparamter (e.g. 0-256) and sample uniformly at random from that range. 

Pros
- Can run hundreds of trials in parallel using the cloud
- Requires less trials than Grid Search to find a good solution

Cons
- Expensive (but less so than Grid Search)

**4. Bayesian Optimization**

Unlike Grid Search and Random Search, Bayesian Optimization takes into account information from  past trials to select parameters for future trials. The details of how this is done is beyond the scope of this notebook, but if you're interested you can read how it works here [here](https://cloud.google.com/blog/products/gcp/hyperparameter-tuning-cloud-machine-learning-engine-using-bayesian-optimization). 

Pros
- Picks values intelligenty based on results from past trials
- Less expensive because requires fewer trials to get a good result

Cons
- Requires sequential trials for best results, takes longer

**AI Platform HyperTune**

AI Platform HyperTune, powered by [Google Vizier](https://ai.google/research/pubs/pub46180), uses Bayesian Optimization by default, but [also supports](https://cloud.google.com/ml-engine/docs/tensorflow/hyperparameter-tuning-overview#search_algorithms) Grid Search and Random Search. 


When tuning just a few hyperparameters (say less than 4), Grid Search and Random Search work well, but when tunining several hyperparameters and the search space is large Bayesian Optimization is best.

In [2]:
PROJECT = "munn-sandbox"  # Replace with your PROJECT
BUCKET = "munn-sandbox"  # Replace with your BUCKET
REGION = "us-central1"            # Choose an available region for AI Platform
TFVERSION = "2.1"                # TF version for AI Platform to use

In [3]:
import os 
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = TFVERSION 

## Move code into python package

In the [previous lab](), we moved our code into a python package for training on Cloud AI Platform. Let's just check that the files are there. You should see the following files in the `taxifare/trainer` directory:
 - `__init__.py`
 - `model.py`
 - `task.py`

In [8]:
!ls -la taxifare/trainer

total 24
drwxr-xr-x 3 jupyter jupyter 4096 Mar 25 22:16 .
drwxr-xr-x 6 jupyter jupyter 4096 Mar 25 22:16 ..
-rw-r--r-- 1 jupyter jupyter    0 Mar 25 22:06 __init__.py
drwxr-xr-x 2 jupyter jupyter 4096 Mar 25 22:07 .ipynb_checkpoints
-rw-r--r-- 1 jupyter jupyter 7356 Mar 25 22:16 model.py
-rw-r--r-- 1 jupyter jupyter 1651 Mar 25 22:16 task.py


To use hyperparameter tuning in your training job you must perform the following steps:

 1. Specify the hyperparameter tuning configuration for your training job by including a HyperparameterSpec in your TrainingInput object.

 2. Include the following code in your training application:

  - Parse the command-line arguments representing the hyperparameters you want to tune, and use the values to set the hyperparameters for your training trial.
Add your hyperparameter metric to the summary for your graph.

  - To submit a hyperparameter tuning job, we must modify `model.py` and `task.py` to expose any variables we want to tune as command line arguments.

### Modify model.py

In [11]:
%%writefile ./taxifare/trainer/model.py
import datetime
import logging
import os
import shutil

import numpy as np
import tensorflow as tf
from tensorflow.keras.callbacks import (
    ModelCheckpoint,
    TensorBoard,
)
from tensorflow import feature_column as fc
from tensorflow.keras.activations import relu
from tensorflow.keras.layers import (
    Dense,
    DenseFeatures,
    Input,
    Lambda,
)
from tensorflow.keras.models import Model


logging.info(tf.version.VERSION)


CSV_COLUMNS = [
        'fare_amount',
        'pickup_datetime',
        'pickup_longitude',
        'pickup_latitude',
        'dropoff_longitude',
        'dropoff_latitude',
        'passenger_count',
        'key',
]
LABEL_COLUMN = 'fare_amount'
DEFAULTS = [[0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']]
DAYS = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']


def features_and_labels(row_data):
    for unwanted_col in ['key']:
        row_data.pop(unwanted_col)
    label = row_data.pop(LABEL_COLUMN)
    return row_data, label


def load_dataset(pattern, batch_size, num_repeat):
    dataset = tf.data.experimental.make_csv_dataset(
        file_pattern=pattern,
        batch_size=batch_size,
        column_names=CSV_COLUMNS,
        column_defaults=DEFAULTS,
        num_epochs=num_repeat,
    )
    return dataset.map(features_and_labels)


def create_train_dataset(pattern, batch_size):
    dataset = load_dataset(pattern, batch_size, num_repeat=None)
    return dataset.prefetch(1)


def create_eval_dataset(pattern, batch_size):
    dataset = load_dataset(pattern, batch_size, num_repeat=1)
    return dataset.prefetch(1)


def parse_datetime(s):
    if type(s) is not str:
        s = s.numpy().decode('utf-8')
    return datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S %Z")


def euclidean(params):
    lon1, lat1, lon2, lat2 = params
    londiff = lon2 - lon1
    latdiff = lat2 - lat1
    return tf.sqrt(londiff*londiff + latdiff*latdiff)


def get_dayofweek(s):
    ts = parse_datetime(s)
    return DAYS[ts.weekday()]


@tf.function
def dayofweek(ts_in):
    return tf.map_fn(
        lambda s: tf.py_function(get_dayofweek, inp=[s], Tout=tf.string),
        ts_in
    )


@tf.function
def fare_thresh(x):
    return 60 * relu(x)


def transform(inputs, NUMERIC_COLS, STRING_COLS, nbuckets):
    # Pass-through columns
    transformed = inputs.copy()
    del transformed['pickup_datetime']

    feature_columns = {
        colname: fc.numeric_column(colname)
        for colname in NUMERIC_COLS
    }

    # Scaling longitude from range [-70, -78] to [0, 1]
    for lon_col in ['pickup_longitude', 'dropoff_longitude']:
        transformed[lon_col] = Lambda(
            lambda x: (x + 78)/8.0,
            name='scale_{}'.format(lon_col)
        )(inputs[lon_col])

    # Scaling latitude from range [37, 45] to [0, 1]
    for lat_col in ['pickup_latitude', 'dropoff_latitude']:
        transformed[lat_col] = Lambda(
            lambda x: (x - 37)/8.0,
            name='scale_{}'.format(lat_col)
        )(inputs[lat_col])

    # Adding Euclidean dist (no need to be accurate: NN will calibrate it)
    transformed['euclidean'] = Lambda(euclidean, name='euclidean')([
        inputs['pickup_longitude'],
        inputs['pickup_latitude'],
        inputs['dropoff_longitude'],
        inputs['dropoff_latitude']
    ])
    feature_columns['euclidean'] = fc.numeric_column('euclidean')

    # hour of day from timestamp of form '2010-02-08 09:17:00+00:00'
    transformed['hourofday'] = Lambda(
        lambda x: tf.strings.to_number(
            tf.strings.substr(x, 11, 2), out_type=tf.dtypes.int32),
        name='hourofday'
    )(inputs['pickup_datetime'])
    feature_columns['hourofday'] = fc.indicator_column(
        fc.categorical_column_with_identity(
            'hourofday', num_buckets=24))

    latbuckets = np.linspace(0, 1, nbuckets).tolist()
    lonbuckets = np.linspace(0, 1, nbuckets).tolist()
    b_plat = fc.bucketized_column(
        feature_columns['pickup_latitude'], latbuckets)
    b_dlat = fc.bucketized_column(
            feature_columns['dropoff_latitude'], latbuckets)
    b_plon = fc.bucketized_column(
            feature_columns['pickup_longitude'], lonbuckets)
    b_dlon = fc.bucketized_column(
            feature_columns['dropoff_longitude'], lonbuckets)
    ploc = fc.crossed_column(
            [b_plat, b_plon], nbuckets * nbuckets)
    dloc = fc.crossed_column(
            [b_dlat, b_dlon], nbuckets * nbuckets)
    pd_pair = fc.crossed_column([ploc, dloc], nbuckets ** 4)
    feature_columns['pickup_and_dropoff'] = fc.embedding_column(
            pd_pair, 100)

    return transformed, feature_columns


def rmse(y_true, y_pred):
    return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true)))


def build_dnn_model(nbuckets, nnsize, lr):
    # input layer is all float except for pickup_datetime which is a string
    STRING_COLS = ['pickup_datetime']
    NUMERIC_COLS = (
            set(CSV_COLUMNS) - set([LABEL_COLUMN, 'key']) - set(STRING_COLS)
    )
    inputs = {
        colname: Input(name=colname, shape=(), dtype='float32')
        for colname in NUMERIC_COLS
    }
    inputs.update({
        colname: Input(name=colname, shape=(), dtype='string')
        for colname in STRING_COLS
    })

    # transforms
    transformed, feature_columns = transform(
        inputs, NUMERIC_COLS, STRING_COLS, nbuckets=nbuckets)
    dnn_inputs = DenseFeatures(feature_columns.values())(transformed)

    x = dnn_inputs
    for layer, neurons in enumerate(nnsize):
        x = Dense(neurons, activation='relu', name='h{}'.format(layer))(x)
    output = Dense(1, name='fare')(x)

    model = Model(inputs, output)
    
    # TODO add in custom lr to optimizer
    my_optimizer = tf.keras.optimizers.Adam(learning_rate=lr, name='Adam')
    model.compile(optimizer=my_optimizer, loss='mse', metrics=[rmse, 'mse'])
    return model


def train_and_evaluate(hparams):
    batch_size = hparams['batch_size'] # this would become a todo
    eval_data_path = hparams['eval_data_path']
    nnsize = hparams['nnsize'] # this would become a todo
    nbuckets = hparams['nbuckets'] # this would become a todo
    lr = hparams['lr']  # TODO
    num_evals = hparams['num_evals']
    num_examples_to_train_on = hparams['num_examples_to_train_on']
    output_dir = hparams['output_dir']
    train_data_path = hparams['train_data_path']

    timestamp = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
    savedmodel_dir = os.path.join(output_dir, 'export/savedmodel')
    model_export_path = os.path.join(savedmodel_dir, timestamp)
    checkpoint_path = os.path.join(output_dir, 'checkpoints')
    tensorboard_path = os.path.join(output_dir, 'tensorboard')

    if tf.io.gfile.exists(output_dir):
        tf.io.gfile.rmtree(output_dir)

    model = build_dnn_model(nbuckets, nnsize, lr)
    logging.info(model.summary())

    trainds = create_train_dataset(train_data_path, batch_size)
    evalds = create_eval_dataset(eval_data_path, batch_size)

    steps_per_epoch = num_examples_to_train_on // (batch_size * num_evals)

    checkpoint_cb = ModelCheckpoint(
        checkpoint_path,
        save_weights_only=True,
        verbose=1
    )
    tensorboard_cb = TensorBoard(tensorboard_path)

    history = model.fit(
        trainds,
        validation_data=evalds,
        epochs=num_evals,
        steps_per_epoch=max(1, steps_per_epoch),
        verbose=2,  # 0=silent, 1=progress bar, 2=one line per epoch
        callbacks=[checkpoint_cb, tensorboard_cb]
    )

    # Exporting the model with default serving function.
    tf.saved_model.save(model, model_export_path)
    return history



Overwriting ./taxifare/trainer/model.py


### Modify task.py

In [48]:
%%writefile taxifare/trainer/task.py
import argparse
import os

from trainer import model


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    
    # will become a TODO
    parser.add_argument(
        "--batch_size",
        help = "Batch size for training steps",
        type = int,
        default = 32
    )

    parser.add_argument(
        "--eval_data_path",
        help = "GCS location pattern of eval files",
        required = True
    )
    
    # will become a TODO
    parser.add_argument(
        "--nnsize",
        help = "Hidden layer sizes (provide space-separated sizes)",
        nargs = "+",
        type = int,
        default=[32, 8]
    )

    # will become a TODO
    parser.add_argument(
        "--nbuckets",
        help = "Number of buckets to divide lat and lon with",
        type = int,
        default = 10
    )
    
    # TODO
    parser.add_argument(
        "--lr",
        help = "learning rate for optimizer",
        type = float,
        default = 0.001
    )
    
    parser.add_argument(
        "--num_evals",
        help = "Number of times to evaluate model on eval data training.",
        type = int,
        default = 5
    )

    parser.add_argument(
        "--num_examples_to_train_on",
        help = "Number of examples to train on.",
        type = int,
        default = 100
    )

    parser.add_argument(
    "--output_dir",
        help = "GCS location to write checkpoints and export models",
        required = True
    )

    parser.add_argument(
        "--train_data_path",
        help = "GCS location pattern of train files containing eval URLs",
        required = True
    )

    parser.add_argument(
        "--job-dir",
        help = "this model ignores this field, but it is required by gcloud",
        default = "junk"
    )

    args = parser.parse_args()
    hparams = args.__dict__
    
    # Append trial_id to path so trials don't overwrite each other
    hparams["output_dir"] = os.path.join(
        args["output_dir"],
        json.loads(
            os.environ.get("TF_CONFIG", "{}")
        ).get("task", {}).get("trial", "")
    )     
    hparams.pop("job-dir", None)

    model.train_and_evaluate(hparams)


Overwriting taxifare/trainer/task.py


### Create config.yaml file

Specify the hyperparameter tuning configuration for your training job
Create a HyperparameterSpec object to hold the hyperparameter tuning configuration for your training job, and add the HyperparameterSpec as the hyperparameters object in your TrainingInput object.

In your HyperparameterSpec, set the hyperparameterMetricTag to a value representing your chosen metric. If you don't specify a hyperparameterMetricTag, AI Platform Training looks for a metric with the name training/hptuning/metric. The following example shows how to create a configuration for a metric named metric1:

In [42]:
%%writefile hptuning_config.yaml
trainingInput:
  scaleTier: BASIC
  hyperparameters:
    goal: MINIMIZE
    maxTrials: 50
    maxParallelTrials: 5
    hyperparameterMetricTag: rmse
    enableTrialEarlyStopping: True
    params:
    - parameterName: lr
      type: DOUBLE
      minValue: 0.0001
      maxValue: 0.1
      scaleType: UNIT_LOG_SCALE
    - parameterName: nbuckets
      type: INTEGER
      minValue: 10
      maxValue: 25
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: batch_size
      type: DISCRETE
      discreteValues:
      - 15
      - 30
      - 50

Overwriting hptuning_config.yaml


#### Report your hyperparameter metric to AI Platform Training

The way to report your hyperparameter metric to the AI Platform Training service depends on whether you are using TensorFlow for training or not. It also depends on whether you are using a runtime version or a custom container for training.

We recommend that your training code reports your hyperparameter metric to AI Platform Training frequently in order to take advantage of early stopping.

TensorFlow with a runtime version
If you use an AI Platform Training runtime version and train with TensorFlow, then you can report your hyperparameter metric to AI Platform Training by writing the metric to a TensorFlow summary. Use one of the following functions:

tf.compat.v1.summary.FileWriter.add_summary (also known as tf.summary.FileWriter.add_summary in TensorFlow 1.x)
tf.summary.scalar (only in TensorFlow 2.x)

In [13]:
<example of custom eval metric>

SyntaxError: invalid syntax (<ipython-input-13-9af87942650c>, line 1)

In [25]:
%%bash

EVAL_DATA_PATH=./taxifare/tests/data/taxi-valid*
TRAIN_DATA_PATH=./taxifare/tests/data/taxi-train*
OUTPUT_DIR=./taxifare-model

python3 ./taxifare/trainer/task.py \
--eval_data_path $EVAL_DATA_PATH \
--output_dir $OUTPUT_DIR \
--train_data_path $TRAIN_DATA_PATH \
--batch_size 5 \
--num_examples_to_train_on 100 \
--num_evals 1 \
--nbuckets 10 \
--nnsize 32 8

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
dropoff_latitude (InputLayer)   [(None,)]            0                                            
__________________________________________________________________________________________________
dropoff_longitude (InputLayer)  [(None,)]            0                                            
__________________________________________________________________________________________________
pickup_longitude (InputLayer)   [(None,)]            0                                            
__________________________________________________________________________________________________
pickup_latitude (InputLayer)    [(None,)]            0                                            
______________________________________________________________________________________________

2020-03-27 17:01:46.529586: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-27 17:01:47.775619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-03-27 17:01:47.777498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-03-27 17:01:48.723672: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-27 17:01:50.703900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-27 17:01:50.704643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235G

In [29]:
!ls -la

total 216
drwxr-xr-x 7 jupyter jupyter  4096 Mar 27 17:25 .
drwxr-xr-x 6 jupyter jupyter  4096 Mar 25 13:54 ..
-rw-r--r-- 1 jupyter jupyter  6903 Mar 24 21:44 0_export_data_from_bq_to_gcs.ipynb
-rw-r--r-- 1 jupyter jupyter 25390 Mar 27 15:21 1_training_at_scale.ipynb
-rw-r--r-- 1 jupyter jupyter 45383 Mar 27 17:25 2_hyperparameter_tuning.ipynb
-rw-r--r-- 1 jupyter jupyter 14611 Mar 26 17:19 3_kubeflow_pipelines.ipynb
-rw-r--r-- 1 jupyter jupyter 47608 Mar 25 13:54 4a_streaming_data_training.ipynb
-rw-r--r-- 1 jupyter jupyter 18620 Mar 25 13:54 4b_streaming_data_inference.ipynb
-rw-r--r-- 1 jupyter jupyter   269 Mar 25 17:58 Dockerfile
-rw-r--r-- 1 jupyter jupyter   681 Mar 27 16:59 hptuning_config.yaml
drwxr-xr-x 2 jupyter jupyter  4096 Mar 27 17:00 .ipynb_checkpoints
-rw-r--r-- 1 jupyter jupyter   599 Mar 24 21:44 Makefile
drwxr-xr-x 3 jupyter jupyter  4096 Mar 24 21:44 pipelines
-rw-r--r-- 1 jupyter jupyter    30 Mar 25 22:16 requirements.txt
-rw-r--r-- 1 jupyter jupyter   208 Mar 27

In [49]:
%%bash
# Replace with your BUCKET and REGION
BUCKET="munn-sandbox"
REGION="munn-sandbox"
TFVERSION="2.1"

OUTDIR=gs://${BUCKET}/taxifare/trained_model
JOBID=taxifare_$(date -u +%y%m%d_%H%M%S)
echo ${OUTDIR} ${REGION} ${JOBID}
gsutil -m rm -rf ${OUTDIR}

# Model and training hyperparameters
BATCH_SIZE=15
NUM_EXAMPLES_TO_TRAIN_ON=100
NUM_EVALS=10
NBUCKETS=10
NNSIZE="32 8"

# GCS paths
GCS_PROJECT_PATH=gs://${BUCKET}/taxifare
DATA_PATH=${GCS_PROJECT_PATH}/data
OUTPUT_DIR=${GCS_PROJECT_PATH}/model_hptune
TRAIN_DATA_PATH=${DATA_PATH}/taxi-train*
EVAL_DATA_PATH=${DATA_PATH}/taxi-valid*

gcloud ai-platform jobs submit training ${JOBID} \
    --module-name=trainer.task \
    --package-path=taxifare/trainer \
    --staging-bucket=gs://${BUCKET} \
    --config=hptuning_config.yaml \
    --python-version=3.7 \
    --runtime-version=${TFVERSION} \
    -- \
    --eval_data_path=${EVAL_DATA_PATH} \
    --output_dir=${OUTPUT_DIR} \
    --train_data_path=${TRAIN_DATA_PATH} \
    --batch_size ${BATCH_SIZE} \
    --num_examples_to_train_on ${NUM_EXAMPLES_TO_TRAIN_ON} \
    --num_evals ${NUM_EVALS} \
    --nbuckets ${NBUCKETS} \
    --nnsize ${NNSIZE}

gs://munn-sandbox/taxifare/trained_model munn-sandbox taxifare_200327_184752
jobId: taxifare_200327_184752
state: QUEUED


CommandException: 1 files/objects could not be removed.
Job [taxifare_200327_184752] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe taxifare_200327_184752

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs taxifare_200327_184752


In [None]:
OUTDIR="gs://{}/taxifare/trained_hp_tune".format(BUCKET)
!gsutil -m rm -rf {OUTDIR} # start fresh each time
!gcloud ai-platform jobs submit training taxifare_$(date -u +%y%m%d_%H%M%S) \
    --package-path=taxifaremodel \
    --module-name=taxifaremodel.task \
    --config=hyperparam.yaml \
    --job-dir=gs://{BUCKET}/taxifare \
    --python-version=3.5 \
    --runtime-version={TFVERSION} \
    --region={REGION} \
    -- \
    --train_data_path=gs://{BUCKET}/taxifare/smallinput/taxi-train.csv \
    --eval_data_path=gs://{BUCKET}/taxifare/smallinput/taxi-valid.csv  \
    --train_steps=5000 \
    --output_dir={OUTDIR}