# Hyperparameter tuning

**Learning Objectives**
1. Learn how to use `cloudml-hypertune` to report the results for Cloud hyperparameter tuning trial runs
2. Learn how to configure the `.yaml` file for submitting a Cloud hyperparameter tuning job
3. Submit a hyperparameter tuning job to Vertex AI

## Introduction

Let's see if we can improve upon that by tuning our hyperparameters.

Hyperparameters are parameters that are set *prior* to training a model, as opposed to parameters which are learned *during* training. 

These include learning rate and batch size, but also model design parameters such as type of activation function and number of hidden units.

Here are the four most common ways to finding the ideal hyperparameters:
1. Manual
2. Grid Search
3. Random Search
4. Bayesian Optimzation

**1. Manual**

Traditionally, hyperparameter tuning is a manual trial and error process. A data scientist has some intuition about suitable hyperparameters which they use as a starting point, then they observe the result and use that information to try a new set of hyperparameters to try to beat the existing performance. 

Pros
- Educational, builds up your intuition as a data scientist
- Inexpensive because only one trial is conducted at a time

Cons
- Requires a lot of time and patience

**2. Grid Search**

On the other extreme we can use grid search. Define a discrete set of values to try for each hyperparameter then try every possible combination. 

Pros
- Can run hundreds of trials in parallel using the cloud
- Guaranteed to find the best solution within the search space

Cons
- Expensive

**3. Random Search**

Alternatively define a range for each hyperparameter (e.g. 0-256) and sample uniformly at random from that range. 

Pros
- Can run hundreds of trials in parallel using the cloud
- Requires less trials than Grid Search to find a good solution

Cons
- Expensive (but less so than Grid Search)

**4. Bayesian Optimization**

Unlike Grid Search and Random Search, Bayesian Optimization takes into account information from past trials to select parameters for future trials. The details of how this is done is beyond the scope of this notebook, but if you're interested you can read how it works here [here](https://cloud.google.com/blog/products/gcp/hyperparameter-tuning-cloud-machine-learning-engine-using-bayesian-optimization). 

Pros
- Picks values intelligently based on results from past trials
- Less expensive because requires fewer trials to get a good result

Cons
- Requires sequential trials for best results, takes longer

**Vertex AI HyperTune**

Vertex AI HyperTune, powered by [Google Vizier](https://ai.google/research/pubs/pub46180), uses Bayesian Optimization by default, but [also supports](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview#search_algorithms) Grid Search and Random Search. 


When tuning just a few hyperparameters (say less than 4), Grid Search and Random Search work well, but when tuning several hyperparameters and the search space is large Bayesian Optimization is best.

In [None]:
import os
import warnings
from datetime import datetime

import tensorboard
from google.cloud import aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt

%load_ext tensorboard
warnings.filterwarnings("ignore")
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

In [None]:
# Change below if necessary
PROJECT = !gcloud config get-value project  # noqa: E999
PROJECT = PROJECT[0]
BUCKET = PROJECT
REGION = "us-central1"

%env PROJECT=$PROJECT
%env BUCKET=$BUCKET
%env REGION=$REGION

In [None]:
%%bash
gcloud config set project $PROJECT
gcloud config set ai/region $REGION

## Make code compatible with Vertex AI Training Service
In order to make our code compatible with Vertex AI Training Service we need to make the following changes:

1. Upload data to Google Cloud Storage 
2. Move code into a trainer Python package
4. Submit training job with `gcloud` to train on Vertex AI

### Upload data to Google Cloud Storage (GCS)

Cloud services don't have access to our local files, so we need to upload them to a location the Cloud servers can read from. In this case we'll use GCS.


In [None]:
!gsutil ls gs://$BUCKET/taxifare/data

## Move code into python package

In the [previous lab](./1_training_at_scale.ipynb), we moved our code into a python package for training on Vertex AI. Let's just check that the files are there. You should see the following files in the `taxifare/trainer` directory:
 - `__init__.py`
 - `model.py`
 - `task.py`

In [None]:
!ls -la taxifare/trainer

To use hyperparameter tuning in your training job you must perform the following steps:

 1. Specify the hyperparameter tuning configuration for your training job by including `parameters` in the `StudySpec` of your Hyperparameter Tuning Job.

 2. Include the following code in your training application:

  - Parse the command-line arguments representing the hyperparameters you want to tune, and use the values to set the hyperparameters for your training trial (we already exposed these parameters as command-line arguments in the earlier lab).

  - Report your hyperparameter metrics during training. Note that while you could just report the metrics at the end of training, it is better to set up a callback, to take advantage of Early Stopping.

  - Read in the environment variable `$AIP_MODEL_DIR`, set by Vertex AI and containing the trial number, as our `output-dir`. As the training code will be submitted several times in a parallelized fashion, it is safer to use this variable than trying to assemble a unique id within the trainer code.

### Modify model.py

In [None]:
%%writefile ./taxifare/trainer/model.py
"""Data prep, train and evaluate DNN model."""

import logging
import os
import hypertune

import numpy as np
import tensorflow as tf
import keras
from keras import callbacks
from keras.layers import (
    Concatenate,
    Dense,
    Discretization,
    Embedding,
    Flatten,
    HashedCrossing,
    Input,
    Lambda,
)

def parse_csv(row):
    ds = tf.strings.split(row, ",")
    # Label: fare_amount
    label = tf.strings.to_number(ds[0])
    # Feature: pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude
    feature = tf.strings.to_number(ds[2:6])  # use some features only
    # Passing feature in tuple so that we can handle them separately.
    return (feature[0], feature[1], feature[2], feature[3]), label


def create_dataset(pattern, batch_size, num_repeat, mode="eval"):
    ds = tf.data.Dataset.list_files(pattern)
    ds = ds.flat_map(tf.data.TextLineDataset)
    ds = ds.map(parse_csv)
    if mode == "train":
        ds.shuffle(buffer_size=1000)
    ds = ds.repeat(num_repeat).batch(batch_size, drop_remainder=True)
    return ds

def lat_lon_parser(row, pick_lat):
    ds = tf.strings.split(row, ",")
    # latitude idx: 3 and 5, longitude idx: 2 and 4
    idx = [3,5] if pick_lat else [2,4]
    return tf.strings.to_number(tf.gather(ds, idx))

def adapt_normalize(train_data_path):
    ds = tf.data.Dataset.list_files(train_data_path)
    ds = ds.flat_map(tf.data.TextLineDataset)
    lat_values = ds.map(lambda x: lat_lon_parser(x, True)).batch(10000)
    lon_values = ds.map(lambda x: lat_lon_parser(x, False)).batch(10000)

    lat_scaler = keras.layers.Normalization(axis=None)
    lon_scaler = keras.layers.Normalization(axis=None)
    lat_scaler.adapt(lat_values)
    lon_scaler.adapt(lon_values)

    print("Computed statistics for latitude:")
    print(f"mean: {lat_scaler.mean}, variance: {lat_scaler.variance}")
    print("+++++")
    print("Computed statistics for longitude:")
    print(f"mean: {lon_scaler.mean}, variance: {lon_scaler.variance}")

    return lat_scaler, lon_scaler


def euclidean(params):
    lon1, lat1, lon2, lat2 = params
    londiff = lon2 - lon1
    latdiff = lat2 - lat1
    return tf.sqrt(londiff * londiff + latdiff * latdiff)


def transform(inputs, nbuckets, normalizers):
    lat_scaler, lon_scaler = normalizers

    # Normalize longitude
    scaled_plon = lon_scaler(inputs["pickup_longitude"])
    scaled_dlon = lon_scaler(inputs["dropoff_longitude"])

    # Normalize latitude
    scaled_plat = lat_scaler(inputs["pickup_latitude"])
    scaled_dlat = lat_scaler(inputs["dropoff_latitude"])

    # Lambda layer for the custom euclidean function
    euclidean_distance = Lambda(euclidean, name="euclidean")(
        [scaled_plon, scaled_plat, scaled_dlon, scaled_dlat]
    )

    # Discretization
    latbuckets = np.linspace(start=-5, stop=5, num=nbuckets).tolist()
    lonbuckets = np.linspace(start=-5, stop=5, num=nbuckets).tolist()

    plon = Discretization(lonbuckets, name="plon_bkt")(scaled_plon)
    plat = Discretization(latbuckets, name="plat_bkt")(scaled_plat)
    dlon = Discretization(lonbuckets, name="dlon_bkt")(scaled_dlon)
    dlat = Discretization(latbuckets, name="dlat_bkt")(scaled_dlat)

    # Feature Cross with HashedCrossing layer
    p_fc = HashedCrossing(num_bins=(nbuckets + 1) ** 2, name="p_fc")((plon, plat))
    d_fc = HashedCrossing(num_bins=(nbuckets + 1) ** 2, name="d_fc")((dlon, dlat))
    pd_fc = HashedCrossing(num_bins=(nbuckets + 1) ** 4, name="pd_fc")((p_fc, d_fc))

    # Embedding with Embedding layer
    pd_embed = Flatten()(
        Embedding(input_dim=(nbuckets + 1) ** 4, output_dim=10, name="pd_embed")(
            pd_fc
        )
    )

    transformed = Concatenate()([
        scaled_plon,
        scaled_dlon,
        scaled_plat,
        scaled_dlat,
        euclidean_distance, 
        pd_embed
    ])

    return transformed


def rmse(y_true, y_pred):
    squared_error = tf.keras.ops.square(y_pred[:, 0] - y_true)
    return tf.keras.ops.sqrt(tf.keras.ops.mean(squared_error))

def build_dnn_model(nbuckets, nnsize, lr, normalizers):
    INPUT_COLS = [
        "pickup_longitude",
        "pickup_latitude",
        "dropoff_longitude",
        "dropoff_latitude",
    ]

    inputs = {
        colname: Input(name=colname, shape=(1,), dtype="float32")
        for colname in INPUT_COLS
    }

    # transforms
    x = transform(inputs, nbuckets, normalizers)

    for layer, nodes in enumerate(nnsize):
        x = Dense(nodes, activation="relu", name=f"h{layer}")(x)
    output = Dense(1, name="fare")(x)

    model = keras.Model(inputs=list(inputs.values()), outputs=output)
    lr_optimizer = keras.optimizers.Adam(learning_rate=lr)
    model.compile(optimizer=lr_optimizer, loss="mse", metrics=[rmse, "mse"])

    return model

# Instantiate the HyperTune reporting object
hpt = hypertune.HyperTune()

# Reporting callback
class HPTCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        global hpt
        hpt.report_hyperparameter_tuning_metric(
            hyperparameter_metric_tag="val_rmse",
            metric_value=logs["val_rmse"],
            global_step=epoch,
        )


def train_and_evaluate(hparams):
    batch_size = hparams["batch_size"]
    nbuckets = hparams["nbuckets"]
    lr = hparams["lr"]
    nnsize = [int(s) for s in hparams["nnsize"].split()]
    eval_data_path = hparams["eval_data_path"]
    num_evals = hparams["num_evals"]
    num_examples_to_train_on = hparams["num_examples_to_train_on"]
    output_dir = hparams["output_dir"]
    train_data_path = hparams["train_data_path"]

    model_export_path = os.path.join(output_dir, "model.keras")
    serving_model_export_path = os.path.join(output_dir, "savedmodel")
    checkpoint_path = os.path.join(output_dir, "checkpoint.keras")
    tensorboard_path = os.path.join(output_dir, "tensorboard")

    if tf.io.gfile.exists(output_dir):
        tf.io.gfile.rmtree(output_dir)

    normalizers = adapt_normalize(eval_data_path)

    model = build_dnn_model(nbuckets, nnsize, lr, normalizers)
    logging.info(model.summary())
    
    trainds = create_dataset(
        pattern=train_data_path, batch_size=batch_size, num_repeat=None, mode="train"
    )

    evalds = create_dataset(
        pattern=eval_data_path, batch_size=batch_size, num_repeat=1, mode="eval"
    )

    steps_per_epoch = num_examples_to_train_on // (batch_size * num_evals)

    checkpoint_cb = callbacks.ModelCheckpoint(
        checkpoint_path, verbose=1
    )
    tensorboard_cb = callbacks.TensorBoard(tensorboard_path, histogram_freq=1)

    history = model.fit(
        trainds,
        validation_data=evalds,
        epochs=num_evals,
        steps_per_epoch=max(1, steps_per_epoch),
        verbose=2,  # 0=silent, 1=progress bar, 2=one line per epoch
        callbacks=[checkpoint_cb, tensorboard_cb, HPTCallback()],
    )

    # Save the Keras model file.
    model.save(model_export_path)
    # Exporting the model in savedmodel for serving.
    model.export(serving_model_export_path)
    return history

### Modify task.py

In [None]:
%%writefile taxifare/trainer/task.py
"""Argument definitions for model training code in `trainer.model`."""
import argparse
import os

from trainer import model

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--batch_size",
        help="Batch size for training steps",
        type=int,
        default=32,
    )
    parser.add_argument(
        "--eval_data_path",
        help="GCS location pattern of eval files",
        required=True,
    )
    parser.add_argument(
        "--nnsize",
        help="Hidden layer sizes (provide space-separated sizes)",
        default="32 8",
    )
    parser.add_argument(
        "--nbuckets",
        help="Number of buckets to divide lat and lon with",
        type=int,
        default=10,
    )
    parser.add_argument(
        "--lr", help="learning rate for optimizer", type=float, default=0.001
    )
    parser.add_argument(
        "--num_evals",
        help="Number of times to evaluate model on eval data training.",
        type=int,
        default=5,
    )
    parser.add_argument(
        "--num_examples_to_train_on",
        help="Number of examples to train on.",
        type=int,
        default=100,
    )
    parser.add_argument(
        "--output_dir",
        help="GCS location to write checkpoints and export models",
        default=os.getenv("AIP_MODEL_DIR"),
    )
    parser.add_argument(
        "--train_data_path",
        help="GCS location pattern of train files containing eval URLs",
        required=True,
    )

    args, _ = parser.parse_known_args()

    hparams = args.__dict__
    print("output_dir", hparams["output_dir"])
    model.train_and_evaluate(hparams)


In [None]:
%%writefile taxifare/setup.py
"""Using `setuptools` to create a source distribution."""

from setuptools import find_packages, setup

setup(
    name="taxifare_trainer",
    version="0.1",
    packages=find_packages(),
    include_package_data=True,
    description="Taxifare model training application.",
)

In [None]:
%%bash
cd taxifare
python ./setup.py sdist --formats=gztar
cd ..

In [None]:
%%bash
gsutil cp taxifare/dist/taxifare_trainer-0.1.tar.gz gs://${BUCKET}/taxifare/

## Run Hyperparameter Tuning Job on Vertex AI

Hyperparameter tuning takes advantage of the processing infrastructure of Google Cloud to test different hyperparameter configurations when training your model. It can give you optimized values for hyperparameters, which maximizes your model's predictive accuracy.


### Setup CustomJob
To leverage that capability, we first define a CustomJob object, just as you would for a normal custom training job on Vertex AI. For more details, please refer to the [custom training notebook](./1_training_at_scale_vertex.ipynb).

In [None]:
NUM_EXAMPLES_TO_TRAIN_ON = 50000
NUM_EVALS = 20
NNSIZE = "32 8"

base_path = f"gs://{BUCKET}/taxifare"
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

args = [
    "--eval_data_path",
    f"{base_path}/data/taxi-valid*",
    "--train_data_path",
    f"{base_path}/data/taxi-train*",
    "--num_examples_to_train_on",
    f"{NUM_EXAMPLES_TO_TRAIN_ON}",
    "--num_evals",
    f"{NUM_EVALS}",
    "--nnsize",
    f"{NNSIZE}",
]

worker_pool_specs = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-4",
            "accelerator_type": None,
            "accelerator_count": None,
        },
        "replica_count": 1,
        "python_package_spec": {
            "executor_image_uri": "us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-17.py310:latest",
            "package_uris": [f"{base_path}/taxifare_trainer-0.1.tar.gz"],
            "python_module": "trainer.task",
            "args": args,
        },
    }
]

custom_job = aiplatform.CustomJob(
    display_name="custom_job",
    worker_pool_specs=worker_pool_specs,
    staging_bucket=f"{base_path}/staging",
    base_output_dir=f"{base_path}/trained_model_{timestamp}",
)

### Create HyperparameterTuningJob

Now, let's define some specs for the hyperparameter tuning job.
When you configure a hyperparameter tuning job, you must specify the following details:

#### `metric_spec`
the metrics you want to use to optimize for. You can specify in a dictionary `{<metrics tag name>: <goal>}`.

The metrics name have to be corresponding to the tag name you report in your training application.

For the goal, You can set the goal as eigher `'maximize'` (e.g., accuracy) or `'minimize'` (e.g., loss value).

#### `parameter_spec`
In a ParameterSpec object, you specify the hyperparameter data type as an instance of a parameter value specification. The following table lists the supported parameter value specifications.

|Type|Data type|Value ranges|Value data|
|--|--|--|--|
|DoubleValueSpec|DOUBLE|minValue & maxValue|Floating-point values|
|IntegerValueSpec|INTEGER|minValue & maxValue|Integer values|
|CategoricalValueSpec|CATEGORICAL|categoricalValues|List of category strings|
|DiscreteValueSpec|DISCRETE|discreteValues|List of values in ascending order|


Also, you can specify that scaling for each hyperparameter. The available scaling types are:

|Scale Type|Description|Alias|
|--|--|--|
|UNIT_LINEAR_SCALE|Scales the feasible space linearly| `'linear'`|
|UNIT_LOG_SCALE|Scales the feasible space logarithmically 0 through 1. The entire feasible space must be strictly positive.| `'log'`|
|UNIT_REVERSE_LOG_SCALE| Scales the feasible space "reverse" logarithmically 0 through 1. The result is that values close to the top of the feasible space are spread out more than points near the bottom. The entire feasible space must be strictly positive.| `'reverse_log'`|

#### `max_trial_count`

Decide how many trials you want to allow the service to run and set the maxTrialCount value in the HyperparameterTuningJob object.

Increasing the number of trials generally yields better results, but it is not always so. Usually, there is a point of diminishing returns after which additional trials have little or no effect on the accuracy.

#### `Parallel trials`
You can specify how many trials can run in parallel by setting parallelTrialCount in the HyperparameterTuningJob.

Running parallel trials has the benefit of reducing the time the training job takes (real time—the total processing time required is not typically changed). 

**However, running in parallel can reduce the effectiveness of the tuning job overall when you use Google Vizier**. That is because Google Vizier uses the results of previous trials to inform the values to assign to the hyperparameters of subsequent trials. When running in parallel, some trials start without having the benefit of the results of any trials still running.


Refer to [the document](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview#hyperparameters) for more details.

In [None]:
hpt_job = aiplatform.HyperparameterTuningJob(
    display_name=f"taxifare_{timestamp}",
    custom_job=custom_job,
    metric_spec={
        "val_rmse": "minimize",
    },
    parameter_spec={
        "lr": hpt.DoubleParameterSpec(min=0.0001, max=0.1, scale="log"),
        "nbuckets": hpt.IntegerParameterSpec(min=10, max=25, scale="linear"),
        "batch_size": hpt.DiscreteParameterSpec(
            values=[16, 64, 128], scale="linear"
        ),
    },
    max_trial_count=8,
    parallel_trial_count=2,
)

Now that all the configuration is complete, let's submit a job and wait for it to finish. You can visit the console page using the URL link provided below.

In [None]:
hpt_job.run(sync=False)

Here is the equivalent `gcloud ai` command where you can provide the config in yaml file.

```bash
# Output directory and job name
TIMESTAMP=$(date -u +%Y%m%d_%H%M%S)
BASE_OUTPUT_DIR=gs://${BUCKET}/taxifare_$TIMESTAMP
JOB_NAME=taxifare_$TIMESTAMP
echo ${BASE_OUTPUT_DIR} ${REGION} ${JOB_NAME}

# Vertex AI machines to use for training
PYTHON_PACKAGE_URI="gs://${BUCKET}/taxifare/taxifare_trainer-0.1.tar.gz"
MACHINE_TYPE="n1-standard-4"
REPLICA_COUNT=1
PYTHON_PACKAGE_EXECUTOR_IMAGE_URI="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-17.py310:latest"
PYTHON_MODULE="trainer.task"

# Model and training hyperparameters
BATCH_SIZE=15
NUM_EXAMPLES_TO_TRAIN_ON=100
NUM_EVALS=10
NBUCKETS=10
LR=0.001
NNSIZE="32 8"

# GCS paths
GCS_PROJECT_PATH=gs://$BUCKET/taxifare
DATA_PATH=$GCS_PROJECT_PATH/data
TRAIN_DATA_PATH=$DATA_PATH/taxi-train*
EVAL_DATA_PATH=$DATA_PATH/taxi-valid*


echo > ./config.yaml "displayName: $JOB_NAME
studySpec:
  metrics:
  - metricId: val_rmse
    goal: MINIMIZE
  parameters:
  - parameterId: lr
    doubleValueSpec:
      minValue: 0.0001
      maxValue: 0.1
    scaleType: UNIT_LOG_SCALE
  - parameterId: nbuckets
    integerValueSpec:
      minValue: 10
      maxValue: 25
    scaleType: UNIT_LINEAR_SCALE
  - parameterId: batch_size
    discreteValueSpec:
      values:
      - 15
      - 30
      - 50
    scaleType: UNIT_LINEAR_SCALE
  algorithm: ALGORITHM_UNSPECIFIED # results in Bayesian optimization
trialJobSpec:
  baseOutputDirectory:
    outputUriPrefix: $BASE_OUTPUT_DIR
  workerPoolSpecs:
  - machineSpec:
      machineType: $MACHINE_TYPE
    pythonPackageSpec:
      args:
      - --train_data_path=$TRAIN_DATA_PATH
      - --eval_data_path=$EVAL_DATA_PATH
      - --batch_size=$BATCH_SIZE
      - --num_examples_to_train_on=$NUM_EXAMPLES_TO_TRAIN_ON
      - --num_evals=$NUM_EVALS
      - --nbuckets=$NBUCKETS
      - --lr=$LR
      - --nnsize=$NNSIZE
      executorImageUri: $PYTHON_PACKAGE_EXECUTOR_IMAGE_URI
      packageUris:
      - $PYTHON_PACKAGE_URI
      pythonModule: $PYTHON_MODULE
    replicaCount: $REPLICA_COUNT"

gcloud ai hp-tuning-jobs create \
    --region=$REGION \
    --display-name=$JOB_NAME \
    --config=config.yaml \
    --max-trial-count=10 \
    --parallel-trial-count=2
```

### Open TensorBoard
We can use TensorBoard for Hyperparameter tuning jobs to compare loss curves of each trial.

You might not see any data for a bit until the job begins. Check the job status on the console, and then return here to click the refresh button in the top right to update TensorBoard.

In [None]:
%tensorboard --logdir {base_path}/trained_model_20250805_191931 --port 8082

Copyright 2025 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License