# Running a Hyperparameter Tuning Job with Vertex Training

## Learning objectives

In this notebook, you learn how to:

1. Create a Vertex AI custom job for training a model. 
2. Launch hyperparameter tuning job with the Python SDK.
3. Cleanup resources.

## Overview

This notebook demonstrates how to run a hyperparameter tuning job with Vertex Training to discover optimal hyperparameter values for an ML model. To speed up the training process, `MirroredStrategy` from the `tf.distribute` module is used to distribute training across multiple GPUs on a single machine.

In this notebook, you create a custom-trained model from a Python script in a Docker container. You learn how to modify training application code for hyperparameter tuning and submit a Vertex Training hyperparameter tuning job with the Python SDK.

### Dataset

The dataset used for this tutorial is the [horses or humans dataset](https://www.tensorflow.org/datasets/catalog/horses_or_humans) from [TensorFlow Datasets](https://www.tensorflow.org/datasets). The trained model predicts if an image is of a horse or a human.

Each learning objective will correspond to a _#TODO_ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solutions/distributed-hyperparameter-tuning.ipynb)

### Install additional packages

Install the latest version of Vertex SDK for Python.

In [1]:
import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [2]:
# Install necessary dependencies
! pip3 install {USER_FLAG} --upgrade google-cloud-aiplatform



### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [3]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

### Set up your Google Cloud project

1. [Enable the Vertex AI API and Compute Engine API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component).

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [2]:
import os

PROJECT_ID = "qwiklabs-gcp-00-b9e7121a76ba"  # Replace your Project ID here 

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Project ID:  qwiklabs-gcp-00-b9e7121a76ba


Otherwise, set your project ID here.

In [3]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "qwiklabs-gcp-00-b9e7121a76ba"   # Replace your Project ID here

Set project ID

In [4]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [5]:
# Import necessary librarary
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you submit a custom training job using the Cloud SDK, you will need to provide a staging bucket.

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are
available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). You may
not use a Multi-Regional Storage bucket for training with Vertex AI.

In [6]:
BUCKET_URI = "gs://qwiklabs-gcp-00-b9e7121a76ba"  # Replace your Bucket name here
REGION = "us-central1"  # @param {type:"string"}

In [7]:
if BUCKET_URI == "" or BUCKET_URI is None or BUCKET_URI == "gs://qwiklabs-gcp-00-b9e7121a76ba":  # Replace your Bucket name here 
    BUCKET_URI = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

In [8]:
print(BUCKET_URI)

gs://qwiklabs-gcp-00-b9e7121a76baaip-20220526063817


**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [9]:
# Create your bucket
! gcloud storage buckets create --location=$REGION $BUCKET_URI

Creating gs://qwiklabs-gcp-00-b9e7121a76baaip-20220526063817/...


Finally, validate access to your Cloud Storage bucket by examining its contents:

In [10]:
# Give access to your Cloud Storage bucket
! gcloud storage ls --all-versions --long $BUCKET_URI

### Import libraries and define constants

In [11]:
# Import necessary libraries
import os
import sys

from google.cloud import aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt

### Write Dockerfile

The first step in containerizing your code is to create a Dockerfile. In the Dockerfile, you'll include all the commands needed to run the image such as installing the necessary libraries and setting up the entry point for the training code.

This Dockerfile uses the Deep Learning Container TensorFlow Enterprise 2.5 GPU Docker image. The Deep Learning Containers on Google Cloud come with many common ML and data science frameworks pre-installed. After downloading that image, this Dockerfile installs the [CloudML Hypertune](https://github.com/GoogleCloudPlatform/cloudml-hypertune) library and sets up the entrypoint for the training code.


In [12]:
%%writefile Dockerfile

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5
WORKDIR /

# Installs hypertune library
RUN pip install cloudml-hypertune

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

Writing Dockerfile


### Create training application code

Next, you create a trainer directory with a `task.py` script that contains the code for your training application.

In [13]:
# Create trainer directory

! mkdir trainer

In the next cell, you write the contents of the training script, `task.py`. This file downloads the _horses or humans_ dataset from TensorFlow datasets and trains a `tf.keras` functional model using `MirroredStrategy` from the `tf.distribute` module.

There are a few components that are specific to using the hyperparameter tuning service:

* The script imports the `hypertune` library. Note that the Dockerfile included instructions to pip install the hypertune library.
* The function `get_args()` defines a command-line argument for each hyperparameter you want to tune. In this example, the hyperparameters that will be tuned are the learning rate, the momentum value in the optimizer, and the number of units in the last hidden layer of the model. The value passed in those arguments is then used to set the corresponding hyperparameter in the code.
* At the end of the `main()` function, the hypertune library is used to define the metric to optimize. In this example, the metric that will be optimized is the the validation accuracy. This metric is passed to an instance of `HyperTune`.

In [14]:
%%writefile trainer/task.py

import argparse
import hypertune
import tensorflow as tf
import tensorflow_datasets as tfds

def get_args():
  """Parses args. Must include all hyperparameters you want to tune."""

  parser = argparse.ArgumentParser()
  parser.add_argument(
      '--learning_rate', required=True, type=float, help='learning rate')
  parser.add_argument(
      '--momentum', required=True, type=float, help='SGD momentum value')
  parser.add_argument(
      '--units',
      required=True,
      type=int,
      help='number of units in last hidden layer')
  parser.add_argument(
      '--epochs',
      required=False,
      type=int,
      default=10,
      help='number of training epochs')
  args = parser.parse_args()
  return args


def preprocess_data(image, label):
  """Resizes and scales images."""

  image = tf.image.resize(image, (150, 150))
  return tf.cast(image, tf.float32) / 255., label


def create_dataset(batch_size):
  """Loads Horses Or Humans dataset and preprocesses data."""

  data, info = tfds.load(
      name='horses_or_humans', as_supervised=True, with_info=True)

  # Create train dataset
  train_data = data['train'].map(preprocess_data)
  train_data = train_data.shuffle(1000)
  train_data = train_data.batch(batch_size)

  # Create validation dataset
  validation_data = data['test'].map(preprocess_data)
  validation_data = validation_data.batch(64)

  return train_data, validation_data


def create_model(units, learning_rate, momentum):
  """Defines and compiles model."""

  inputs = tf.keras.Input(shape=(150, 150, 3))
  x = tf.keras.layers.Conv2D(16, (3, 3), activation='relu')(inputs)
  x = tf.keras.layers.MaxPooling2D((2, 2))(x)
  x = tf.keras.layers.Conv2D(32, (3, 3), activation='relu')(x)
  x = tf.keras.layers.MaxPooling2D((2, 2))(x)
  x = tf.keras.layers.Conv2D(64, (3, 3), activation='relu')(x)
  x = tf.keras.layers.MaxPooling2D((2, 2))(x)
  x = tf.keras.layers.Flatten()(x)
  x = tf.keras.layers.Dense(units, activation='relu')(x)
  outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
  model = tf.keras.Model(inputs, outputs)
  model.compile(
      loss='binary_crossentropy',
      optimizer=tf.keras.optimizers.SGD(
          learning_rate=learning_rate, momentum=momentum),
      metrics=['accuracy'])
  return model


def main():
  args = get_args()

  # Create Strategy
  strategy = tf.distribute.MirroredStrategy()

  # Scale batch size
  GLOBAL_BATCH_SIZE = 64 * strategy.num_replicas_in_sync  
  train_data, validation_data = create_dataset(GLOBAL_BATCH_SIZE)

  # Wrap model variables within scope
  with strategy.scope():
    model = create_model(args.units, args.learning_rate, args.momentum)

  # Train model
  history = model.fit(
      train_data, epochs=args.epochs, validation_data=validation_data)

  # Define Metric
  hp_metric = history.history['val_accuracy'][-1]

  hpt = hypertune.HyperTune()
  hpt.report_hyperparameter_tuning_metric(
      hyperparameter_metric_tag='accuracy',
      metric_value=hp_metric,
      global_step=args.epochs)


if __name__ == '__main__':
  main()

Writing trainer/task.py


### Build the Container

In the next cells, you build the container and push it to Google Container Registry.

In [15]:
# Set the IMAGE_URI
IMAGE_URI = f"gcr.io/{PROJECT_ID}/horse-human:hypertune"

In [16]:
# Build the docker image
! docker build -f Dockerfile -t $IMAGE_URI ./

Sending build context to Docker daemon  355.3kB
Step 1/5 : FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5
latest: Pulling from deeplearning-platform-release/tf2-gpu.2-5

[1B05cd42bd: Pulling fs layer 
[1B353a95ec: Pulling fs layer 
[1B996407de: Pulling fs layer 
[1B6fc70f16: Pulling fs layer 
[1B6be11512: Pulling fs layer 
[1B5a9c78ee: Pulling fs layer 
[1B690abd59: Pulling fs layer 
[1Bdd7682bc: Pulling fs layer 
[1B7660f71d: Pulling fs layer 
[1B792fb622: Pulling fs layer 
[1B43275aca: Pulling fs layer 
[1B6f115483: Pulling fs layer 
[1B22b52524: Pulling fs layer 
[1B9a9f592b: Pulling fs layer 
[1B7d1c07e4: Pulling fs layer 
[1Bd5e9923d: Pulling fs layer 
[13Bbe11512: Waiting fs layer 
[1B79b748d5: Pulling fs layer 
[1B8bd6f7fb: Pulling fs layer 
[1B16bef138: Pulling fs layer 
[1B27033824: Pulling fs layer 
[1B8679f003: Pulling fs layer 
[1B22e7ff04: Pulling fs layer 
[1B48a3aec1: Pulling fs layer 
[1B637de50b: Pulling fs layer 
[1Bf1eb7f16: Pulling f

In [17]:
# Push it to Google Container Registry:
! docker push $IMAGE_URI

The push refers to repository [gcr.io/qwiklabs-gcp-00-b9e7121a76ba/horse-human]

[1Bd49ddc5a: Preparing 
[1B7c5ace32: Preparing 
[1Baa5df10d: Preparing 
[1B010939aa: Preparing 
[1Bc4ea3a81: Preparing 
[1B08c5711b: Preparing 
[1Bb564e194: Preparing 
[1B6808a3d1: Preparing 
[1Bbdf9b557: Preparing 
[1Bdbc2b748: Preparing 
[1Bb8f29c2e: Preparing 
[1B7b2f7486: Preparing 
[1B97a3e6e4: Preparing 
[1Ba5e8117f: Preparing 
[1B8124ed57: Preparing 
[1B4704bb3d: Preparing 
[1B6ef24b4b: Preparing 
[1B113f67c8: Preparing 
[1B857a1d48: Preparing 
[1B97864c52: Preparing 
[1Bbaac3e32: Preparing 
[1Ba1af4c10: Preparing 
[1Ba468ca49: Preparing 
[1B205798d1: Preparing 
[1Bcd6d4269: Preparing 
[1B55c89c2a: Preparing 
[1Bb9034da6: Preparing 
[1B4fbfce85: Preparing 
[1B9ca3db46: Preparing 
[1B1a1930ab: Preparing 
[1Bf5a43f1f: Preparing 
[2Bf5a43f1f: Mounted from deeplearning-platform-release/tf2-gpu.2-5 [28A[2K[30A[2K[27A[2K[31A[2K[26A[2K[24A[2K[22A[2K[21A[2K[1

### Create and run hyperparameter tuning job on Vertex AI

Once your container is pushed to Google Container Registry, you use the Vertex SDK to create and run the hyperparameter tuning job.

You define the following specifications:
* `worker_pool_specs`: Dictionary specifying the machine type and Docker image. This example defines a single node cluster with one `n1-standard-4` machine with two `NVIDIA_TESLA_T4` GPUs.
* `parameter_spec`: Dictionary specifying the parameters to optimize. The dictionary key is the string assigned to the command line argument for each hyperparameter in your training application code, and the dictionary value is the parameter specification. The parameter specification includes the type, min/max values, and scale for the hyperparameter.
* `metric_spec`: Dictionary specifying the metric to optimize. The dictionary key is the `hyperparameter_metric_tag` that you set in your training application code, and the value is the optimization goal.

In [18]:
# Define required specifications
worker_pool_specs = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-4",
            "accelerator_type": "ACCELERATOR_TYPE_UNSPECIFIED",
            "accelerator_count": 0,
        },
        "replica_count": 1,
        "container_spec": {"image_uri": IMAGE_URI},
    }
]

metric_spec = {"accuracy": "maximize"}

parameter_spec = {
    "learning_rate": hpt.DoubleParameterSpec(min=0.001, max=1, scale="log"),
    "momentum": hpt.DoubleParameterSpec(min=0, max=1, scale="linear"),
    "units": hpt.DiscreteParameterSpec(values=[64, 128, 512], scale=None),
}

Create a `CustomJob`.

In [19]:
print(BUCKET_URI)

gs://qwiklabs-gcp-00-b9e7121a76baaip-20220526063817


In [20]:
# Create a CustomJob

JOB_NAME = "horses-humans-hyperparam-job" + TIMESTAMP

my_custom_job = # TODO 1: Your code goes here(
    display_name=JOB_NAME,
    project=PROJECT_ID,
    worker_pool_specs=worker_pool_specs,
    staging_bucket=BUCKET_URI,
)

Then, create and run a `HyperparameterTuningJob`.

There are a few arguments to note:

* `max_trial_count`: Sets an upper bound on the number of trials the service will run. The recommended practice is to start with a smaller number of trials and get a sense of how impactful your chosen hyperparameters are before scaling up.

* `parallel_trial_count`:  If you use parallel trials, the service provisions multiple training processing clusters. The worker pool spec that you specify when creating the job is used for each individual training cluster.  Increasing the number of parallel trials reduces the amount of time the hyperparameter tuning job takes to run; however, it can reduce the effectiveness of the job overall. This is because the default tuning strategy uses results of previous trials to inform the assignment of values in subsequent trials.
 
* `search_algorithm`: The available search algorithms are grid, random, or default (None). The default option applies Bayesian optimization to search the space of possible hyperparameter values and is the recommended algorithm.

In [None]:
# Create and run HyperparameterTuningJob

hp_job = # TODO 2: Your code goes here(
    display_name=JOB_NAME,
    custom_job=my_custom_job,
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=15,
    parallel_trial_count=3,
    project=PROJECT_ID,
    search_algorithm=None,
)

hp_job.run()

Creating HyperparameterTuningJob
HyperparameterTuningJob created. Resource name: projects/585438674354/locations/us-central1/hyperparameterTuningJobs/1248415738746634240
To use this HyperparameterTuningJob in another session:
hpt_job = aiplatform.HyperparameterTuningJob.get('projects/585438674354/locations/us-central1/hyperparameterTuningJobs/1248415738746634240')
View HyperparameterTuningJob:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1248415738746634240?project=585438674354
HyperparameterTuningJob projects/585438674354/locations/us-central1/hyperparameterTuningJobs/1248415738746634240 current state:
JobState.JOB_STATE_PENDING
HyperparameterTuningJob projects/585438674354/locations/us-central1/hyperparameterTuningJobs/1248415738746634240 current state:
JobState.JOB_STATE_PENDING
HyperparameterTuningJob projects/585438674354/locations/us-central1/hyperparameterTuningJobs/1248415738746634240 current state:
JobState.JOB_STATE_RUNNING
HyperparameterTuningJ

** It will nearly take 50 mintues to complete the job successfully.**

Click on the generated link in the output to see your run in the Cloud Console. When the job completes, you will see the results of the tuning trials.

![console_ui_results](tuning_results.png)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [22]:
# Set this to true only if you'd like to delete your bucket
delete_bucket = # TODO 3: Your code goes here

if delete_bucket or os.getenv("IS_TESTING"):
    ! gcloud storage rm --recursive $BUCKET_URI