In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Distributed Vertex AI Hyperparameter Tuning

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/distributed_hyperparameter_tuning.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/distributed_hyperparameter_tuning.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/training/distributed_hyperparameter_tuning.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                         
</table>

## Overview

This notebook demonstrates how to run a hyperparameter tuning job with Vertex AI Training to discover optimal hyperparameter values for an ML model. To speed up the training process, `MirroredStrategy` from the `tf.distribute` module is used to distribute training across multiple GPUs on a single machine.

Learn more about [Vertex AI Hyperparameter Tuning](https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview).

### Objective

In this notebook, you create a custom trained model from a Python script in a Docker container. You learn how to modify training application code for hyperparameter tuning and submit a Vertex AI Hyperparameter Tuning job with the Python SDK.

This tutorial uses the following Google Cloud ML services:

- `Vertex AI Training`
- `Vertex AI Hyperparameter Tuning`

The steps performed include:

- Training using a Python package.
- Report accuracy when hyperparameter tuning.
- Save the model artifacts to Cloud Storage using GCSFuse.

### Dataset

The dataset used for this tutorial is the [horses or humans dataset](https://www.tensorflow.org/datasets/catalog/horses_or_humans) from [TensorFlow Datasets](https://www.tensorflow.org/datasets). The trained model predicts if an image is of a horse or a human.

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage


Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installations

Install the following packages to execute this notebook.

In [None]:
! pip3 install --upgrade google-cloud-aiplatform -q

### Colab Only: Uncomment the following cell to restart the kernel

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### Before you begin

#### Set your project ID

**If you don't know your project ID**, try the following:
-  Run `gcloud config list`
-  Run `gcloud projects list`
-  See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# set the project id
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable used by Vertex AI. 
Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench** 
- Do nothing as you are already authenticated.

**2. Local JupyterLab Instance,** uncomment and run.

In [None]:
# ! gcloud auth login

**3. Colab,** uncomment and run:

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l {REGION} {BUCKET_URI}

### Import libraries and define constants

In [None]:
import os

from google.cloud import aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt

### Write Dockerfile

The first step in containerizing your code is to create a Dockerfile. In the Dockerfile, you'll include all the commands needed to run the image such as installing the necessary libraries and setting up the entry point for the training code.

This Dockerfile uses the Deep Learning Container TensorFlow Enterprise 2.5 GPU Docker image. The Deep Learning Containers on Google Cloud come with many common ML and data science frameworks pre-installed. After downloading that image, this Dockerfile installs the [CloudML Hypertune](https://github.com/GoogleCloudPlatform/cloudml-hypertune) library and sets up the entrypoint for the training code.


In [None]:
%%writefile Dockerfile

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5
WORKDIR /

# Installs hypertune library
RUN pip install cloudml-hypertune

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

### Create training application code

Next, you create a trainer directory with a `task.py` script that contains the code for your training application.

In [None]:
# Create trainer directory

! mkdir trainer

In the next cell, you write the contents of the training script, `task.py`. This file downloads the _horses or humans_ dataset from TensorFlow datasets and trains a `tf.keras` functional model using `MirroredStrategy` from the `tf.distribute` module.

There are a few components that are specific to using the hyperparameter tuning service:

* The script imports the `hypertune` library. Note that the Dockerfile included instructions to pip install the hypertune library.
* The function `get_args()` defines a command-line argument for each hyperparameter you want to tune. In this example, the hyperparameters that will be tuned are the learning rate, the momentum value in the optimizer, and the number of units in the last hidden layer of the model. The value passed in those arguments is then used to set the corresponding hyperparameter in the code.
* At the end of the `main()` function, the hypertune library is used to define the metric to optimize. In this example, the metric that will be optimized is the the validation accuracy. This metric is passed to an instance of `HyperTune`.

In [None]:
%%writefile trainer/task.py

import argparse
import hypertune
import tensorflow as tf
import tensorflow_datasets as tfds

def get_args():
  """Parses args. Must include all hyperparameters you want to tune."""

  parser = argparse.ArgumentParser()
  parser.add_argument(
      '--learning_rate', required=True, type=float, help='learning rate')
  parser.add_argument(
      '--momentum', required=True, type=float, help='SGD momentum value')
  parser.add_argument(
      '--units',
      required=True,
      type=int,
      help='number of units in last hidden layer')
  parser.add_argument(
      '--epochs',
      required=False,
      type=int,
      default=10,
      help='number of training epochs')
  args = parser.parse_args()
  return args


def preprocess_data(image, label):
  """Resizes and scales images."""

  image = tf.image.resize(image, (150, 150))
  return tf.cast(image, tf.float32) / 255., label


def create_dataset(batch_size):
  """Loads Horses Or Humans dataset and preprocesses data."""

  data, info = tfds.load(
      name='horses_or_humans', as_supervised=True, with_info=True)

  # Create train dataset
  train_data = data['train'].map(preprocess_data)
  train_data = train_data.shuffle(1000)
  train_data = train_data.batch(batch_size)

  # Create validation dataset
  validation_data = data['test'].map(preprocess_data)
  validation_data = validation_data.batch(64)

  return train_data, validation_data


def create_model(units, learning_rate, momentum):
  """Defines and compiles model."""

  inputs = tf.keras.Input(shape=(150, 150, 3))
  x = tf.keras.layers.Conv2D(16, (3, 3), activation='relu')(inputs)
  x = tf.keras.layers.MaxPooling2D((2, 2))(x)
  x = tf.keras.layers.Conv2D(32, (3, 3), activation='relu')(x)
  x = tf.keras.layers.MaxPooling2D((2, 2))(x)
  x = tf.keras.layers.Conv2D(64, (3, 3), activation='relu')(x)
  x = tf.keras.layers.MaxPooling2D((2, 2))(x)
  x = tf.keras.layers.Flatten()(x)
  x = tf.keras.layers.Dense(units, activation='relu')(x)
  outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
  model = tf.keras.Model(inputs, outputs)
  model.compile(
      loss='binary_crossentropy',
      optimizer=tf.keras.optimizers.SGD(
          learning_rate=learning_rate, momentum=momentum),
      metrics=['accuracy'])
  return model


def main():
  args = get_args()

  # Create Strategy
  strategy = tf.distribute.MirroredStrategy()

  # Scale batch size
  GLOBAL_BATCH_SIZE = 64 * strategy.num_replicas_in_sync  
  train_data, validation_data = create_dataset(GLOBAL_BATCH_SIZE)

  # Wrap model variables within scope
  with strategy.scope():
    model = create_model(args.units, args.learning_rate, args.momentum)

  # Train model
  history = model.fit(
      train_data, epochs=args.epochs, validation_data=validation_data)

  # Define Metric
  hp_metric = history.history['val_accuracy'][-1]

  hpt = hypertune.HyperTune()
  hpt.report_hyperparameter_tuning_metric(
      hyperparameter_metric_tag='accuracy',
      metric_value=hp_metric,
      global_step=args.epochs)


if __name__ == '__main__':
  main()

### Build the Container

In the next cells, you build the container and push it to Google Container Registry.

In [None]:
# Set the IMAGE_URI
IMAGE_URI = f"gcr.io/{PROJECT_ID}/horse-human:hypertune"

In [None]:
# Build the docker image
! docker build -f Dockerfile -t $IMAGE_URI ./

In [None]:
# Push it to Google Container Registry:
! docker push $IMAGE_URI

### Create and run hyperparameter tuning job on Vertex AI

Once your container is pushed to Google Container Registry, you use the Vertex SDK to create and run the hyperparameter tuning job.

You define the following specifications:
* `worker_pool_specs`: Dictionary specifying the machine type and Docker image. This example defines a single node cluster with one `n1-standard-4` machine with two `NVIDIA_TESLA_T4` GPUs.
* `parameter_spec`: Dictionary specifying the parameters to optimize. The dictionary key is the string assigned to the command line argument for each hyperparameter in your training application code, and the dictionary value is the parameter specification. The parameter specification includes the type, min/max values, and scale for the hyperparameter.
* `metric_spec`: Dictionary specifying the metric to optimize. The dictionary key is the `hyperparameter_metric_tag` that you set in your training application code, and the value is the optimization goal.

In [None]:
worker_pool_specs = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-4",
            "accelerator_type": "NVIDIA_TESLA_T4",
            "accelerator_count": 2,
        },
        "replica_count": 1,
        "container_spec": {"image_uri": IMAGE_URI},
    }
]

metric_spec = {"accuracy": "maximize"}

parameter_spec = {
    "learning_rate": hpt.DoubleParameterSpec(min=0.001, max=1, scale="log"),
    "momentum": hpt.DoubleParameterSpec(min=0, max=1, scale="linear"),
    "units": hpt.DiscreteParameterSpec(values=[64, 128, 512], scale=None),
}

Create a `CustomJob`.

In [None]:
# Create a CustomJob

JOB_NAME = "horses-humans-hyperparam-job"

my_custom_job = aiplatform.CustomJob(
    display_name=JOB_NAME,
    project=PROJECT_ID,
    worker_pool_specs=worker_pool_specs,
    staging_bucket=BUCKET_URI,
)

Then, create and run a `HyperparameterTuningJob`.

There are a few arguments to note:

* `max_trial_count`: Sets an upper bound on the number of trials the service will run. The recommended practice is to start with a smaller number of trials and get a sense of how impactful your chosen hyperparameters are before scaling up.

* `parallel_trial_count`:  If you use parallel trials, the service provisions multiple training processing clusters. The worker pool spec that you specify when creating the job is used for each individual training cluster.  Increasing the number of parallel trials reduces the amount of time the hyperparameter tuning job takes to run; however, it can reduce the effectiveness of the job overall. This is because the default tuning strategy uses results of previous trials to inform the assignment of values in subsequent trials.
 
* `search_algorithm`: The available search algorithms are grid, random, or default (None). The default option applies Bayesian optimization to search the space of possible hyperparameter values and is the recommended algorithm.

In [None]:
# Create and run HyperparameterTuningJob

hp_job = aiplatform.HyperparameterTuningJob(
    display_name=JOB_NAME,
    custom_job=my_custom_job,
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=15,
    parallel_trial_count=3,
    project=PROJECT_ID,
    search_algorithm=None,
)

hp_job.run()

Click on the generated link in the output to see your run in the Cloud Console. When the job completes, you will see the results of the tuning trials.

![console_ui_results](tuning_results.png)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
# Set this to true only if you'd like to delete your bucket
delete_bucket = False

hp_job.delete()

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI