# HyperParameter Tuning Vertex AI (example model - black-n-white to color from chapter6)

### Important Note: 
This notebook might deploy and consume cloud resources in your Google Cloud Project(s) leading to you getting charged/billed for those resources. It's your respondibility to verify the impact of this code before you run it and to monitor and delete any resources to avoid ongoing cloud charges. 

## Install useful packages

### Dependencies

#### Before running this notebook, please make sure you have already installed the following libraries with correct versions.

- numpy==1.21.6
- google-cloud-aiplatform==1.24.1
- google-cloud-storage==2.9.0
- pillow==9.5.0

## Imports

In [1]:
import numpy as np
import glob
import matplotlib.pyplot as plt
import os
import google.cloud.aiplatform as aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt
from datetime import datetime
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
%matplotlib inline

## Setup project configurations

In [2]:
PROJECT_ID='417812395597'
REGION='us-west2'
SERVICE_ACCOUNT='417812395597-compute@developer.gserviceaccount.com'
BUCKET_URI='gs://my-training-artifacts'

## Initialize Vertex AI (SDK)

In [3]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## Containerize training application code

### This task.py will run inside the container that we will define later
#### task.py should have the entire training flow including - 
- Load and prepared the training data
- define model architetcure
- train model (run trial with given hyperparameters)
- save model (optional)
- pass training trial output to hypertune method

In [4]:
%%writefile task.py
# Single, Mirror and Multi-Machine Distributed Training

import tensorflow as tf
import tensorflow
from tensorflow.python.client import device_lib
import argparse
import os
import sys
from io import BytesIO
import numpy as np
from tensorflow.python.lib.io import file_io
import hypertune

def get_args():
    '''Parses args. Must include all hyperparameters you want to tune.'''

    parser = argparse.ArgumentParser()
    parser.add_argument(
      '--epochs',
      required=True,
      type=int,
      help='training epochs')
    parser.add_argument(
      '--steps_per_epoch',
      required=True,
      type=int,
      help='steps_per_epoch')
    parser.add_argument(
      '--learning_rate',
      required=True,
      type=float,
      help='learning rate')
    parser.add_argument(
      '--batch_size',
      required=True,
      type=int,
      help='training batch size')
    parser.add_argument(
      '--loss',
      required=True,
      type=str,
      help='loss function')
    
    args = parser.parse_args()
    return args

print('Python Version = {}'.format(sys.version))
print('TensorFlow Version = {}'.format(tf.__version__))
print('TF_CONFIG = {}'.format(os.environ.get('TF_CONFIG', 'Not found')))
print('DEVICES', device_lib.list_local_devices())

# Single Machine, single compute device
DISTRIBUTE='single'
if DISTRIBUTE == 'single':
    if tf.test.is_gpu_available():
        strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    else:
        strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")
# Single Machine, multiple compute device
elif DISTRIBUTE == 'mirror':
    strategy = tf.distribute.MirroredStrategy()
# Multiple Machine, multiple compute device
elif DISTRIBUTE == 'multi':
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

# Multi-worker configuration
print('num_replicas_in_sync = {}'.format(strategy.num_replicas_in_sync))

# Preparing dataset
BUFFER_SIZE = 10000

def make_datasets_unbatched():
    # Load train, validation and test sets
    dest = 'gs://data-bucket-417812395597/'
    train_x = np.load(BytesIO(
        file_io.read_file_to_string(dest+'train_x', binary_mode=True)
    ))
    train_y = np.load(BytesIO(
        file_io.read_file_to_string(dest+'train_y', binary_mode=True)
    ))
    val_x = np.load(BytesIO(
        file_io.read_file_to_string(dest+'val_x', binary_mode=True)
    ))
    val_y = np.load(BytesIO(
        file_io.read_file_to_string(dest+'val_y', binary_mode=True)
    ))
    test_x = np.load(BytesIO(
        file_io.read_file_to_string(dest+'test_x', binary_mode=True)
    ))
    test_y = np.load(BytesIO(
        file_io.read_file_to_string(dest+'test_y', binary_mode=True)
    ))
    return train_x, train_y, val_x, val_y, test_x, test_y

def tf_model():
    black_n_white_input = tensorflow.keras.layers.Input(shape=(80, 80, 1))
    
    enc = black_n_white_input
    
    #Encoder part
    enc = tensorflow.keras.layers.Conv2D(
        32, kernel_size=3, strides=2, padding='same'
    )(enc)
    enc = tensorflow.keras.layers.LeakyReLU(alpha=0.2)(enc)
    enc = tensorflow.keras.layers.BatchNormalization(momentum=0.8)(enc)
    
    enc = tensorflow.keras.layers.Conv2D(
        64, kernel_size=3, strides=2, padding='same'
    )(enc)
    enc = tensorflow.keras.layers.LeakyReLU(alpha=0.2)(enc)
    enc = tensorflow.keras.layers.BatchNormalization(momentum=0.8)(enc)
    
    enc = tensorflow.keras.layers.Conv2D(
        128, kernel_size=3, strides=2, padding='same'
    )(enc)
    enc = tensorflow.keras.layers.LeakyReLU(alpha=0.2)(enc)
    enc = tensorflow.keras.layers.BatchNormalization(momentum=0.8)(enc)
    
    enc = tensorflow.keras.layers.Conv2D(
        256, kernel_size=1, strides=2, padding='same'
    )(enc)
    enc = tensorflow.keras.layers.LeakyReLU(alpha=0.2)(enc)
    enc = tensorflow.keras.layers.Dropout(0.5)(enc)
    
    #Decoder part
    dec = enc
    
    dec = tensorflow.keras.layers.Conv2DTranspose(
        256, kernel_size=3, strides=2, padding='same'
    )(dec)
    dec = tensorflow.keras.layers.Activation('relu')(dec)
    dec = tensorflow.keras.layers.BatchNormalization(momentum=0.8)(dec)
    
    dec = tensorflow.keras.layers.Conv2DTranspose(
        128, kernel_size=3, strides=2, padding='same'
    )(dec)
    dec = tensorflow.keras.layers.Activation('relu')(dec)
    dec = tensorflow.keras.layers.BatchNormalization(momentum=0.8)(dec)
    
    dec = tensorflow.keras.layers.Conv2DTranspose(
        64, kernel_size=3, strides=2, padding='same'
    )(dec)
    dec = tensorflow.keras.layers.Activation('relu')(dec)
    dec = tensorflow.keras.layers.BatchNormalization(momentum=0.8)(dec)
    
    dec = tensorflow.keras.layers.Conv2DTranspose(
        32, kernel_size=3, strides=2, padding='same'
    )(dec)
    dec = tensorflow.keras.layers.Activation('relu')(dec)
    dec = tensorflow.keras.layers.BatchNormalization(momentum=0.8)(dec)
    
    dec = tensorflow.keras.layers.Conv2D(
        3, kernel_size=3, padding='same'
    )(dec)
    
    color_image = tensorflow.keras.layers.Activation('tanh')(dec)
    
    return black_n_white_input, color_image

# Build the and compile TF model
def build_and_compile_tf_model(loss_fn, learning_rate):
    black_n_white_input, color_image = tf_model()
    model = tensorflow.keras.models.Model(
        inputs=black_n_white_input,
        outputs=color_image
    )
    _optimizer = tensorflow.keras.optimizers.Adam(
        learning_rate=learning_rate,
        beta_1=0.5
    )
    model.compile(
        loss=loss_fn,
        optimizer=_optimizer
    )
    return model

def main():
    args = get_args()
    
    NUM_WORKERS = strategy.num_replicas_in_sync
    # Here the batch size scales up by number of workers since
    # `tf.data.Dataset.batch` expects the global batch size.
    GLOBAL_BATCH_SIZE = args.batch_size * NUM_WORKERS
    MODEL_DIR = os.getenv("AIP_MODEL_DIR")

    train_x, train_y, val_x, val_y, _, _ = make_datasets_unbatched()

    with strategy.scope():
        # Creation of dataset, and model building/compiling need to be within
        # `strategy.scope()`.
        model = build_and_compile_tf_model(args.loss, args.learning_rate)

    history = model.fit(
        train_x,
        train_y,
        batch_size=GLOBAL_BATCH_SIZE,
        epochs=args.epochs,
        steps_per_epoch=args.steps_per_epoch,
        validation_data=(val_x, val_y),
    )
    model.save(MODEL_DIR)
    
    # DEFINE HPT METRIC
    hp_metric = history.history['val_loss'][-1]

    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
      hyperparameter_metric_tag='val_loss',
      metric_value=hp_metric,
      global_step=args.epochs)


if __name__ == "__main__":
    main()

Overwriting task.py


## Create staging bucket

In [5]:
BUCKET_URI = "gs://hpt-staging"  # @param {type:"string"}

if BUCKET_URI == "" or BUCKET_URI is None or BUCKET_URI == "gs://[your-bucket-name]":
    BUCKET_URI = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}
    
GCS_OUTPUT_BUCKET = BUCKET_URI + "/output/"

Creating gs://hpt-staging/...
ServiceException: 409 A Cloud Storage bucket named 'hpt-staging' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


## Write DockerFile with all dependencies

In [6]:
%%writefile Dockerfile

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-8

WORKDIR /

# Installs hypertune library
RUN pip install cloudml-hypertune

# Copies the trainer code to the Docker image.
COPY task.py .

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "task"]

Overwriting Dockerfile


## Build the training container and push to Container Registry

### define image name and tag

In [7]:
PROJECT_NAME="kartik-first-project"
IMAGE_URI = (
    f"gcr.io/{PROJECT_NAME}/example-tf-hptune:latest"
)

### build the image

In [8]:
! docker build ./ -t $IMAGE_URI

Sending build context to Docker daemon  2.409MB
Step 1/5 : FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-8
 ---> daa3282bb3b7
Step 2/5 : WORKDIR /
 ---> Using cache
 ---> 5323ad28255d
Step 3/5 : RUN pip install cloudml-hypertune
 ---> Using cache
 ---> 2988333848b9
Step 4/5 : COPY task.py .
 ---> 0363fe6e9c20
Step 5/5 : ENTRYPOINT ["python", "-m", "task"]
 ---> Running in 762e9ce0b817
Removing intermediate container 762e9ce0b817
 ---> 4c41d77ed699
Successfully built 4c41d77ed699
Successfully tagged gcr.io/kartik-first-project/example-tf-hptune:latest


### push to Google Container Registry (GCR)

In [9]:
! docker push $IMAGE_URI

The push refers to repository [gcr.io/kartik-first-project/example-tf-hptune]

[1B0b5f2d89: Preparing 
[1B16073345: Preparing 
[1B7b161056: Preparing 
[1B430af398: Preparing 
[1B99b65f71: Preparing 
[1B4dc40863: Preparing 
[1B3e63bbac: Preparing 
[1B60d99d76: Preparing 
[1Bd66856b8: Preparing 
[1B392e20ca: Preparing 
[1B2dedcefe: Preparing 
[1Bda59e951: Preparing 
[1B6392cd3d: Preparing 
[1Ba3599e64: Preparing 
[1Be65e4796: Preparing 
[1B12f28341: Preparing 
[1Bf03e3a22: Preparing 
[1B590b263f: Preparing 
[1B1d25a79d: Preparing 
[1B63abc016: Preparing 
[1B4c721201: Preparing 
[1Baff47928: Preparing 
[1B75551159: Preparing 
[1B28b679a2: Preparing 
[1Bbf18a086: Preparing 
[1B37c703d0: Preparing 
[1Bafc18bc7: Preparing 
[1B6e4c67d2: Preparing 
[1B7209b7a9: Preparing 
[16B65e4796: Waiting g 
[1Bec0b36f1: Preparing 
[16B03e3a22: Waiting g 
[1Bf5de02be: Preparing 
[1B7d7fdfb7: Preparing 
[1Bce4c0976: Preparing 
[28B66856b8: Waiting g 
[1B672e1e8b: Prepari

## Configure HyperParameter Tuning Job

### define worker specifications

In [10]:
# The spec of the worker pools including machine type and Docker image
# Be sure to replace PROJECT_ID in the `image_uri` with your project.

worker_pool_specs = [
    {
        "machine_spec": {
            "machine_type": "n1-standard-8",
            "accelerator_type": None,
            "accelerator_count": 0,
        },
        "replica_count": 1,
        "container_spec": {
            "image_uri": f"gcr.io/{PROJECT_NAME}/example-tf-hptune:latest"
        },
    }
]

### define parameter specifications (For tuning)

In [11]:
# Dictionary representing parameters to optimize.
# The dictionary key is the parameter_id, which is passed into your training
# job as a command line argument,
# And the dictionary value is the parameter specification of the metric.
parameter_spec = {
    "learning_rate": hpt.DoubleParameterSpec(min=0.0001, max=0.001, scale="log"),
    "epochs": hpt.DiscreteParameterSpec(values=[10, 20, 30], scale=None),
    "steps_per_epoch": hpt.IntegerParameterSpec(min=100, max=300, scale="linear"),
    "batch_size": hpt.DiscreteParameterSpec(values=[16, 32, 64], scale=None),
    "loss": hpt.CategoricalParameterSpec(["mse"]), # we can add other loss values
}

### define metric spec

In [12]:
metric_spec = {"val_loss": "minimize"}

### Define Custom Job (that will run each trial)

In [13]:
my_custom_job = aiplatform.CustomJob(
    display_name="example-tf-hpt-job",
    worker_pool_specs=worker_pool_specs,
    staging_bucket=GCS_OUTPUT_BUCKET,
)

## Create HyperParameter Tuning Job

In [14]:
hp_job = aiplatform.HyperparameterTuningJob(
    display_name="example-tf-hpt-job",
    custom_job=my_custom_job,
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=5,
    parallel_trial_count=3,
)

## run job

In [15]:
hp_job.run()

Creating HyperparameterTuningJob
HyperparameterTuningJob created. Resource name: projects/417812395597/locations/us-west2/hyperparameterTuningJobs/852574510317043712
To use this HyperparameterTuningJob in another session:
hpt_job = aiplatform.HyperparameterTuningJob.get('projects/417812395597/locations/us-west2/hyperparameterTuningJobs/852574510317043712')
View HyperparameterTuningJob:
https://console.cloud.google.com/ai/platform/locations/us-west2/training/852574510317043712?project=417812395597
HyperparameterTuningJob projects/417812395597/locations/us-west2/hyperparameterTuningJobs/852574510317043712 current state:
JobState.JOB_STATE_PENDING
HyperparameterTuningJob projects/417812395597/locations/us-west2/hyperparameterTuningJobs/852574510317043712 current state:
JobState.JOB_STATE_RUNNING
HyperparameterTuningJob projects/417812395597/locations/us-west2/hyperparameterTuningJobs/852574510317043712 current state:
JobState.JOB_STATE_RUNNING
HyperparameterTuningJob projects/417812395597