In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# E2E ML on GCP: MLOps stage 2 : experimentation

<table align="left">
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage2/mlops_experimentation.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/ai/platform/notebooks/deploy-notebook?download_url=https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage2/mlops_experimentation.ipynb">
      Open in Google Cloud Notebooks
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview


This tutorial demonstrates how to use Vertex AI for E2E MLOps on Google Cloud in production. This tutorial covers stage 2 : experimentation.

### Dataset

The dataset used for this tutorial is the [Chicago Taxi](https://www.kaggle.com/chicago/chicago-taxi-trips-bq). The version of the dataset you will use in this tutorial is stored in a public BigQuery table. The trained model predicts whether someone would leave a tip for a taxi fare.

### Objective

In this tutorial, you create a MLOps stage 2: experimentation process.

This tutorial uses the following Vertex AI:

- `Vertex AI Datasets`
- `Vertex AI Models`
- `Vertex AI AutoML`
- `Vertex AI Training`
- `Vertex AI TensorBoard`
- `Vertex AI Vizier`
- `Vertex AI Batch Prediction`

The steps performed include:

- Review the `Dataset` resource created during stage 1.
- Train an AutoML tabular binary classifier model in the background.
- Build the experimental model architecture.
- Construct a custom training package for the `Dataset` resource.
- Test the custom training package locally.
- Test the custom training package in the cloud with Vertex AI Training.
- Hyperparameter tune the model training with Vertex AI Vizier.
- Train the custom model with Vertex AI Training.
- Add a serving function for online/batch prediction to the custom model.
- Test the custom model with the serving function.
- Evaluate the custom model using Vertex AI Batch Prediction
- Wait for the AutoML training job to complete.
- Evaluate the AutoML model using Vertex AI Batch Prediction with the same evaluation slices as the custom model.
- Set the evaluation results of the AutoML model as the baseline.
- If the evaluation of the custom model is below baseline, continue to experiment with the custom model.
- If the evaluation of the custom model is above baseline, save the model as the first best model.

### Recommendations

When doing E2E MLOps on Google Cloud for experimentation, the following best practices with structured (tabular) data are recommended:

 - Determine a baseline evaluation using AutoML.
 - Design and build a model architecture.
     - Upload the untrained model architecture as a Vertex AI Model resource.


 - Construct a training package that can be ran locally and as a Vertex AI Training job.
     - Decompose the training package into: data, model, train and task Python modules.
     - Obtain the location of the transformed training data from the user metadata of the Vertex AI Dataset resource.
     - Obtain the location of the model artifacts from the Vertex AI Model resource.
     - Include in the training package initializing a Vertex AI Experiment and corresponding run.
     - Log hyperparameters and training parameters for the experiment.
     - Add callbacks for early stop, TensorBoard, and hyperparameter tuning, where hyperparameter tuning is a command-line option.


 - Test the training package locally with a small number of epochs.
 - Test the training package with Vertex AI Training.
 - Do hyperparameter tuning with Vertex AI Hyperparameter Tuning.
 - Do full training of the custom model with Vertex AI Training.
     - Log the hyperparameter values for the experiment/run.


 - Evaluate the custom model.
     - Single evaluation slice, same metrics as AutoML
         - Add evaluation to the training package and return the results in a file in the Cloud Storage bucket used for training
     - Custom evaluation slices, custom metrics
         - Evaluate custom evaluation slices as a Vertex AI Batch Prediction for both AutoML and custom model
         - Perform custom metrics on the results from the batch job


 - Compare custom model metrics against the AutoML baseline
     - If less than baseline, then continue to experiment
     - If greater then baseline, then upload model as the new baseline and save evaluation results with the model.

## Installations

Install *one time* the packages for executing the MLOps notebooks.

In [None]:
ONCE_ONLY = False
if ONCE_ONLY:
    ! pip3 install -U tensorflow==2.5 $USER_FLAG
    ! pip3 install -U tensorflow-data-validation==1.2 $USER_FLAG
    ! pip3 install -U tensorflow-transform==1.2 $USER_FLAG
    ! pip3 install -U tensorflow-io==0.18 $USER_FLAG
    ! pip3 install --upgrade google-cloud-aiplatform[tensorboard] $USER_FLAG
    ! pip3 install --upgrade google-cloud-pipeline-components $USER_FLAG
    ! pip3 install --upgrade google-cloud-bigquery $USER_FLAG
    ! pip3 install --upgrade google-cloud-logging $USER_FLAG
    ! pip3 install --upgrade apache-beam[gcp] $USER_FLAG
    ! pip3 install --upgrade pyarrow $USER_FLAG
    ! pip3 install --upgrade cloudml-hypertune $USER_FLAG
    ! pip3 install --upgrade kfp $USER_FLAG
    ! pip3 install --upgrade torchvision $USER_FLAG
    ! pip3 install --upgrade rpy2 $USER_FLAG

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [None]:
BUCKET_NAME = "gs://[your-bucket-name]"  # @param {type:"string"}

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_NAME

#### Service Account

**If you don't know your service account**, try to get your service account using `gcloud` command by executing the second cell below.

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your GCP project id from gcloud
    shell_output = !gcloud auth list 2>/dev/null
    SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()
    print("Service Account:", SERVICE_ACCOUNT)

### Set up variables

Next, set up some variables used throughout the tutorial.
### Import libraries and define constants

In [None]:
import google.cloud.aiplatform as aip

#### Import TensorFlow

Import the TensorFlow package into your Python environment.

In [None]:
import tensorflow as tf

#### Import TensorFlow Transform

Import the TensorFlow Transform (TFT) package into your Python environment.

In [None]:
import tensorflow_transform as tft

#### Import TensorFlow Data Validation

Import the TensorFlow Data Validation (TFDV) package into your Python environment.

In [None]:
import tensorflow_data_validation as tfdv

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aip.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)

#### Set hardware accelerators

You can set hardware accelerators for training and prediction.

Set the variables `TRAIN_GPU/TRAIN_NGPU` and `DEPLOY_GPU/DEPLOY_NGPU` to use a container image supporting a GPU and the number of GPUs allocated to the virtual machine (VM) instance. For example, to use a GPU container image with 4 Nvidia Telsa K80 GPUs allocated to each VM, you would specify:

    (aip.AcceleratorType.NVIDIA_TESLA_K80, 4)


Otherwise specify `(None, None)` to use a container image to run on a CPU.

Learn more about [hardware accelerator support for your region](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators).

*Note*: TF releases before 2.3 for GPU support will fail to load the custom model in this tutorial. It is a known issue and fixed in TF 2.3. This is caused by static graph ops that are generated in the serving function. If you encounter this issue on your own custom models, use a container image for TF 2.3 with GPU support.

In [None]:
if os.getenv("IS_TESTING_TRAIN_GPU"):
    TRAIN_GPU, TRAIN_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_TRAIN_GPU")),
    )
else:
    TRAIN_GPU, TRAIN_NGPU = (aip.gapic.AcceleratorType.NVIDIA_TESLA_K80, 4)

if os.getenv("IS_TESTING_DEPLOY_GPU"):
    DEPLOY_GPU, DEPLOY_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_DEPLOY_GPU")),
    )
else:
    DEPLOY_GPU, DEPLOY_NGPU = (None, None)

#### Set pre-built containers

Set the pre-built Docker container image for training and prediction.


For the latest list, see [Pre-built containers for training](https://cloud.google.com/ai-platform-unified/docs/training/pre-built-containers).


For the latest list, see [Pre-built containers for prediction](https://cloud.google.com/ai-platform-unified/docs/predictions/pre-built-containers).

In [None]:
if os.getenv("IS_TESTING_TF"):
    TF = os.getenv("IS_TESTING_TF")
else:
    TF = "2.5".replace(".", "-")

if TF[0] == "2":
    if TRAIN_GPU:
        TRAIN_VERSION = "tf-gpu.{}".format(TF)
    else:
        TRAIN_VERSION = "tf-cpu.{}".format(TF)
    if DEPLOY_GPU:
        DEPLOY_VERSION = "tf2-gpu.{}".format(TF)
    else:
        DEPLOY_VERSION = "tf2-cpu.{}".format(TF)
else:
    if TRAIN_GPU:
        TRAIN_VERSION = "tf-gpu.{}".format(TF)
    else:
        TRAIN_VERSION = "tf-cpu.{}".format(TF)
    if DEPLOY_GPU:
        DEPLOY_VERSION = "tf-gpu.{}".format(TF)
    else:
        DEPLOY_VERSION = "tf-cpu.{}".format(TF)

TRAIN_IMAGE = "{}-docker.pkg.dev/vertex-ai/training/{}:latest".format(
    REGION.split("-")[0], TRAIN_VERSION
)
DEPLOY_IMAGE = "{}-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(
    REGION.split("-")[0], DEPLOY_VERSION
)

print("Training:", TRAIN_IMAGE, TRAIN_GPU, TRAIN_NGPU)
print("Deployment:", DEPLOY_IMAGE, DEPLOY_GPU, DEPLOY_NGPU)

#### Set machine type

Next, set the machine type to use for training and prediction.

- Set the variables `TRAIN_COMPUTE` and `DEPLOY_COMPUTE` to configure  the compute resources for the VMs you will use for for training and prediction.
 - `machine type`
     - `n1-standard`: 3.75GB of memory per vCPU.
     - `n1-highmem`: 6.5GB of memory per vCPU
     - `n1-highcpu`: 0.9 GB of memory per vCPU
 - `vCPUs`: number of \[2, 4, 8, 16, 32, 64, 96 \]

*Note: The following is not supported for training:*

 - `standard`: 2 vCPUs
 - `highcpu`: 2, 4 and 8 vCPUs

*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*.

In [None]:
if os.getenv("IS_TESTING_TRAIN_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_TRAIN_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

if os.getenv("IS_TESTING_DEPLOY_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_DEPLOY_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Deploy machine type", DEPLOY_COMPUTE)

### Retrieve the dataset from stage 1

Next, retrieve the dataset you created during stage 1 with the helper function `find_dataset()`. This helper function finds all the datasets whose display name matches the specified prefix and import format (e.g., bq). Finally it sorts the matches by create time and returns the latest version.

In [None]:
def find_dataset(display_name_prefix, import_format):
    matches = []
    datasets = aip.TabularDataset.list()
    for dataset in datasets:
        if dataset.display_name.startswith(display_name_prefix):
            try:
                if (
                    "bq" == import_format
                    and dataset.to_dict()["metadata"]["inputConfig"]["bigquerySource"]
                ):
                    matches.append(dataset)
                if (
                    "csv" == import_format
                    and dataset.to_dict()["metadata"]["inputConfig"]["gcsSource"]
                ):
                    matches.append(dataset)
            except:
                pass

    create_time = None
    for match in matches:
        if create_time is None or match.create_time > create_time:
            create_time = match.create_time
            dataset = match

    return dataset


dataset = find_dataset("Chicago Taxi", "bq")

print(dataset)

### Load dataset's user metadata

Load the user metadata for the dataset.

In [None]:
import json

try:
    with tf.io.gfile.GFile(
        "gs://" + dataset.labels["user_metadata"] + "/metadata.jsonl", "r"
    ) as f:
        metadata = json.load(f)

    print(metadata)
except:
    print("no metadata")

### Create and run training pipeline

To train an AutoML model, you perform two steps: 1) create a training pipeline, and 2) run the pipeline.

#### Create training pipeline

An AutoML training pipeline is created with the `AutoMLTabularTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the `TrainingJob` resource.
- `optimization_prediction_type`: The type task to train the model for.
  - `classification`: A tabuar classification model.
  - `regression`: A tabular regression model.
- `column_transformations`: (Optional): Transformations to apply to the input columns
- `optimization_objective`: The optimization objective to minimize or maximize.
  - binary classification:
    - `minimize-log-loss`
    - `maximize-au-roc`
    - `maximize-au-prc`
    - `maximize-precision-at-recall`
    - `maximize-recall-at-precision`
  - multi-class classification:
    - `minimize-log-loss`
  - regression:
    - `minimize-rmse`
    - `minimize-mae`
    - `minimize-rmsle`

The instantiated object is the DAG (directed acyclic graph) for the training pipeline.

In [None]:
dag = aip.AutoMLTabularTrainingJob(
    display_name="chicago_" + TIMESTAMP,
    optimization_prediction_type="classification",
    optimization_objective="minimize-log-loss",
)

print(dag)

#### Run the training pipeline

Next, you run the DAG to start the training job by invoking the method `run`, with the following parameters:

- `dataset`: The `Dataset` resource to train the model.
- `model_display_name`: The human readable name for the trained model.
- `training_fraction_split`: The percentage of the dataset to use for training.
- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).
- `validation_fraction_split`: The percentage of the dataset to use for validation.
- `target_column`: The name of the column to train as the label.
- `budget_milli_node_hours`: (optional) Maximum training time specified in unit of millihours (1000 = hour).
- `disable_early_stopping`: If `True`, training maybe completed before using the entire budget if the service believes it cannot further improve on the model objective measurements.

The `run` method when completed returns the `Model` resource.

The execution of the training pipeline will take upto 180 minutes.

In [None]:
async_model = dag.run(
    dataset=dataset,
    model_display_name="chicago_" + TIMESTAMP,
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    budget_milli_node_hours=8000,
    disable_early_stopping=False,
    target_column="tip_bin",
    sync=False,
)

### Create experiment for tracking training related metadata

Setup tracking the parameters (configuration) and metrics (results) for each experiment:

- `aip.init()` - Create an experiment instance
- `aip.start_run()` - Track a specific run within the experiment.

Learn more about [Introduction to Vertex AI ML Metadata](https://cloud.google.com/vertex-ai/docs/ml-metadata/introduction).

In [None]:
EXPERIMENT_NAME = "chicago-" + TIMESTAMP
aip.init(experiment=EXPERIMENT_NAME)
aip.start_run("run-1")

### Create a Vertex AI TensorBoard instance

Create a Vertex AI TensorBoard instance to use TensorBoard in conjunction with Vertex AI Training for custom model training.

Learn more about [Get started with Vertex AI TensorBoard](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview).

In [None]:
TENSORBOARD_DISPLAY_NAME = "chicago_" + TIMESTAMP
tensorboard = aip.Tensorboard.create(display_name=TENSORBOARD_DISPLAY_NAME)
tensorboard_resource_name = tensorboard.gca_resource.name
print("TensorBoard resource name:", tensorboard_resource_name)

### Create the input layer for your custom model

Next, you create the input layer for your custom tabular model, based on the data types of each feature.

In [None]:
from tensorflow.keras.layers import Input


def create_model_inputs(
    numeric_features=None, categorical_features=None, embedding_features=None
):
    inputs = {}
    for feature_name in numeric_features:
        inputs[feature_name] = Input(name=feature_name, shape=[], dtype=tf.float32)
    for feature_name in categorical_features:
        inputs[feature_name] = Input(name=feature_name, shape=[], dtype=tf.int64)
    for feature_name in embedding_features:
        inputs[feature_name] = Input(name=feature_name, shape=[], dtype=tf.int64)

    return inputs

In [None]:
input_layers = create_model_inputs(
    numeric_features=metadata["numeric_features"],
    categorical_features=metadata["categorical_features"],
    embedding_features=metadata["embedding_features"],
)

print(input_layers)

### Create the binary classifier custom model

Next, you create your binary classifier custom tabular model.

In [None]:
from math import sqrt

from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import (Activation, Concatenate, Dense, Embedding,
                                     experimental)


def create_binary_classifier(
    input_layers,
    tft_output,
    metaparams,
    numeric_features,
    categorical_features,
    embedding_features,
):
    layers = []
    for feature_name in input_layers:
        if feature_name in embedding_features:
            vocab_size = tft_output.vocabulary_size_by_name(feature_name)
            embedding_size = int(sqrt(vocab_size))
            embedding_output = Embedding(
                input_dim=vocab_size + 1,
                output_dim=embedding_size,
                name=f"{feature_name}_embedding",
            )(input_layers[feature_name])
            layers.append(embedding_output)
        elif feature_name in categorical_features:
            vocab_size = tft_output.vocabulary_size_by_name(feature_name)
            onehot_layer = experimental.preprocessing.CategoryEncoding(
                num_tokens=vocab_size,
                output_mode="binary",
                name=f"{feature_name}_onehot",
            )(input_layers[feature_name])
            layers.append(onehot_layer)
        elif feature_name in numeric_features:
            numeric_layer = tf.expand_dims(input_layers[feature_name], -1)
            layers.append(numeric_layer)
        else:
            pass

    joined = Concatenate(name="combines_inputs")(layers)
    feedforward_output = Sequential(
        [Dense(units, activation="relu") for units in metaparams["hidden_units"]],
        name="feedforward_network",
    )(joined)
    logits = Dense(units=1, name="logits")(feedforward_output)
    pred = Activation("sigmoid")(logits)

    model = Model(inputs=input_layers, outputs=[pred])
    return model

In [None]:
TRANSFORM_ARTIFACTS_DIR = metadata["transform_artifacts_dir"]
tft_output = tft.TFTransformOutput(TRANSFORM_ARTIFACTS_DIR)

metaparams = {"hidden_units": [128, 64]}
aip.log_params(metaparams)

model = create_binary_classifier(
    input_layers,
    tft_output,
    metaparams,
    numeric_features=metadata["numeric_features"],
    categorical_features=metadata["categorical_features"],
    embedding_features=metadata["embedding_features"],
)

model.summary()

#### Visualize the model architecture

Next, visualize the architecture of the custom model.

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True, show_dtype=True)

### Save model artifacts

Next, save the model artifacts to your Cloud Storage bucket

In [None]:
MODEL_DIR = f"{BUCKET_NAME}/base_model"

model.save(MODEL_DIR)

### Upload the local model to a Vertex AI Model resource

Next, you upload your local custom model artifacts to Vertex AI to convert into a managed Vertex AI Model resource.

In [None]:
vertex_custom_model = aip.Model.upload(
    display_name="chicago_" + TIMESTAMP,
    artifact_uri=MODEL_DIR,
    serving_container_image_uri=DEPLOY_IMAGE,
    labels={"base_model": "1"},
    sync=True,
)

### Construct the training package

#### Package layout

Before you start training, you will look at how a Python package is assembled for a custom training job. When unarchived, the package contains the following directory/file layout.

- PKG-INFO
- README.md
- setup.cfg
- setup.py
- trainer
  - \_\_init\_\_.py
  - task.py
  - other Python scripts

The files `setup.cfg` and `setup.py` are the instructions for installing the package into the operating environment of the Docker image.

The file `trainer/task.py` is the Python script for executing the custom training job.

In [None]:
# Make folder for Python training script
! rm -rf custom
! mkdir custom

# Add package information
! touch custom/README.md

setup_cfg = "[egg_info]\n\ntag_build =\n\ntag_date = 0"
! echo "$setup_cfg" > custom/setup.cfg

setup_py = "import setuptools\n\nsetuptools.setup(\n\n    install_requires=[\n\n        'google-cloud-aiplatform',\n\n        'cloudml-hypertune',\n\n        'tensorflow_datasets==1.3.0',\n\n        'tensorflow_data_validation==1.2',\n\n    ],\n\n    packages=setuptools.find_packages())"
! echo "$setup_py" > custom/setup.py

pkg_info = "Metadata-Version: 1.0\n\nName: Chicago Taxi tabular binary classifier\n\nVersion: 0.0.0\n\nSummary: Demostration training script\n\nHome-page: www.google.com\n\nAuthor: Google\n\nAuthor-email: cdpe@google.com\n\nLicense: Public\n\nDescription: Demo\n\nPlatform: Vertex AI"
! echo "$pkg_info" > custom/PKG-INFO

# Make the training subfolder
! mkdir custom/trainer
! touch custom/trainer/__init__.py

#### Get feature specification for the preprocessed data

Next, create the feature specification for the preprocessed data.

In [None]:
transform_feature_spec = tft_output.transformed_feature_spec()
print(transform_feature_spec)

### Load the transformed data into a tf.data.Dataset

Next, you load the gzip TFRecords on Cloud Storage storage into a `tf.data.Dataset` generator. These functions are re-used when training the custom model using `Vertex Training`, so you save them to the python training package.

In [None]:
%%writefile custom/trainer/data.py

import tensorflow as tf

def _gzip_reader_fn(filenames):
    """Small utility returning a record reader that can read gzip'ed files."""
    return tf.data.TFRecordDataset(filenames, compression_type="GZIP")


def get_dataset(file_pattern, feature_spec, label_column, batch_size=200):
    """Generates features and label for tuning/training.
    Args:
      file_pattern: input tfrecord file pattern.
      feature_spec: a dictionary of feature specifications.
      batch_size: representing the number of consecutive elements of returned
        dataset to combine in a single batch
    Returns:
      A dataset that contains (features, indices) tuple where features is a
        dictionary of Tensors, and indices is a single Tensor of label indices.
    """

    dataset = tf.data.experimental.make_batched_features_dataset(
        file_pattern=file_pattern,
        batch_size=batch_size,
        features=feature_spec,
        label_key=label_column,
        reader=_gzip_reader_fn,
        num_epochs=1,
        drop_final_batch=True,
    )

    return dataset

In [None]:
from custom.trainer import data

TRANSFORMED_DATA_PREFIX = metadata["transformed_data_prefix"]
LABEL_COLUMN = metadata["label_column"]

train_data_file_pattern = TRANSFORMED_DATA_PREFIX + "/train/data-*.gz"
val_data_file_pattern = TRANSFORMED_DATA_PREFIX + "/val/data-*.gz"
test_data_file_pattern = TRANSFORMED_DATA_PREFIX + "/test/data-*.gz"

for input_features, target in data.get_dataset(
    train_data_file_pattern, transform_feature_spec, LABEL_COLUMN, batch_size=3
).take(1):
    for key in input_features:
        print(
            f"{key} {input_features[key].dtype}: {input_features[key].numpy().tolist()}"
        )
    print(f"target: {target.numpy().tolist()}")

#### Test the model architecture with transformed input

Next, test the model architecture with a sample of the transformed training input.

*Note:* Since the model is untrained, the predictions should be random. Since this is a binary classifier, expect the predicted results ~0.5.

In [None]:
model(input_features)

## Develop and test the training scripts

When experimenting, one typically develops and tests the training package locally, before moving to training in the cloud.

### Create training script

Next, you write the Python script for compiling and training the model.

In [None]:
%%writefile custom/trainer/train.py

from trainer import data
import tensorflow as tf
import logging
from hypertune import HyperTune

def compile(model, hyperparams):
    ''' Compile the model '''
    optimizer = tf.keras.optimizers.Adam(learning_rate=hyperparams["learning_rate"])
    loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
    metrics = [tf.keras.metrics.BinaryAccuracy(name="accuracy")]

    model.compile(optimizer=optimizer,loss=loss, metrics=metrics)
    return model

def warmup(
    model,
    hyperparams,
    train_data_dir,
    label_column,
    transformed_feature_spec
):
    ''' Warmup the initialized model weights '''

    train_dataset = data.get_dataset(
        train_data_dir,
        transformed_feature_spec,
        label_column,
        batch_size=hyperparams["batch_size"],
    )

    lr_inc = (hyperparams['end_learning_rate'] - hyperparams['start_learning_rate']) / hyperparams['num_epochs']

    def scheduler(epoch, lr):
        if epoch == 0:
            return hyperparams['start_learning_rate']
        return lr + lr_inc


    callbacks = [tf.keras.callbacks.LearningRateScheduler(scheduler)]

    logging.info("Model warmup started...")
    history = model.fit(
            train_dataset,
            epochs=hyperparams["num_epochs"],
            steps_per_epoch=hyperparams["steps"],
            callbacks=callbacks
    )

    logging.info("Model warmup completed.")
    return history


def train(
    model,
    hyperparams,
    train_data_dir,
    val_data_dir,
    label_column,
    transformed_feature_spec,
    log_dir,
    tuning=False
):
    ''' Train the model '''

    train_dataset = data.get_dataset(
        train_data_dir,
        transformed_feature_spec,
        label_column,
        batch_size=hyperparams["batch_size"],
    )

    val_dataset = data.get_dataset(
        val_data_dir,
        transformed_feature_spec,
        label_column,
        batch_size=hyperparams["batch_size"],
    )

    early_stop = tf.keras.callbacks.EarlyStopping(
        monitor=hyperparams["early_stop"]["monitor"], patience=hyperparams["early_stop"]["patience"], restore_best_weights=True
    )

    callbacks = [early_stop]

    if log_dir:
        tensorboard = tf.keras.callbacks.TensorBoard(log_dir=log_dir)

        callbacks = callbacks.append(tensorboard)

    if tuning:
        # Instantiate the HyperTune reporting object
        hpt = HyperTune()

        # Reporting callback
        class HPTCallback(tf.keras.callbacks.Callback):

            def on_epoch_end(self, epoch, logs=None):
                hpt.report_hyperparameter_tuning_metric(
                    hyperparameter_metric_tag='val_loss',
                    metric_value=logs['val_loss'],
                    global_step=epoch
                )

        if not callbacks:
            callbacks = []
        callbacks.append(HPTCallback())

    logging.info("Model training started...")
    history = model.fit(
            train_dataset,
            epochs=hyperparams["num_epochs"],
            validation_data=val_dataset,
            callbacks=callbacks
    )

    logging.info("Model training completed.")
    return history

def evaluate(
    model,
    hyperparams,
    test_data_dir,
    label_column,
    transformed_feature_spec
):
    logging.info("Model evaluation started...")
    test_dataset = data.get_dataset(
        test_data_dir,
        transformed_feature_spec,
        label_column,
        hyperparams["batch_size"],
    )

    evaluation_metrics = model.evaluate(test_dataset)
    logging.info("Model evaluation completed.")

    return evaluation_metrics

### Train the model locally

Next, test the training package locally, by training with just a few epochs:

- `num_epochs`: The number of epochs to pass to the training package.
- `compile()`: Compile the model for training.
- `warmup()`: Warmup the initialized model weights.
- `train()`: Train the model.

In [None]:
os.chdir("custom")

import logging

from trainer import train

TENSORBOARD_LOG_DIR = "./logs"

logging.getLogger().setLevel(logging.INFO)

hyperparams = {}
hyperparams["learning_rate"] = 0.01
aip.log_params(hyperparams)

train.compile(model, hyperparams)

warmupparams = {}
warmupparams["start_learning_rate"] = 0.0001
warmupparams["end_learning_rate"] = 0.01
warmupparams["num_epochs"] = 4
warmupparams["batch_size"] = 64
warmupparams["steps"] = 50
aip.log_params(warmupparams)

train.warmup(
    model, warmupparams, train_data_file_pattern, LABEL_COLUMN, transform_feature_spec
)

trainparams = {}
trainparams["num_epochs"] = 5
trainparams["batch_size"] = 64
trainparams["early_stop"] = {"monitor": "val_loss", "patience": 5}
aip.log_params(trainparams)

train.train(
    model,
    trainparams,
    train_data_file_pattern,
    val_data_file_pattern,
    LABEL_COLUMN,
    transform_feature_spec,
    TENSORBOARD_LOG_DIR,
)

os.chdir("..")

### Evaluate the model locally

Next, test the evaluation portion of the training package:


- `evaluate()`: Evaluate the model.

In [None]:
os.chdir("custom")

from trainer import train

evalparams = {}
evalparams["batch_size"] = 64

metrics = {}
metrics["loss"], metrics["acc"] = train.evaluate(
    model, evalparams, test_data_file_pattern, LABEL_COLUMN, transform_feature_spec
)
print("ACC", metrics["acc"], "LOSS", metrics["loss"])
aip.log_metrics(metrics)

os.chdir("..")

### Retrieve model from Vertex AI

Next, create the Python script to retrieve your experimental model from Vertex AI.

In [None]:
%%writefile custom/trainer/model.py

import google.cloud.aiplatform as aip

def get(model_id):
    model = aip.Model(model_id)
    return model

### Create the task script for the Python training package

Next, you create the `task.py` script for driving the training package. Some noteable steps include:

- Command-line arguments:
    - `model-id`: The resource ID of the `Model` resource you built during experimenting. This is the untrained model architecture.
    - `dataset-id`: The resource ID of the `Dataset` resource to use for training.
    - `experiment`: The name of the experiment.
    - `run`: The name of the run within this experiment.
    - `tensorboard-logdir`: The logging directory for Vertex AI Tensorboard.


- `get_data()`:
    - Loads the Dataset resource into memory.
    - Obtains the user metadata from the Dataset resource.
    - From the metadata, obtain location of transformed data, transformation function and name of label column


- `get_model()`:
    - Loads the Model resource into memory.
    - Obtains location of model artifacts of the model architecture.
    - Loads the model architecture.
    - Compiles the model.


- `warmup_model()`:
   - Warms up the initialized model weights


- `train_model()`:
    - Train the model.


- `evaluate_model()`:
    - Evaluates the model.
    - Saves evaluation metrics to Cloud Storage bucket.

In [None]:
%%writefile custom/trainer/task.py
import os
import argparse
import logging
import json

import tensorflow as tf
import tensorflow_transform as tft
from tensorflow.python.client import device_lib

import google.cloud.aiplatform as aip

from trainer import data
from trainer import model as model_
from trainer import train
try:
    from trainer import serving
except:
    pass

parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', dest='model_dir',
                    default=os.getenv('AIP_MODEL_DIR'), type=str, help='Model dir.')
parser.add_argument('--model-id', dest='model_id',
                    default=None, type=str, help='Vertex Model ID.')
parser.add_argument('--dataset-id', dest='dataset_id',
                    default=None, type=str, help='Vertex Dataset ID.')
parser.add_argument('--lr', dest='lr',
                    default=0.001, type=float,
                    help='Learning rate.')
parser.add_argument('--start_lr', dest='start_lr',
                    default=0.0001, type=float,
                    help='Starting learning rate.')
parser.add_argument('--epochs', dest='epochs',
                    default=20, type=int,
                    help='Number of epochs.')
parser.add_argument('--steps', dest='steps',
                    default=200, type=int,
                    help='Number of steps per epoch.')
parser.add_argument('--batch_size', dest='batch_size',
                    default=16, type=int,
                    help='Batch size.')
parser.add_argument('--distribute', dest='distribute', type=str, default='single',
                    help='distributed training strategy')
parser.add_argument('--tensorboard-log-dir', dest='tensorboard_log_dir',
                    default=os.getenv('AIP_TENSORBOARD_LOG_DIR'), type=str,
                    help='Output file for tensorboard logs')
parser.add_argument('--experiment', dest='experiment',
                    default=None, type=str,
                    help='Name of experiment')
parser.add_argument('--project', dest='project',
                    default=None, type=str,
                    help='Name of project')
parser.add_argument('--run', dest='run',
                    default=None, type=str,
                    help='Name of run in experiment')
parser.add_argument('--evaluate', dest='evaluate',
                    default=False, type=bool,
                    help='Whether to perform evaluation')
parser.add_argument('--serving', dest='serving',
                    default=False, type=bool,
                    help='Whether to attach the serving function')
parser.add_argument('--tuning', dest='tuning',
                    default=False, type=bool,
                    help='Whether to perform hyperparameter tuning')
parser.add_argument('--warmup', dest='warmup',
                    default=False, type=bool,
                    help='Whether to perform warmup weight initialization')
args = parser.parse_args()


logging.getLogger().setLevel(logging.INFO)
logging.info('DEVICES'  + str(device_lib.list_local_devices()))

# Single Machine, single compute device
if args.distribute == 'single':
    if tf.test.is_gpu_available():
        strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    else:
        strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")
    logging.info("Single device training")
# Single Machine, multiple compute device
elif args.distribute == 'mirrored':
    strategy = tf.distribute.MirroredStrategy()
    logging.info("Mirrored Strategy distributed training")
# Multi Machine, multiple compute device
elif args.distribute == 'multiworker':
    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    logging.info("Multi-worker Strategy distributed training")
    logging.info('TF_CONFIG = {}'.format(os.environ.get('TF_CONFIG', 'Not found')))
logging.info('num_replicas_in_sync = {}'.format(strategy.num_replicas_in_sync))

# Initialize the run for this experiment
if args.experiment:
    logging.info("Initialize experiment: {}".format(args.experiment))
    aip.init(experiment=args.experiment, project=args.project)
    aip.start_run(args.run)

metadata = {}

def get_data():
    ''' Get the preprocessed training data '''
    global train_data_file_pattern, val_data_file_pattern, test_data_file_pattern
    global label_column, transform_feature_spec, metadata

    dataset = aip.TabularDataset(args.dataset_id)
    METADATA = 'gs://' + dataset.labels['user_metadata'] + "/metadata.jsonl"

    with tf.io.gfile.GFile(METADATA, "r") as f:
        metadata = json.load(f)

    TRANSFORMED_DATA_PREFIX = metadata['transformed_data_prefix']
    label_column = metadata['label_column']

    train_data_file_pattern = TRANSFORMED_DATA_PREFIX + '/train/data-*.gz'
    val_data_file_pattern = TRANSFORMED_DATA_PREFIX + '/val/data-*.gz'
    test_data_file_pattern = TRANSFORMED_DATA_PREFIX + '/test/data-*.gz'

    TRANSFORM_ARTIFACTS_DIR = metadata['transform_artifacts_dir']
    tft_output = tft.TFTransformOutput(TRANSFORM_ARTIFACTS_DIR)
    transform_feature_spec = tft_output.transformed_feature_spec()

def get_model():
    ''' Get the untrained model architecture '''
    global model_artifacts

    vertex_model = model_.get(args.model_id)
    model_artifacts = vertex_model.gca_resource.artifact_uri
    model = tf.keras.models.load_model(model_artifacts)

    # Compile the model
    hyperparams = {}
    hyperparams["learning_rate"] = args.lr
    if args.experiment:
        aip.log_params(hyperparams)

    metadata.update(hyperparams)
    with tf.io.gfile.GFile(os.path.join(args.model_dir, "metrics.txt"), "w") as f:
        f.write(json.dumps(metadata))

    train.compile(model, hyperparams)
    return model

def warmup_model(model):
    ''' Warmup the initialized model weights '''
    warmupparams = {}
    warmupparams["num_epochs"] = args.epochs
    warmupparams["batch_size"] = args.batch_size
    warmupparams["steps"] = args.steps
    warmupparams["start_learning_rate"] = args.start_lr
    warmupparams["end_learning_rate"] = args.lr

    train.warmup(model, warmupparams, train_data_file_pattern, label_column, transform_feature_spec)
    return model

def train_model(model):
    ''' Train the model '''
    trainparams = {}
    trainparams["num_epochs"] = args.epochs
    trainparams["batch_size"] = args.batch_size
    trainparams["early_stop"] = {"monitor": "val_loss", "patience": 5}
    if args.experiment:
        aip.log_params(trainparams)

    metadata.update(trainparams)
    with tf.io.gfile.GFile(os.path.join(args.model_dir, "metrics.txt"), "w") as f:
        f.write(json.dumps(metadata))

    train.train(model, trainparams, train_data_file_pattern, val_data_file_pattern, label_column, transform_feature_spec, args.tensorboard_log_dir, args.tuning)
    return model

def evaluate_model(model):
    ''' Evaluate the model '''
    evalparams = {}
    evalparams["batch_size"] = args.batch_size
    metrics = train.evaluate(model, evalparams, test_data_file_pattern, label_column, transform_feature_spec)

    metadata.update({'metrics': metrics})
    with tf.io.gfile.GFile(os.path.join(args.model_dir, "metrics.txt"), "w") as f:
        f.write(json.dumps(metadata))

get_data()
with strategy.scope():
    model = get_model()

if args.warmup:
    model = warmup_model(model)
else:
    model = train_model(model)

if args.evaluate:
    evaluate_model(model)

if args.serving:
    logging.info('Save serving model to: ' + args.model_dir)
    serving.construct_serving_model(
        model=model,
        serving_model_dir=args.model_dir,
        metadata=metadata
    )
elif args.warmup:
    logging.info('Save warmed up model to: ' + model_artifacts)
    model.save(model_artifacts)
else:
    logging.info('Save trained model to: ' + args.model_dir)
    model.save(args.model_dir)

### Test training package locally

Next, test your completed training package locally with just a few epochs.

In [None]:
DATASET_ID = dataset.resource_name
MODEL_ID = vertex_custom_model.resource_name
!cd custom; python3 -m trainer.task --model-id={MODEL_ID} --dataset-id={DATASET_ID} --experiment='chicago' --run='test' --project={PROJECT_ID} --epochs=5 --model-dir=/tmp --evaluate=True

### Warmup training

Now that you have tested the training scripts, you perform warmup training on the base model. Warmup training is used to stabilize the weight initialization. By doing so, each subsequent training and tuning of the model architecture will start with the same stabilized weight initialization.

In [None]:
MODEL_DIR = f"{BUCKET_NAME}/base_model"

!cd custom; python3 -m trainer.task --model-id={MODEL_ID} --dataset-id={DATASET_ID} --project={PROJECT_ID} --epochs=5 --steps=300 --batch_size=16 --lr=0.01 --start_lr=0.0001 --model-dir={MODEL_DIR} --warmup=True

## Mirrored Strategy

When training on a single VM, one can either train was a single compute device or with multiple compute devices on the same VM. With Vertex AI Distributed Training you can specify both the number of compute devices for the VM instance and type of compute devices: CPU, GPU.

Vertex AI Distributed Training supports `tf.distribute.MirroredStrategy' for TensorFlow models. To enable training across multiple compute devices on the same VM, you do the following additional steps in your Python training script:

1. Set the tf.distribute.MirrorStrategy
2. Compile the model within the scope of tf.distribute.MirrorStrategy. *Note:* Tells MirroredStrategy which variables to mirror across your compute devices.
3. Increase the batch size for each compute device to num_devices * batch size.

During transitions, the distribution of batches will be synchronized as well as the updates to the model parameters.

### Create and run custom training job


To train a custom model, you perform two steps: 1) create a custom training job, and 2) run the job.

#### Create custom training job

A custom training job is created with the `CustomTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the custom training job.
- `container_uri`: The training container image.

- `python_package_gcs_uri`: The location of the Python training package as a tarball.
- `python_module_name`: The relative path to the training script in the Python package.
- `model_serving_container_uri`: The container image for deploying the model.

*Note:* There is no requirements parameter. You specify any requirements in the `setup.py` script in your Python package.

In [None]:
DISPLAY_NAME = "chicago_" + TIMESTAMP

job = aip.CustomPythonPackageTrainingJob(
    display_name=DISPLAY_NAME,
    python_package_gcs_uri=f"{BUCKET_NAME}/trainer_chicago.tar.gz",
    python_module_name="trainer.task",
    container_uri=TRAIN_IMAGE,
    model_serving_container_image_uri=DEPLOY_IMAGE,
    project=PROJECT_ID,
)

In [None]:
! rm -rf custom/logs
! rm -rf custom/trainer/__pycache__

#### Store training script on your Cloud Storage bucket

Next, you package the training folder into a compressed tar ball, and then store it in your Cloud Storage bucket.

In [None]:
! rm -f custom.tar custom.tar.gz
! tar cvf custom.tar custom
! gzip custom.tar
! gsutil cp custom.tar.gz $BUCKET_NAME/trainer_chicago.tar.gz

#### Run the custom Python package training job

Next, you run the custom job to start the training job by invoking the method `run()`. The parameters are the same as when running a CustomTrainingJob.

*Note:* The parameter service_account is set so that the initializing experiment step `aip.init(experiment="...")` has necessarily permission to access the Vertex AI Metadata Store.

In [None]:
MODEL_DIR = BUCKET_NAME + "/testing"

CMDARGS = [
    "--epochs=5",
    "--batch_size=16",
    "--distribute=mirrored",
    "--experiment=chicago",
    "--run=test",
    "--project=" + PROJECT_ID,
    "--model-id=" + MODEL_ID,
    "--dataset-id=" + DATASET_ID,
]

model = job.run(
    model_display_name="chicago_" + TIMESTAMP,
    args=CMDARGS,
    replica_count=1,
    machine_type=TRAIN_COMPUTE,
    accelerator_type=TRAIN_GPU.name,
    accelerator_count=TRAIN_NGPU,
    base_output_dir=MODEL_DIR,
    service_account=SERVICE_ACCOUNT,
    tensorboard=tensorboard_resource_name,
    sync=True,
)

### Delete a custom training job

After a training job is completed, you can delete the training job with the method `delete()`.  Prior to completion, a training job can be canceled with the method `cancel()`.

In [None]:
job.delete()

#### Delete the model

The method 'delete()' will delete the model.

In [None]:
model.delete()

## Hyperparameter tuning

Next, you perform hyperparameter tuning with the training package. The training package has some additions that make the same package usable for both hyperparameter tuning, as well as local testing and full cloud training:

- Command-Line:
  - `tuning`: indicates to use the HyperTune service as a callback during training.


- `train()`: If tuning is set, creates and adds a callback to HyperTune service.

### Prepare your machine specification

Now define the machine specification for your custom training job. This tells Vertex what type of machine instance to provision for the training.
  - `machine_type`: The type of GCP instance to provision -- e.g., n1-standard-8.
  - `accelerator_type`: The type, if any, of hardware accelerator. In this tutorial if you previously set the variable `TRAIN_GPU != None`, you are using a GPU; otherwise you will use a CPU.
  - `accelerator_count`: The number of accelerators.

In [None]:
if TRAIN_GPU:
    machine_spec = {
        "machine_type": TRAIN_COMPUTE,
        "accelerator_type": TRAIN_GPU,
        "accelerator_count": TRAIN_NGPU,
    }
else:
    machine_spec = {"machine_type": TRAIN_COMPUTE, "accelerator_count": 0}

### Prepare your disk specification

(optional) Now define the disk specification for your custom training job. This tells Vertex what type and size of disk to provision in each machine instance for the training.

  - `boot_disk_type`: Either SSD or Standard. SSD is faster, and Standard is less expensive. Defaults to SSD.
  - `boot_disk_size_gb`: Size of disk in GB.

In [None]:
DISK_TYPE = "pd-ssd"  # [ pd-ssd, pd-standard]
DISK_SIZE = 200  # GB

disk_spec = {"boot_disk_type": DISK_TYPE, "boot_disk_size_gb": DISK_SIZE}

### Define worker pool specification for hyperparameter tuning job

Next, define the worker pool specification. Note that we plan to tune the learning rate and batch size, so you do not pass them as command-line arguments (omitted). The Vertex AI Hyperparameter Tuning service will pick values for both learning rate and batch size during trials, which it will pass along as command-line arguments.

In [None]:
CMDARGS = [
    "--epochs=5",
    "--distribute=mirrored",
    # "--experiment=chicago",
    # "--run=tune",
    # "--project=" + PROJECT_ID,
    "--model-id=" + MODEL_ID,
    "--dataset-id=" + DATASET_ID,
    "--tuning=True",
]

worker_pool_spec = [
    {
        "replica_count": 1,
        "machine_spec": machine_spec,
        "disk_spec": disk_spec,
        "python_package_spec": {
            "executor_image_uri": TRAIN_IMAGE,
            "package_uris": [BUCKET_NAME + "/trainer_chicago.tar.gz"],
            "python_module": "trainer.task",
            "args": CMDARGS,
        },
    }
]

## Create a custom job

Use the class `CustomJob` to create a custom job, such as for hyperparameter tuning, with the following parameters:

- `display_name`: A human readable name for the custom job.
- `worker_pool_specs`: The specification for the corresponding VM instances.

In [None]:
job = aip.CustomJob(
    display_name="chicago_" + TIMESTAMP, worker_pool_specs=worker_pool_spec
)

## Create a hyperparameter tuning job

Use the class `HyperparameterTuningJob` to create a hyperparameter tuning job, with the following parameters:

- `display_name`: A human readable name for the custom job.
- `custom_job`: The worker pool spec from this custom job applies to the CustomJobs created in all the trials.
- `metrics_spec`: The metrics to optimize. The dictionary key is the metric_id, which is reported by your training job, and the dictionary value is the optimization goal of the metric('minimize' or 'maximize').
- `parameter_spec`: The parameters to optimize. The dictionary key is the metric_id, which is passed into your training job as a command line key word argument, and the dictionary value is the parameter specification of the metric.
- `search_algorithm`: The search algorithm to use: `grid`, `random` and `None`. If `None` is specified, the `Vizier` service (Bayesian) is used.
- `max_trial_count`: The maximum number of trials to perform.

In [None]:
from google.cloud.aiplatform import hyperparameter_tuning as hpt

hpt_job = aip.HyperparameterTuningJob(
    display_name="chicago_" + TIMESTAMP,
    custom_job=job,
    metric_spec={
        "val_loss": "minimize",
    },
    parameter_spec={
        "lr": hpt.DoubleParameterSpec(min=0.001, max=0.1, scale="log"),
        "batch_size": hpt.DiscreteParameterSpec([16, 32, 64, 128, 256], scale="linear"),
    },
    search_algorithm=None,
    max_trial_count=8,
    parallel_trial_count=1,
)

## Run the hyperparameter tuning job

Use the `run()` method to execute the hyperparameter tuning job.

In [None]:
hpt_job.run()

### Best trial

Now look at which trial was the best:

In [None]:
best = (None, None, None, 0.0)
for trial in hpt_job.trials:
    # Keep track of the best outcome
    if float(trial.final_measurement.metrics[0].value) > best[3]:
        try:
            best = (
                trial.id,
                float(trial.parameters[0].value),
                float(trial.parameters[1].value),
                float(trial.final_measurement.metrics[0].value),
            )
        except:
            best = (
                trial.id,
                float(trial.parameters[0].value),
                None,
                float(trial.final_measurement.metrics[0].value),
            )

print(best)

### Delete the hyperparameter tuning job

The method 'delete()' will delete the hyperparameter tuning job.

In [None]:
hpt_job.delete()

### Save the best hyperparameter values

In [None]:
LR = best[2]
BATCH_SIZE = int(best[1])

### Create and run custom training job


To train a custom model, you perform two steps: 1) create a custom training job, and 2) run the job.

#### Create custom training job

A custom training job is created with the `CustomTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the custom training job.
- `container_uri`: The training container image.

- `python_package_gcs_uri`: The location of the Python training package as a tarball.
- `python_module_name`: The relative path to the training script in the Python package.
- `model_serving_container_uri`: The container image for deploying the model.

*Note:* There is no requirements parameter. You specify any requirements in the `setup.py` script in your Python package.

In [None]:
DISPLAY_NAME = "chicago_" + TIMESTAMP

job = aip.CustomPythonPackageTrainingJob(
    display_name=DISPLAY_NAME,
    python_package_gcs_uri=f"{BUCKET_NAME}/trainer_chicago.tar.gz",
    python_module_name="trainer.task",
    container_uri=TRAIN_IMAGE,
    model_serving_container_image_uri=DEPLOY_IMAGE,
    project=PROJECT_ID,
)

#### Run the custom Python package training job

Next, you run the custom job to start the training job by invoking the method `run()`. The parameters are the same as when running a CustomTrainingJob.

*Note:* The parameter service_account is set so that the initializing experiment step `aip.init(experiment="...")` has necessarily permission to access the Vertex AI Metadata Store.

In [None]:
MODEL_DIR = BUCKET_NAME + "/trained"
FULL_EPOCHS = 100

CMDARGS = [
    f"--epochs={FULL_EPOCHS}",
    f"--lr={LR}",
    f"--batch_size={BATCH_SIZE}",
    "--distribute=mirrored",
    "--experiment=chicago",
    "--run=full",
    "--project=" + PROJECT_ID,
    "--model-id=" + MODEL_ID,
    "--dataset-id=" + DATASET_ID,
    "--evaluate=True",
]

model = job.run(
    model_display_name="chicago_" + TIMESTAMP,
    args=CMDARGS,
    replica_count=1,
    machine_type=TRAIN_COMPUTE,
    accelerator_type=TRAIN_GPU.name,
    accelerator_count=TRAIN_NGPU,
    base_output_dir=MODEL_DIR,
    service_account=SERVICE_ACCOUNT,
    tensorboard=tensorboard_resource_name,
    sync=True,
)

### Delete a custom training job

After a training job is completed, you can delete the training job with the method `delete()`.  Prior to completion, a training job can be canceled with the method `cancel()`.

In [None]:
job.delete()

### Get the experiment results

Next, you use the experiment name as a parameter to the method `get_experiment_df()` to get the results of the experiment as a pandas dataframe.

In [None]:
EXPERIMENT_NAME = "chicago"

experiment_df = aip.get_experiment_df()
experiment_df = experiment_df[experiment_df.experiment_name == EXPERIMENT_NAME]
experiment_df.T

## Review the custom model evaluation results

Next, you review the evaluation metrics builtin into the training package.

In [None]:
METRICS = MODEL_DIR + "/model/metrics.txt"
! gsutil cat $METRICS

### Delete the TensorBoard instance

Next, delete the TensorBoard instance.

In [None]:
tensorboard.delete()

In [None]:
vertex_custom_model = model
model = tf.keras.models.load_model(MODEL_DIR + "/model")

## Add a serving function

Next, you add a serving function to your model for online and batch prediction. This allows prediction requests to be sent in raw format (unpreprocessed), either as a serialized TF.Example or JSONL object. The serving function will then preprocess the prediction request into the transformed format expected by the model.

In [None]:
%%writefile custom/trainer/serving.py

import tensorflow as tf
import tensorflow_data_validation as tfdv
import tensorflow_transform as tft
import logging

def _get_serve_features_fn(model, tft_output):
    """Returns a function that accept a dictionary of features and applies TFT."""

    model.tft_layer = tft_output.transform_features_layer()

    @tf.function
    def serve_features_fn(raw_features):
        """Returns the output to be used in the serving signature."""

        transformed_features = model.tft_layer(raw_features)
        probabilities = model(transformed_features)
        return {"scores": probabilities}


    return serve_features_fn

def _get_serve_tf_examples_fn(model, tft_output, feature_spec):
    """Returns a function that parses a serialized tf.Example and applies TFT."""

    model.tft_layer = tft_output.transform_features_layer()

    @tf.function
    def serve_tf_examples_fn(serialized_tf_examples):
        """Returns the output to be used in the serving signature."""
        for key in list(feature_spec.keys()):
            if key not in features:
                feature_spec.pop(key)

        parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)

        transformed_features = model.tft_layer(parsed_features)
        probabilities = model(transformed_features)
        return {"scores": probabilities}

    return serve_tf_examples_fn

def construct_serving_model(
    model, serving_model_dir, metadata
):
    global features

    schema_location = metadata['schema']
    features = metadata['numeric_features'] + metadata['categorical_features'] + metadata['embedding_features']
    print("FEATURES", features)
    tft_output_dir = metadata["transform_artifacts_dir"]

    schema = tfdv.load_schema_text(schema_location)
    feature_spec = tft.tf_metadata.schema_utils.schema_as_feature_spec(schema).feature_spec

    tft_output = tft.TFTransformOutput(tft_output_dir)

    # Drop features that were not used in training
    features_input_signature = {
        feature_name: tf.TensorSpec(
            shape=(None, 1), dtype=spec.dtype, name=feature_name
        )
        for feature_name, spec in feature_spec.items()
        if feature_name in features
    }

    signatures = {
        "serving_default": _get_serve_features_fn(
            model, tft_output
        ).get_concrete_function(features_input_signature),
        "serving_tf_example": _get_serve_tf_examples_fn(
            model, tft_output, feature_spec
        ).get_concrete_function(
            tf.TensorSpec(shape=[None], dtype=tf.string, name="examples")
        ),
    }

    logging.info("Model saving started...")
    model.save(serving_model_dir, signatures=signatures)
    logging.info("Model saving completed.")

### Construct the serving model

Now construct the serving model and store the serving model to your Cloud Storage bucket.

In [None]:
os.chdir("custom")

from trainer import serving

SERVING_MODEL_DIR = BUCKET_NAME + "/serving_model"

serving.construct_serving_model(
    model=model, serving_model_dir=SERVING_MODEL_DIR, metadata=metadata
)

serving_model = tf.keras.models.load_model(SERVING_MODEL_DIR)

os.chdir("..")

### Test the serving model locally with tf.Example data

Next, test the layer interface in the serving model for tf.Example data.

In [None]:
EXPORTED_TFREC_PREFIX = metadata["exported_tfrec_prefix"]
file_names = tf.data.TFRecordDataset.list_files(
    EXPORTED_TFREC_PREFIX + "/data-*.tfrecord"
)
for batch in tf.data.TFRecordDataset(file_names).batch(3).take(1):
    predictions = serving_model.signatures["serving_tf_example"](batch)
    for key in predictions:
        print(f"{key}: {predictions[key]}")

### Test the serving model locally with JSONL data

Next, test the layer interface in the serving model for JSONL data.

In [None]:
schema = tfdv.load_schema_text(metadata["schema"])
feature_spec = tft.tf_metadata.schema_utils.schema_as_feature_spec(schema).feature_spec

instance = {
    "dropoff_grid": "POINT(-87.6 41.9)",
    "euclidean": 2064.2696,
    "loc_cross": "",
    "payment_type": "Credit Card",
    "pickup_grid": "POINT(-87.6 41.9)",
    "trip_miles": 1.37,
    "trip_day": 12,
    "trip_hour": 6,
    "trip_month": 2,
    "trip_day_of_week": 4,
    "trip_seconds": 555,
}

for feature_name in instance:
    dtype = feature_spec[feature_name].dtype
    instance[feature_name] = tf.constant([[instance[feature_name]]], dtype)

predictions = serving_model.signatures["serving_default"](**instance)
for key in predictions:
    print(f"{key}: {predictions[key].numpy()}")

### Upload the serving model to a Vertex AI Model resource

Next, you upload your serving custom model artifacts to Vertex AI to convert into a managed Vertex AI Model resource.

In [None]:
vertex_serving_model = aip.Model.upload(
    display_name="chicago_" + TIMESTAMP,
    artifact_uri=SERVING_MODEL_DIR,
    serving_container_image_uri=DEPLOY_IMAGE,
    labels={"user_metadata": BUCKET_NAME[5:]},
    sync=True,
)

### Evaluate the serving model

Next, evaluate the serving model with the evaluation (test) slices. For apples-to-apples comparison, you use the same evaluation slices for both the custom model and the AutoML model. Since your evaluation slices and metrics maybe custom, we recommend:

- Send each evaluation slice as a Vertex AI Batch Prediction Job.
- Use a custom evaluation script to evaluate the results from the batch prediction job.

In [None]:
SERVING_OUTPUT_DATA_DIR = BUCKET_NAME + "/batch_eval"
EXPORTED_JSONL_PREFIX = metadata["exported_jsonl_prefix"]

MIN_NODES = 1
MAX_NODES = 1

job = vertex_serving_model.batch_predict(
    instances_format="jsonl",
    predictions_format="jsonl",
    job_display_name="chicago_" + TIMESTAMP,
    gcs_source=EXPORTED_JSONL_PREFIX + "*.jsonl",
    gcs_destination_prefix=SERVING_OUTPUT_DATA_DIR,
    model_parameters=None,
    machine_type=DEPLOY_COMPUTE,
    accelerator_type=DEPLOY_GPU,
    accelerator_count=DEPLOY_NGPU,
    starting_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
    sync=True,
)

### Perform custom evaluation metrics

After the batch job has completed, you input the results and target labels to your custom evaluation script. For demonstration purposes, we just display the results of the batch prediction.

In [None]:
batch_dir = ! gsutil ls $SERVING_OUTPUT_DATA_DIR
batch_dir = batch_dir[0]
outputs = ! gsutil ls $batch_dir
errors = outputs[0]
results = outputs[1]
print("errors")
! gsutil cat $errors
print("results")
! gsutil cat $results | head -n10

In [None]:
model = async_model

### Wait for completion of AutoML training job

Next, wait for the AutoML training job to complete. Alternatively, one can set the parameter `sync` to `True` in the `run()` method to block until the AutoML training job is completed.

In [None]:
model.wait()

## Review model evaluation scores
After your model has finished training, you can review the evaluation scores for it.

First, you need to get a reference to the new model. As with datasets, you can either use the reference to the model variable you created when you deployed the model or you can list all of the models in your project.

In [None]:
# Get model resource ID
models = aip.Model.list(filter="display_name=chicago_" + TIMESTAMP)

# Get a reference to the Model Service client
client_options = {"api_endpoint": f"{REGION}-aiplatform.googleapis.com"}
model_service_client = aip.gapic.ModelServiceClient(client_options=client_options)

model_evaluations = model_service_client.list_model_evaluations(
    parent=models[0].resource_name
)
model_evaluation = list(model_evaluations)[0]
print(model_evaluation)

## Compare metric results with AutoML baseline

Finally, you make a decision if the current experiment produces a custom model that is better than the AutoML baseline, as follows:
    - Compare the evaluation results for each evaluation slice between the custom model and the AutoML model.
    - Weight the results according to your business purposes.
    - Add up the result and make a determination if the custom model is better.

### Store evaluation results for custom model

Next, you use the labels field to store user metadata containing the custom metrics information.

In [None]:
import json

metadata = {}
metadata["train_eval_metrics"] = METRICS
metadata["custom_eval_metrics"] = "[you-fill-this-in]"

with tf.io.gfile.GFile("gs://" + BUCKET_NAME[5:] + "/metadata.jsonl", "w") as f:
    json.dump(metadata, f)

!gsutil cat $BUCKET_NAME/metadata.jsonl

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Dataset
- Pipeline
- Model
- Endpoint
- AutoML Training Job
- Batch Job
- Custom Job
- Hyperparameter Tuning Job
- Cloud Storage Bucket

In [None]:
delete_all = False

if delete_all:
    # Delete the dataset using the Vertex dataset object
    try:
        if "dataset" in globals():
            dataset.delete()
    except Exception as e:
        print(e)

    # Delete the model using the Vertex model object
    try:
        if "model" in globals():
            model.delete()
    except Exception as e:
        print(e)

    if "BUCKET_NAME" in globals():
        ! gsutil rm -r $BUCKET_NAME