In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# E2E ML on GCP: MLOps stage 3 : formalization

<table align="left">
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage3/mlops_formalization.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/ai/platform/notebooks/deploy-notebook?download_url=https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage3/mlops_formalization.ipynb">
      Open in Google Cloud Notebooks
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview


This tutorial demonstrates how to use Vertex AI for E2E MLOps on Google Cloud in production. This tutorial covers stage 3 : formalization.

### Dataset

The dataset used for this tutorial is the [Chicago Taxi](https://www.kaggle.com/chicago/chicago-taxi-trips-bq). The version of the dataset you will use in this tutorial is stored in a public BigQuery table. The trained model predicts whether someone would leave a tip for a taxi fare.

### Objective

In this tutorial, you create a MLOps stage 3: formalization process.

This tutorial uses the following Vertex AI:

- `Vertex AI Pipelines`
- `Vertex AI Training`
- `Google Cloud Pipeline Components`
- `Vertex AI Dataset, and Model resources
- `Dataflow`

The steps performed include:

- Obtain resources from the experimentation stage.
    - Baseline model.
    - Dataset schema/statistics for baseline model.
- Formalize a data preprocessing pipeline.
    - Extract columns/rows from BigQuery table to local BigQuery table.
    - Use Tensorflow Data Validation library to determine statistics, schema, and features.
    - Use Dataflow to preprocess the data.
    - Create a Vertex AI Dataset.
- Formalize a build model architecture pipeline.
    - Create the Vertex AI Model base model.
- Formalize a training pipeline.

### Recommendations

When doing E2E MLOps on Google Cloud for formalization, the following best practices with structured (tabular) data are recommended:

- Use pipeline components to automate the training of the model.
   - Use pre-built Vertex AI components where available.


- Decompose the pipeline into the following sub-pipelines:

   - Data pipeline
   - Model architecture pipeline
   - Training pipeline

- The data pipeline should perform the following tasks:
    - Do satistical analysis on the dataset using Tensorflow Data Validation library.
    - Split the dataset examples into training, validation and test datasets using `Dataflow` components.
    - Preprocess and transform the split datasets into machine learning ready format, i.e., `TFRecord`, using `Dataflow` components.
    - Preprocess copies of test dataset for testing serving model using `Dataflow` components.
    - Create a `Vertex AI Dataset resource` using `Vertex AI Dataset` components.
       - Store as metadata the statistics, schema, transformation function, transformed data features, and location of transformed datasets.


- The model architecture pipeline should perform the following tasks:
    - Retrieve the metadata for the corresponding `Vertex AI Dataset`.
    - Construct the input layer to the model architecture according to the transformed feature information in the metadata.
    - Construct the body of the model architecture.
    - Upload the model architecture's model artifacts as a `Vertex AI Model` using `Vertex AI Model` components.
    - Label the `Vertex AI Model` resource as the `base model architecture`.


- The training pipeline should perform the following tasks:
    - Start an `Vertex AI Experiment` for this training run, and log corresponding tasks, parameters and results.
    - Load the corresponding `Vertex AI Dataset` resource.
        - Retrieve the metadata for the locations of the training, test and validation datasets.
    - Load the corresponding `Vertex AI Model` resource.
        - From the metadata, retrieve the location of the model artifacts for the model architectures.
        - From the metadata, retrieve the hyperparameters for the current model baseline.
        - Load and compile the model artifacts.
    - Train the model.
        - Train the model with corresponding hyperparameters.
        - Track the training with a `Vertex AI Tensorboard` instance.
        - Store the trained model artifacts on Cloud Storage.
    - Evaluate the model.
        - Evaluate the model using the test dataset.
        - Store the model's evaluation metrics as metadata.
    - Upload the model artifacts to a `Vertex AI Model` resource.
        - Add a serving function to the model artifacts.
        - Upload the model artifacts + serving function.
        - Add label for metadata, including location of training parameters, evaluation metrics and tagging the model instance as a candidate model.
    - Create a `Vertex AI Endpoint` resource.
    - Deploy the trained `Vertex AI Model` resource to the `Vertex AI Endpoint` resource.

## Installations

Install *one time* the packages for executing the MLOps notebooks.

In [None]:
ONCE_ONLY = False
if ONCE_ONLY:
    ! pip3 install -U tensorflow==2.5 $USER_FLAG
    ! pip3 install -U tensorflow-data-validation==1.2 $USER_FLAG
    ! pip3 install -U tensorflow-transform==1.2 $USER_FLAG
    ! pip3 install -U tensorflow-io==0.18 $USER_FLAG
    ! pip3 install --upgrade google-cloud-aiplatform[tensorboard] $USER_FLAG
    ! pip3 install --upgrade google-cloud-pipeline-components $USER_FLAG
    ! pip3 install --upgrade google-cloud-bigquery $USER_FLAG
    ! pip3 install --upgrade google-cloud-logging $USER_FLAG
    ! pip3 install --upgrade apache-beam[gcp] $USER_FLAG
    ! pip3 install --upgrade pyarrow $USER_FLAG
    ! pip3 install --upgrade cloudml-hypertune $USER_FLAG
    ! pip3 install --upgrade kfp $USER_FLAG
    ! pip3 install --upgrade torchvision $USER_FLAG
    ! pip3 install --upgrade rpy2 $USER_FLAG
    ! pip3 install --upgrade python-tabulate $USER_FLAG
    ! pip3 install -U opencv-python-headless==4.5.2.52 $USER_FLAG

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [None]:
BUCKET_NAME = "gs://[your-bucket-name]"  # @param {type:"string"}

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_NAME

#### Service Account

**If you don't know your service account**, try to get your service account using `gcloud` command by executing the second cell below.

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your GCP project id from gcloud
    shell_output = !gcloud auth list 2>/dev/null
    SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()
    print("Service Account:", SERVICE_ACCOUNT)

#### Set service account access for Vertex AI Pipelines

Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step -- you only need to run these once per service account.

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_NAME

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_NAME

### Set up variables

Next, set up some variables used throughout the tutorial.
### Import libraries and define constants

In [None]:
import google.cloud.aiplatform as aip

#### Import TensorFlow

Import the TensorFlow package into your Python environment.

In [None]:
import tensorflow as tf

In [None]:
from typing import NamedTuple

from kfp import dsl
from kfp.v2 import compiler
from kfp.v2.dsl import component

In [None]:
from google_cloud_pipeline_components.v1.dataflow import DataflowPythonJobOp
from google_cloud_pipeline_components.v1.wait_gcp_resources import \
    WaitGcpResourcesOp

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aip.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)

#### Set hardware accelerators

You can set hardware accelerators for training and prediction.

Set the variables `TRAIN_GPU/TRAIN_NGPU` and `DEPLOY_GPU/DEPLOY_NGPU` to use a container image supporting a GPU and the number of GPUs allocated to the virtual machine (VM) instance. For example, to use a GPU container image with 4 Nvidia Telsa K80 GPUs allocated to each VM, you would specify:

    (aip.AcceleratorType.NVIDIA_TESLA_K80, 4)


Otherwise specify `(None, None)` to use a container image to run on a CPU.

Learn more about [hardware accelerator support for your region](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators).

*Note*: TF releases before 2.3 for GPU support will fail to load the custom model in this tutorial. It is a known issue and fixed in TF 2.3. This is caused by static graph ops that are generated in the serving function. If you encounter this issue on your own custom models, use a container image for TF 2.3 with GPU support.

In [None]:
if os.getenv("IS_TESTING_TRAIN_GPU"):
    TRAIN_GPU, TRAIN_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_TRAIN_GPU")),
    )
else:
    TRAIN_GPU, TRAIN_NGPU = (aip.gapic.AcceleratorType.NVIDIA_TESLA_K80, 1)

if os.getenv("IS_TESTING_DEPLOY_GPU"):
    DEPLOY_GPU, DEPLOY_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_DEPLOY_GPU")),
    )
else:
    DEPLOY_GPU, DEPLOY_NGPU = (None, None)

#### Set pre-built containers

Set the pre-built Docker container image for training and prediction.


For the latest list, see [Pre-built containers for training](https://cloud.google.com/ai-platform-unified/docs/training/pre-built-containers).


For the latest list, see [Pre-built containers for prediction](https://cloud.google.com/ai-platform-unified/docs/predictions/pre-built-containers).

In [None]:
if os.getenv("IS_TESTING_TF"):
    TF = os.getenv("IS_TESTING_TF")
else:
    TF = "2.5".replace(".", "-")

if TF[0] == "2":
    if TRAIN_GPU:
        TRAIN_VERSION = "tf-gpu.{}".format(TF)
    else:
        TRAIN_VERSION = "tf-cpu.{}".format(TF)
    if DEPLOY_GPU:
        DEPLOY_VERSION = "tf2-gpu.{}".format(TF)
    else:
        DEPLOY_VERSION = "tf2-cpu.{}".format(TF)
else:
    if TRAIN_GPU:
        TRAIN_VERSION = "tf-gpu.{}".format(TF)
    else:
        TRAIN_VERSION = "tf-cpu.{}".format(TF)
    if DEPLOY_GPU:
        DEPLOY_VERSION = "tf-gpu.{}".format(TF)
    else:
        DEPLOY_VERSION = "tf-cpu.{}".format(TF)

TRAIN_IMAGE = "{}-docker.pkg.dev/vertex-ai/training/{}:latest".format(
    REGION.split("-")[0], TRAIN_VERSION
)
DEPLOY_IMAGE = "{}-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(
    REGION.split("-")[0], DEPLOY_VERSION
)

print("Training:", TRAIN_IMAGE, TRAIN_GPU, TRAIN_NGPU)
print("Deployment:", DEPLOY_IMAGE, DEPLOY_GPU, DEPLOY_NGPU)

#### Set machine type

Next, set the machine type to use for training and prediction.

- Set the variables `TRAIN_COMPUTE` and `DEPLOY_COMPUTE` to configure  the compute resources for the VMs you will use for for training and prediction.
 - `machine type`
     - `n1-standard`: 3.75GB of memory per vCPU.
     - `n1-highmem`: 6.5GB of memory per vCPU
     - `n1-highcpu`: 0.9 GB of memory per vCPU
 - `vCPUs`: number of \[2, 4, 8, 16, 32, 64, 96 \]

*Note: The following is not supported for training:*

 - `standard`: 2 vCPUs
 - `highcpu`: 2, 4 and 8 vCPUs

*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*.

In [None]:
if os.getenv("IS_TESTING_TRAIN_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_TRAIN_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

if os.getenv("IS_TESTING_DEPLOY_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_DEPLOY_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Deploy machine type", DEPLOY_COMPUTE)

#### Location of BigQuery training data.

Now set the variable `IMPORT_FILE` to the location of the data table in BigQuery.

In [None]:
IMPORT_FILE = "bq://bigquery-public-data.chicago_taxi_trips.taxi_trips"
BQ_TABLE = "bigquery-public-data.chicago_taxi_trips.taxi_trips"

### Retrieve the dataset from stage 1

Next, retrieve the dataset you created during stage 1 with the helper function `find_dataset()`. This helper function finds all the datasets whose display name matches the specified prefix and import format (e.g., bq). Finally it sorts the matches by create time and returns the latest version.

In [None]:
def find_dataset(display_name_prefix, import_format):
    matches = []
    datasets = aip.TabularDataset.list()
    for dataset in datasets:
        if dataset.display_name.startswith(display_name_prefix):
            try:
                if (
                    "bq" == import_format
                    and dataset.to_dict()["metadata"]["inputConfig"]["bigquerySource"]
                ):
                    matches.append(dataset)
                if (
                    "csv" == import_format
                    and dataset.to_dict()["metadata"]["inputConfig"]["gcsSource"]
                ):
                    matches.append(dataset)
            except:
                pass

    create_time = None
    for match in matches:
        if create_time is None or match.create_time > create_time:
            create_time = match.create_time
            dataset = match

    return dataset


dataset = find_dataset("Chicago Taxi", "bq")

print(dataset)

### Load dataset's user metadata

Load the user metadata for the dataset.

In [None]:
import json

try:
    with tf.io.gfile.GFile(
        "gs://" + dataset.labels["user_metadata"] + "/metadata.jsonl", "r"
    ) as f:
        metadata = json.load(f)

    print(metadata)
except:
    print("no metadata")

### Retrieve the model architecture and baseline model from stage 2

Next, retrieve the model architecture and baseline trained model you created during stage 2 with the helper function `find_model()`. This helper function finds all the models whose display name matches the specified prefix and contains the specified label. Finally it sorts the matches by create time and returns the latest version.

In [None]:
def find_model(display_name_prefix, label=None):
    matches = []
    models = aip.Model.list()
    for model in models:
        if model.display_name.startswith(display_name_prefix):
            try:
                if label in model.to_dict()["labels"].keys():
                    matches.append(model)
            except:
                pass

    model = None
    create_time = None
    for match in matches:
        if create_time is None or match.create_time > create_time:
            create_time = match.create_time
            model = match

    return model


base_model = find_model("chicago", "base_model")
baseline_model = find_model("chicago", "user_metadata")

print(base_model)
print(baseline_model)

### Load baseline models's user metadata

Load the user metadata for the baseline model.

In [None]:
import json

try:
    with tf.io.gfile.GFile(
        "gs://" + baseline_model.labels["user_metadata"] + "/metadata.jsonl", "r"
    ) as f:
        baseline_metadata = json.load(f)
        print(baseline_metadata)

        with tf.io.gfile.GFile(baseline_metadata["train_eval_metrics"], "r") as f:
            baseline_metrics = json.load(f)
            print(baseline_metrics)
except:
    print("no metadata")

## Formalizing pipelines introduction

A primary reason for formalizing the training and deployment of a model into a pipeline, is that overtime things will change and you will want to rebuild/retrain your model. A pipeline provides the ability to integrate these tasks as an automated process within a CI/CD process.

While one generally represents a formalized pipeline as a single e2e pipeline, in practice you decompose the e2e pipeline into the following sub-pipelines:

- data pipeline
    - data analysis
    - data preprocessing
- model pipeline
    - model architecture construction
    - base model storage
- training pipeline
    - model training
    - model evaluation
- deployment pipeline
    - model candidate
    - pre-production deployment evaluations
    - deployment to production

## Formalizing data pipeline introduction

The data pipeline consists of data analysis and data preprocessing tasks.

### Data analysis task

This task performs an analysis of the dataset to determine it's statistical distribution. This distribution is then used to build a dataset schema. The schema is then used by the data preprocessing task. Additionally for tabular data, the default feature types per feature are determined -- i.e., categorical, numeric.

### Data preprocessing task

This task performs a conversion of the raw dataset data into one or more machine learning ready formats. The dataset schema is used to determine how to preprocess the data. Other tasks include: splitting the dataset into training, test and validation, and encoding and storing the preprocessed data to disk.

### Triggers

Within the CI/CD process, the data pipeline is triggered for one or more of the following example reasons, while not exhaustive:

- New data added to the dataset.
- Addition or subtraction of features.
- Code changes to the preprocessing tasks.
- Code changes to the feature engineering tasks.
- Input layer changes that invalidate the stored preprocessed data.

### Create component for creating a local BigQuery dataset

Next, you create a component which makes a local copy, -- i.e., in your project, of the BigQuery Chicago Taxi dataset, where:

- Select features to include
- Select criteria for including rows
- Perform feature engineering.

This component returns as an artifact the BigQuery path to the local dataset copy.

In [None]:
@component(packages_to_install=["bigquery"])
def make_chicago_bq_dataset(bq_table: str, year: int, limit: int, project: str) -> str:
    from google.cloud import bigquery

    bqclient = bigquery.Client(project=project)

    BQ_DATASET = bq_table.split(".")[1]
    BQ_TABLE_COPY = f"{project}.{BQ_DATASET}.taxi_trips"

    if bq_table.startswith("bq://"):
        bq_table = bq_table[5:]

    query = f"""
    CREATE OR REPLACE TABLE `{BQ_TABLE_COPY}`
    AS (
        WITH
          taxitrips AS (
          SELECT
            trip_start_timestamp,
            trip_seconds,
            trip_miles,
            payment_type,
            pickup_longitude,
            pickup_latitude,
            dropoff_longitude,
            dropoff_latitude,
            tips,
            fare
          FROM
            `{bq_table}`
          WHERE pickup_longitude IS NOT NULL
          AND pickup_latitude IS NOT NULL
          AND dropoff_longitude IS NOT NULL
          AND dropoff_latitude IS NOT NULL
          AND trip_miles > 0
          AND trip_seconds > 0
          AND fare > 0
          AND EXTRACT(YEAR FROM trip_start_timestamp) = {year}
        )

        SELECT
          EXTRACT(MONTH from trip_start_timestamp) as trip_month,
          EXTRACT(DAY from trip_start_timestamp) as trip_day,
          EXTRACT(DAYOFWEEK from trip_start_timestamp) as trip_day_of_week,
          EXTRACT(HOUR from trip_start_timestamp) as trip_hour,
          CAST(trip_seconds AS FLOAT64) as trip_seconds,
          trip_miles,
          payment_type,
          ST_AsText(
              ST_SnapToGrid(ST_GeogPoint(pickup_longitude, pickup_latitude), 0.1)
          ) AS pickup_grid,
          ST_AsText(
              ST_SnapToGrid(ST_GeogPoint(dropoff_longitude, dropoff_latitude), 0.1)
          ) AS dropoff_grid,
          ST_Distance(
              ST_GeogPoint(pickup_longitude, pickup_latitude),
              ST_GeogPoint(dropoff_longitude, dropoff_latitude)
          ) AS euclidean,
          CONCAT(
              ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
                  pickup_latitude), 0.1)),
              ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
                  dropoff_latitude), 0.1))
          ) AS loc_cross,
          IF((tips/fare >= 0.2), 1, 0) AS tip_bin,
        FROM
          taxitrips
        LIMIT {limit}
    )
    """

    response = bqclient.query(query)
    _ = response.result()

    return BQ_TABLE_COPY

### Create component for performing data analysis on a BigQuery dataset table

Next, you create a component which performs data analysis on a BigQuery dataset table using Tensorflow Data Validation library, where:

- Create a client connection to BigQuery.
- Extract the BigQuery table to a pandas dataframe.
- Use Tensorflow Data Validation library (TFDV) to generate the dataset statistics
- Use Tensorflow Data Validation library (TFDV) to generate the dataset schema
- Determine feature types (numeric vs categorical) from the statistics.
- Write statistics and data to Cloud Storage bucket.

This component returns as an artifact a dictionary representing the dataset metadata.

In [None]:
@component(
    packages_to_install=[
        "tensorflow",
        "tensorflow-data-validation==1.2",
        "google-cloud-bigquery",
    ]
)
def data_analysis(
    bq_table: str, label_column: str, data_bucket: str, project: str
) -> dict:

    import json

    import tensorflow as tf
    import tensorflow_data_validation as tfdv
    from google.cloud import bigquery

    bqclient = bigquery.Client(project=project)

    table = bigquery.TableReference.from_string(bq_table)

    rows = bqclient.list_rows(table)

    dataframe = rows.to_dataframe()

    stats = tfdv.generate_statistics_from_dataframe(
        dataframe=dataframe,
        stats_options=tfdv.StatsOptions(
            label_feature=label_column, sample_rate=1, num_top_values=50
        ),
    )

    tfdv.write_stats_text(stats, data_bucket + "/statistics.jsonl")

    NUMERIC_FEATURES = []
    CATEGORICAL_FEATURES = []
    for _ in range(len(stats.datasets[0].features)):
        if stats.datasets[0].features[_].path.step[0] == label_column:
            continue
        if stats.datasets[0].features[_].type == 0:  # int
            CATEGORICAL_FEATURES.append(stats.datasets[0].features[_].path.step[0])
        elif stats.datasets[0].features[_].type == 1:  # float
            NUMERIC_FEATURES.append(stats.datasets[0].features[_].path.step[0])
        elif stats.datasets[0].features[_].type == 2:  # string
            CATEGORICAL_FEATURES.append(stats.datasets[0].features[_].path.step[0])

    schema = tfdv.infer_schema(statistics=stats)

    tfdv.write_schema_text(output_path=data_bucket + "/schema.txt", schema=schema)

    metadata = {
        "label_column": label_column,
        "statistics": data_bucket + "/statistics.jsonl",
        "schema": data_bucket + "/schema.txt",
        "numeric_features": NUMERIC_FEATURES,
        "categorical_features": CATEGORICAL_FEATURES,
    }

    with tf.io.gfile.GFile(data_bucket + "/metadata.jsonl", "w") as f:
        json.dump(metadata, f)

    return metadata

### Create constructing the run arguments for Dataflow component

Next, you create a component for constructing the run arguments for the subsequent Dataflow component.

In [None]:
@component()
def make_dataflow_args(
    bucket: str,
    bq_table: str,
    setup_file: str,
    metadata: dict,
    transformed_data_prefix: str,
    transform_artifacts_dir: str,
    exported_tfrec_prefix: str,
    exported_jsonl_prefix: str,
    label: str,
) -> list:
    return [
        "--bucket",
        bucket,
        "--bq-table",
        bq_table,
        "--runner",
        "DataflowRunner",
        "--setup_file",
        setup_file,
        "--metadata",
        str(metadata),
        "--transformed-data-prefix",
        transformed_data_prefix,
        "--transform-artifacts-dir",
        transform_artifacts_dir,
        "--exported-tfrec-prefix",
        exported_tfrec_prefix,
        "--exported-jsonl-prefix",
        exported_jsonl_prefix,
        "--label",
        label,
    ]

### Write the Dataflow Python module for preprocessing the data.

Next, you write the Python script for preprocessing the data. This script will be used by the subsequent Dataflow component.

#### Dataset splitting

- Query the BigQuery table for all examples (parse_bq_record).
- Split the examples into training, evaluation and test datasets (split_dataset).

#### Data preprocessing

- Preprocess each example (preprocessing_fn).
- Write the preprocessed data to a Cloud Storage bucket as TFRecords.
- Write the transformation function artifacts to a Cloud Storage bucket.
- Write the raw (unprocessed) examples to a Cloud Storage bucket as TFRecords.
- Write the raw (unprocessed) examples to a Cloud Storage bucket as JSONL.

In [None]:
%%writefile preprocess.py

import argparse
import logging
import json
import os

import tensorflow as tf
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
import tensorflow_data_validation as tfdv

import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions

def run(argv=None):
    """ Main entry: data management"""

    parser = argparse.ArgumentParser()
    parser.add_argument('--bq-table', dest='bq_table', type=str)
    parser.add_argument('--bucket', dest='bucket', type=str)
    parser.add_argument('--metadata', dest='metadata', type=str)
    parser.add_argument('--transformed-data-prefix', dest='transformed_data_prefix', type=str)
    parser.add_argument('--transform-artifacts-dir', dest='transform_artifacts_dir', type=str)
    parser.add_argument('--exported-tfrec-prefix', dest='exported_tfrec_prefix', type=str)
    parser.add_argument('--exported-jsonl-prefix', dest='exported_jsonl_prefix', type=str)
    parser.add_argument('--label', dest='label', type=str)

    args, pipeline_args = parser.parse_known_args(argv)

    logging.info("ARGS")
    logging.info(args)
    logging.info("PIPELINE ARGS")
    logging.info(pipeline_args)

    metadata = json.loads(args.metadata.replace("'", '"'))

    numeric_features = metadata['numeric_features']
    categorical_features = metadata['categorical_features']
    schema_location = metadata['schema']

    for i in range(0, len(pipeline_args), 2):
        if "--temp_location" == pipeline_args[i]:
            temp_location = pipeline_args[i+1]
        elif "--project" == pipeline_args[i]:
            project = pipeline_args[i+1]

    exported_train = args.bucket + '/exported_data/train'
    exported_eval  = args.bucket + '/exported_data/eval'

    logging.info("Get schema")
    schema = tfdv.load_schema_text(schema_location)
    feature_spec = tft.tf_metadata.schema_utils.schema_as_feature_spec(
        schema
    ).feature_spec

    raw_metadata = tft.tf_metadata.dataset_metadata.DatasetMetadata(
        tft.tf_metadata.schema_utils.schema_from_feature_spec(feature_spec)
    )
    query = f"SELECT * FROM {args.bq_table}"

    logging.info("Preprocess the data")
    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = True
    with beam.Pipeline(options=pipeline_options) as pipeline:
        with tft_beam.Context(temp_location):

            def parse_bq_record(bq_record):
                """Parses a bq_record to a dictionary."""
                output = {}
                for key in bq_record:
                    output[key] = [bq_record[key]]
                return output

            def split_dataset(bq_row, num_partitions, ratio):
                """Returns a partition number for a given bq_row."""
                import json

                assert num_partitions == len(ratio)
                bucket = sum(map(ord, json.dumps(bq_row))) % sum(ratio)
                total = 0
                for i, part in enumerate(ratio):
                    total += part
                    if bucket < total:
                        return i
                return len(ratio) - 1

            def convert_to_jsonl(data, label=None):
                ''' Converts a parsed record to JSON '''
                if label:
                    del data[label]
                return json.dumps(data)

            def preprocessing_fn(inputs):
                outputs = {}
                for key in inputs.keys():
                    if key in numeric_features:
                        outputs[key] = tft.scale_to_z_score(inputs[key])
                    elif key in categorical_features:
                        outputs[key] = tft.compute_and_apply_vocabulary(
                                            inputs[key],
                                            num_oov_buckets=1,
                                            vocab_filename=key,
                                        )
                    else:
                        outputs[key] = inputs[key]
                    outputs[key] = tf.squeeze(outputs[key], -1)
                return outputs


            # Read raw BigQuery data.
            raw_train_data, raw_val_data, raw_test_data = (
                pipeline
                | "Read Raw Data"
                >> beam.io.ReadFromBigQuery(
                    query=query,
                    project=project,
                    use_standard_sql=True,
                )
                | "Parse Data" >> beam.Map(parse_bq_record)
                | "Split" >> beam.Partition(split_dataset, 3, ratio=[8, 1, 1])
            )

            # Create a train_dataset from the data and schema.
            raw_train_dataset = (raw_train_data, raw_metadata)

            # Analyze and transform raw_train_dataset to produced transformed_train_dataset and transform_fn.
            transformed_train_dataset, transform_fn = (
                raw_train_dataset
                | "Analyze & Transform"
                >> tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)
            )

            # Get data and schema separately from the transformed_dataset.
            transformed_train_data, transformed_metadata = transformed_train_dataset

            # Get data and schema separately from the transformed_dataset.
            transformed_train_data, transformed_metadata = transformed_train_dataset

            # Write transformed train data.
            _ = (
                transformed_train_data
                | "Write Transformed Train Data"
                >> beam.io.tfrecordio.WriteToTFRecord(
                    file_path_prefix=os.path.join(
                        args.transformed_data_prefix, "train/data"
                    ),
                    file_name_suffix=".gz",
                    coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema),
                )
            )

            # Create a val_dataset from the data and schema.
            raw_val_dataset = (raw_val_data, raw_metadata)

            # Transform raw_val_dataset to produced transformed_val_dataset using transform_fn.
            transformed_val_dataset = (
                raw_val_dataset,
                transform_fn,
            ) | "Transform Validation Data" >> tft_beam.TransformDataset()

            # Get data from the transformed_val_dataset.
            transformed_val_data, _ = transformed_val_dataset

            # Write transformed val data.
            _ = (
                transformed_val_data
                | "Write Transformed Validation Data"
                >> beam.io.tfrecordio.WriteToTFRecord(
                    file_path_prefix=os.path.join(args.transformed_data_prefix, "val/data"),
                    file_name_suffix=".gz",
                    coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema),
                )
            )

            # Create a test_dataset from the data and schema.
            raw_test_dataset = (raw_test_data, raw_metadata)

            # Transform raw_test_dataset to produced transformed_test_dataset using transform_fn.
            transformed_test_dataset = (
                raw_test_dataset,
                transform_fn,
            ) | "Transform Test Data" >> tft_beam.TransformDataset()


            # Get data from the transformed_test_dataset.
            transformed_test_data, _ = transformed_test_dataset

            # write transformed test data.
            _ = (
                transformed_test_data
                | "Write Transformed Test Data"
                >> beam.io.tfrecordio.WriteToTFRecord(
                    file_path_prefix=os.path.join(args.transformed_data_prefix, "test/data"),
                    file_name_suffix=".gz",
                    coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema),
                )
            )

            # Write transform_fn.
            _ = transform_fn | "Write Transform Artifacts" >> tft_beam.WriteTransformFn(
                args.transform_artifacts_dir
            )

            # Write raw test data to GCS as TF Records
            _ = raw_test_data | "Write TF Test Data" >> beam.io.tfrecordio.WriteToTFRecord(
                file_path_prefix=os.path.join(args.exported_tfrec_prefix, "data"),
                file_name_suffix=".tfrecord",
                coder=tft.coders.ExampleProtoCoder(raw_metadata.schema),
            )

            # Convert raw test data to JSON (for batch prediction)
            json_test_data = (
                raw_test_data
            ) | "Convert Batch Test Data" >> beam.Map(convert_to_jsonl, label=args.label)

            # Write raw test data to GCS as JSONL files.
            _ = json_test_data | "Write JSONL Test Data" >> beam.io.WriteToText(
                file_path_prefix=args.exported_jsonl_prefix, file_name_suffix=".jsonl"
            )


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

#### Write the requirements (installs) for the Dataflow (Apache Beam) pipeline module

Next, create the `requirements.txt` file to specify Python modules that are required to be installed for executing the Apache Beam pipeline module -- in this case, `apache-beam`, `tensorflow-transform` and `tensorflow-data-validation` are required.

In [None]:
%%writefile requirements.txt
apache-beam
tensorflow-transform==1.2.0
tensorflow-data-validation==1.2
future

#### Prepare package requirements for Dataflow job.

Before you can run a Dataflow job, you need to specify the package requirements for the worker pool that will execute the job.

In [None]:
%%writefile setup.py
import setuptools

REQUIRED_PACKAGES = [
    'tensorflow-transform==1.2.0',
    'tensorflow-data-validation==1.2',
    'future'
]
PACKAGE_NAME = 'my_package'
PACKAGE_VERSION = '0.0.1'
setuptools.setup(
    name=PACKAGE_NAME,
    version=PACKAGE_VERSION,
    description='preprocessing transformation',
    install_requires=REQUIRED_PACKAGES,
    author="cdpe@google.com",
    packages=setuptools.find_packages()
)

#### Copy python module and requirements file to Cloud Storage

Next, you copy the Python module, requirements and setup file to your Cloud Storage bucket.

In [None]:
GCS_PREPROCESS_PY = BUCKET_NAME + "/preprocess.py"
! gsutil cp preprocess.py $GCS_PREPROCESS_PY
GCS_REQUIREMENTS_TXT = BUCKET_NAME + "/requirements.txt"
! gsutil cp requirements.txt $GCS_REQUIREMENTS_TXT
GCS_SETUP_PY = BUCKET_NAME + "/setup.py"
! gsutil cp setup.py $GCS_SETUP_PY

### Create transformed data analysis component

Next, you create a component which performs data analysis on the transformed training data using Tensorflow Transform, where:

- Load the transformation function artifacts output as `TFTTransformOutput`.
- Using the transformed output, determine the number of unique instances per categorical feature.
- If the number of unique instances > 10, convert from categorical to embedding feature type.
- Update the metadata file for the dataset.

In [None]:
@component(packages_to_install=["tensorflow", "tensorflow-transform==1.2.0", "future"])
def transformed_data_analysis(
    metadata_location: str,
    transformed_data_prefix: str,
    transform_artifacts_dir: str,
    exported_jsonl_prefix: str,
    exported_tfrec_prefix: str,
) -> dict:
    import json

    import tensorflow as tf
    import tensorflow_transform as tft

    tft_output = tft.TFTransformOutput(transform_artifacts_dir)

    with tf.io.gfile.GFile(metadata_location, "r") as f:
        metadata = json.load(f)
    categorical_features = metadata["categorical_features"]

    CATEGORICAL_FEATURES = []
    EMBEDDING_FEATURES = []
    for feature in categorical_features:
        unique = tft_output.vocabulary_size_by_name(feature)
        if unique > 10:
            EMBEDDING_FEATURES.append(feature)
            print("Convert to embedding", feature, unique)
        else:
            CATEGORICAL_FEATURES.append(feature)

    metadata["categorical_features"] = CATEGORICAL_FEATURES
    metadata["embedding_features"] = EMBEDDING_FEATURES

    metadata["transformed_data_prefix"] = transformed_data_prefix
    metadata["transform_artifacts_dir"] = transform_artifacts_dir
    metadata["exported_jsonl_prefix"] = exported_jsonl_prefix
    metadata["exported_tfrec_prefix"] = exported_tfrec_prefix

    with tf.io.gfile.GFile(metadata_location, "w") as f:
        json.dump(metadata, f)

    return metadata

### Construct pipeline for data analysis and preprocessing

Next, construct the pipeline with the following tasks:

- Create the local copy BigQuery dataset.
- Perform data analysis on the dataset.
- Prepare run arguments for Dataflow script.
- Execute the Dataflow script.
- Create a Vertex AI Dataset resource.

In [None]:
@dsl.pipeline(name="data-preprocessing", description="Prepare the dataset")
def pipeline(
    bq_table: str,
    display_name: str,
    transformed_data_prefix: str,
    transform_artifacts_dir: str,
    exported_tfrec_prefix: str,
    exported_jsonl_prefix: str,
    label_column: str,
    python_file_path: str,
    requirements_file_path: str,
    setup_file: str,
    staging_dir: str,
    data_bucket: str,
    metadata_location: str,
    dataset_labels: dict,
    year: int,
    limit: int,
    project: str = PROJECT_ID,
    region: str = REGION,
):
    from google_cloud_pipeline_components import aiplatform as gcc_aip

    bq_op = make_chicago_bq_dataset(
        bq_table=bq_table, year=year, limit=limit, project=project
    )

    analysis_op = data_analysis(
        bq_table=bq_op.output,
        label_column=label_column,
        data_bucket=data_bucket,
        project=project,
    )

    args_op = make_dataflow_args(
        bucket=data_bucket,
        setup_file=setup_file,
        bq_table=bq_op.output,
        metadata=analysis_op.output,
        transformed_data_prefix=transformed_data_prefix,
        transform_artifacts_dir=transform_artifacts_dir,
        exported_tfrec_prefix=exported_tfrec_prefix,
        exported_jsonl_prefix=exported_jsonl_prefix,
        label=label_column,
    )

    dataflow_python_op = DataflowPythonJobOp(
        project=project,
        location=region,
        python_module_path=python_file_path,
        temp_location=staging_dir,
        requirements_file_path=requirements_file_path,
        args=args_op.output,
    ).after(bq_op)

    dataflow_wait_op = WaitGcpResourcesOp(
        gcp_resources=dataflow_python_op.outputs["gcp_resources"]
    )

    transformed_analysis_op = transformed_data_analysis(
        metadata_location=metadata_location,
        transformed_data_prefix=transformed_data_prefix,
        transform_artifacts_dir=transform_artifacts_dir,
        exported_jsonl_prefix=exported_jsonl_prefix,
        exported_tfrec_prefix=exported_tfrec_prefix,
    ).after(dataflow_wait_op)

    dataset_op = gcc_aip.TabularDatasetCreateOp(
        project=project,
        display_name=display_name,
        bq_source=bq_table,
        labels=dataset_labels,
    ).after(transformed_analysis_op)

### Compile and execute the data analysis and preprocessing pipeline

Next, you compile the pipeline and then exeute it. The pipeline takes the following parameters, which are passed as the dictionary `parameter_values`:

- `bq_table`: The BigQuery table used for training the model.
- `display_name`: The display name for the generated Vertex AI resources.
- `transformed_data_prefix`: The Cloud Storage location of the preprocessed training, test and validation data.
- `transform_artifacts_dir`: The Cloud Storage location of the transform function artifacts.
- `exported_tfrec_prefix`: The Cloud Storage location of the debug/test data for the serving model as TFRecords.
- `exported_jsonl_prefix`: The Cloud Storage location of the debug/test data for the serving model in JSONL format.
- `label_column`: The name of the label column.
- `python_file_path`: The Cloud Storage location of the Dataflow Python script for preprocessing the data.
- `requirements_file_path`: The Cloud Storage location of the requirements.txt file for the Dataflow component.
- `setup_file`: The Cloud Storage location of the setup.py script for the Dataflow component.
- `staging_dir`: The Cloud Storage location for the temporary location for the Apache Beam pipeline.
- `data_bucket`: The Cloud Storage location for data analysis artifacts.
- `metadata_location`: The Cloud Storage location for the Vertex AI Dataset user metadata.
- `dataset_labels`: User defined labels to add to the Vertex AI Dataset -- i.e., metadata location
- `project`: The project ID.
- `region`: The region.

In [None]:
PIPELINE_ROOT = "{}/pipeline_root/data_preprocess".format(BUCKET_NAME)

EXPORTED_JSONL_PREFIX = os.path.join(BUCKET_NAME, "exported_data/jsonl")
EXPORTED_TFREC_PREFIX = os.path.join(BUCKET_NAME, "exported_data/tfrec")
TRANSFORMED_DATA_PREFIX = os.path.join(BUCKET_NAME, "transformed_data")
TRANSFORM_ARTIFACTS_DIR = os.path.join(BUCKET_NAME, "transformed_artifacts")

compiler.Compiler().compile(pipeline_func=pipeline, package_path="data_preprocess.json")

pipeline = aip.PipelineJob(
    display_name="data_preprocess",
    template_path="data_preprocess.json",
    pipeline_root=PIPELINE_ROOT,
    parameter_values={
        "bq_table": IMPORT_FILE,
        "display_name": "Chicago Taxi" + TIMESTAMP,
        "transformed_data_prefix": TRANSFORMED_DATA_PREFIX,
        "transform_artifacts_dir": TRANSFORM_ARTIFACTS_DIR,
        "exported_tfrec_prefix": EXPORTED_TFREC_PREFIX,
        "exported_jsonl_prefix": EXPORTED_JSONL_PREFIX,
        "label_column": "tip_bin",
        "python_file_path": GCS_PREPROCESS_PY,
        "requirements_file_path": GCS_REQUIREMENTS_TXT,
        "setup_file": GCS_SETUP_PY,
        "staging_dir": PIPELINE_ROOT,
        "data_bucket": BUCKET_NAME,
        "metadata_location": BUCKET_NAME + "/metadata.jsonl",
        "dataset_labels": {"user_metadata": BUCKET_NAME[5:]},
        "year": 2020,
        "limit": 300000,
        "project": PROJECT_ID,
        "region": REGION,
    },
)

pipeline.run()

! rm -f data_preprocess.json requirements.txt setup.py preprocess.py

### View the data pipeline execution results

In [None]:
PROJECT_NUMBER = pipeline.gca_resource.name.split("/")[1]
print(PROJECT_NUMBER)


def print_pipeline_output(job, output_task_name):
    JOB_ID = job.name
    print(JOB_ID)
    for _ in range(len(job.gca_resource.job_detail.task_details)):
        TASK_ID = job.gca_resource.job_detail.task_details[_].task_id
        EXECUTE_OUTPUT = (
            PIPELINE_ROOT
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/executor_output.json"
        )
        GCP_RESOURCES = (
            PIPELINE_ROOT
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/gcp_resources"
        )
        EVAL_METRICS = (
            PIPELINE_ROOT
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/evaluation_metrics"
        )
        if tf.io.gfile.exists(EXECUTE_OUTPUT):
            ! gsutil cat $EXECUTE_OUTPUT
            return EXECUTE_OUTPUT
        elif tf.io.gfile.exists(GCP_RESOURCES):
            ! gsutil cat $GCP_RESOURCES
            return GCP_RESOURCES
        elif tf.io.gfile.exists(EVAL_METRICS):
            ! gsutil cat $EVAL_METRICS
            return EVAL_METRICS

    return None


print("make-chicago-bq-dataset")
artifacts = print_pipeline_output(pipeline, "make-chicago-bq-dataset")
print("\n")
print("data-analysis")
artifacts = print_pipeline_output(pipeline, "data-analysis")
print("\n")
print("make-dataflow-args")
artifacts = print_pipeline_output(pipeline, "make-dataflow-args")
print("\n")
print("dataflow-python")
artifacts = print_pipeline_output(pipeline, "dataflow-python")
print("\n")
print("wait-gcp-resources")
artifacts = print_pipeline_output(pipeline, "wait-gcp-resources")
print("\n")
print("transformed-data-analysis")
artifacts = print_pipeline_output(pipeline, "transformed-data-analysis")
print("\n")
print("tabular-dataset-create")
artifacts = print_pipeline_output(pipeline, "tabular-dataset-create")
print("\n")

output = !gsutil cat $artifacts
output = json.loads(output[0])
dataset_id = output["artifacts"]["dataset"]["artifacts"][0]["metadata"]["resourceName"]
print(dataset_id)

## Formalizing model pipeline introduction

The model pipeline consists of building the model architecture task.

### Build model architecture task

- Construct the model architecture for the base model and save as a Vertex AI Model resource.

### Triggers

Within the CI/CD process, the model pipeline is triggered for one or more of the following example reasons, while not exhaustive:

- If the data pipeline is re-executed.
- Code changes to the build model architecture task.
- The model architecture is being replaced.

### Create build model architecture component

Next, you create a component which creates the base model architecture. Note, the base model is untrained. In this example, the model architecture is for a tabular model, where:

- Load the corresonding Vertex AI Dataset,
- Load the dataset metadata.
- Use the metadata information on the feature types to build the input layer.
- Build the DNN body of the model.
- Save the base model artifacts to the Cloud Storage location.

The component returns the full resource name of the generated Vertex AI Model resource.

In [None]:
@component(packages_to_install=["tensorflow==2.5", "tensorflow-transform", "future"])
def build_model(
    dataset_id: str, display_name: str, deploy_image: str, bucket: str, project: str
) -> str:

    import subprocess

    subprocess.call(["pip3", "install", "google-cloud-aiplatform"])

    import json
    import logging
    from math import sqrt

    import google.cloud.aiplatform as aip
    import tensorflow as tf
    import tensorflow_transform as tft
    from tensorflow.keras import Model, Sequential
    from tensorflow.keras.layers import (Activation, Concatenate, Dense,
                                         Embedding, Input, experimental)

    logging.info("Tensorflow version: " + tf.__version__)

    aip.init(project=project, staging_bucket=bucket, experiment=display_name)
    aip.start_run(run="retrain")

    # Load the dataset resource from the dataset resource ID.
    dataset = aip.TabularDataset(dataset_id)

    # Load the metadata for this dataset
    with tf.io.gfile.GFile(
        "gs://" + dataset.labels["user_metadata"] + "/metadata.jsonl", "r"
    ) as f:
        metadata = json.load(f)

    def create_model_inputs(
        numeric_features=None, categorical_features=None, embedding_features=None
    ):
        inputs = {}
        for feature_name in numeric_features:
            inputs[feature_name] = Input(name=feature_name, shape=[], dtype=tf.float32)
        for feature_name in categorical_features:
            inputs[feature_name] = Input(name=feature_name, shape=[], dtype=tf.int64)
        for feature_name in embedding_features:
            inputs[feature_name] = Input(name=feature_name, shape=[], dtype=tf.int64)

        return inputs

    input_layers = create_model_inputs(
        numeric_features=metadata["numeric_features"],
        categorical_features=metadata["categorical_features"],
        embedding_features=metadata["embedding_features"],
    )

    logging.info("Created input layers for model")
    logging.info(input_layers)

    def create_binary_classifier(
        input_layers,
        tft_output,
        metaparams,
        numeric_features,
        categorical_features,
        embedding_features,
    ):
        layers = []
        for feature_name in input_layers:
            if feature_name in embedding_features:
                vocab_size = tft_output.vocabulary_size_by_name(feature_name)
                embedding_size = int(sqrt(vocab_size))
                embedding_output = Embedding(
                    input_dim=vocab_size + 1,
                    output_dim=embedding_size,
                    name=f"{feature_name}_embedding",
                )(input_layers[feature_name])
                layers.append(embedding_output)
            elif feature_name in categorical_features:
                vocab_size = tft_output.vocabulary_size_by_name(feature_name)
                onehot_layer = experimental.preprocessing.CategoryEncoding(
                    num_tokens=vocab_size,
                    output_mode="binary",
                    name=f"{feature_name}_onehot",
                )(input_layers[feature_name])
                layers.append(onehot_layer)
            elif feature_name in numeric_features:
                numeric_layer = tf.expand_dims(input_layers[feature_name], -1)
                layers.append(numeric_layer)
            else:
                pass

        logging.info("Created layers for model")
        logging.info(layers)

        joined = Concatenate(name="combines_inputs")(layers)
        feedforward_output = Sequential(
            [Dense(units, activation="relu") for units in metaparams["hidden_units"]],
            name="feedforward_network",
        )(joined)
        logits = Dense(units=1, name="logits")(feedforward_output)
        pred = Activation("sigmoid")(logits)

        model = Model(inputs=input_layers, outputs=[pred])
        return model

    TRANSFORM_ARTIFACTS_DIR = metadata["transform_artifacts_dir"]
    tft_output = tft.TFTransformOutput(TRANSFORM_ARTIFACTS_DIR)

    metaparams = {"hidden_units": [128, 64]}
    aip.log_params(metaparams)

    model = create_binary_classifier(
        input_layers,
        tft_output,
        metaparams,
        numeric_features=metadata["numeric_features"],
        categorical_features=metadata["categorical_features"],
        embedding_features=metadata["embedding_features"],
    )

    logging.info("Created binary classifier model")
    logging.info(model.summary)

    logging.info("Save base model architecture")
    MODEL_DIR = f"{bucket}/base_model"
    model.save(MODEL_DIR)

    return MODEL_DIR

### Construct pipeline for building the model architecture

Next, construct the pipeline with the following tasks:

- Build the base model architecture.
- Create a Vertex AI Model resource for the base model.

In [None]:
@dsl.pipeline(name="build-model", description="Build the base model architecture")
def pipeline(
    dataset_id: str,
    display_name: str,
    deploy_image: str,
    bucket: str,
    project: str = PROJECT_ID,
    region: str = REGION,
    labels: dict = {"base_model": "1"},
):
    from google_cloud_pipeline_components.types import artifact_types
    from google_cloud_pipeline_components.v1.model import ModelUploadOp
    from kfp.v2.components import importer_node

    model_build_op = build_model(
        dataset_id=dataset_id,
        display_name=display_name,
        deploy_image=deploy_image,
        bucket=bucket,
        project=project,
    )

    import_unmanaged_model_task = importer_node.importer(
        artifact_uri=model_build_op.output,
        artifact_class=artifact_types.UnmanagedContainerModel,
        metadata={
            "containerSpec": {
                "imageUri": DEPLOY_IMAGE,
            },
        },
    ).after(model_build_op)

    model_upload = ModelUploadOp(
        project=project,
        display_name=display_name,
        unmanaged_container_model=import_unmanaged_model_task.outputs["artifact"],
    ).after(import_unmanaged_model_task)

### Compile and execute the build model architecture pipeline

Next, you compile the pipeline and then execute it. The pipeline takes the following parameters, which are passed as the dictionary `parameter_values`:

- `dataset_id`: The full resource name of the corresponding Vertex AI Dataset.
- `display_name`: The display name for the generated Vertex AI Model resource.
- `deploy_image`: The associated deployment container image.
- `bucket`: The Cloud Storage location to store the model artifacts.
- `project`: The project ID.
- `region`: The region.

In [None]:
PIPELINE_ROOT = "{}/pipeline_root/model_build".format(BUCKET_NAME)

compiler.Compiler().compile(pipeline_func=pipeline, package_path="model_build.json")

pipeline = aip.PipelineJob(
    display_name="model-build",
    template_path="model_build.json",
    pipeline_root=PIPELINE_ROOT,
    parameter_values={
        "dataset_id": dataset_id,
        "display_name": "chicago" + TIMESTAMP,
        "deploy_image": DEPLOY_IMAGE,
        "bucket": BUCKET_NAME,
        "project": PROJECT_ID,
        "region": REGION,
    },
)

pipeline.run()

! rm -f model_build.json

### View the model pipeline execution results

In [None]:
print("model-build")
artifacts = print_pipeline_output(pipeline, "model-build")
print("\n")
print("model-upload")
artifacts = print_pipeline_output(pipeline, "model-upload")
print("\n")

output = !gsutil cat $artifacts
output = json.loads(output[0])
model_id = output["artifacts"]["model"]["artifacts"][0]["metadata"]["resourceName"]
print(model_id)

## Formalizing training pipeline introduction

The training pipeline consists of training the model.

### Train the model task

- Retrieve the model architecture.
- If warmup training:
    - Warmup the model.
    - Save model weights back to the base model architecture.
- If full training:
    - Retrieve the hyperparameters from the baseline model
    - Train the model
    - Evaluate the model
    - Save the model artifacts to the specified Cloud Storage location.

### Triggers

Within the CI/CD process, the training pipeline is triggered for one or more of the following example reasons, while not exhaustive:

- If the data pipeline is re-executed.
- If the model pipeline is re-executed.
- Code changes to the model training task.

### Construct the training package

#### Package layout

Before you start training, you will look at how a Python package is assembled for a custom training job. When unarchived, the package contains the following directory/file layout.

- PKG-INFO
- README.md
- setup.cfg
- setup.py
- trainer
  - \_\_init\_\_.py
  - task.py
  - other Python scripts

The files `setup.cfg` and `setup.py` are the instructions for installing the package into the operating environment of the Docker image.

The file `trainer/task.py` is the Python script for executing the custom training job.

In [None]:
# Make folder for Python training script
! rm -rf custom
! mkdir custom

# Add package information
! touch custom/README.md

setup_cfg = "[egg_info]\n\ntag_build =\n\ntag_date = 0"
! echo "$setup_cfg" > custom/setup.cfg

setup_py = "import setuptools\n\nsetuptools.setup(\n\n    install_requires=[\n\n        'google-cloud-aiplatform',\n\n        'cloudml-hypertune',\n\n        'tensorflow_datasets==1.3.0',\n\n        'tensorflow_data_validation==1.2',\n\n    ],\n\n    packages=setuptools.find_packages())"
! echo "$setup_py" > custom/setup.py

pkg_info = "Metadata-Version: 1.0\n\nName: Chicago Taxi tabular binary classification\n\nVersion: 0.0.0\n\nSummary: Demostration training script\n\nHome-page: www.google.com\n\nAuthor: Google\n\nAuthor-email: cdpe@google.com\n\nLicense: Public\n\nDescription: Demo\n\nPlatform: Vertex AI"
! echo "$pkg_info" > custom/PKG-INFO

# Make the training subfolder
! mkdir custom/trainer
! touch custom/trainer/__init__.py

### Load the transformed data into a tf.data.Dataset

Next, you load the gzip TFRecords on Cloud Storage storage into a `tf.data.Dataset` generator. These functions are re-used when training the custom model using `Vertex Training`, so you save them to the python training package.

In [None]:
%%writefile custom/trainer/data.py

import tensorflow as tf

def _gzip_reader_fn(filenames):
    """Small utility returning a record reader that can read gzip'ed files."""
    return tf.data.TFRecordDataset(filenames, compression_type="GZIP")


def get_dataset(file_pattern, feature_spec, label_column, batch_size=200):
    """Generates features and label for tuning/training.
    Args:
      file_pattern: input tfrecord file pattern.
      feature_spec: a dictionary of feature specifications.
      batch_size: representing the number of consecutive elements of returned
        dataset to combine in a single batch
    Returns:
      A dataset that contains (features, indices) tuple where features is a
        dictionary of Tensors, and indices is a single Tensor of label indices.
    """

    dataset = tf.data.experimental.make_batched_features_dataset(
        file_pattern=file_pattern,
        batch_size=batch_size,
        features=feature_spec,
        label_key=label_column,
        reader=_gzip_reader_fn,
        num_epochs=1,
        drop_final_batch=True,
    )

    return dataset

## Develop and test the training scripts

When experimenting, one typically develops and tests the training package locally, before moving to training in the cloud.

### Create training script

Next, you write the Python script for compiling and training the model.

In [None]:
%%writefile custom/trainer/train.py

from trainer import data
import tensorflow as tf
import logging
from hypertune import HyperTune

def compile(model, hyperparams):
    ''' Compile the model '''
    optimizer = tf.keras.optimizers.Adam(learning_rate=hyperparams["learning_rate"])
    loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
    metrics = [tf.keras.metrics.BinaryAccuracy(name="accuracy")]

    model.compile(optimizer=optimizer,loss=loss, metrics=metrics)
    return model

def warmup(
    model,
    hyperparams,
    train_data_dir,
    label_column,
    transformed_feature_spec
):
    ''' Warmup the initialized model weights '''

    train_dataset = data.get_dataset(
        train_data_dir,
        transformed_feature_spec,
        label_column,
        batch_size=hyperparams["batch_size"],
    )

    lr_inc = (hyperparams['end_learning_rate'] - hyperparams['start_learning_rate']) / hyperparams['num_epochs']

    def scheduler(epoch, lr):
        if epoch == 0:
            return hyperparams['start_learning_rate']
        return lr + lr_inc


    callbacks = [tf.keras.callbacks.LearningRateScheduler(scheduler)]

    logging.info("Model warmup started...")
    history = model.fit(
            train_dataset,
            epochs=hyperparams["num_epochs"],
            steps_per_epoch=hyperparams["steps"],
            callbacks=callbacks
    )

    logging.info("Model warmup completed.")
    return history


def train(
    model,
    hyperparams,
    train_data_dir,
    val_data_dir,
    label_column,
    transformed_feature_spec,
    log_dir,
    tuning=False
):
    ''' Train the model '''

    train_dataset = data.get_dataset(
        train_data_dir,
        transformed_feature_spec,
        label_column,
        batch_size=hyperparams["batch_size"],
    )

    val_dataset = data.get_dataset(
        val_data_dir,
        transformed_feature_spec,
        label_column,
        batch_size=hyperparams["batch_size"],
    )

    early_stop = tf.keras.callbacks.EarlyStopping(
        monitor=hyperparams["early_stop"]["monitor"], patience=hyperparams["early_stop"]["patience"], restore_best_weights=True
    )

    callbacks = [early_stop]

    if log_dir:
        tensorboard = tf.keras.callbacks.TensorBoard(log_dir=log_dir)

        callbacks = callbacks.append(tensorboard)

    if tuning:
        # Instantiate the HyperTune reporting object
        hpt = HyperTune()

        # Reporting callback
        class HPTCallback(tf.keras.callbacks.Callback):

            def on_epoch_end(self, epoch, logs=None):
                hpt.report_hyperparameter_tuning_metric(
                    hyperparameter_metric_tag='val_loss',
                    metric_value=logs['val_loss'],
                    global_step=epoch
                )

        if not callbacks:
            callbacks = []
        callbacks.append(HPTCallback())

    logging.info("Model training started...")
    history = model.fit(
            train_dataset,
            epochs=hyperparams["num_epochs"],
            validation_data=val_dataset,
            callbacks=callbacks
    )

    logging.info("Model training completed.")
    return history

def evaluate(
    model,
    hyperparams,
    test_data_dir,
    label_column,
    transformed_feature_spec
):
    logging.info("Model evaluation started...")
    test_dataset = data.get_dataset(
        test_data_dir,
        transformed_feature_spec,
        label_column,
        hyperparams["batch_size"],
    )

    evaluation_metrics = model.evaluate(test_dataset)
    logging.info("Model evaluation completed.")

    return evaluation_metrics

## Add a serving function

Next, you add a serving function to your model for online and batch prediction. This allows prediction requests to be sent in raw format (unpreprocessed), either as a serialized TF.Example or JSONL object. The serving function will then preprocess the prediction request into the transformed format expected by the model.

In [None]:
%%writefile custom/trainer/serving.py

import tensorflow as tf
import tensorflow_data_validation as tfdv
import tensorflow_transform as tft
import logging

def _get_serve_features_fn(model, tft_output):
    """Returns a function that accept a dictionary of features and applies TFT."""

    model.tft_layer = tft_output.transform_features_layer()

    @tf.function
    def serve_features_fn(raw_features):
        """Returns the output to be used in the serving signature."""

        transformed_features = model.tft_layer(raw_features)
        probabilities = model(transformed_features)
        return {"scores": probabilities}


    return serve_features_fn

def _get_serve_tf_examples_fn(model, tft_output, feature_spec):
    """Returns a function that parses a serialized tf.Example and applies TFT."""

    model.tft_layer = tft_output.transform_features_layer()

    @tf.function
    def serve_tf_examples_fn(serialized_tf_examples):
        """Returns the output to be used in the serving signature."""
        for key in list(feature_spec.keys()):
            if key not in features:
                feature_spec.pop(key)

        parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)

        transformed_features = model.tft_layer(parsed_features)
        probabilities = model(transformed_features)
        return {"scores": probabilities}

    return serve_tf_examples_fn

def construct_serving_model(
    model, serving_model_dir, metadata
):
    global features

    schema_location = metadata['schema']
    features = metadata['numeric_features'] + metadata['categorical_features'] + metadata['embedding_features']
    print("FEATURES", features)
    tft_output_dir = metadata["transform_artifacts_dir"]

    schema = tfdv.load_schema_text(schema_location)
    feature_spec = tft.tf_metadata.schema_utils.schema_as_feature_spec(schema).feature_spec

    tft_output = tft.TFTransformOutput(tft_output_dir)

    # Drop features that were not used in training
    features_input_signature = {
        feature_name: tf.TensorSpec(
            shape=(None, 1), dtype=spec.dtype, name=feature_name
        )
        for feature_name, spec in feature_spec.items()
        if feature_name in features
    }

    signatures = {
        "serving_default": _get_serve_features_fn(
            model, tft_output
        ).get_concrete_function(features_input_signature),
        "serving_tf_example": _get_serve_tf_examples_fn(
            model, tft_output, feature_spec
        ).get_concrete_function(
            tf.TensorSpec(shape=[None], dtype=tf.string, name="examples")
        ),
    }

    logging.info("Model saving started...")
    model.save(serving_model_dir, signatures=signatures)
    logging.info("Model saving completed.")

### Retrieve model from Vertex AI

Next, create the Python script to retrieve your experimental model from Vertex AI.

In [None]:
%%writefile custom/trainer/model.py

import google.cloud.aiplatform as aip

def get(model_id):
    model = aip.Model(model_id)
    return model

### Create the task script for the Python training package

Next, you create the `task.py` script for driving the training package. Some noteable steps include:

- Command-line arguments:
    - `model-id`: The resource ID of the `Model` resource you built during experimenting. This is the untrained model architecture.
    - `dataset-id`: The resource ID of the `Dataset` resource to use for training.
    - `experiment`: The name of the experiment.
    - `run`: The name of the run within this experiment.
    - `tensorboard-logdir`: The logging directory for Vertex AI Tensorboard.


- `get_data()`:
    - Loads the Dataset resource into memory.
    - Obtains the user metadata from the Dataset resource.
    - From the metadata, obtain location of transformed data, transformation function and name of label column


- `get_model()`:
    - Loads the Model resource into memory.
    - Obtains location of model artifacts of the model architecture.
    - Loads the model architecture.
    - Compiles the model.


- `warmup_model()`:
   - Warms up the initialized model weights


- `train_model()`:
    - Train the model.


- `evaluate_model()`:
    - Evaluates the model.
    - Saves evaluation metrics to Cloud Storage bucket.

In [None]:
%%writefile custom/trainer/task.py
import os
import argparse
import logging
import json

import tensorflow as tf
import tensorflow_transform as tft
from tensorflow.python.client import device_lib

import google.cloud.aiplatform as aip

from trainer import data
from trainer import model as model_
from trainer import train
try:
    from trainer import serving
except:
    pass

parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', dest='model_dir',
                    default=os.getenv('AIP_MODEL_DIR'), type=str, help='Model dir.')
parser.add_argument('--model-id', dest='model_id',
                    default=None, type=str, help='Vertex Model ID.')
parser.add_argument('--dataset-id', dest='dataset_id',
                    default=None, type=str, help='Vertex Dataset ID.')
parser.add_argument('--lr', dest='lr',
                    default=0.001, type=float,
                    help='Learning rate.')
parser.add_argument('--start_lr', dest='start_lr',
                    default=0.0001, type=float,
                    help='Starting learning rate.')
parser.add_argument('--epochs', dest='epochs',
                    default=20, type=int,
                    help='Number of epochs.')
parser.add_argument('--steps', dest='steps',
                    default=200, type=int,
                    help='Number of steps per epoch.')
parser.add_argument('--batch_size', dest='batch_size',
                    default=16, type=int,
                    help='Batch size.')
parser.add_argument('--distribute', dest='distribute', type=str, default='single',
                    help='distributed training strategy')
parser.add_argument('--tensorboard-log-dir', dest='tensorboard_log_dir',
                    default=os.getenv('AIP_TENSORBOARD_LOG_DIR'), type=str,
                    help='Output file for tensorboard logs')
parser.add_argument('--experiment', dest='experiment',
                    default=None, type=str,
                    help='Name of experiment')
parser.add_argument('--project', dest='project',
                    default=None, type=str,
                    help='Name of project')
parser.add_argument('--run', dest='run',
                    default=None, type=str,
                    help='Name of run in experiment')
parser.add_argument('--evaluate', dest='evaluate',
                    default=False, type=bool,
                    help='Whether to perform evaluation')
parser.add_argument('--serving', dest='serving',
                    default=False, type=bool,
                    help='Whether to attach the serving function')
parser.add_argument('--tuning', dest='tuning',
                    default=False, type=bool,
                    help='Whether to perform hyperparameter tuning')
parser.add_argument('--warmup', dest='warmup',
                    default=False, type=bool,
                    help='Whether to perform warmup weight initialization')
args = parser.parse_args()


logging.getLogger().setLevel(logging.INFO)
logging.info('DEVICES'  + str(device_lib.list_local_devices()))

# Single Machine, single compute device
if args.distribute == 'single':
    if tf.test.is_gpu_available():
        strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    else:
        strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")
    logging.info("Single device training")
# Single Machine, multiple compute device
elif args.distribute == 'mirrored':
    strategy = tf.distribute.MirroredStrategy()
    logging.info("Mirrored Strategy distributed training")
# Multi Machine, multiple compute device
elif args.distribute == 'multiworker':
    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    logging.info("Multi-worker Strategy distributed training")
    logging.info('TF_CONFIG = {}'.format(os.environ.get('TF_CONFIG', 'Not found')))
logging.info('num_replicas_in_sync = {}'.format(strategy.num_replicas_in_sync))

# Initialize the run for this experiment
if args.experiment:
    logging.info("Initialize experiment: {}".format(args.experiment))
    aip.init(experiment=args.experiment, project=args.project)
    aip.start_run(args.run)

metadata = {}

def get_data():
    ''' Get the preprocessed training data '''
    global train_data_file_pattern, val_data_file_pattern, test_data_file_pattern
    global label_column, transform_feature_spec, metadata

    dataset = aip.TabularDataset(args.dataset_id)
    METADATA = 'gs://' + dataset.labels['user_metadata'] + "/metadata.jsonl"

    with tf.io.gfile.GFile(METADATA, "r") as f:
        metadata = json.load(f)

    TRANSFORMED_DATA_PREFIX = metadata['transformed_data_prefix']
    label_column = metadata['label_column']

    train_data_file_pattern = TRANSFORMED_DATA_PREFIX + '/train/data-*.gz'
    val_data_file_pattern = TRANSFORMED_DATA_PREFIX + '/val/data-*.gz'
    test_data_file_pattern = TRANSFORMED_DATA_PREFIX + '/test/data-*.gz'

    TRANSFORM_ARTIFACTS_DIR = metadata['transform_artifacts_dir']
    tft_output = tft.TFTransformOutput(TRANSFORM_ARTIFACTS_DIR)
    transform_feature_spec = tft_output.transformed_feature_spec()

def get_model():
    ''' Get the untrained model architecture '''
    global model_artifacts

    vertex_model = model_.get(args.model_id)
    model_artifacts = vertex_model.gca_resource.artifact_uri
    model = tf.keras.models.load_model(model_artifacts)

    # Compile the model
    hyperparams = {}
    hyperparams["learning_rate"] = args.lr
    if args.experiment:
        aip.log_params(hyperparams)

    metadata.update(hyperparams)
    with tf.io.gfile.GFile(os.path.join(args.model_dir, "metrics.txt"), "w") as f:
        f.write(json.dumps(metadata))

    train.compile(model, hyperparams)
    return model

def warmup_model(model):
    ''' Warmup the initialized model weights '''
    warmupparams = {}
    warmupparams["num_epochs"] = args.epochs
    warmupparams["batch_size"] = args.batch_size
    warmupparams["steps"] = args.steps
    warmupparams["start_learning_rate"] = args.start_lr
    warmupparams["end_learning_rate"] = args.lr

    train.warmup(model, warmupparams, train_data_file_pattern, label_column, transform_feature_spec)
    return model

def train_model(model):
    ''' Train the model '''
    trainparams = {}
    trainparams["num_epochs"] = args.epochs
    trainparams["batch_size"] = args.batch_size
    trainparams["early_stop"] = {"monitor": "val_loss", "patience": 5}
    if args.experiment:
        aip.log_params(trainparams)

    metadata.update(trainparams)
    with tf.io.gfile.GFile(os.path.join(args.model_dir, "metrics.txt"), "w") as f:
        f.write(json.dumps(metadata))

    train.train(model, trainparams, train_data_file_pattern, val_data_file_pattern, label_column, transform_feature_spec, args.tensorboard_log_dir, args.tuning)
    return model

def evaluate_model(model):
    ''' Evaluate the model '''
    evalparams = {}
    evalparams["batch_size"] = args.batch_size
    metrics = train.evaluate(model, evalparams, test_data_file_pattern, label_column, transform_feature_spec)

    metadata.update({'metrics': metrics})
    with tf.io.gfile.GFile(os.path.join(args.model_dir, "metrics.txt"), "w") as f:
        f.write(json.dumps(metadata))

get_data()
with strategy.scope():
    model = get_model()

if args.warmup:
    model = warmup_model(model)
else:
    model = train_model(model)

if args.evaluate:
    evaluate_model(model)

if args.serving:
    logging.info('Save serving model to: ' + args.model_dir)
    serving.construct_serving_model(
        model=model,
        serving_model_dir=args.model_dir,
        metadata=metadata
    )
elif args.warmup:
    logging.info('Save warmed up model to: ' + model_artifacts)
    model.save(model_artifacts)
else:
    logging.info('Save trained model to: ' + args.model_dir)
    model.save(args.model_dir)

### Test training package locally

Next, test your completed training package locally with just a few epochs.

In [None]:
DATASET_ID = dataset_id
MODEL_ID = model_id
!cd custom; python3 -m trainer.task --model-id={MODEL_ID} --dataset-id={DATASET_ID} --experiment='chicago' --run='test' --project={PROJECT_ID} --epochs=5 --model-dir=/tmp --evaluate=True --serving=True

#### Store training script on your Cloud Storage bucket

Next, you package the training folder into a compressed tar ball, and then store it in your Cloud Storage bucket.

In [None]:
! rm -f custom.tar custom.tar.gz
! tar cvf custom.tar custom
! gzip custom.tar
! gsutil cp custom.tar.gz $BUCKET_NAME/trainer_chicago.tar.gz

## Construct custom training pipeline

In the example below, you construct a pipeline for training a custom model using pre-built Google Cloud Pipeline Components for Vertex AI Training, as follows:


1. Pipeline arguments, specify the locations of:
    - `dataset_id`: The full resource path name to the corresponding `Vertex AI Dataset` training.
    - `model_id`: The full resource path name to the corresponding `Vertex AI Model` model architecture.
    - `python_package`: The custom training Python package.
    - `python_module`: The entry module in the package to execute.
    - `args`: The command line arguments to the custom training Python script.
    - `container_uri`: The training container.
    - `model_serving_container_image_uri`: The serving container.
    - `machine_type`: The machine VM image for training.
    - `replica_count`: The number of machine VMs for training.
    - `accelerator_type`: The type of accelerator, if any, for training.
    - `accelerator_count`: The number of accelerators for training.
    - `tensorboard`:The full resource path to the `Vertex AI Tensorboard` instance.
    - `warmup`: Warmup the initialization of the base model architecture weights.
    - `bucket`: The Cloud Storage location of the trained model artifacts.
    - `service_account`: The service account for executing the pipeline components.
    - `project`: The project for executing the pipeline components.
    - `label`: Metadata relating to the trained model to store in the corresponding trained `Vertex AI Model` resource.


2. Use the prebuilt component `CustomPythonPackageTrainingJobRunOp` to train a custom model and upload the custom model as a Vertex AI Model resource, where:
    - The display name for the dataset is passed into the pipeline.
    - The dataset is the output from the `TabularDatasetCreateOp`.
    - The python package, command line argument are passed into the pipeline.
    - The training and serving containers are specified in the pipeline definition.
    - The component returns the model resource as `outputs["model"]`.


3. Use the prebuilt component `EndpointCreateOp` to create a `Vertex AI Endpoint` to deploy the trained model to, where:
    - Since the component has no dependencies on other components, by default it would be executed in parallel with the model training.
    - The `after(training_op)` is added to serialize its execution, so its only executed if the training operation completes successfully.
     - The component returns the endpoint resource as `outputs["endpoint"]`.


4. Use the prebuilt component `ModelDeployOp` to deploy the trained `Vertex AI Model` to, where:
    - The display name for the dataset is passed into the pipeline.
    - The model is the output from the `CustomPythonPackageTrainingJobRunOp`.
    - The endpoint is the output from the `EndpointCreateOp`.

*Note:* Since each component is executed as a graph node in its own execution context, you pass the parameter `project` for each component op, in constrast to doing a `aip.init(project=project)` if this was a Python script calling the SDK methods directly within the same execution context.

In [None]:
@dsl.pipeline(
    name="chicago-custom-training",
    description="Custom tabular binary classification training",
)
def pipeline(
    display_name: str,
    dataset_id: str,
    model_id: str,
    python_package: str,
    python_module: str,
    args: str,
    container_uri: str,
    machine_type: str,
    bucket: str,
    model_serving_container_image_uri: str = "",
    tensorboard: str = "",
    service_account: str = "",
    label: str = str({"candidate_model": "1"}).replace("'", '"'),
    replica_count: int = 1,
    accelerator_type: str = "",
    accelerator_count: int = 0,
    warmup: str = "False",
    project: str = PROJECT_ID,
    region: str = REGION,
):
    from google_cloud_pipeline_components import aiplatform as gcc_aip
    from google_cloud_pipeline_components.v1.endpoint import (EndpointCreateOp,
                                                              ModelDeployOp)

    with dsl.Condition(warmup == "True", name="warmup-model"):

        warmup_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
            project=project,
            display_name=display_name,
            # Warmup Training
            python_package_gcs_uri=python_package,
            python_module_name=python_module,
            container_uri=container_uri,
            staging_bucket=bucket,
            args=args,
            replica_count=replica_count,
            machine_type=machine_type,
            accelerator_type=accelerator_type,
            accelerator_count=accelerator_count,
        )

    with dsl.Condition(warmup == "False", name="train-model"):

        training_op = gcc_aip.CustomPythonPackageTrainingJobRunOp(
            project=project,
            display_name=display_name,
            # Full Training
            python_package_gcs_uri=python_package,
            python_module_name=python_module,
            container_uri=container_uri,
            staging_bucket=bucket,
            args=args,
            replica_count=replica_count,
            machine_type=machine_type,
            accelerator_type=accelerator_type,
            accelerator_count=accelerator_count,
            tensorboard=tensorboard,
            service_account=service_account,
            # Serving - As part of this operation, the model is registered to Vertex AI
            model_serving_container_image_uri=model_serving_container_image_uri,
            model_display_name=display_name,
            labels=label,
        )

        endpoint_op = EndpointCreateOp(
            project=project,
            location=region,
            display_name=display_name,
        ).after(training_op)

        deploy_op = ModelDeployOp(
            model=training_op.outputs["model"],
            endpoint=endpoint_op.outputs["endpoint"],
            dedicated_resources_min_replica_count=1,
            dedicated_resources_max_replica_count=1,
            dedicated_resources_machine_type="n1-standard-4",
        )

### Compile and execute the model warmup condition of the pipeline

Next, you compile the pipeline and then execute it. The pipeline takes the following parameters, which are passed as the dictionary `parameter_values`:

- `dataset_id`: The full resource name of the corresponding Vertex AI Dataset.
- `model_id`: The full resource name of the corresponding Vertex AI Model architecture.
- `display_name`: The display name for the trained Vertex AI Model resource.
- `python_package`: The Python package for the custom warmup training job.
- `python_module`: The Python module in the package to execute.
- `args`: The command line arguments to pass to the Python module.
- `container_uri`: The training container image.
- `machine_type`: The VM for executing the training job.
- `replica_count`: The number of virtual machines -- if doing distributed multi-machine training.
- `accelerator_type`: The type of HW accelerators -- if any.
- `accelerator_count`: The number of HW accelerators -- if any.
- `bucket`: The Cloud Storage location to store the model artifacts.
- `tensorboard`: The full resource name of a Vertex AI Tensorboard.
- `service_account`: The service account for the Tensorboard instance.
- `project`: The project ID.
- `region`: The region.

*Note*: This portion of the pipeline does not create a new `Vertex AI Model` resource, but instead updates the weight of the model artifacts of the existing `Vertex AI Model` base architecture.

In [None]:
PIPELINE_ROOT = "{}/pipeline_root/model-train".format(BUCKET_NAME)

compiler.Compiler().compile(pipeline_func=pipeline, package_path="model_train.json")

pipeline = aip.PipelineJob(
    display_name="model-warmup",
    template_path="model_train.json",
    pipeline_root=PIPELINE_ROOT,
    parameter_values={
        "dataset_id": dataset_id,
        "model_id": model_id,
        "display_name": "chicago" + TIMESTAMP,
        "python_package": f"{BUCKET_NAME}/trainer_chicago.tar.gz",
        "python_module": "trainer.task",
        "args": [
            "--dataset-id",
            dataset_id,
            "--model-id",
            model_id,
            "--epochs",
            str(5),
            "--batch_size",
            str(16),
            "--steps",
            str(200),
            "--lr",
            baseline_metrics["learning_rate"],
            "--start_lr",
            0.0001,
            "--warmup",
            True,
            "--project",
            PROJECT_ID,
        ],
        "container_uri": TRAIN_IMAGE,
        "machine_type": TRAIN_COMPUTE,
        "replica_count": 1,
        "accelerator_type": TRAIN_GPU.name,
        "accelerator_count": TRAIN_NGPU,
        "bucket": BUCKET_NAME,
        "project": PROJECT_ID,
        "region": REGION,
        "warmup": "True",
    },
)

pipeline.run()

### Create a Vertex AI TensorBoard instance

Create a Vertex AI TensorBoard instance to use TensorBoard in conjunction with Vertex AI Training for custom model training.

Learn more about [Get started with Vertex AI TensorBoard](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview).

In [None]:
TENSORBOARD_DISPLAY_NAME = "chicago_" + TIMESTAMP
tensorboard = aip.Tensorboard.create(display_name=TENSORBOARD_DISPLAY_NAME)
tensorboard_resource_name = tensorboard.gca_resource.name
print("TensorBoard resource name:", tensorboard_resource_name)

### Compile and execute the model training pipeline

Next, you compile the pipeline and then execute it. The pipeline takes the following parameters, which are passed as the dictionary `parameter_values`:

- `dataset_id`: The full resource name of the corresponding Vertex AI Dataset.
- `model_id`: The full resource name of the corresponding Vertex AI Model architecture.
- `display_name`: The display name for the trained Vertex AI Model resource.
- `python_package`: The Python package for the custom training job.
- `python_module`: The Python module in the package to execute.
- `args`: The command line arguments to pass to the Python module.
    - *Note*: The pipeline uses the hyperparameters from the baseline model. Alternatively, one could use the hyperparameters from the current blessed model, or repeat hyperparameter tuning.
- `container_uri`: The training container image.
- `model_serving_container_image_uri`: The associated deployment container image.
- `machine_type`: The VM for executing the training job.
- `replica_count`: The number of virtual machines -- if doing distributed multi-machine training.
- `accelerator_type`: The type of HW accelerators -- if any.
- `accelerator_count`: The number of HW accelerators -- if any.
- `bucket`: The Cloud Storage location to store the model artifacts.
- `tensorboard`: The full resource name of a Vertex AI Tensorboard.
- `service_account`: The service account for the Tensorboard instance.
- `project`: The project ID.
- `region`: The region.

In [None]:
PIPELINE_ROOT = "{}/pipeline_root/model-train".format(BUCKET_NAME)

pipeline = aip.PipelineJob(
    display_name="model-train",
    template_path="model_train.json",
    pipeline_root=PIPELINE_ROOT,
    parameter_values={
        "dataset_id": dataset_id,
        "model_id": model_id,
        "display_name": "chicago" + TIMESTAMP,
        "python_package": f"{BUCKET_NAME}/trainer_chicago.tar.gz",
        "python_module": "trainer.task",
        "args": [
            "--dataset-id",
            dataset_id,
            "--model-id",
            model_id,
            "--experiment",
            "chicago" + TIMESTAMP,
            "--run",
            "retrain",
            "--epochs",
            str(int(baseline_metrics["num_epochs"])),
            "--batch_size",
            str(int(baseline_metrics["batch_size"])),
            "--lr",
            baseline_metrics["learning_rate"],
            "--evaluate",
            "True",
            "--serving",
            "True",
            "--project",
            PROJECT_ID,
        ],
        "container_uri": TRAIN_IMAGE,
        "tensorboard": tensorboard.gca_resource.name,
        "service_account": SERVICE_ACCOUNT,
        "model_serving_container_image_uri": DEPLOY_IMAGE,
        "machine_type": TRAIN_COMPUTE,
        "replica_count": 1,
        "accelerator_type": TRAIN_GPU.name,
        "accelerator_count": TRAIN_NGPU,
        "bucket": BUCKET_NAME,
        "project": PROJECT_ID,
        "region": REGION,
    },
)

pipeline.run()

! rm -rf model_train.json custom_tar.gz custom

### View the training pipeline results

In [None]:
print("custompythonpackagetrainingjob-run-2")
artifacts = print_pipeline_output(pipeline, "custompythonpackagetrainingjob-run-2")
print("\n")
output = !gsutil cat $artifacts
output = json.loads(output[0])
model_id = output["artifacts"]["model"]["artifacts"][0]["metadata"]["resourceName"]
print("\n")
print(model_id)
print("\n")

print("endpoint-create")
artifacts = print_pipeline_output(pipeline, "endpoint-create")
print("\n")
print("model-deploy")
artifacts = print_pipeline_output(pipeline, "model-deploy")
print("\n")
print("endpoint-create")
artifacts = print_pipeline_output(pipeline, "endpoint-create")
print("\n")
print("model-deploy")
artifacts = print_pipeline_output(pipeline, "model-deploy")
print("\n")

### Get the experiment results

Next, you use the experiment name as a parameter to the method `get_experiment_df()` to get the results of the experiment as a pandas dataframe.

In [None]:
EXPERIMENT_NAME = "chicago" + TIMESTAMP

aip.init(experiment=EXPERIMENT_NAME)
experiment_df = aip.get_experiment_df()
experiment_df = experiment_df[experiment_df.experiment_name == EXPERIMENT_NAME]
experiment_df.T

#### Get the evaluation metrics of the trained model

Now that the model is trained, get and display the evaluation metric results.

In [None]:
model = aip.Model(model_id)
model_artifacts = model.gca_resource.artifact_uri

!gsutil cat {model_artifacts}/metrics.txt

In [None]:
delete_all = False

if delete_all:
    # Delete the dataset using the Vertex dataset object
    try:
        if "dataset" in globals():
            dataset.delete()
    except Exception as e:
        print(e)

    # Delete the model using the Vertex model object
    try:
        if "model" in globals():
            model.delete()
    except Exception as e:
        print(e)

    # delete the BQ table
    # delete the pipeline

    if "BUCKET_NAME" in globals():
        ! gsutil rm -r $BUCKET_NAME