In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

This notebook is a revised version of an unpublished notebook from Juan Acevedo

# E2E ML on GCP: MLOps stage 3 : formalization: get started with TFX pipelines


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage3/get_started_with_tfx_pipeline.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage3/get_started_with_tfx_pipeline.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/ml_ops/stage3/get_started_with_tfx_pipeline.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview


This tutorial demonstrates how to use Vertex AI for E2E MLOps on Google Cloud in production. This tutorial covers stage 3 : formalization: get started with TFX and Vertex AI Pipelines.

### Objective

In this tutorial, you learn how to use TensorFlow Extended (TFX) with `Vertex AI Pipelines`.

This tutorial uses the following Google Cloud ML services:

- `Vertex AI Pipelines`
- `Vertex AI Training`
- `Google Cloud Pipeline Components`
- `Dataflow`


The steps performed include:

- Create a TFX e2e pipeline.
- Execute the pipeline locally.
- Execute the pipeline on Google Cloud using `Vertex AI Training`
- Execute the pipeline using `Vertex AI Pipelines`.

### Dataset

The dataset used for this tutorial is the [CIFAR10 dataset](https://www.tensorflow.org/datasets/catalog/cifar10) from [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/overview). The version of the dataset is built into TensorFlow. The trained model predicts which type of class an image is from ten classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, or truck.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* Dataflow

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and [Dataflow pricing](https://cloud.google.com/dataflow/pricing)
and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installations

Install the packages required for executing this notebook.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG -q
! pip3 install --upgrade google-cloud-pipeline-components $USER_FLAG -q
! pip3 install --upgrade tfx[kfp] $USER_FLAG -q
! pip3 install --upgrade tensorflow $USER_FLAG -q
! pip3 install --upgrade tensorflow-hub $USER_FLAG -q
! pip3 install --upgrade apache-beam[gcp] $USER_FLAG -q
! pip3 install -U tensorflow-io {USER_FLAG} -q
! pip3 install -U tensorflow-estimator==2.6.0 {USER_FLAG} -q

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### GPU runtime

*Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select* **Runtime > Change Runtime Type > GPU**

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)

3. Enable the APIs necessary to execute this notebook -- see cell below.

4. If you are running this notebook locally, you need to install the [Cloud SDK]((https://cloud.google.com/sdk)).

5. Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$`.

### Enable APIs

You can enable the required APIs using `gcloud`.

In [None]:
! gcloud services enable compute.googleapis.com         \
                       containerregistry.googleapis.com  \
                       aiplatform.googleapis.com  \
                       cloudbuild.googleapis.com \
                       cloudfunctions.googleapis.com \
                       dataflow.googleapis.com

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

In the Cloud Console, go to the [Create service account key](https://console.cloud.google.com/apis/credentials/serviceaccountkey) page.

**Click Create service account**.

In the **Service account name** field, enter a name, and click **Create**.

In the **Grant this service account access to project** section, click the Role drop-down list. Type "Vertex" into the filter box, and select **Vertex Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

Click Create. A JSON file that contains your key downloads to your local environment.

Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell.

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = False
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        IS_COLAB = True
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "aip-" + TIMESTAMP
    BUCKET_URI = "gs://" + BUCKET_NAME

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

#### Service Account

**If you don't know your service account**, try to get your service account using `gcloud` command by executing the second cell below.

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    if IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### Set service account access for Vertex AI Pipelines

Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step -- you only need to run these once per service account.

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/dataflow.admin $BUCKET_URI
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/dataflow.worker $BUCKET_URI

### Set up variables

Next, set up some variables used throughout the tutorial.
### Import libraries and define constants

In [None]:
from os import listdir

import google.cloud.aiplatform as aiplatform

from tfx import v1 as tfx
from tfx.components import (BulkInferrer, Evaluator, ExampleValidator,
                            ImportExampleGen, InfraValidator, Pusher,
                            SchemaGen, StatisticsGen, Trainer, Transform,
                            Tuner)
from tfx.dsl.components.common import resolver
from tfx.dsl.experimental import latest_blessed_model_resolver
from tfx.orchestration import metadata, pipeline
from tfx.proto import (bulk_inferrer_pb2, example_gen_pb2, pusher_pb2,
                       trainer_pb2)

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

#### Set hardware accelerators

You can set hardware accelerators for training and prediction.

Set the variables `TRAIN_GPU/TRAIN_NGPU` and `DEPLOY_GPU/DEPLOY_NGPU` to use a container image supporting a GPU and the number of GPUs allocated to the virtual machine (VM) instance. For example, to use a GPU container image with 4 Nvidia Telsa K80 GPUs allocated to each VM, you would specify:

    (aip.AcceleratorType.NVIDIA_TESLA_K80, 4)


Otherwise specify `(None, None)` to use a container image to run on a CPU.

Learn more about [hardware accelerator support for your region](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators).

*Note*: TF releases before 2.3 for GPU support will fail to load the custom model in this tutorial. It is a known issue and fixed in TF 2.3. This is caused by static graph ops that are generated in the serving function. If you encounter this issue on your own custom models, use a container image for TF 2.3 with GPU support.

In [None]:
if os.getenv("IS_TESTING_TRAIN_GPU"):
    TRAIN_GPU, TRAIN_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_TRAIN_GPU")),
    )
else:
    TRAIN_GPU, TRAIN_NGPU = (None, None)

if os.getenv("IS_TESTING_DEPLOY_GPU"):
    DEPLOY_GPU, DEPLOY_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_DEPLOY_GPU")),
    )
else:
    DEPLOY_GPU, DEPLOY_NGPU = (None, None)

#### Set pre-built containers

Set the pre-built Docker container image for training and prediction.


For the latest list, see [Pre-built containers for training](https://cloud.google.com/ai-platform-unified/docs/training/pre-built-containers).


For the latest list, see [Pre-built containers for prediction](https://cloud.google.com/ai-platform-unified/docs/predictions/pre-built-containers).

In [None]:
if os.getenv("IS_TESTING_TF"):
    TF = os.getenv("IS_TESTING_TF")
else:
    TF = "2.5".replace(".", "-")

if TF[0] == "2":
    if TRAIN_GPU:
        TRAIN_VERSION = "tf-gpu.{}".format(TF)
    else:
        TRAIN_VERSION = "tf-cpu.{}".format(TF)
    if DEPLOY_GPU:
        DEPLOY_VERSION = "tf2-gpu.{}".format(TF)
    else:
        DEPLOY_VERSION = "tf2-cpu.{}".format(TF)
else:
    if TRAIN_GPU:
        TRAIN_VERSION = "tf-gpu.{}".format(TF)
    else:
        TRAIN_VERSION = "tf-cpu.{}".format(TF)
    if DEPLOY_GPU:
        DEPLOY_VERSION = "tf-gpu.{}".format(TF)
    else:
        DEPLOY_VERSION = "tf-cpu.{}".format(TF)

TRAIN_IMAGE = "{}-docker.pkg.dev/vertex-ai/training/{}:latest".format(
    REGION.split("-")[0], TRAIN_VERSION
)
DEPLOY_IMAGE = "{}-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(
    REGION.split("-")[0], DEPLOY_VERSION
)

print("Training:", TRAIN_IMAGE, TRAIN_GPU, TRAIN_NGPU)
print("Deployment:", DEPLOY_IMAGE, DEPLOY_GPU, DEPLOY_NGPU)

#### Set machine type

Next, set the machine type to use for training and prediction.

- Set the variables `TRAIN_COMPUTE` and `DEPLOY_COMPUTE` to configure  the compute resources for the VMs you will use for for training and prediction.
 - `machine type`
     - `n1-standard`: 3.75GB of memory per vCPU.
     - `n1-highmem`: 6.5GB of memory per vCPU
     - `n1-highcpu`: 0.9 GB of memory per vCPU
 - `vCPUs`: number of \[2, 4, 8, 16, 32, 64, 96 \]

*Note: The following is not supported for training:*

 - `standard`: 2 vCPUs
 - `highcpu`: 2, 4 and 8 vCPUs

*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*.

In [None]:
if os.getenv("IS_TESTING_TRAIN_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_TRAIN_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

if os.getenv("IS_TESTING_DEPLOY_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_DEPLOY_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Deploy machine type", DEPLOY_COMPUTE)

## Introduction to TensorFlow Extended (TFX)

A component is an implementation of an ML task that you can use as a step in your TFX pipeline. TFX provides several [standard components](https://www.tensorflow.org/tfx/guide#tfx_standard_components) that you can use in your pipelines. If these components do not meet your needs, you can build [custom components](https://www.tensorflow.org/tfx/guide/understanding_custom_components).

The outputs of steps in a TFX pipeline are artifacts. Subsequent steps in your workflow may use these artifacts as inputs. In this way, TFX lets you transfter data between workflow steps.

This tutorial will cover the following standard TFX components.

- ExampleGen
- StatisticsGen
- SchemaGen
- ExampleValidator
- Transform
- Trainer
- Tuner
- Evaluator
- InfraValidator
- Pusher
- BulkInferer

<img src="https://g3doc.corp.google.com/cloud/sales/teams/sales_aiml_northam/g3doc/tfx-pipelines/img/tfx-components.png"/>

#### Location of Cloud Storage training data.

Next, you download a subset of the CIFAR-10 dataset as TFRecords.

In [None]:
! rm -rf custom
! mkdir custom

! wget https://github.com/tensorflow/tfx/raw/master/tfx/examples/cifar10/data/test/cifar10_test.tfrecord -P custom/data/test/
! wget https://github.com/tensorflow/tfx/raw/master/tfx/examples/cifar10/data/train/cifar10_train.tfrecord -P custom/data/train/
! wget https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/cifar10/data/labels.txt -P custom/data/

DATA_ROOT = "custom/data"

### Overview of ExampleGen component

The `ExampleGen` component ingests data into TFX pipelines. In consumes external files/services to generate examples which will be read by other TFX components. It also provides consistent and configurable partition, and shuffles the dataset for ML best practice.

* Consumes: Data from external data sources such as CSV, TFRecord, Avro, Parquet and BigQuery.

* Emits: Artifact of tf.Example records, tf.SequenceExample records, or proto format, depending on the payload format.

In this example, you call `ImportExampleGen()` with the following parameters:

- `input_base`: (optional) The location of the TFRecord dataset. Default is None.
- `input_config`: (optional) How the dataset is laid out at the input_base. Default is None. If unset, the files under input_base will be treated as a single split.
    - `splits`: How the dataset is split.
    
Additional parameters you may set:

- `output_config`: (optional) The output configuration. Default is None. If unset, default splits will be 'train' and 'eval' with size 2:1.
- `range_config`: (optional) Specifies the range of span values to consider. Default is None. If unset, driver will default to searching for latest span with no restrictions.
- `payload_format`: (optional) Payload format of input data. Should be one of example_gen_pb2.PayloadFormat enum. Default is `example_gen_pb2.FORMAT_TF_EXAMPLE`.
    
The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [ImportExampleGen](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/ImportExampleGen).

In [None]:
input_config = example_gen_pb2.Input(
    splits=[
        example_gen_pb2.Input.Split(name="train", pattern="train/*"),
        example_gen_pb2.Input.Split(name="eval", pattern="test/*"),
    ]
)

example_gen = ImportExampleGen(input_base=DATA_ROOT, input_config=input_config)

### Overview of the StatisticsGen component

The `StatisticsGen` component generates features statistics over both training and serving data, which can be used by other pipeline components. `StatisticsGen` uses Apache Beam to scale to large datasets.

* Consumes: Dataset artifact created by an ExampleGen pipeline component.

* Emits: Dataset statistics artifact.

In this example, you call `StatisticsGen()` with the following parameters:

- `examples`: The dataset artifact from which to produce the dataset statistics artifact, i.e., `example_gen.outputs['examples']`.

Additional parameters you may set:

- `schema`: (optional) A schema channel to use for automatically configuring the value of `stats_options` passed to TensorFlow Data Validation (TFDV) library.
- `stats_options`: (optional) 	The StatsOptions instance to configure optional TFDV behavior. When `stats_options.schema` is set, it will be used instead of the schema channel input. 
- `exlude_splits`: List of names of splits to exclude from inferring the schema. Default is None.

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [StatisticsGen](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/StatisticsGen).

In [None]:
statistics_gen = StatisticsGen(examples=example_gen.outputs["examples"])

### Overview of the SchemaGen component

Some of the TFX components use a schema description of the input data. The schema is an instance of `schema.proto`. It can specify data types for feature values, whether a feature has to be present in all examples, allowed value ranges, and other properties. A `SchemaGen` pipeline component will automatically generate a schema by inferring types, categories, and ranges from the training data.

* Consumes: Dataset statistics artifact from a `StatisticsGen` component.

* Emits: Dataset schema artifact.

In this example, you call `SchemaGen()` with the following parameters:

- `statistics`: The dataset statistics artifact from which to produce a dataset schema artifact, i.e., `statistics_gen.outputs['statistics']`.
- `infer_feature_shape`: (optonal) Whether to infer the feature shape. Default to True.

Additional parameters you may set:

- `exlude_splits`: (optional) List of names of splits to exclude from inferring the schema. Default is None.

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [SchemaGen](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/SchemaGen).

In [None]:
schema_gen = SchemaGen(
    statistics=statistics_gen.outputs["statistics"], infer_feature_shape=True
)

### Overview of the ExampleValidator component

The `ExampleValidator` component identifies anomalies in training and serving data. It can detect different classes of anomalies in the data. For example it can:

* Perform validity checks by comparing data statistics against a schema that codifies expectations of the user.

* Detect training-serving skew by comparing training and serving data.

* Detect data drift by looking at a series of data.

The ExampleValidator component:

* Consumes: A schema artifact from a `SchemaGen` component and statistics artifact from a `StatisticsGen` component.

* Emits: Validation artifact.

In this example, you call `ExampleValidator()` with the following parameters:

- `statistics`: The dataset statistics artifact from `StatisticsGen` component.
- `schema`: The dataset schema artifact from `SchemaGen` component.

Additional parameters you may set:

- `exlude_splits`: (optional) List of names of splits to exclude from validating. Default is None.

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [ExampleValidator](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/ExampleValidator).

In [None]:
example_validator = ExampleValidator(
    statistics=statistics_gen.outputs["statistics"], schema=schema_gen.outputs["schema"]
)

### Overview of the Transform component

The `Transform` component performs feature engineering on `tf.Examples` emitted from an `ExampleGen` component, using a data schema created by a `SchemaGen` component, and emits both a SavedModel as well as statistics on both pre-transform and post-transform data. When executed, the SavedModel will accept `tf.Examples` emmited from an `ExampleGen` component and emit the transformed feature data.

* Consumes: `tf.Examples` from an `ExampleGen` component, and a dataset schema from a `SchemaGen` component.

* Emits: A SavedModel for a `Trainer` component, pre-transform and post-transform statistics.

In this example, you call `Transform()` with the following parameters:

- `examples`: Dataset examples from a `ExampleGen` component.
- `schema`: A dataset schema from a `SchemaGen` component.
- `module_file`: The file path to a python module file, from which the `preprocessing_fn` function will be loaded. Exactly one of `module_file` or `preprocessing_fn` must be supplied.

Additional parameters you may set:

- `preprocessing_fn`: (optional) The path to python function that implements a `preprocessing_fn`. 
- `splits_config`: (optional) Specifies splits that should be analyzed and transformed. Defailts to None.
- `analyze_cache`: (optional) When provided, Transform will try use the cached calculation if possible. Defaults to None.
- `materialize`: (optional) If True, write transformed examples as an output. Defaults to True.
- `disable_analyzer_cache`: (optional) If False, Transform will use input cache if provided and write cache output. If True, analyzer_cache must not be provided. Defaults to False.
- `force_tf_compat_v1`: (optional) If True and/or TF2 behaviors are disabled Transform will use Tensorflow in compat.v1 mode irrespective of installed version of Tensorflow. Defaults to False. 
- `custom_config`: (optional) A dictionary which contains additional parameters that will be passed to preprocessing_fn. Defaults to None.
- `disable_statistics`: (optional) If True, do not invoke TFDV to compute pre-transform and post-transform statistics. When statistics are computed, they will will be stored in the pre_transform_feature_stats/ and post_transform_feature_stats/ subfolders of the transform_graph export. Defaults to False.
- `stats_options_updater_fn`: (optional) The path to a python function that implements a 'stats_options_updater_fn'. See 'module_file' for expected signature of the function. 'stats_options_updater_fn' cannot be defined if 'module_file' is specified. Defaults to None.

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [Transform](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/Transform).

In [None]:
TRANSFORM_MODULE = "custom/transform.py"  # implements preprocessing_fn

transform = Transform(
    examples=example_gen.outputs["examples"],
    schema=schema_gen.outputs["schema"],
    module_file=TRANSFORM_MODULE,
)

#### Write the transform module

Next, write the transform module for preprocessing the examples from the dataset.

In [None]:
%%writefile custom/transform.py
import tensorflow as tf

_IMAGE_KEY = 'image'
_LABEL_KEY = 'label'

def _transformed_name(key):
    # This makes it easier for continuous training since mobilenet's input layer is input_1
    if key == _IMAGE_KEY:
        return 'input_1'
    else:
        return key + '_xf'


# TFX Transform will call this function.
def preprocessing_fn(inputs):
    """tf.transform's callback function for preprocessing inputs.

      Args:
        inputs: map from feature keys to raw not-yet-transformed features.

      Returns:
        Map from string feature key to transformed feature operations.
    """
    outputs = {}

    # tf.io.decode_png function cannot be applied on a batch of data.
    # We have to use tf.map_fn
    image_features = tf.map_fn(
        lambda x: tf.io.decode_png(x[0], channels=3),
        inputs[_IMAGE_KEY],
        dtype=tf.uint8
    )
    image_features = tf.image.resize(image_features, [32, 32])
    image_features = tf.keras.applications.mobilenet.preprocess_input(image_features)

    outputs[_transformed_name(_IMAGE_KEY)] = image_features
    # Do not apply label transformation as it will result in wrong evaluation.
    outputs[_transformed_name(_LABEL_KEY)] = inputs[_LABEL_KEY]

    return outputs

### Overview of the Tuner component

The `Tuner` component tunes the hyperparameters for the model.

* Consumes:

    - `tf.Examples` used for training and evaluation.
    - A user provided module file (or module fn) that defines the tuning logic, including model definition, hyperparameter search space, objective, etc.
    - Protobuf definition of training and evaluation arguments.
    - (Optional) Protobuf definition of tuning arguments.
    - (Optional) Transform graph produced by an upstream Transform component.
    - (Optional) A data schema created by a SchemaGen pipeline component and optionally altered by the developer.

* Emits: the best hyperparameter results artifact.

In this example, you call `Tuner()` with the following parameters:

- `examples`: The training/evaluation examples from `ExampleGen`.
- `module_file`: (optional) A path to python module file containing UDF tuner definition. The module_file must implement a function named tuner_fn at its top level. The function must have the following signature. def tuner_fn(fn_args: FnArgs) -> TunerFnResult: Exactly one of 'module_file' or `tuner_fn` must be supplied.
- `transform_graph`: The input transform graph if present. This is used when transformed examples are provided.
- `train_args`: (optional) A trainer_pb2.TrainArgs instance, containing args used for training. Currently only splits and num_steps are available. Default behavior (when splits is empty) is train on train split.
- `eval_args`: (optional) A trainer_pb2.EvalArgs instance, containing args used for eval. Currently only splits and num_steps are available. Default behavior (when splits is empty) is evaluate on eval split.

Additional parameters you may set:

- `schema`: (optional) The schema for the training and evaluation data. This is used when raw examples are provided.
- `base_model`: (optional) The model that will be used for training. This can be used for warmstart, transfer learning or model ensembling.
- `tuner_fn`: (optional) A python path to UDF model definition function. See `module_file` for the required signature of the UDF. Exactly one of `module_file` or `tuner_fn` must be supplied.
- `tune_args`: (optional) A `trainer_pb2.TrainArgs` instance, containing args used for training. Currently only splits and num_steps are available. Default behavior (when splits is empty) is train on train split.
- `custom_config`: (optional) A dictionary which contains addtional training job parameters that will be passed into user module.

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [Tuner](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/Tuner).

In [None]:
from tfx.proto import trainer_pb2

TUNER_MODULE = "custom/tuner.py"  # implements tuner_fn

tuner = Tuner(
    module_file=TUNER_MODULE,
    examples=transform.outputs["transformed_examples"],
    transform_graph=transform.outputs["transform_graph"],
    train_args=trainer_pb2.TrainArgs(num_steps=20),
    eval_args=trainer_pb2.EvalArgs(num_steps=5),
)

#### Write the tuner module

Next, write the tuner module for hyperparameter tuning the model

In [None]:
%%writefile custom/tuner.py

# not implemented in this tutorial

### Overview of the Trainer component

The `Trainer` TFX pipeline component trains a TensorFlow model.

* Consumes:

    - `tf.Examples` used for training and eval
    - A user provided module file that defines the trainer logic
    - Protobuf definition of train args and eval args
    - (Optional) A data schema created by a SchemaGen pipeline component
    - (Optional) Transform graph produced by an upstream Transform component
    - (Optional) Pre-trained models used for scenarios such as warmstart
    - (Optional) Hyperparameters, which will be passed to use module function.

* Emits: At least one model for inference/serving (typically in SavedModelFormat) and optionally another model for eval (typlically an EvalSavedModel)

In this example, you call `Trainer()` with the following parameters:

- `module_file`: (optional) The file path to a python module file, from which the `run_fn` function will be loaded. Exactly one of `module_file` or `run_fn` must be supplied.
- `examples`: (optional) The source of examples used in training (required). May be raw or transformed.
- `transform_graph`: (optional) The input transform graph if present.
- `schema`: (optional) The schema for training and evaluation data.
- `train_args`: (optional) A `proto.TrainArgs` instance, containing args used for training Currently only splits and num_steps are available. Default behavior (when splits is empty) is train on train split.
- `eval_args`: (optional) A `proto.EvalArgs` instance, containing args used for evaluation. Currently only splits and num_steps are available. Default behavior (when splits is empty) is evaluate on eval split.

Additional parameters you may set:

- `run_fn`: (optional) A python path to UDF model definition function for generic trainer. See `module_file` for details. Exactly one of `module_file` or `run_fn` must be supplied if Trainer uses GenericExecutor (default).
- `trainer_fn`: (optional) A python path to UDF model definition function for estimator based trainer. See `module_file` for the required signature of the UDF. Exactly one of `module_file` or `trainer_fn` must be supplied if `Trainer` uses Estimator based Executor.
- `hyperparameters`: (optional) The hyperparameters for training module. 
- `base_model`: (optional) The model that will be used for training. This can be used for warmstart, transfer learning or model ensembling.
- `custom_config`: (optional) A dict which contains addtional training job parameters that will be passed into user module.

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [Trainer](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/Trainer).

In [None]:
TRAINER_MODULE = "custom/train.py"  # implements run_fn

trainer = Trainer(
    module_file=TRAINER_MODULE,
    examples=transform.outputs["transformed_examples"],
    transform_graph=transform.outputs["transform_graph"],
    schema=schema_gen.outputs["schema"],
    # not implemented in this tutorial:
    # hyperparameters=tuner.outputs['best_hyperparameters'],
    # This will be passed to `run_fn`.
    train_args=trainer_pb2.TrainArgs(num_steps=100),
    eval_args=trainer_pb2.EvalArgs(num_steps=5),
)

#### Write the trainer module

Next, write the trainer module for training the model.

In [None]:
%%writefile custom/train.py
import os
from typing import List, Text
import absl
import tensorflow as tf
import tensorflow_transform as tft
from tensorflow.keras.regularizers import L2

from tfx.components.trainer.fn_args_utils import DataAccessor
from tfx.components.trainer.fn_args_utils import FnArgs
from tfx.components.trainer.rewriting import converters
from tfx.components.trainer.rewriting import rewriter
from tfx.components.trainer.rewriting import rewriter_factory
from tfx.dsl.io import fileio
from tfx_bsl.tfxio import dataset_options

# When training on the whole dataset use following constants instead.
# This setting should give ~91% accuracy on the whole test set
# _TRAIN_DATA_SIZE = 50000
# _EVAL_DATA_SIZE = 10000
# _TRAIN_BATCH_SIZE = 64
# _EVAL_BATCH_SIZE = 64
# _CLASSIFIER_LEARNING_RATE = 3e-4
# _FINETUNE_LEARNING_RATE = 5e-5
# _CLASSIFIER_EPOCHS = 12

_TRAIN_DATA_SIZE = 1024
_EVAL_DATA_SIZE = 1024
_TRAIN_BATCH_SIZE = 32
_EVAL_BATCH_SIZE = 32
LEARNING_RATE = 1e-3
FINETUNE_LEARNING_RATE = 7e-6
EPOCHS = 30


_IMAGE_KEY = 'image'
_LABEL_KEY = 'label'

def _transformed_name(key):
    # This makes it easier for continuous training since mobilenet's input layer is input_1
    if key == _IMAGE_KEY:
        return 'input_1'
    else:
        return key + '_xf'
    
def _get_serve_image_fn(model, tf_transform_output):
  """Returns a function that feeds the input tensor into the model."""
  model.tft_layer = tf_transform_output.transform_features_layer()
  @tf.function
  def serve_image_fn(serialized_tf_examples):

    feature_spec = tf_transform_output.raw_feature_spec()
    feature_spec.pop(_LABEL_KEY)
    parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)

    transformed_features = model.tft_layer(parsed_features)
    #del transformed_features[_transformed_name(_IMAGE_KEY)]

    return model(transformed_features)

  return serve_image_fn


def _image_augmentation(image_features):
  """Perform image augmentation on batches of images .

  Args:
    image_features: a batch of image features

  Returns:
    The augmented image features
  """
  batch_size = tf.shape(image_features)[0]
  image_features = tf.image.random_flip_left_right(image_features)
  image_features = tf.image.resize_with_crop_or_pad(image_features, 36, 36)
  image_features = tf.image.random_crop(image_features,
                                        (batch_size, 32, 32, 3))
  return image_features


def _data_augmentation(feature_dict):
  """Perform data augmentation on batches of data.

  Args:
    feature_dict: a dict containing features of samples

  Returns:
    The feature dict with augmented features
  """
  image_features = feature_dict[_transformed_name(_IMAGE_KEY)]
  image_features = _image_augmentation(image_features)
  feature_dict[_transformed_name(_IMAGE_KEY)] = image_features
  return feature_dict


def _input_fn(file_pattern: List[Text],
              data_accessor: DataAccessor,
              tf_transform_output: tft.TFTransformOutput,
              is_train: bool = False,
              batch_size: int = 200) -> tf.data.Dataset:
  """Generates features and label for tuning/training.

  Args:
    file_pattern: List of paths or patterns of input tfrecord files.
    data_accessor: DataAccessor for converting input to RecordBatch.
    tf_transform_output: A TFTransformOutput.
    is_train: Whether the input dataset is train split or not.
    batch_size: representing the number of consecutive elements of returned
      dataset to combine in a single batch

  Returns:
    A dataset that contains (features, indices) tuple where features is a
      dictionary of Tensors, and indices is a single Tensor of label indices.
  """
  dataset = data_accessor.tf_dataset_factory(
      file_pattern,
      dataset_options.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_transformed_name(_LABEL_KEY)),
      tf_transform_output.transformed_metadata.schema)
  # Apply data augmentation. We have to do data augmentation here because
  # we need to apply data agumentation on-the-fly during training. If we put
  # it in Transform, it will only be applied once on the whole dataset, which
  # will lose the point of data augmentation.
  if is_train:
    dataset = dataset.map(lambda x, y: (_data_augmentation(x), y))

  return dataset


def _build_keras_model() -> tf.keras.Model:
  """Creates a Image classification model with MobileNet backbone.

  Returns:
    The image classification Keras Model and the backbone MobileNet model
  """
  # We create a MobileNet model with weights pre-trained on ImageNet.
  # We remove the top classification layer of the MobileNet, which was
  # used for classifying ImageNet objects. We will add our own classification
  # layer for CIFAR10 later. We use average pooling at the last convolution
  # layer to get a 1D vector for classification, which is consistent with the
  # origin MobileNet setup
  base_model = tf.keras.applications.MobileNet(
      input_shape=(32, 32, 3),
      include_top=False,
      weights='imagenet',
      pooling='avg')
  base_model.input_spec = None

  # freeze the layers of the base model
  base_model.trainable = False

  model = tf.keras.Sequential([
      tf.keras.layers.InputLayer(
          input_shape=(32, 32, 3), name=_transformed_name(_IMAGE_KEY)),
      base_model,
      tf.keras.layers.Dense(10, activation='softmax', kernel_regularizer=L2(0.001))
  ])

  model.compile(
      loss='sparse_categorical_crossentropy',
      optimizer=tf.keras.optimizers.RMSprop(lr=LEARNING_RATE),
      metrics=['sparse_categorical_accuracy'])
  model.summary(print_fn=absl.logging.info)

  return model, base_model

# TFX Trainer will call this function.
def run_fn(fn_args: FnArgs):
    """Train the model based on given args.

    Args:
    fn_args: Holds args used to train the model as name/value pairs.

    Raises:
    ValueError: if invalid inputs.
    """
    tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)

    strategy = tf.distribute.MirroredStrategy(devices=['device:GPU:0'])

    baseline_path = fn_args.base_model
    if baseline_path is not None:
        with strategy.scope():
            model = tf.keras.models.load_model(os.path.join(baseline_path))
            # rename input layer to match transform name
            #model.layers[0].layers[0]._name = _transformed_name(_IMAGE_KEY)
    else:
        with strategy.scope():
            model, base_model = _build_keras_model()

    train_dataset = _input_fn(
        fn_args.train_files,
        fn_args.data_accessor,
        tf_transform_output,
        is_train=True,
        batch_size=_TRAIN_BATCH_SIZE)
    eval_dataset = _input_fn(
        fn_args.eval_files,
        fn_args.data_accessor,
        tf_transform_output,
        is_train=False,
        batch_size=_EVAL_BATCH_SIZE)

    absl.logging.info('TensorBoard logging to {}'.format(fn_args.model_run_dir))
    # Write logs to path
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=fn_args.model_run_dir, update_freq='batch')

    # Our training regime has two phases: we first freeze the backbone and train
    # the newly added classifier only, then unfreeze part of the backbone and
    # fine-tune with classifier jointly.
    steps_per_epoch = int(_TRAIN_DATA_SIZE / _TRAIN_BATCH_SIZE)

    absl.logging.info('Start training the top classifier')
    model.fit(
        train_dataset,
        epochs=EPOCHS,
        steps_per_epoch=fn_args.train_steps, #steps_per_epoch,
        validation_data=eval_dataset,
        validation_steps=fn_args.eval_steps,
        callbacks=[tensorboard_callback])

    # only if first training, finetune.
    if baseline_path is None:
        absl.logging.info('Start fine-tuning the model')
        # Unfreeze the top MobileNet layers and do joint fine-tuning
        base_model.trainable = True

        # We need to recompile the model because layer properties have changed
        model.compile(
            loss='sparse_categorical_crossentropy',
            optimizer=tf.keras.optimizers.RMSprop(lr=FINETUNE_LEARNING_RATE),
            metrics=['sparse_categorical_accuracy'])
        model.summary(print_fn=absl.logging.info)

        model.fit(
            train_dataset,
            initial_epoch=EPOCHS,
            epochs=EPOCHS + 20,
            steps_per_epoch=steps_per_epoch,
            validation_data=eval_dataset,
            validation_steps=fn_args.eval_steps,
            callbacks=[tensorboard_callback])

    signatures = {
      'serving_default':
          _get_serve_image_fn(model,tf_transform_output).get_concrete_function(
              tf.TensorSpec(
                  shape=[None],
                  dtype=tf.string,
                  name=_IMAGE_KEY))
    }

    temp_saving_model_dir = os.path.join(fn_args.serving_model_dir)
    model.save(temp_saving_model_dir, save_format='tf', signatures=signatures)

### Overview of the Evaluator component

The Evaluator component performs deep analysis on the training results for your models, to help you understand how your model performs on subsets of your data. The Evaluator also helps you validate your exported models, ensuring that they are "good enough" to be pushed to production.

* Consumes:

    - An eval split from `ExampleGen`

    - A trained model from `Trainer`

    - (Optional) A previously blessed model

* Emits:

    - Analysis results to ML Metadata

    - Validation results to ML Metadata

In this example, you call `Evaluator()` with the following parameters:

- `examples`: The source of examples used for evaluation (required).
- `model`: (optional) The model produced by the `Trainer` component.
- `eval_config`: (optional) Instance of tfma.EvalConfig containg configuration settings for running the evaluation. 

Additional parameters you may set:

- `baseline_model`: (optional) The baseline model for model diff and model validation purpose.
- `example_splits`: (optional) Names of splits on which the metrics are computed. Default behavior (when example_splits is set to None or Empty) is using the 'eval' split.
- `schema`: (optional) The schema for TFXIO.
- `module_file`: (optional) A path to python module file containing UDFs for Evaluator customization. 
- `module_path`: (optional) A python path to the custom module that contains the UDFs. See 'module_file' for the required signature of UDFs.

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [Evaluator](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/Evaluator).

In [None]:
import tensorflow_model_analysis as tfma

LOWER_BOUND_VALIDATION = 0.55  # the metric threshold to validate the model.

eval_config = tfma.EvalConfig(
    model_specs=[tfma.ModelSpec(label_key="label")],
    slicing_specs=[tfma.SlicingSpec()],
    metrics_specs=[
        tfma.MetricsSpec(
            metrics=[
                tfma.MetricConfig(
                    class_name="SparseCategoricalAccuracy",
                    threshold=tfma.MetricThreshold(
                        value_threshold=tfma.GenericValueThreshold(
                            lower_bound={"value": LOWER_BOUND_VALIDATION}
                        ),
                        # Change threshold will be ignored if there is no
                        # baseline model resolved from MLMD (first run).
                        change_threshold=tfma.GenericChangeThreshold(
                            direction=tfma.MetricDirection.HIGHER_IS_BETTER,
                            absolute={"value": -1e-3},
                        ),
                    ),
                )
            ]
        )
    ],
)

evaluator = Evaluator(
    examples=example_gen.outputs["examples"],
    model=trainer.outputs["model"],
    # baseline_model=model_resolver.outputs['model'],
    eval_config=eval_config,
)

### Overview of the InfraValidator component

`InfraValidator` is a TFX component that is used as an early warning layer before pushing a model into production. The name "infra" validator came from the fact that it is validating the model in the actual model serving "infrastructure". If `Evaluator` is to guarantee the performance of the model, `InfraValidator` is to guarantee the model is mechanically fine and prevents bad models from being pushed.

* Consumes: a trained model in SavedModel format from the `Trainer `component

* Emits: infra validation result artifact

In this example, you call `InfraValidator()` with the following parameters:

- `model`: The model produced by the `Trainer` component.
- `serving_spec`: A ServingSpec configuration about serving binary and test platform config to launch model server for validation. 

Additional parameters you may set:

- `examples`: (optional) The source of examples used for validating the infrastructure.
- `request_spec`: (optional)
- `validation_spec`: (optional)

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [InfraValidator](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/InfraValidator).

In [None]:
# Not implemented in this tutorial
try:
    infra_validator = InfraValidator(
        model=trainer.outputs["model"], serving_spec=tfx.proto.ServingSpec(...)
    )
except Exception as e:
    print(e)

### Overview of the Pusher component

The `Pusher` component is used to push a validated model to a deployment target during model training or re-training. Before the deployment, Pusher relies on one or more blessings from other validation components to decide whether to push the model or not.

* Consumes:

A trained model from the `Trainer` component
(Optional but recommended) InfraValidator blesses the model if the model is mechanically servable in a production environment

* Emits: the same trained model along with versioning metadata

In this example, you call `Pusher()` with the following parameters:

- `model`: (optional) The model artifact from the `Trainer` component.
- `model_blessing`: (optional) The model evaluation artifact from the `Evaluator` component.
- `push_destination`: (optional) A pusher_pb2.PushDestination instance, providing info for tensorflow serving to load models.

Additional parameters you may set:

- `infra_blessing`: The infrastructure validation artifact from `InfraValidator`.
- `custom_config`: A dictionary which contains the deployment job parameters to be passed to Cloud platforms.

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [Pusher](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/Pusher).

In [None]:
LOCAL_ROOT = "./"
_pipeline_name = "cifar10"
serving_model_dir_lite = os.path.join(LOCAL_ROOT, "serving_model_lite", _pipeline_name)


pusher = Pusher(
    model=trainer.outputs["model"],
    model_blessing=evaluator.outputs["blessing"],
    push_destination=pusher_pb2.PushDestination(
        filesystem=pusher_pb2.PushDestination.Filesystem(
            base_directory=serving_model_dir_lite
        )
    ),
)

### Overview of the BulkInferrer component

The `BulkInferrer` TFX component performs batch inference on unlabeled data. The generated InferenceResult contains the original features and the prediction results.

* Consumes:

    - A trained model in SavedModel format
    - Unlabelled tf.Examples that contain features
    - (Optional) Validation results from Evaluator component


* Emits: The inference (prediction) results

In this example, you call `BulkInferrer()` with the following parameters:

- `examples`: The examples for inference, usually from an instance of `ExampleGen`.
- `model`: (optional) The model artifact from the `Trainer` component.
- `model_blessing`: (optional) The model evaluation artifact from the `Evaluator` component.
- `data_spec`: (optional) A `bulk_inferrer_pb2.DataSpec` instance that describes data selection.
- `model_spec`: (optional) A `bulk_inferrer_pb2.ModelSpec` instance that describes model specification.

Additional parameters you may set:

- `output_example_spec`: (optional) A bulk_inferrer_pb2.OutputExampleSpec instance, specify if you want BulkInferrer to output examples instead of inference result.

The output from this call is a compiled component that can be executed in the context of a pipeline.

Learn more about [BulkInferrer](https://www.tensorflow.org/tfx/api_docs/python/tfx/v1/components/BulkInferrer).

In [None]:
bulk_inferrer = BulkInferrer(
    examples=example_gen.outputs["examples"],
    model=trainer.outputs["model"],
    model_blessing=evaluator.outputs["blessing"],
    data_spec=bulk_inferrer_pb2.DataSpec(),
    model_spec=bulk_inferrer_pb2.ModelSpec(),
)

## Define the TFX pipeline

Next, you define the TFX pipeline. In this tutorial, the pipeline consists of the following steps:

- `example_gen`: Generate tf.Examples for training and evaluation.
- `statistics_gen`: Generate the dataset statistiscs artifact.
- `schema_gen`: Generate the dataset schema artifact.
- `example_validator`: Analyze the dataset examples for anomalies.
- `transform`: Preprocess the dataset examples.
- `trainer`: Train the model from the preprocessed dataset examples.
- `evaluator`: Evaluate whether the model meets deployment criteria.
- `pusher`: Deploy the model

The pipeline DAG is defined (and compiled) by instantiating a `pipeline.Pipeline`, with the following parameters:

- `pipeline_name`: The display name for the pipeline.
- `pipeline_root`: The storage location for writing the artifacts from each pipeline step.
- `components`: The ordered list of pipeline steps.
- `enable_cache`: Whether to reuse cached results from previous pipeline run. Defaults to False.
- `metadata_connection_config`: Settings for storing metadata information from the pipeline steps.
- `beam_pipeline_args`: The parameters for the Apache Beam job.

In [None]:
LOCAL_ROOT = "./"

_pipeline_name = "cifar10"
_pipeline_root = f"{LOCAL_ROOT}/pipelines/{_pipeline_name}"
_metadata_path = os.path.join(LOCAL_ROOT, "metadata", _pipeline_name, "metadata.db")

_beam_pipeline_args = [
    "--direct_running_mode=multi_processing",
    "--direct_num_workers=0",
]


components = [
    example_gen,
    statistics_gen,
    schema_gen,
    example_validator,
    transform,
    trainer,
    evaluator,
    pusher,
]

tfx_pipeline = pipeline.Pipeline(
    pipeline_name=_pipeline_name,
    pipeline_root=_pipeline_root,
    components=components,
    enable_cache=True,
    metadata_connection_config=metadata.sqlite_metadata_connection_config(
        _metadata_path
    ),
    beam_pipeline_args=_beam_pipeline_args,
)

### Execute the TFX pipeline

Next, you execute the pipeline. In this tutorial, you use the `Apahe BeamDagRunner` to execute the TFX pipeline DAG.

In [None]:
from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner

BeamDagRunner().run(tfx_pipeline)

#### Visualizing artifact output from ImportExampleGen

Next, you view the output from `ImportExampleGen`, which consists of compressed archived files of the image data from the CIFAR-10 dataset.

In [None]:
path = f"{_pipeline_root}/ImportExampleGen/examples"
examples_run = os.path.join(path, [f for f in listdir(path)][0])
train_examples_filepath = os.path.join(examples_run, "Split-train")
eval_examples_filepath = os.path.join(examples_run, "Split-eval")

print("Training TFRecords")
! ls {train_examples_filepath}
print("Evaluation TFRecords")
! ls {eval_examples_filepath}

#### Visualizing artifact output from StatisticsGen

You can visualize some of the StatisticsGen Artifact and compare our train and eval datasets. This visualization tool is very useful to give us insights about the data such as data distribution and possible data skew.

In [None]:
import os
from os import listdir

import tensorflow_data_validation as tfdv

path = f"{_pipeline_root}/StatisticsGen/statistics/"
stats_run = os.path.join(path, [f for f in listdir(path)][0])
train_stats_filepath = os.path.join(stats_run, "Split-train", "FeatureStats.pb")
eval_stats_filepath = os.path.join(stats_run, "Split-eval", "FeatureStats.pb")

train_stats = tfdv.load_stats_binary(train_stats_filepath)
eval_stats = tfdv.load_stats_binary(eval_stats_filepath)

tfdv.visualize_statistics(
    train_stats, rhs_statistics=eval_stats, lhs_name="train", rhs_name="eval"
)

#### Visualizing artifact output from SchemaGen

Next, you view the output from `SchemaGen`, which is in protobuf format stored as a plain text file.

In [None]:
path = f"{_pipeline_root}/SchemaGen/schema/"
schema_run = os.path.join(path, [f for f in listdir(path)][0])

! cat {schema_run}/schema.pbtxt

#### Visualizing artifact output from ExampleValidator

Next, you view the output from `ExampleValidator`, which is in protobuf format.

In [None]:
path = f"{_pipeline_root}/ExampleValidator/anomalies/"
example_validator_run = os.path.join(path, [f for f in listdir(path)][0])

train_example_validator_filepath = os.path.join(
    example_validator_run, "Split-train", "SchemaDiff.pb"
)
eval_example_validator_filepath = os.path.join(
    example_validator_run, "Split-eval", "SchemaDiff.pb"
)

! ls {train_example_validator_filepath}
! ls {eval_example_validator_filepath}

#### Visualizing artifact output from Transform

Next, you view the output from `Transform`, which contains the following subfolders:

- `pre_transform_schema`
- `pre_transform_stats`
- `post_transform_anomalies`
- `post_transform_schema`
- `post_transform_stats`
- `transform_graph`
- `transform_examples`

In [None]:
path = f"{_pipeline_root}/Transform/transform_graph/"
transform_graph_run = os.path.join(path, [f for f in listdir(path)][0])

print("Transform Graph")
! ls {transform_graph_run}

path = f"{_pipeline_root}/Transform/transformed_examples/"
transformed_examples_run = os.path.join(path, [f for f in listdir(path)][0])

transformed_examples_train_filepath = os.path.join(
    transformed_examples_run, "Split-train"
)
transformed_examples_eval_filepath = os.path.join(
    transformed_examples_run, "Split-eval"
)

print("Transformed Examples")
! ls {transformed_examples_train_filepath} {transformed_examples_eval_filepath}

#### Visualizing artifact output from Trainer

Next, you view the output from `Trainer`, whichn consists of the trained model artifacts.

In [None]:
path = f"{_pipeline_root}/Trainer/model/"
trainer_run = os.path.join(path, [f for f in listdir(path)][0])

model_artifacts = os.path.join(trainer_run, "Format-Serving")

! ls {model_artifacts}

#### Visualizing artifact output from Evaluator

Next, you view the output from `Evaluator`, which consists of two parts: `evaluation` and `blessing`.

The `evaluation` consists of the validation results stored as TFRecords.

The `blessing` consists of a single (empty) plain text file, whose name is either:

- `BLESSED`: Model passed the validation requirements.
- `NOT_BLESSED`: Model did not pass the validation requirements.

In [None]:
path = f"{_pipeline_root}/Evaluator/evaluation/"
evaluation_run = os.path.join(path, [f for f in listdir(path)][0])

path = f"{_pipeline_root}/Evaluator/blessing/"
blessing_run = os.path.join(path, [f for f in listdir(path)][0])

print("Evaluation")
! ls {evaluation_run}
print("Blessing")
! ls {blessing_run}

#### Visualizing artifact output from Pusher

Next, you view the output from `Pusher`. If the model passed evaluation, then the output will contain the pushed model artifacts:

- `saved_model.pb`
- `keras_metadata.pb`
- `assets`
- `variables`

In [None]:
path = f"{_pipeline_root}/Pusher/pushed_model/"
pusher_run = os.path.join(path, [f for f in listdir(path)][0])

! ls {pusher_run}

#### Delete local temporary files

In [None]:
! rm -rf metadata pipelines

## Running the TFX pipeline as a `Vertex AI CustomJob`

In this section of the tutorial, you run your pre-existing TFX pipeline in Google Cloud using the `Vertex AI CustomJob`. The `CustomJob` service provides you the ability to run any Python package on Google Cloud, and be able to track the job under **Training->Custom Jobs**.

You perform the following steps:

1. Place the TFX pipeline code into a Python script.
2. Create a Python package containing:
    - Python scripts (TFX pipeline, transform/trainer/tune scripts)
    - setup and requirements
    - Compress and tar the package and copy to your Cloud Storage bucket.
3. Copy the training data to your Cloud Storage bucket.
4. Setup the worker pool specification.
5. Create and run the `CustomJob`.

### Recommendation

Using a `CustomJob` for executing TFX pipelines on Google Cloud is recommended for smaller jobs -- vs. converting to and executing as a `Vertex AI Pipeline`.

The way Vertex AI pipelines allocates resources is not well-suited for small jobs. When running on Vertex AI pipelines, you are running dataflow jobs for ingesting data, creating statistics, schema and transformations, and running training jobs as custom training jobs in Vertex AI.

### Write the TFX pipeline as a Python script

Next, you write your TFX pipeline code as a Python script -- which will be ran by the `CustomJob`.

*Note:* sqlite3 cannot open a database on a GCS style path (i.e., gs://). You use GCSFuse to make the GCS bucket appear as a network mounted filesystem.

In [None]:
! mkdir custom/trainer
! gsutil cp custom/transform.py {BUCKET_URI}/transform.py
! gsutil cp custom/train.py {BUCKET_URI}/train.py

In [None]:
content = f"""
import os
import tensorflow as tf
import tensorflow_hub as hub

from tfx import v1 as tfx
from tfx.components import (ImportExampleGen, 
                            StatisticsGen, 
                            SchemaGen, 
                            ExampleValidator, 
                            Transform,
                            Trainer,
                            Tuner,
                            Evaluator,
                            InfraValidator,
                            Pusher,
                            BulkInferrer
                           )
from tfx.dsl.components.common import resolver
from tfx.orchestration import pipeline
from tfx.orchestration import metadata
from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner

from tfx.proto import example_gen_pb2
from tfx.proto import trainer_pb2
from tfx.proto import pusher_pb2
from tfx.proto import bulk_inferrer_pb2

import tensorflow_model_analysis as tfma

DATA_ROOT = "{BUCKET_URI}/data"
LOCAL_ROOT = "{BUCKET_URI}"

input_config = example_gen_pb2.Input(splits=[
    example_gen_pb2.Input.Split(name='train', pattern='train/*'),
    example_gen_pb2.Input.Split(name='eval', pattern='test/*')
])

example_gen = ImportExampleGen(
    input_base=DATA_ROOT, 
    input_config=input_config
)

statistics_gen = StatisticsGen(
    examples=example_gen.outputs['examples']
)

schema_gen = SchemaGen(
    statistics=statistics_gen.outputs['statistics'], 
    infer_feature_shape=True
)

example_validator = ExampleValidator(
      statistics=statistics_gen.outputs['statistics'],
      schema=schema_gen.outputs['schema']
)

TRANSFORM_MODULE = '{BUCKET_URI}/transform.py'  # implements preprocessing_fn

transform = Transform(
      examples=example_gen.outputs['examples'],
      schema=schema_gen.outputs['schema'],
      module_file=TRANSFORM_MODULE
)

TRAINER_MODULE = '{BUCKET_URI}/train.py'  # implements run_fn

trainer = Trainer(
    module_file=TRAINER_MODULE,   
    examples=transform.outputs['transformed_examples'],
    transform_graph=transform.outputs['transform_graph'],
    schema=schema_gen.outputs['schema'],
    
    # not implemented in this tutorial: 
    # hyperparameters=tuner.outputs['best_hyperparameters'],
    
    # This will be passed to `run_fn`.
    train_args=trainer_pb2.TrainArgs(num_steps=100),
    eval_args=trainer_pb2.EvalArgs(num_steps=5)
)

LOWER_BOUND_VALIDATION = 0.55  # the metric threshold to validate the model.

eval_config = tfma.EvalConfig(
      model_specs=[tfma.ModelSpec(label_key='label')],
      slicing_specs=[tfma.SlicingSpec()],
      metrics_specs=[
          tfma.MetricsSpec(metrics=[
              tfma.MetricConfig(
                  class_name='SparseCategoricalAccuracy',
                  threshold=tfma.MetricThreshold(
                      value_threshold=tfma.GenericValueThreshold(
                          lower_bound={{'value': LOWER_BOUND_VALIDATION}}),
                      # Change threshold will be ignored if there is no
                      # baseline model resolved from MLMD (first run).
                      change_threshold=tfma.GenericChangeThreshold(
                          direction=tfma.MetricDirection.HIGHER_IS_BETTER,
                          absolute={{'value': -1e-3}})))
          ])
      ])

evaluator = Evaluator(
    examples=example_gen.outputs['examples'],
    model=trainer.outputs['model'],
    #baseline_model=model_resolver.outputs['model'],
    eval_config=eval_config
)

_pipeline_name = 'cifar10'
serving_model_dir_lite = os.path.join(LOCAL_ROOT, 'serving_model_lite', _pipeline_name)

gs_prefix = 'gs://'
gcsfuse_prefix = '/gcs/'
if serving_model_dir_lite.startswith(gs_prefix):
    serving_model_dir_lite = serving_model_dir_lite.replace(gs_prefix, gcsfuse_prefix)
    dirpath = os.path.split(serving_model_dir_lite)[0]
    if not os.path.isdir(dirpath):
        os.makedirs(dirpath)


pusher = Pusher(
    model=trainer.outputs['model'],
    model_blessing=evaluator.outputs['blessing'],
    push_destination=pusher_pb2.PushDestination(
        filesystem=pusher_pb2.PushDestination.Filesystem(
            base_directory=serving_model_dir_lite
        )
    )
)


_pipeline_root = LOCAL_ROOT + "/pipelines/" + _pipeline_name
_metadata_path = os.path.join(LOCAL_ROOT, 'metadata', _pipeline_name, 'metadata.db')

if _metadata_path.startswith(gs_prefix):
    _metadata_path = _metadata_path.replace(gs_prefix, gcsfuse_prefix)
    dirpath = os.path.split(_metadata_path)[0]
    if not os.path.isdir(dirpath):
        os.makedirs(dirpath)
if _pipeline_root.startswith(gs_prefix):
    _pipeline_root = _pipeline_root.replace(gs_prefix, gcsfuse_prefix)
    dirpath = os.path.split(_pipeline_root)[0]
    if not os.path.isdir(dirpath):
        os.makedirs(dirpath)

_beam_pipeline_args = [
    '--direct_running_mode=multi_processing',
    '--direct_num_workers=0',
]


components = [
    example_gen, 
    statistics_gen, 
    schema_gen,
    example_validator,
    transform,
    trainer,
    evaluator,
    pusher
]

pipeline = pipeline.Pipeline(
    pipeline_name=_pipeline_name,
    pipeline_root=_pipeline_root,
    components=components,
    enable_cache=False,
    metadata_connection_config=metadata.sqlite_metadata_connection_config(_metadata_path),
    beam_pipeline_args=_beam_pipeline_args
)

BeamDagRunner().run(pipeline)
"""

with open("custom/trainer/tfx_pipeline.py", "w") as f:
    f.write(content)

### Examine the TFX pipeline package

#### Package layout

Before you start the custom job for your TFX pipeline, you will look at how a Python package is assembled for a custom job. When unarchived, the package contains the following directory/file layout.

- PKG-INFO
- README.md
- setup.cfg
- setup.py
- trainer
  - \_\_init\_\_.py
  - tfx_pipeline.py
  - transform.py
  - tuner.py
  - train.py

The files `setup.cfg` and `setup.py` are the instructions for installing the package into the operating environment of the Docker image.

The file `trainer/task.py` is the Python script for executing the custom hyperparameter tuning job. *Note*, when we referred to it in the worker pool specification, we replace the directory slash with a dot (`trainer.task`) and dropped the file suffix (`.py`).

#### Package Assembly

In the following cells, you will assemble the TFX pipeline package.

In [None]:
# Add package information
! touch custom/README.md

setup_cfg = "[egg_info]\n\ntag_build =\n\ntag_date = 0"
! echo "$setup_cfg" > custom/setup.cfg

setup_py = "import setuptools\n\nsetuptools.setup(\n\n    install_requires=[\n\n        'tensorflow==2.5.0',\n\n        'tensorflow_hub',\n\n    'tfx',\n\n],\n\n    packages=setuptools.find_packages())"
! echo "$setup_py" > custom/setup.py

pkg_info = "Metadata-Version: 1.0\n\nName: Boston Housing tabular regression\n\nVersion: 0.0.0\n\nSummary: Demostration hyperparameter tuning script\n\nHome-page: www.google.com\n\nAuthor: Google\n\nAuthor-email: aferlitsch@google.com\n\nLicense: Public\n\nDescription: Demo\n\nPlatform: Vertex"
! echo "$pkg_info" > custom/PKG-INFO

! touch custom/trainer/__init__.py

#### Copy Python package and data to your bucket

Next, compressed and archive the TFX pipeline package, and copy both the package and the data to your Cloud Storage bucket.

In [None]:
! rm -f custom.tar custom.tar.gz
! tar cvf custom.tar custom
! gzip custom.tar
! gsutil cp custom.tar.gz $BUCKET_URI/trainer.tar.gz

! gsutil cp -r custom/data $BUCKET_URI/data

### Prepare your machine specification

Now define the machine specification for your custom hyperparameter tuning job. This tells Vertex what type of machine instance to provision for the hyperparameter tuning.
  - `machine_type`: The type of GCP instance to provision -- e.g., n1-standard-8.
  - `accelerator_type`: The type, if any, of hardware accelerator. In this tutorial if you previously set the variable `TRAIN_GPU != None`, you are using a GPU; otherwise you will use a CPU.
  - `accelerator_count`: The number of accelerators.

In [None]:
if TRAIN_GPU:
    machine_spec = {
        "machine_type": TRAIN_COMPUTE,
        "accelerator_type": TRAIN_GPU,
        "accelerator_count": TRAIN_NGPU,
    }
else:
    machine_spec = {"machine_type": TRAIN_COMPUTE, "accelerator_count": 0}

### Prepare your disk specification

(optional) Now define the disk specification for your custom hyperparameter tuning job. This tells Vertex what type and size of disk to provision in each machine instance for the hyperparameter tuning.

  - `boot_disk_type`: Either SSD or Standard. SSD is faster, and Standard is less expensive. Defaults to SSD.
  - `boot_disk_size_gb`: Size of disk in GB.

In [None]:
DISK_TYPE = "pd-ssd"  # [ pd-ssd, pd-standard]
DISK_SIZE = 200  # GB

disk_spec = {"boot_disk_type": DISK_TYPE, "boot_disk_size_gb": DISK_SIZE}

### Define the worker pool specification

Next, you define the worker pool specification for your custom hyperparameter tuning job. The worker pool specification will consist of the following:

- `replica_count`: The number of instances to provision of this machine type.
- `machine_spec`: The hardware specification.
- `disk_spec` : (optional) The disk storage specification.

- `python_package`: The Python training package to install on the VM instance(s) and which Python module to invoke, along with command line arguments for the Python module.

Let's dive deeper now into the python package specification:

-`executor_image_spec`: This is the docker image which is configured for your custom hyperparameter tuning job.

-`package_uris`: This is a list of the locations (URIs) of your python training packages to install on the provisioned instance. The locations need to be in a Cloud Storage bucket. These can be either individual python files or a zip (archive) of an entire package. In the later case, the job service will unzip (unarchive) the contents into the docker image.

-`python_module`: The Python module (script) to invoke for running the custom hyperparameter tuning job. In this example, you will be invoking `trainer.task.py` -- note that it was not neccessary to append the `.py` suffix.

-`args`: The command line arguments to pass to the corresponding Pythom module. 

In [None]:
CMDARGS = []

worker_pool_spec = [
    {
        "replica_count": 1,
        "machine_spec": machine_spec,
        "disk_spec": disk_spec,
        "python_package_spec": {
            "executor_image_uri": TRAIN_IMAGE,
            "package_uris": [BUCKET_URI + "/trainer.tar.gz"],
            "python_module": "trainer.tfx_pipeline",
            "args": CMDARGS,
        },
    }
]

## Create a custom job

Use the class `CustomJob` to create a custom job, such as for hyperparameter tuning, with the following parameters:

- `display_name`: A human readable name for the custom job.
- `worker_pool_specs`: The specification for the corresponding VM instances.

In [None]:
job = aiplatform.CustomJob(
    display_name="tfx_" + TIMESTAMP, worker_pool_specs=worker_pool_spec
)

### Execute the custom job

Next, you execute the `CustomJob` using the `run()` method, with the following parameters:

- `service_account`: The service account to execute the job.
- `sync`: Whether to run the job asynchronously or block until completion (True).

In [None]:
job.run(service_account=SERVICE_ACCOUNT, sync=True)

### Delete the custom job

The method 'delete()' will delete the custom job.

In [None]:
job.delete()

#### Verify the metadata database was created

Next, verify that the metadata.db database file exists where you specified it to be written in your Cloud Storage bucket.

In [None]:
! gsutil ls {BUCKET_URI}/metadata/cifar10/metadata.db

#### Visualizing artifact output from ImportExampleGen

Next, you view the output from `ImportExampleGen`, which consists of compressed archived files of the image data from the CIFAR-10 dataset.

In [None]:
_pipeline_root = f"{BUCKET_URI}/pipelines/cifar10"

path = f"{_pipeline_root}/ImportExampleGen/examples"
dirs = ! gsutil ls {path}
examples_run = dirs[-1]

train_examples_filepath = os.path.join(examples_run, "Split-train")
eval_examples_filepath = os.path.join(examples_run, "Split-eval")

print("Training TFRecords")
! gsutil ls {train_examples_filepath}
print("Evaluation TFRecords")
! gsutil ls {eval_examples_filepath}

#### Visualizing artifact output from StatisticsGen

You can visualize some of the StatisticsGen Artifact and compare our train and eval datasets. This visualization tool is very useful to give us insights about the data such as data distribution and possible data skew.

In [None]:
import os
from os import listdir

import tensorflow_data_validation as tfdv

path = f"{_pipeline_root}/StatisticsGen/statistics/"
dirs = ! gsutil ls {path}
stats_run = dirs[-1]
train_stats_filepath = os.path.join(stats_run, "Split-train", "FeatureStats.pb")
eval_stats_filepath = os.path.join(stats_run, "Split-eval", "FeatureStats.pb")

train_stats = tfdv.load_stats_binary(train_stats_filepath)
eval_stats = tfdv.load_stats_binary(eval_stats_filepath)

tfdv.visualize_statistics(
    train_stats, rhs_statistics=eval_stats, lhs_name="train", rhs_name="eval"
)

#### Visualizing artifact output from SchemaGen

Next, you view the output from `SchemaGen`, which is in protobuf format stored as a plain text file.

In [None]:
path = f"{_pipeline_root}/SchemaGen/schema/"
dirs = ! gsutil ls {path}
schema_run = dirs[-1]

! gsutil cat {schema_run}schema.pbtxt

#### Visualizing artifact output from ExampleValidator

Next, you view the output from `ExampleValidator`, which is in protobuf format.

In [None]:
path = f"{_pipeline_root}/ExampleValidator/anomalies/"
dirs = ! gsutil ls {path}
example_validator_run = dirs[-1]

train_example_validator_filepath = os.path.join(
    example_validator_run, "Split-train", "SchemaDiff.pb"
)
eval_example_validator_filepath = os.path.join(
    example_validator_run, "Split-eval", "SchemaDiff.pb"
)

! gsutil ls {train_example_validator_filepath}
! gsutil ls {eval_example_validator_filepath}

#### Visualizing artifact output from Transform

Next, you view the output from `Transform`, which contains the following subfolders:

- `pre_transform_schema`
- `pre_transform_stats`
- `post_transform_anomalies`
- `post_transform_schema`
- `post_transform_stats`
- `transform_graph`
- `transform_examples`

In [None]:
path = f"{_pipeline_root}/Transform/transform_graph/"
dirs = ! gsutil ls {path}
transform_graph_run = dirs[-1]

print("Transform Graph")
! gsutil ls {transform_graph_run}

path = f"{_pipeline_root}/Transform/transformed_examples/"
dirs = ! gsutil ls {path}
transformed_examples_run = dirs[-1]

transformed_examples_train_filepath = os.path.join(
    transformed_examples_run, "Split-train"
)
transformed_examples_eval_filepath = os.path.join(
    transformed_examples_run, "Split-eval"
)

print("Transformed Examples")
! gsutil ls {transformed_examples_train_filepath} {transformed_examples_eval_filepath}

#### Visualizing artifact output from Trainer

Next, you view the output from `Trainer`, whichn consists of the trained model artifacts.

In [None]:
path = f"{_pipeline_root}/Trainer/model/"
dirs = ! gsutil ls {path}
trainer_run = dirs[-1]
model_artifacts = os.path.join(trainer_run, "Format-Serving")

! gsutil ls {model_artifacts}

#### Visualizing artifact output from Evaluator

Next, you view the output from `Evaluator`, which consists of two parts: `evaluation` and `blessing`.

The `evaluation` consists of the validation results stored as TFRecords.

The `blessing` consists of a single (empty) plain text file, whose name is either:

- `BLESSED`: Model passed the validation requirements.
- `NOT_BLESSED`: Model did not pass the validation requirements.

In [None]:
path = f"{_pipeline_root}/Evaluator/evaluation/"
dirs = ! gsutil ls {path}
evaluation_run = dirs[-1]

path = f"{_pipeline_root}/Evaluator/blessing/"
dirs = ! gsutil ls {path}
blessing_run = dirs[-1]

print("Evaluation")
! gsutil ls {evaluation_run}
print("Blessing")
! gsutil ls {blessing_run}

#### Visualizing artifact output from Pusher

Next, you view the output from `Pusher`. If the model passed evaluation, then the output will contain the pushed model artifacts:

- `saved_model.pb`
- `keras_metadata.pb`
- `assets`
- `variables`

In [None]:
path = f"{_pipeline_root}/Pusher/pushed_model/"
dirs = ! gsutil ls {path}
pusher_run = dirs[-1]

! gsutil ls {pusher_run}

## Execute TFX pipeline as `Vertex AI Pipeline`

The Vertex AI pipeline will run in the cloud and deploy a model to a Vertex AI endpoint. All metadata will be stored in the Vertex AI Metadata store.

With a few modifications to our code this pipeline can be deployed in Vertex.

- Define constants with cloud paths instead of local ones (done previously as CustomJob)
- Copy the data to our GCS bucket (done previously as CustomJob)
- Copy the Transform and Trainer module to our GCS bucket (done previously as CustomJob)
- Create Service Account Key (done previously as CustomJob)
- Change our Beam arguments to use the DataflowRunner
- Change the Trainer and Pusher components to use google_cloud_ai_platform extensions.

### Recommendation

Using `Vertex AI Pipelines` for executing TFX pipelines is recommended for larger jobs -- vs. executing as a `CustomJob`

The way Vertex AI pipelines allocates resources is best suited for large jobs. When running on Vertex AI pipelines, you are running dataflow jobs for ingesting data, creating statistics, schema and transformations, and running training jobs as custom training jobs in Vertex AI.

### Update Apache Beam arguments to use Dataflow

`Dataflow` provides serverless large-scale preprocessing of data for Apache Beam. You enable `Dataflow` as a runner by setting the parameter `--runner` to `DataflowRunner`. 

Additional parameters set:

- `project`: Your project ID.
- `region`: Your region (location).
- `service_account`: Your service account.
- `machine_type`: The VM to provision.
- `disk_size_gb`: The amount of disk space for the provisioned VM.
- `temp_location`: The Cloud Storage location for Apache Beam to write temporary resources.

Learn more about [Dataflow](https://cloud.google.com/dataflow)

In [None]:
_beam_pipeline_args = [
    "--runner=DataflowRunner",
    f"--project={PROJECT_ID}",
    f"--temp_location={BUCKET_NAME}/tmp/",
    f"--region={REGION}",
    "--disk_size_gb=50",
    f"--machine-type={TRAIN_COMPUTE}",
    f"--service_account_email=vertexai-test@{PROJECT_ID}.iam.gserviceaccount.com",
    "--experiments=use_runner_v2",
]

### Update the Trainer and Pusher components

Next, you update the Trainer and Pusher components to use TFX extensions for `Vertex AI`.

#### Update the Trainer component

You update the `Trainer` component call as follows:

- Replace `custom_config` with a `Vertex AI` job specification, which contains the `worker_pool_specs`.
- Replace the `tfx.components.Trainer` with the `Vertex AI` extension `tfx.extensions.google_cloud_ai_platform.Trainer`.

In [None]:
vertex_job_spec = {
    "project": PROJECT_ID,
    "worker_pool_specs": [
        {
            "machine_spec": {
                "machine_type": TRAIN_COMPUTE,
                "accelerator_type": TRAIN_GPU,
                "accelerator_count": TRAIN_NGPU,
            },
            "replica_count": 1,
            "container_spec": {"image_uri": TRAIN_IMAGE},
        }
    ],
}

train_custom_config = {
    tfx.extensions.google_cloud_ai_platform.TRAINING_ARGS_KEY: vertex_job_spec,
    tfx.extensions.google_cloud_ai_platform.VERTEX_REGION_KEY: REGION,  # must be us-central1 for vertexai
    tfx.extensions.google_cloud_ai_platform.ENABLE_VERTEX_KEY: True,
}

DATA_ROOT = os.path.join(BUCKET_URI, "data")
labels_path = os.path.join(DATA_ROOT, "labels.txt")

train_custom_config["labels_path"] = labels_path

TRAINER_MODULE = os.path.join(BUCKET_URI, "train.py")

trainer = tfx.extensions.google_cloud_ai_platform.Trainer(
    module_file=TRAINER_MODULE,
    examples=transform.outputs["transformed_examples"],
    transform_graph=transform.outputs["transform_graph"],
    schema=schema_gen.outputs["schema"],
    # base_model=model_resolver.outputs['model'],
    train_args=trainer_pb2.TrainArgs(num_steps=160),
    eval_args=trainer_pb2.EvalArgs(num_steps=4),
    custom_config=train_custom_config,
)

#### Update the Pusher component

You update the `Pusher` component call as follows:

- Update the `custom_config` with a `Vertex AI` serving specification.
- Replace the `tfx.components.Pusher` with the `Vertex AI` extension `tfx.extensions.google_cloud_ai_platform.Pusher`.

In [None]:
vertex_serving_spec = {
    "project_id": PROJECT_ID,
    "endpoint_name": "vertex-pipeline-cifar10",
    "machine_type": DEPLOY_COMPUTE,
}

pusher = tfx.extensions.google_cloud_ai_platform.Pusher(
    model=trainer.outputs["model"],
    model_blessing=evaluator.outputs["blessing"],
    custom_config={
        tfx.extensions.google_cloud_ai_platform.ENABLE_VERTEX_KEY: True,
        tfx.extensions.google_cloud_ai_platform.VERTEX_REGION_KEY: REGION,
        tfx.extensions.google_cloud_ai_platform.VERTEX_CONTAINER_IMAGE_URI_KEY: DEPLOY_IMAGE,
        tfx.extensions.google_cloud_ai_platform.SERVING_ARGS_KEY: vertex_serving_spec,
    },
)

### Update the TFX pipeline to run as a `Vertex AI Pipeline`

Finally, you will update your TFX pipeline code to run as a `Vertex AI Pipeline`.

In [None]:
content = f"""
import os
import logging
import absl

import tensorflow as tf
import tensorflow_hub as hub

from tfx import v1 as tfx
from tfx.components import (ImportExampleGen, 
                            StatisticsGen, 
                            SchemaGen, 
                            ExampleValidator, 
                            Transform,
                            Trainer,
                            Tuner,
                            Evaluator,
                            InfraValidator,
                            Pusher,
                            #BulkInferrer
                           )
from tfx.dsl.components.common import resolver
from tfx.orchestration import pipeline
from tfx.orchestration import metadata
from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner

from tfx.proto import example_gen_pb2
from tfx.proto import trainer_pb2
from tfx.proto import pusher_pb2
from tfx.proto import bulk_inferrer_pb2

import tensorflow_model_analysis as tfma

DATA_ROOT = "{BUCKET_URI}/data"
LOCAL_ROOT = "{BUCKET_URI}"

def create_pipeline():
    input_config = example_gen_pb2.Input(splits=[
        example_gen_pb2.Input.Split(name='train', pattern='train/*'),
        example_gen_pb2.Input.Split(name='eval', pattern='test/*')
    ])

    example_gen = ImportExampleGen(
        input_base=DATA_ROOT, 
        input_config=input_config
    )

    statistics_gen = StatisticsGen(
        examples=example_gen.outputs['examples']
    )

    schema_gen = SchemaGen(
        statistics=statistics_gen.outputs['statistics'], 
        infer_feature_shape=True
    )

    example_validator = ExampleValidator(
          statistics=statistics_gen.outputs['statistics'],
          schema=schema_gen.outputs['schema']
    )

    TRANSFORM_MODULE = '{BUCKET_URI}/transform.py'  # implements preprocessing_fn

    transform = Transform(
          examples=example_gen.outputs['examples'],
          schema=schema_gen.outputs['schema'],
          module_file=TRANSFORM_MODULE,
          force_tf_compat_v1=True
    )

    TRAINER_MODULE = '{BUCKET_URI}/train.py'  # implements run_fn

    vertex_job_spec = {{
          'project': '{PROJECT_ID}',
          'worker_pool_specs' : [
              {{
                'machine_spec': {{
                  'machine_type' : '{TRAIN_COMPUTE}',
                  'accelerator_type' : {TRAIN_GPU},  
                  'accelerator_count' : {TRAIN_NGPU}
                }},
                'replica_count' : 1,
                'container_spec' : {{
                  'image_uri' : '{TRAIN_IMAGE}'
                }}
              }}
          ]
    }}

    train_custom_config = {{
      tfx.extensions.google_cloud_ai_platform.TRAINING_ARGS_KEY : vertex_job_spec,
      tfx.extensions.google_cloud_ai_platform.VERTEX_REGION_KEY : '{REGION}', 
      tfx.extensions.google_cloud_ai_platform.ENABLE_VERTEX_KEY : True
    }}

    labels_path = os.path.join(DATA_ROOT, 'labels.txt')

    train_custom_config['labels_path'] = labels_path

    trainer = Trainer(
        module_file=TRAINER_MODULE,   
        examples=transform.outputs['transformed_examples'],
        transform_graph=transform.outputs['transform_graph'],
        schema=schema_gen.outputs['schema'],

        # not implemented in this tutorial: 
        # hyperparameters=tuner.outputs['best_hyperparameters'],

        # This will be passed to `run_fn`.
        train_args=trainer_pb2.TrainArgs(num_steps=100),
        eval_args=trainer_pb2.EvalArgs(num_steps=5)
    )

    LOWER_BOUND_VALIDATION = 0.55  # the metric threshold to validate the model.

    eval_config = tfma.EvalConfig(
          model_specs=[tfma.ModelSpec(label_key='label')],
          slicing_specs=[tfma.SlicingSpec()],
          metrics_specs=[
              tfma.MetricsSpec(metrics=[
                  tfma.MetricConfig(
                      class_name='SparseCategoricalAccuracy',
                      threshold=tfma.MetricThreshold(
                          value_threshold=tfma.GenericValueThreshold(
                              lower_bound={{'value': LOWER_BOUND_VALIDATION}}),
                          # Change threshold will be ignored if there is no
                          # baseline model resolved from MLMD (first run).
                          change_threshold=tfma.GenericChangeThreshold(
                              direction=tfma.MetricDirection.HIGHER_IS_BETTER,
                              absolute={{'value': -1e-3}})))
              ])
          ])

    evaluator = Evaluator(
        examples=example_gen.outputs['examples'],
        model=trainer.outputs['model'],
        #baseline_model=model_resolver.outputs['model'],
        eval_config=eval_config
    )

    _pipeline_name = 'cifar10'

    vertex_serving_spec = {{
        'project_id' : '{PROJECT_ID}',
          'endpoint_name' : 'vertex-pipeline-cifar10',
          'machine_type' : '{DEPLOY_COMPUTE}'
      }}

    pusher = tfx.extensions.google_cloud_ai_platform.Pusher(
      model=trainer.outputs['model'],
      model_blessing=evaluator.outputs['blessing'],
      custom_config={{
        tfx.extensions.google_cloud_ai_platform.ENABLE_VERTEX_KEY: True,
        tfx.extensions.google_cloud_ai_platform.VERTEX_REGION_KEY: '{REGION}',
        tfx.extensions.google_cloud_ai_platform.VERTEX_CONTAINER_IMAGE_URI_KEY: '{DEPLOY_IMAGE}',
        tfx.extensions.google_cloud_ai_platform.SERVING_ARGS_KEY: vertex_serving_spec
      }}
    )

    _pipeline_root = LOCAL_ROOT + "/pipelines/" + _pipeline_name
    _metadata_path = os.path.join(LOCAL_ROOT, 'metadata', _pipeline_name, 'metadata.db')

    gs_prefix = 'gs://'
    gcsfuse_prefix = '/gcs/'
    try:
        if _metadata_path.startswith(gs_prefix):
            _metadata_path = _metadata_path.replace(gs_prefix, gcsfuse_prefix)
            dirpath = os.path.split(_metadata_path)[0]
            if not os.path.isdir(dirpath):
                os.makedirs(dirpath)
        if _pipeline_root.startswith(gs_prefix):
            _pipeline_root = _pipeline_root.replace(gs_prefix, gcsfuse_prefix)
            dirpath = os.path.split(_pipeline_root)[0]
            if not os.path.isdir(dirpath):
                os.makedirs(dirpath)
    except:
        pass

    _beam_pipeline_args = [
        '--runner=DataflowRunner',
        "--project={PROJECT_ID}",
        "--temp_location={BUCKET_URI}/tmp/",
        "--region={REGION}",
        '--disk_size_gb=50',
        '--machine-type={TRAIN_COMPUTE}',
        "--service_account_email={SERVICE_ACCOUNT}",
        '--experiments=use_runner_v2'
    ]

    components = [
        example_gen, 
        statistics_gen, 
        schema_gen,
        example_validator,
        transform,
        trainer,
        evaluator,
        pusher
    ]

    tfx_pipeline = pipeline.Pipeline(
        pipeline_name=_pipeline_name,
        pipeline_root=_pipeline_root,
        components=components,
        enable_cache=False,
        # metadata_connection_config=metadata.sqlite_metadata_connection_config(_metadata_path),  # Vertex AI tracks metadata in ML Metadata
        beam_pipeline_args=_beam_pipeline_args
    )
    
    return tfx_pipeline

# To run this pipeline from the python CLI:
#   $python cifar_pipeline_native_keras.py
if __name__ == '__main__':

  loggers = [logging.getLogger(name) for name in logging.root.manager.loggerDict]
  for logger in loggers:
    logger.setLevel(logging.INFO)
  logging.getLogger().setLevel(logging.INFO)

  absl.logging.set_verbosity(absl.logging.FATAL)

  runner = tfx.orchestration.experimental.KubeflowV2DagRunner(
    config=tfx.orchestration.experimental.KubeflowV2DagRunnerConfig(),
    output_filename="cifar10-pipeline.json"
  )

  runner.run(
      create_pipeline()
  )
"""

with open("custom/trainer/tfx_pipeline.py", "w") as f:
    f.write(content)

### Generate the JSON pipeline file

Next, execute the script locally to generate the KFP pipeline JSON definition.

*Note*: This won't actually excute the pipeline -- just generates the JSON file.

In [None]:
! python3 custom/trainer/tfx_pipeline.py

### Execute the `Vertex AI Pipeline`

Now that you have a compiled KFP JSON pipeline definition, you execute it as a `Vertex AI Pipeline` job.

The output will contain a link with See your Pipeline job here. The link will take you to the pipeline definition that is running in `Vertex AI`, and will look like the image below.

<img src='https://g3doc.corp.google.com/cloud/sales/teams/sales_aiml_northam/g3doc/tfx-pipelines/img/vertex-ai-pipeline.png'/>

In [None]:
PIPELINE_ROOT = f"{BUCKET_URI}/pipeline_root/cifar10"

job = aiplatform.PipelineJob(
    display_name="cifar10-tfx",
    template_path="cifar10-pipeline.json",
    pipeline_root=PIPELINE_ROOT,
    enable_caching=True,
)

job.run()

### View the pipeline results

In [None]:
PROJECT_NUMBER = job.gca_resource.name.split("/")[1]
print(PROJECT_NUMBER)


def print_pipeline_output(job, output_task_name):
    JOB_ID = job.name
    print(JOB_ID)
    artifact = ""
    for _ in range(len(job.gca_resource.job_detail.task_details)):
        TASK_ID = job.gca_resource.job_detail.task_details[_].task_id
        EXECUTE_OUTPUT = (
            PIPELINE_ROOT
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/executor_output.json"
        )
        GCP_RESOURCES = (
            PIPELINE_ROOT
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/gcp_resources"
        )
        EVALUATION_METRICS = (
            PIPELINE_ROOT
            + "/"
            + PROJECT_NUMBER
            + "/"
            + JOB_ID
            + "/"
            + output_task_name
            + "_"
            + str(TASK_ID)
            + "/evaluation_metrics"
        )
        # Check if file exists, 0 is success
        !gsutil -q stat $EXECUTE_OUTPUT
        if _exit_code == 0:
            ! gsutil cat $EXECUTE_OUTPUT
            artifact = EXECUTE_OUTPUT
            break
        !gsutil -q stat $GCP_RESOURCES
        if _exit_code == 0:
            ! gsutil cat $GCP_RESOURCES
            artifact = GCP_RESOURCES
            break
        !gsutil -q stat $EVALUATION_METRICS
        if _exit_code == 0:
            ! gsutil cat $EVALUATION_METRICS
            artifact = EVALUATION_METRICS
            break

    return artifact


print("ImportExampleGen")
artifacts = print_pipeline_output(job, "ImportExampleGen")
print("\n\n")
print("StatisticsGen")
artifacts = print_pipeline_output(job, "StatisticsGen")
print("\n\n")
print("SchemaGen")
metrics = print_pipeline_output(job, "SchemaGen")
print("\n\n")
print("ExampleValidator")
artifacts = print_pipeline_output(job, "ExampleValidator")
print("\n\n")
print("Transform")
artifacts = print_pipeline_output(job, "Transform")
print("\n\n")
print("Trainer")
artifacts = print_pipeline_output(job, "Trainer")
print("\n\n")
print("Evaluator")
artifacts = print_pipeline_output(job, "Evaluator")
print("\n\n")
print("Pusher")
artifacts = print_pipeline_output(job, "Pusher")
print("\n\n")

### Delete the pipeline job

The method 'delete()' will delete the pipeline job.

In [None]:
job.delete()

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial.

In [None]:
delete_bucket = False

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI

! rm -rf custom custom.tar.gz