# AI Platform Training with TFX and AI Platform Pipelines


This notebook-based tutorial will create and run a TFX pipeline which trains an ML model using Cloud AI Platform Training and pushes it to Cloud AI Platfrom Models.

This notebook is based on the TFX pipeline we built in
[Simple TFX Pipeline for Cloud AI Platform Pipelines Tutorial](https://colab.sandbox.google.com/drive/1XK_S1Op3XtSzxBa5psciGBoSdJBBcHJz?resourcekey=0-ZOyUzMvkNGWtjor_gpzQGQ).
If you have not read that tutorial yet, you should read it before proceeding with this notebook.

Google Cloud AI Platform is a fully managed, end-to-end platform for data science and machine learning. Besides Pipelines that we are using, AI Platform offers a managed Training service and these can be used together with TFX.

In this tutorial, we will use AI Platform Training to train a model in a TFX pipeline.


## Set up
Before you run this notebook, ensure that your Google Cloud user account and
project are granted access to the Managed Pipelines Experimental. To be granted
access to the Managed Pipelines Experimental, fill out this
[form](http://go/cloud-mlpipelines-signup) and let your account representative
know you have requested access.

This notebook is intended to be run on
[Google Colab](https://colab.research.google.com/notebooks/intro.ipynb) or on
[AI Platform Notebooks](https://cloud.google.com/ai-platform-notebooks). See
the "AI Platform Notebooks" section in the Experimental
[User Guide](https://docs.google.com/document/d/1JXtowHwppgyghnj1N1CT73hwD1caKtWkLcm2_0qGBoI/edit?usp=sharing)
for more detail on creating a notebook server instance.

**To run this notebook on AI Platform Notebooks**, click on the **File** menu,
then select "Download .ipynb".  Then, upload that notebook from your local
machine to AI Platform Notebooks. (In the AI Platform Notebooks left panel,
look for an icon of an arrow pointing up, to upload).


### Configure GCP project and install packages.


If you are running this notebook on Colab, authenticate with your user account:

In [None]:
import sys
if 'google.colab' in sys.modules:
  from google.colab import auth
  auth.authenticate_user()


**If you are on AI Platform Notebooks**, authenticate with Google Cloud before
running the next section, by running
```sh
gcloud auth login
```
**in the Terminal window** (which you can open via **File** > **New** in the
menu). You only need to do this once per notebook instance.

We will install required Python packages including TFX and KFP to author ML pipelines and submit jobs to AI Platform Pipelines.

In [None]:
AIPLATFORM_PIPELINES_CLIENT_WHEEL='aiplatform_pipelines_client-0.1.0.caip20210415.dev0-py3-none-any.whl'
!gsutil cp gs://cloud-aiplatform-pipelines/testing/{AIPLATFORM_PIPELINES_CLIENT_WHEEL} .
KFP_WHEEL='kfp-1.5.0rc5.tar.gz'
!gsutil cp gs://cloud-aiplatform-pipelines/testing/{KFP_WHEEL} .

In [None]:
if 'google.colab' in sys.modules:
  USER_FLAG = ''
else:
  USER_FLAG = '--user'

In [None]:
!pip install {USER_FLAG} pip --upgrade
!pip install {USER_FLAG} -i https://pypi-nightly.tensorflow.org/simple tfx[kfp]==0.30.0.dev20210418 "google-cloud-storage>=1.37.1" {KFP_WHEEL} {AIPLATFORM_PIPELINES_CLIENT_WHEEL} --upgrade

#### Did you restart the runtime?

If you are using Google Colab, the first time that you run
the cell above, you must restart the runtime by clicking
above "RESTART RUNTIME" button or using "Runtime > Restart
runtime ..." menu. This is because of the way that Colab
loads packages.

If you are not on Colab, you can restart runtime with following cell.

In [None]:
import sys
if not 'google.colab' in sys.modules:
  # Automatically restart kernel after installs
  import IPython
  app = IPython.Application.instance()
  app.kernel.do_shutdown(True)

In [None]:
# Make sure we are logged in after restart.
import sys
if 'google.colab' in sys.modules:
  from google.colab import auth
  auth.authenticate_user()

Check the TensorFlow and TFX versions.

In [None]:
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
import tfx
print('TFX version: {}'.format(tfx.__version__))
import kfp
print('KFP version: {}'.format(kfp.__version__))

### Set up variables

We will set up some variables used to customize the pipelines below. Following information is required:

* Your _Google Cloud Project_ id. See [Identifying your project id](https://cloud.google.com/resource-manager/docs/creating-managing-projects#identifying_projects). 
* An _API_KEY_ to trigger a pipeline on AI Platform Pipelines.
* GCP Region.

**Enter required values in the cell below before running it**.


In [None]:

GOOGLE_CLOUD_PROJECT = ''     # <--- ENTER THIS
GCS_BUCKET_NAME = ''          # <--- ENTER THIS
API_KEY = ''                  # <--- ENTER THIS
GOOGLE_CLOUD_REGION = 'us-central1'      # <--- ENTER THIS

assert GOOGLE_CLOUD_PROJECT and GCS_BUCKET_NAME and API_KEY and GOOGLE_CLOUD_REGION, 'Please set all required parameters.'

Set `gcloud` to use your project.

In [None]:
!gcloud config set project {GOOGLE_CLOUD_PROJECT}

In [None]:
PIPELINE_NAME = 'penguin-caip-ucaip-training'

# Path to various pipeline artifact.
PIPELINE_ROOT = 'gs://{}/pipeline_root/{}'.format(GCS_BUCKET_NAME, PIPELINE_NAME)

# Paths for users' Python module.
MODULE_ROOT = 'gs://{}/pipeline_module/{}'.format(GCS_BUCKET_NAME, PIPELINE_NAME)

# Paths for users' data.
DATA_ROOT = 'gs://{}/data/{}'.format(GCS_BUCKET_NAME, PIPELINE_NAME)

# This is the path where your model will be pushed for serving.
SERVING_MODEL_DIR = 'gs://{}/serving_model/{}'.format(
    GCS_BUCKET_NAME, PIPELINE_NAME)

print('PIPELINE_ROOT: {}'.format(PIPELINE_ROOT))

### Prepare example data
We will use the same
[Palmer Penguins dataset](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) as [Simple TFX Pipeline Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple).

There are four numeric features in this dataset which were already normalized
to have range [0,1]. We will build a classification model which predicts the
`species` of penguins.

We need to make our own copy of the dataset. Because TFX ExampleGen reads
inputs from a directory, we need to create a directory and copy dataset to it
on GCS.

In [None]:
!gsutil cp gs://download.tensorflow.org/data/palmer_penguins/penguins_processed.csv {DATA_ROOT}/

Take a quick look at the CSV file.

In [None]:
!gsutil cat {DATA_ROOT}/penguins_processed.csv | head

## Create a pipeline

Our pipeline will be very similar to the pipeline we created in 
[Simple TFX Pipeline for Cloud AI Platform Pipelines Tutorial](https://colab.sandbox.google.com/drive/1XK_S1Op3XtSzxBa5psciGBoSdJBBcHJz?resourcekey=0-ZOyUzMvkNGWtjor_gpzQGQ). The pipeline will consists of three components, CsvExampleGen, Trainer and Pusher. But we will set special `Executor`s which are classes actually used to run component workloads.

TFX provides a special `Executor` to submit training jobs to AI Platform Training service, and all we have to do is just specifying the executor at the `Trainer` component along with some required GCP parameters.

Similarly, we will use a special `Executor` for the `Pusher` component. It will upload the trained model to AI Platform Models instead of copying to a filesystem.

In this tutorial, we will run AI Platform Training jobs using CPUs only, and then with a GPU.

### Write model code.

The model itself is almost similar to the model in [Simple TFX Pipeline Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple). We will add `_get_distribution_strategy()` function which creates a [TensorFlow distribution strategy](https://www.tensorflow.org/guide/distributed_training) and it is used in `run_fn` to use MirroredStrategy if GPU is available.

In [None]:
_trainer_module_file = 'penguin_trainer.py'

In [None]:
%%writefile {_trainer_module_file}

# Copied from https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple and
# slightly modified run_fn() to add distribution_strategy.

from typing import List
from absl import logging
import tensorflow as tf
from tensorflow import keras
from tensorflow_metadata.proto.v0 import schema_pb2
from tensorflow_transform.tf_metadata import schema_utils

from tfx.components.trainer import fn_args_utils
from tfx_bsl.tfxio import dataset_options

_FEATURE_KEYS = [
    'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g'
]
_LABEL_KEY = 'species'

_TRAIN_BATCH_SIZE = 20
_EVAL_BATCH_SIZE = 10

# Since we're not generating or creating a schema, we will instead create
# a feature spec.  Since there are a fairly small number of features this is
# manageable for this dataset.
_FEATURE_SPEC = {
    **{
        feature: tf.io.FixedLenFeature(shape=[1], dtype=tf.float32)
        for feature in _FEATURE_KEYS
    }, _LABEL_KEY: tf.io.FixedLenFeature(shape=[1], dtype=tf.int64)
}


def _input_fn(file_pattern: List[str],
              data_accessor: fn_args_utils.DataAccessor,
              schema: schema_pb2.Schema,
              batch_size: int) -> tf.data.Dataset:
  """Generates features and label for training.

  Args:
    file_pattern: List of paths or patterns of input tfrecord files.
    data_accessor: DataAccessor for converting input to RecordBatch.
    schema: schema of the input data.
    batch_size: representing the number of consecutive elements of returned
      dataset to combine in a single batch

  Returns:
    A dataset that contains (features, indices) tuple where features is a
      dictionary of Tensors, and indices is a single Tensor of label indices.
  """
  return data_accessor.tf_dataset_factory(
      file_pattern,
      dataset_options.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()


def _make_keras_model() -> tf.keras.Model:
  """Creates a DNN Keras model for classifying penguin data.

  Returns:
    A Keras Model.
  """
  # The model below is built with Functional API, please refer to
  # https://www.tensorflow.org/guide/keras/overview for all API options.
  inputs = [keras.layers.Input(shape=(1,), name=f) for f in _FEATURE_KEYS]
  d = keras.layers.concatenate(inputs)
  for _ in range(2):
    d = keras.layers.Dense(8, activation='relu')(d)
  outputs = keras.layers.Dense(3)(d)

  model = keras.Model(inputs=inputs, outputs=outputs)
  model.compile(
      optimizer=keras.optimizers.Adam(1e-2),
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      metrics=[keras.metrics.SparseCategoricalAccuracy()])

  model.summary(print_fn=logging.info)
  return model


# NEW: Read `use_gpu` from the custom_config of the Trainer.
#      if it uses GPU, enable MirroredStrategy.
def _get_distribution_strategy(fn_args: fn_args_utils.FnArgs):
  if fn_args.custom_config.get('use_gpu', False):
    logging.info('Using MirroredStrategy with one GPU.')
    return tf.distribute.MirroredStrategy(devices=['device:GPU:0'])
  return None


# TFX Trainer will call this function.
def run_fn(fn_args: fn_args_utils.FnArgs):
  """Train the model based on given args.

  Args:
    fn_args: Holds args used to train the model as name/value pairs.
  """

  # This schema is usually either an output of SchemaGen or a manually-curated
  # version provided by pipeline author. A schema can also derived from TFT
  # graph if a Transform component is used. In the case when either is missing,
  # `schema_from_feature_spec` could be used to generate schema from very simple
  # feature_spec, but the schema returned would be very primitive.
  schema = schema_utils.schema_from_feature_spec(_FEATURE_SPEC)

  train_dataset = _input_fn(
      fn_args.train_files,
      fn_args.data_accessor,
      schema,
      batch_size=_TRAIN_BATCH_SIZE)
  eval_dataset = _input_fn(
      fn_args.eval_files,
      fn_args.data_accessor,
      schema,
      batch_size=_EVAL_BATCH_SIZE)

  # NEW: If we have a distribution strategy, build a model in a strategy scope.
  strategy = _get_distribution_strategy(fn_args)
  if strategy is None:
    model = _make_keras_model()
  else:
    with strategy.scope():
      model = _make_keras_model()

  model.fit(
      train_dataset,
      steps_per_epoch=fn_args.train_steps,
      validation_data=eval_dataset,
      validation_steps=fn_args.eval_steps)

  # The result of the training should be saved in `fn_args.serving_model_dir`
  # directory.
  model.save(fn_args.serving_model_dir, save_format='tf')

Copy the module file to GCS which can be accessed from the pipeline components.

Otherwise, you might want to build a container image including the module file and use the image to run the pipeline and AI Platform Training jobs.

In [None]:
!gsutil cp {_trainer_module_file} {MODULE_ROOT}/

### Write a pipeline definition

We will define a function to create a TFX pipeline. It has the same three Components as in [Simple TFX Pipeline Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple), but we use a custom executor for `Trainer`.

`ai_platform_trainer_executor.GenericExecutor` behaves like a regular trainer.GenericExecutor, but it just moves the computation for the model training to cloud. It launches a job in Google Cloud AI Platform Training service and the trainer component in the orchestration system will just wait until the AI Platform Training job completes.


In [None]:
from tfx.components import CsvExampleGen
from tfx.components import Pusher
from tfx.components import Trainer
from tfx.dsl.components.base import executor_spec
from tfx.extensions.google_cloud_ai_platform.trainer import executor as ai_platform_trainer_executor
from tfx.orchestration import pipeline
from tfx.proto import pusher_pb2
from tfx.proto import trainer_pb2


def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,
                     module_file: str, serving_model_dir: str, project_id: str,
                     region: str, use_gpu: bool) -> pipeline.Pipeline:
  """Implements the penguin pipeline with TFX."""
  # Brings data into the pipeline or otherwise joins/converts training data.
  example_gen = CsvExampleGen(input_base=data_root)

  # NEW: Configuration for AI Platform Training.
  # This dictionary will be passed as `CustomJobSpec` to unified AI Platform Training.
  ai_platform_job_spec = {
      'project': project_id,
      'worker_pool_specs': [{
          'machine_spec': {
              'machine_type': 'n1-standard-4',
          },
          'replica_count': 1,
          'container_spec': {
              'image_uri': 'gcr.io/tfx-oss-public/tfx:{}'.format(tfx.__version__),
          },
      }],
  }
  if use_gpu:
    # See https://cloud.google.com/ai-platform-unified/docs/reference/rest/v1/MachineSpec#AcceleratorType
    # for available machine types.
    ai_platform_job_spec['worker_pool_specs'][0]['machine_spec'].update({
        'accelerator_type': 'NVIDIA_TESLA_K80',
        'accelerator_count': 1
    })

  # Trains a model using AI Platform Training.
  trainer = Trainer(
      module_file=module_file,
      examples=example_gen.outputs['examples'],
      train_args=trainer_pb2.TrainArgs(num_steps=100),
      eval_args=trainer_pb2.EvalArgs(num_steps=5),
      # NEW: We need to specify a custom executor with related configs.
      custom_executor_spec=executor_spec.ExecutorClassSpec(
          ai_platform_trainer_executor.GenericExecutor),
      custom_config={
          ai_platform_trainer_executor.ENABLE_UCAIP_KEY:
              True,
          ai_platform_trainer_executor.UCAIP_REGION_KEY:
              region,
          ai_platform_trainer_executor.TRAINING_ARGS_KEY:
              ai_platform_job_spec,
          'use_gpu':
              use_gpu,
      })
  
  # Pushes the model to a filesystem destination.
  pusher = Pusher(
      model=trainer.outputs['model'],
      push_destination=pusher_pb2.PushDestination(
          filesystem=pusher_pb2.PushDestination.Filesystem(
              base_directory=serving_model_dir)))

  components = [
      example_gen,
      trainer,
      pusher,
  ]

  return pipeline.Pipeline(
      pipeline_name=pipeline_name,
      pipeline_root=pipeline_root,
      components=components)


## Run the pipeline on AI Platform Pipelines.

We will use AI Platform Pipelines to run the pipeline as we did in [Simple TFX Pipeline for Cloud AI Platform Pipelines Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/gcp/caip_pipelines_simple).

We need to define a runner to actually run the pipeline.


In [None]:
import os
from tfx.orchestration import pipeline as tfx_pipeline
from tfx.orchestration.kubeflow.v2 import kubeflow_v2_dag_runner

PIPELINE_DEFINITION_FILE = PIPELINE_NAME + '_pipeline.json'

# We will write pipeline definition to PIPELINE_DEFINITION_FILE.
runner = kubeflow_v2_dag_runner.KubeflowV2DagRunner(
    config=kubeflow_v2_dag_runner.KubeflowV2DagRunnerConfig(
        project_id=GOOGLE_CLOUD_PROJECT
    ),
    output_filename=PIPELINE_DEFINITION_FILE)
_ = runner.run(
    _create_pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=PIPELINE_ROOT,
        data_root=DATA_ROOT,
        module_file=os.path.join(MODULE_ROOT, _trainer_module_file),
        serving_model_dir=SERVING_MODEL_DIR,
        project_id=GOOGLE_CLOUD_PROJECT,
        region=GOOGLE_CLOUD_REGION,
        # We will use CPUs only for now.
        use_gpu=False),
    write_out=True)

The generated definition file can be submitted using the AI Platform Pipelines client.

In [None]:
from aiplatform.pipelines import client

pipelines_client = client.Client(
    project_id=GOOGLE_CLOUD_PROJECT,
    region=GOOGLE_CLOUD_REGION,
    api_key=API_KEY
)

pipelines_client.create_run_from_job_spec(PIPELINE_DEFINITION_FILE)

Now you can visit the link in the output above or
visit 'AI Platform (Unified) > Pipelines' in [Google Cloud Console](https://console.developers.google.com/) to see the
progress.

### Run the pipeline using GPU


Cloud AI Platform Pipelines supports training using GPUs. See [Machine spec reference](https://cloud.google.com/ai-platform-unified/docs/reference/rest/v1/MachineSpec#AcceleratorType) for available options.

We already defined our pipeline to support GPU training. All we need to do is setting `use_gpu` to True. Then a pipeline will be created with a machine spec including one NVIDIA_TESLA_K80 and our model training code will use `tf.distribute.MirroredStrategy`.
You can just run the code to run pipeline with `use_gpu=True`.


In [None]:
runner.run(
    _create_pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=PIPELINE_ROOT,
        data_root=DATA_ROOT,
        module_file=os.path.join(MODULE_ROOT, _trainer_module_file),
        serving_model_dir=SERVING_MODEL_DIR,
        project_id=GOOGLE_CLOUD_PROJECT,
        region=GOOGLE_CLOUD_REGION,
        # Updated: Use GPUs. We will use 'BASIC_GPU' scaleTier and 
        # the model code will use tf.distribute.MirroredStrategy.
        use_gpu=True),
    write_out=True)

pipelines_client.create_run_from_job_spec(PIPELINE_DEFINITION_FILE)

You can visit the link in the output above or
visit 'AI Platform (Unified) > Pipelines' in [Google Cloud Console](https://console.developers.google.com/) to see the
progress.

-----------------------------
Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.