# Continuous training with TFX and Cloud AI Platform

## Learning Objectives

1.  Use the TFX CLI to build a TFX pipeline.
2.  Deploy a TFX pipeline on the managed AI Platform service.
3.  Create and monitor TFX pipeline runs using the TFX CLI and KFP UI.

In this lab, you use the [TFX CLI](https://www.tensorflow.org/tfx/guide/cli) utility to build and deploy a TFX pipeline that uses [**Kubeflow pipelines**](https://www.tensorflow.org/tfx/guide/kubeflow) for orchestration, **AI Platform** for model training, and a managed [AI Platform Pipeline instance (Kubeflow Pipelines)](https://www.tensorflow.org/tfx/guide/kubeflow) that runs on a Kubernetes cluster for compute. You will then create and monitor pipeline runs using the TFX CLI as well as the KFP UI.

### Setup

In [1]:
import yaml

# Set `PATH` to include the directory containing TFX CLI and skaffold.
PATH=%env PATH
%env PATH=/home/jupyter/.local/bin:{PATH}

env: PATH=/home/jupyter/.local/bin:/usr/local/cuda/bin:/opt/conda/bin:/opt/conda/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games


In [2]:
!python -c "import tfx; print('TFX version: {}'.format(tfx.__version__))"
!python -c "import kfp; print('KFP version: {}'.format(kfp.__version__))"

TFX version: 0.25.0
KFP version: 1.0.4


**Note**: this lab was built and tested with the following package versions:

`TFX version: 0.25.0`  
`KFP version: 1.0.4`

(Optional) If running the above command results in different package versions or you receive an import error, upgrade to the correct versions by running the cell below:

In [None]:
%pip install --upgrade --user tfx==0.25.0
%pip install --upgrade --user kfp==1.0.4

Note: you may need to restart the kernel to pick up the correct package versions.

## Understanding the pipeline design
The pipeline source code can be found in the `pipeline` folder.

In [3]:
%cd pipeline

/home/jupyter/mlops-on-gcp/workshops/tfx-caip-tf23/lab-02-tfx-pipeline/pipeline


In [4]:
!ls -la

total 92
drwxr-xr-x 5 jupyter jupyter  4096 Jan 13 03:55 .
drwxr-xr-x 5 jupyter jupyter  4096 Jan 11 08:08 ..
drwxr-xr-x 2 jupyter jupyter  4096 Jan  8 02:45 .ipynb_checkpoints
-rw-r--r-- 1 jupyter jupyter    97 Jan  7 19:29 Dockerfile
drwxr-xr-x 2 jupyter jupyter  4096 Jan 11 08:08 __pycache__
-rw-r--r-- 1 jupyter jupyter   291 Jan  8 02:56 build.yaml
-rw-r--r-- 1 jupyter jupyter  1450 Jan 11 07:46 config.py
-rw-r--r-- 1 jupyter jupyter  1222 Dec 30 22:39 features.py
-rw-r--r-- 1 jupyter jupyter 10534 Jan 13 03:55 model.py
-rw-r--r-- 1 jupyter jupyter 11304 Jan 11 08:14 pipeline.py
-rw-r--r-- 1 jupyter jupyter  2032 Dec 30 22:39 preprocessing.py
-rw-r--r-- 1 jupyter jupyter  3387 Jan 11 08:07 runner.py
drwxr-xr-x 2 jupyter jupyter  4096 Dec 30 22:39 schema
-rw-r--r-- 1 jupyter jupyter  4667 Jan  8 02:45 tfx_covertype_continuous_training_7.tar.gz
-rw-r--r-- 1 jupyter jupyter  4665 Jan  8 02:55 tfx_covertype_continuous_training_8.tar.gz
-rw-r--r-- 1 jupyter jupyter  4705 Jan 11 08:08 tf

The `config.py` module configures the default values for the environment specific settings and the default values for the pipeline runtime parameters. 
The default values can be overwritten at compile time by providing the updated values in a set of environment variables.

The `pipeline.py` module contains the TFX DSL defining the workflow implemented by the pipeline.

The `preprocessing.py` module implements the data preprocessing logic  the `Transform` component.

The `model.py` module implements the training logic for the   `Train` component.

The `runner.py` module configures and executes `KubeflowDagRunner`. At compile time, the `KubeflowDagRunner.run()` method conversts the TFX DSL into the pipeline package in the [argo](https://argoproj.github.io/argo/) format.

The `features.py` module contains feature definitions common across `preprocessing.py` and `model.py`.


## Building and deploying the pipeline

You will use TFX CLI to compile and deploy the pipeline. As explained in the previous section, the environment specific settings can be provided through a set of environment variables and embedded into the pipeline package at compile time.

### Exercise: Create AI Platform Pipelines cluster

Navigate to [AI Platform Pipelines](https://console.cloud.google.com/ai-platform/pipelines/clusters) page in the Google Cloud Console.

**1.  Create or select an existing Kubernetes cluster (GKE) and deploy AI Platform**. Make sure to select `"Allow access to the following Cloud APIs https://www.googleapis.com/auth/cloud-platform"` to allow for programmatic access to your pipeline by the Kubeflow SDK for the rest of the lab. Also, provide an `App instance name` such as "tfx" or "mlops". Note you may have already deployed an AI Pipelines instance during the Setup for the lab series. If so, you can proceed using that instance below in the next step.

Validate the deployment of your AI Platform Pipelines instance in the console before proceeding.

**2. Configure your environment settings**.

Update  the below constants  with the settings reflecting your lab environment. 

- `GCP_REGION` - the compute region for AI Platform Training and Prediction
- `ARTIFACT_STORE` - the GCS bucket created during installation of AI Platform Pipelines. The bucket name will contain the `kubeflowpipelines-` prefix.

In [5]:
# Use the following command to identify the GCS bucket for metadata and pipeline storage.
!gsutil ls

gs://artifacts.dougkelly-sandbox.appspot.com/
gs://dougkelly-sandbox/
gs://dougkelly-sandbox-kubeflowpipelines-default/
gs://dougkelly-sandbox-msc-demos/
gs://msc-bqml-demos/
gs://msc-demos/


- `ENDPOINT` - set the `ENDPOINT` constant to the endpoint to your AI Platform Pipelines instance. The endpoint to the AI Platform Pipelines instance can be found on the [AI Platform Pipelines](https://console.cloud.google.com/ai-platform/pipelines/clusters) page in the Google Cloud Console. Open the *SETTINGS* for your instance and use the value of the `host` variable in the *Connect to this Kubeflow Pipelines instance from a Python client via Kubeflow Pipelines SKD* section of the *SETTINGS* window. The format is `'....[region].pipelines.googleusercontent.com'`.

In [6]:
GCP_REGION = 'us-central1'
ENDPOINT = '60ff837483ecde05-dot-us-central2.pipelines.googleusercontent.com'
ARTIFACT_STORE_URI = 'gs://dougkelly-sandbox-kubeflowpipelines-default'

PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]

### Compile the pipeline

You can build and upload the pipeline to the AI Platform Pipelines instance in one step, using the `tfx pipeline create` command. The `tfx pipeline create` goes through the following steps:
- (Optional) Builds the custom image to that provides a runtime environment for TFX components, 
- Compiles the pipeline DSL into a pipeline package 
- Uploads the pipeline package to the instance.

As you debug the pipeline DSL, you may prefer to first use the `tfx pipeline compile` command, which only executes the compilation step. After the DSL compiles successfully you can use `tfx pipeline create` to go through all steps.

#### Set the pipeline's compile time settings

The pipeline can run using a security context of the GKE default node pool's service account or the service account defined in the `user-gcp-sa` secret of the Kubernetes namespace hosting Kubeflow Pipelines. If you want to use the `user-gcp-sa` service account you change the value of `USE_KFP_SA` to `True`.

Note that the default AI Platform Pipelines configuration does not define the `user-gcp-sa` secret.

In [7]:
PIPELINE_NAME = 'tfx_covertype_continuous_training_9'
MODEL_NAME = 'tfx_covertype_classifier'

USE_KFP_SA=False
DATA_ROOT_URI = 'gs://workshop-datasets/covertype/small'
CUSTOM_TFX_IMAGE = 'gcr.io/{}/{}'.format(PROJECT_ID, PIPELINE_NAME)
RUNTIME_VERSION = '2.3'
PYTHON_VERSION = '3.7'

In [8]:
%env PROJECT_ID={PROJECT_ID}
%env KUBEFLOW_TFX_IMAGE={CUSTOM_TFX_IMAGE}
%env ARTIFACT_STORE_URI={ARTIFACT_STORE_URI}
%env DATA_ROOT_URI={DATA_ROOT_URI}
%env GCP_REGION={GCP_REGION}
%env MODEL_NAME={MODEL_NAME}
%env PIPELINE_NAME={PIPELINE_NAME}
%env RUNTIME_VERSION={RUNTIME_VERSION}
%env PYTHON_VERIONS={PYTHON_VERSION}
%env USE_KFP_SA={USE_KFP_SA}

env: PROJECT_ID=dougkelly-sandbox
env: KUBEFLOW_TFX_IMAGE=gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9
env: ARTIFACT_STORE_URI=gs://dougkelly-sandbox-kubeflowpipelines-default
env: DATA_ROOT_URI=gs://workshop-datasets/covertype/small
env: GCP_REGION=us-central1
env: MODEL_NAME=tfx_covertype_classifier
env: PIPELINE_NAME=tfx_covertype_continuous_training_9
env: RUNTIME_VERSION=2.3
env: PYTHON_VERIONS=3.7
env: USE_KFP_SA=False


In [13]:
!tfx pipeline compile --engine kubeflow --pipeline_path runner.py

CLI
Compiling pipeline
[0mPipeline compiled successfully.
Pipeline package path: /home/jupyter/mlops-on-gcp/workshops/tfx-caip-tf23/lab-02-tfx-pipeline/pipeline/tfx_covertype_continuous_training_9.tar.gz


Note: you should see a `tfx_covertype_continuous_training.tar.gz` file appear in your current directory.

### Deploy the pipeline package to AI Platform Pipelines

After the pipeline code compiles without any errors you can use the `tfx pipeline create` command to perform the full build and deploy the pipeline. You will deploy your compiled pipeline code e.g. `gcr.io/[PROJECT_ID]/tfx_covertype_continuous_training` to run on AI Platform Pipelines with the TFX CLI.

In [42]:
!tfx pipeline create  \
--pipeline_path=runner.py \
--endpoint={ENDPOINT} \
--build_target_image={CUSTOM_TFX_IMAGE}

CLI
Creating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
Target image gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9 is not used. If the build spec is provided, update the target image in the build spec file build.yaml.
[Skaffold] Generating tags...
[Skaffold]  - gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9 -> gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9:latest
[Skaffold] Checking cache...
[Skaffold]  - gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9: Not found. Building
[Skaffold] Building [gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9]...
[Skaffold] Sending build context to Docker daemon  103.9kB
[Skaffold] Step 1/4 : FROM tensorflow/tfx:0.25.0
[Skaffold]  ---> 05d9b228cf63
[Skaffold] Step 2/4 : WORKDIR ./pipeline
[Skaffold]  ---> Using cache
[Skaffold]  ---> fb2b71cc2724
[Skaffold] Step 3/4 : COPY ./ ./
[Skaffold]  ---> 302ba8f

Hint: you should see a `build.yaml` file in your pipeline folder created by skaffold. 

If you need to redeploy the pipeline you can first delete the previous version using `tfx pipeline delete` or you can update the pipeline in-place using `tfx pipeline update`.

To delete the pipeline:

`tfx pipeline delete --pipeline_name {PIPELINE_NAME} --endpoint {ENDPOINT}`

To update the pipeline:

`tfx pipeline update --pipeline_path runner.py --endpoint {ENDPOINT}`

In [14]:
!tfx pipeline update --pipeline_path runner.py --endpoint {ENDPOINT}

CLI
Updating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
[Skaffold] Generating tags...
[Skaffold]  - gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9 -> gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9:latest
[Skaffold] Checking cache...
[Skaffold]  - gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9: Not found. Building
[Skaffold] Building [gcr.io/dougkelly-sandbox/tfx_covertype_continuous_training_9]...
[Skaffold] Sending build context to Docker daemon    106kB
[Skaffold] Step 1/4 : FROM tensorflow/tfx:0.25.0
[Skaffold]  ---> 05d9b228cf63
[Skaffold] Step 2/4 : WORKDIR ./pipeline
[Skaffold]  ---> Using cache
[Skaffold]  ---> fb2b71cc2724
[Skaffold] Step 3/4 : COPY ./ ./
[Skaffold]  ---> ad662fe5f281
[Skaffold] Step 4/4 : ENV PYTHONPATH="/pipeline:${PYTHONPATH}"
[Skaffold]  ---> Running in 7559e147f8aa
[Skaffold] Removing intermediate container 7559e147f8aa
[Skaffold] 

### Create and monitor a pipeline run
After the pipeline has been deployed, you can trigger and monitor pipeline runs using TFX CLI or KFP UI.

In [19]:
enable_tuning = 'True'
components = ['examplegen', 'trainer']

if enable_tuning:
    components.append('tuner')

components

['examplegen', 'trainer', 'tuner']

In [21]:
bool(enable_tuning)

True

**1.  Trigger a pipeline run using the TFX CLI**.

In [30]:
!tfx run create --pipeline_name={PIPELINE_NAME} --endpoint={ENDPOINT}

CLI
Creating a run for pipeline: tfx_covertype_continuous_training_7
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Run created for pipeline: tfx_covertype_continuous_training_7
+-------------------------------------+--------------------------------------+----------+---------------------------+-----------------------------------------------------------------------------------------------------------------------------+
| pipeline_name                       | run_id                               | status   | created_at                | link                                                                                                                        |
| tfx_covertype_continuous_training_7 | de152ba4-2309-4e9b-aa44-2bed3b85ae24 |          | 2021-01-08T02:46:29+00:00 | http://60ff837483ecde05-dot-us-central2.pipelines.googleusercontent.com/#/runs/details/de152ba4-2309-4e9b-aa44-2bed3b85ae24 |
+-------------------------------------+--------------

**2. Trigger a pipeline run from the KFP UI**.

On the [AI Platform Pipelines](https://console.cloud.google.com/ai-platform/pipelines/clusters) page, click `OPEN PIPELINES DASHBOARD`. A new tab will open. Select the `Pipelines` tab to the left, you see the `tfx_covertype_continuous_training` pipeline you deployed previously. Click on the pipeline name which will open up a window with a graphical display of your TFX pipeline. Next, click the `Create a run` button. Verify the `Pipeline name` and `Pipeline version` are pre-populated and optionally provide a `Run name` and `Experiment` to logically group the run metadata under before hitting `Start`.

*Note: each full pipeline run takes about 45 minutes to 1 hour.* Take the time to review the pipeline metadata artifacts created in the GCS storage bucket for each component including data splits, your Tensorflow SavedModel, model evaluation results, etc. as the pipeline executes.

To list all active runs of the pipeline:

In [20]:
!tfx run list --pipeline_name {PIPELINE_NAME} --endpoint {ENDPOINT}

CLI
Listing all runs of pipeline: tfx_covertype_continuous_training
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
+-----------------------------------+--------------------------------------+----------+---------------------------+----------------------------------------------------------------------------------------------------------------------------+
| pipeline_name                     | run_id                               | status   | created_at                | link                                                                                                                       |
| tfx_covertype_continuous_training | a8eb5d5a-94f3-44ee-9ae9-24e26fa923ba | Running  | 2020-09-26T20:38:17+00:00 | http://a56ba690a399e62-dot-us-central2.pipelines.googleusercontent.com/#/runs/details/a8eb5d5a-94f3-44ee-9ae9-24e26fa923ba |
+-----------------------------------+--------------------------------------+----------+---------------------------+----------

To retrieve the status of a given run:

In [None]:
RUN_ID='[YOUR RUN ID]'

!tfx run status --pipeline_name {PIPELINE_NAME} --run_id {RUN_ID} --endpoint {ENDPOINT}

## Next Steps

In this lab, you learned how to manually build and deploy a TFX pipeline to AI Platform Pipelines and trigger pipeline runs from a notebook. In the next lab, you will construct a Cloud Build CI/CD workflow that automatically builds and deploys the same TFX covertype classifier pipeline.

## License

<font size=-1>Licensed under the Apache License, Version 2.0 (the \"License\");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.</font>