# Continuous training pipeline with KFP and Cloud AI Platform

**Learning Objectives:**
1. Learn how to use KF pre-build components (BiqQuery, CAIP training and predictions)
1. Learn how to use KF lightweight python components
1. Learn how to build a KF pipeline with these components
1. Learn how to compile, upload, and run a KF pipeline with the command line


In this lab, you will build, deploy, and run a KFP pipeline that orchestrates **BigQuery** and **Cloud AI Platform** services to train, tune, and deploy a **scikit-learn** model.


## Understanding the pipeline design


The workflow implemented by the pipeline is defined using a Python based Domain Specific Language (DSL). The pipeline's DSL is in the `covertype_training_pipeline.py` file that we will generate below.

The pipeline's DSL has been designed to avoid hardcoding any environment specific settings like file paths or connection strings. These settings are provided to the pipeline code through a set of environment variables.


In [73]:
from typing import NamedTuple

import kfp
from kfp import dsl

from kfp.v2 import compiler
from kfp.v2.dsl import (Artifact, Dataset, Input, InputPath, Model, Output,
                        OutputPath, ClassificationMetrics, Metrics, component)

from kfp.v2.google.client import AIPlatformClient

from google.cloud import aiplatform
from google_cloud_pipeline_components import aiplatform as gcc_aip

from jinja2 import Template

import time

In [74]:
REGION = 'us-central1'

PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]

ARTIFACT_STORE = f'gs://{PROJECT_ID}-vertex'

PIPELINE_ROOT = f'{ARTIFACT_STORE}/pipeline'
DATA_ROOT = f'{ARTIFACT_STORE}/data'
JOB_DIR_ROOT = f'{ARTIFACT_STORE}/jobs'
TRAINING_FILE_PATH = f'{DATA_ROOT}/training/dataset.csv'
VALIDATION_FILE_PATH = f'{DATA_ROOT}/validation/dataset.csv'
API_ENDPOINT = f'{REGION}-aiplatform.googleapis.com'

PIPELINE_NAME = 'covertype_kfp_pipeline'
PIPELINE_JSON = f'{PIPELINE_NAME}.json'

In [263]:
IMAGE_NAME='trainer_image_covertype_vertex'
TAG='latest'
TRAINER_IMAGE=f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}:{TAG}'
TRAINER_IMAGE

'gcr.io/qwiklabs-gcp-04-14242c0aa6a7/trainer_image_covertype_vertex:latest'

In [None]:
!gcloud builds submit --timeout 15m --tag $TRAINER_IMAGE trainer_image

In [229]:
TIMESTAMP = time.strftime("%Y%m%d_%H%M%S")
JOB_NAME = f"covertype_training_{TIMESTAMP}"
JOB_DIR = f"{JOB_DIR_ROOT}/{JOB_NAME}"
JOB_DIR

'gs://qwiklabs-gcp-04-14242c0aa6a7-vertex/jobs/covertype_training_20210806_190130'

In [230]:
STAGING_BUCKET = f'{PIPELINE_ROOT}/staging'
STAGING_BUCKET

'gs://qwiklabs-gcp-04-14242c0aa6a7-vertex/pipeline/staging'

In [231]:
SERVING_CONTAINER_IMAGE_URI = 'us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-20:latest'

In [232]:
alpha = 0.01
max_iter = 5

In [233]:
TRAINER_IMAGE

'gcr.io/qwiklabs-gcp-04-14242c0aa6a7/trainer_image_covertype_vertex:latest'

In [234]:
job = aiplatform.CustomContainerTrainingJob(
    display_name='covertype_training',
    container_uri=TRAINER_IMAGE,
    command=[
        "python", 
        "train.py",
        f"--job_dir={JOB_DIR}",
        f"--training_dataset_path={TRAINING_FILE_PATH}",
        f"--validation_dataset_path={VALIDATION_FILE_PATH}",
        f"--alpha={alpha}",
        f"--max_iter={max_iter}",
        "--nohptune"
    ],
    staging_bucket=STAGING_BUCKET,
    model_serving_container_image_uri=SERVING_CONTAINER_IMAGE_URI,
)
model = job.run(replica_count=1, model_display_name='covertype_kfp_model')
endpoint = model.deploy(
    traffic_split={"0": 100},
    machine_type="n1-standard-2",
)

INFO:google.cloud.aiplatform.training_jobs:Training Output directory:
gs://qwiklabs-gcp-04-14242c0aa6a7-vertex/pipeline/staging/aiplatform-custom-training-2021-08-06-19:01:35.317 
INFO:google.cloud.aiplatform.training_jobs:View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/6306002650504626176?project=71720575744
INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/71720575744/locations/us-central1/trainingPipelines/6306002650504626176 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/71720575744/locations/us-central1/trainingPipelines/6306002650504626176 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training_jobs:CustomContainerTrainingJob projects/71720575744/locations/us-central1/trainingPipelines/6306002650504626176 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.training

In [236]:
endpoint = model.deploy(
    traffic_split={"0": 100},
    machine_type="n1-standard-2",
)

INFO:google.cloud.aiplatform.models:Creating Endpoint
INFO:google.cloud.aiplatform.models:Create Endpoint backing LRO: projects/71720575744/locations/us-central1/endpoints/3201971374130200576/operations/1925549699834576896
INFO:google.cloud.aiplatform.models:Endpoint created. Resource name: projects/71720575744/locations/us-central1/endpoints/3201971374130200576
INFO:google.cloud.aiplatform.models:To use this Endpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.Endpoint('projects/71720575744/locations/us-central1/endpoints/3201971374130200576')
INFO:google.cloud.aiplatform.models:Deploying model to Endpoint : projects/71720575744/locations/us-central1/endpoints/3201971374130200576
INFO:google.cloud.aiplatform.models:Deploy Endpoint model backing LRO: projects/71720575744/locations/us-central1/endpoints/3201971374130200576/operations/6537235718261964800
INFO:google.cloud.aiplatform.models:Endpoint model deployed. Resource name: projects/71720575744/loca

In [288]:
@component(
    base_image='python:3.8',
    output_component_file='covertype_kfp_train_and_deploy.yaml',
    packages_to_install=['google-cloud-aiplatform'],
)
def train_and_deploy(
        project: str,
        location: str,
        container_uri: str,
        training_file_path: str,
        validation_file_path: str,
        staging_bucket: str,
        job_dir: str,
        alpha: float, 
        max_iter: int,
    ):
    from google.cloud import aiplatform
    
    
    SERVING_CONTAINER_IMAGE_URI = 'us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-20:latest'

    aiplatform.init(project=project, location=location,
                staging_bucket=staging_bucket)
    job = aiplatform.CustomContainerTrainingJob(
        display_name='covertype_kfp_training',
        container_uri=container_uri,
        command=[
            "python", 
            "train.py",
            f"--job_dir={job_dir}",
            f"--training_dataset_path={training_file_path}",
            f"--validation_dataset_path={validation_file_path}",
            f"--alpha={alpha}",
            f"--max_iter={max_iter}",
            "--nohptune"
        ],
        staging_bucket=staging_bucket,
        model_serving_container_image_uri=SERVING_CONTAINER_IMAGE_URI,
    )
    model = job.run(replica_count=1, model_display_name='covertype_kfp_model')
    endpoint = model.deploy(
        traffic_split={"0": 100},
        machine_type='n1-standard-2',
    )

In [289]:
endpoint = train_and_deploy(
    project=PROJECT_ID,
    location=REGION,
    container_uri=TRAINER_IMAGE,
    training_file_path=TRAINING_FILE_PATH,
    validation_file_path=VALIDATION_FILE_PATH,
    staging_bucket=STAGING_BUCKET,
    job_dir=JOB_DIR,
    alpha=0.02, 
    max_iter=2,            
)

In [290]:
@dsl.pipeline(
    name="covertype-kfp-pipeline",
    description="The pipeline training and deploying the Covertype classifier",
    pipeline_root=PIPELINE_ROOT,
)
def covertype_train():
    train_and_deploy_op = train_and_deploy(
        project=PROJECT_ID,
        location=REGION,
        container_uri=TRAINER_IMAGE,
        training_file_path=TRAINING_FILE_PATH,
        validation_file_path=VALIDATION_FILE_PATH,
        staging_bucket=STAGING_BUCKET,
        job_dir=JOB_DIR,
        alpha=0.02, 
        max_iter=2,            
    )


In [291]:
compiler.Compiler().compile(
    pipeline_func=covertype_train, 
    package_path=PIPELINE_JSON,
)

In [292]:
api_client = AIPlatformClient(
    project_id=PROJECT_ID,
    region=REGION,
)

In [293]:
response = api_client.create_run_from_job_spec(
    job_spec_path=PIPELINE_JSON,
    # pipeline_root=PIPELINE_ROOT  # this argument is necessary if you did not specify PIPELINE_ROOT as part of the pipeline definition.
)

The custom components execute in a container image defined in `base_image/Dockerfile`.

In [4]:
!cat base_image/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire scikit-learn==0.20.4 pandas==0.24.2 kfp==0.2.5


The training step in the pipeline employes the AI Platform Training component to schedule a  AI Platform Training job in a custom training container. The custom training image is defined in `trainer_image/Dockerfile`.

In [5]:
!cat trainer_image/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire cloudml-hypertune scikit-learn==0.20.4 pandas==0.24.2
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]


## Building and deploying the pipeline

Before deploying to AI Platform Pipelines, the pipeline DSL has to be compiled into a pipeline runtime format, also refered to as a pipeline package.  The runtime format is based on [Argo Workflow](https://github.com/argoproj/argo), which is expressed in YAML. 


### Configure environment settings

Update  the below constants  with the settings reflecting your lab environment. 

- `REGION` - the compute region for AI Platform Training and Prediction
- `ARTIFACT_STORE` - the GCS bucket created during installation of AI Platform Pipelines. The bucket name starts with the `hostedkfp-default-` prefix.
- `ENDPOINT` - set the `ENDPOINT` constant to the endpoint to your AI Platform Pipelines instance. Then endpoint to the AI Platform Pipelines instance can be found on the [AI Platform Pipelines](https://console.cloud.google.com/ai-platform/pipelines/clusters) page in the Google Cloud Console.

1. Open the *SETTINGS* for your instance
2. Use the value of the `host` variable in the *Connect to this Kubeflow Pipelines instance from a Python client via Kubeflow Pipelines SKD* section of the *SETTINGS* window.

### Build the trainer image

### Build the base image for custom components

In [10]:
IMAGE_NAME='base_image'
TAG='latest'
BASE_IMAGE='gcr.io/{}/{}:{}'.format(PROJECT_ID, IMAGE_NAME, TAG)

In [None]:
!gcloud builds submit --timeout 15m --tag $BASE_IMAGE base_image

Creating temporary tarball archive of 1 file(s) totalling 122 bytes before compression.
Uploading tarball of [base_image] to [gs://qwiklabs-gcp-01-3aff5ef1f764_cloudbuild/source/1606483509.529902-02e717e4bbe24c6daf7e1902153a93a3.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/qwiklabs-gcp-01-3aff5ef1f764/builds/bdc125ea-5d1a-4255-94b9-df5c5f8385bf].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/bdc125ea-5d1a-4255-94b9-df5c5f8385bf?project=1016448934670].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "bdc125ea-5d1a-4255-94b9-df5c5f8385bf"

FETCHSOURCE
Fetching storage object: gs://qwiklabs-gcp-01-3aff5ef1f764_cloudbuild/source/1606483509.529902-02e717e4bbe24c6daf7e1902153a93a3.tgz#1606483510262206
Copying gs://qwiklabs-gcp-01-3aff5ef1f764_cloudbuild/source/1606483509.529902-02e717e4bbe24c6daf7e1902153a93a3.tgz#1606483510262206...
/ [1 files][  227.0 B/  227.0 B]                                    

### Compile the pipeline

You can compile the DSL using an API from the **KFP SDK** or using the **KFP** compiler.

To compile the pipeline DSL using the **KFP** compiler.

#### Set the pipeline's compile time settings

The pipeline can run using a security context of the GKE default node pool's service account or the service account defined in the `user-gcp-sa` secret of the Kubernetes namespace hosting Kubeflow Pipelines. If you want to use the `user-gcp-sa` service account you change the value of `USE_KFP_SA` to `True`.

Note that the default AI Platform Pipelines configuration does not define the `user-gcp-sa` secret.

In [22]:
USE_KFP_SA = False

COMPONENT_URL_SEARCH_PREFIX = 'https://raw.githubusercontent.com/kubeflow/pipelines/0.2.5/components/gcp/'
RUNTIME_VERSION = '1.15'
PYTHON_VERSION = '3.7'

%env USE_KFP_SA={USE_KFP_SA}
%env BASE_IMAGE={BASE_IMAGE}
%env TRAINER_IMAGE={TRAINER_IMAGE}
%env COMPONENT_URL_SEARCH_PREFIX={COMPONENT_URL_SEARCH_PREFIX}
%env RUNTIME_VERSION={RUNTIME_VERSION}
%env PYTHON_VERSION={PYTHON_VERSION}

env: USE_KFP_SA=False
env: BASE_IMAGE=gcr.io/qwiklabs-gcp-01-3aff5ef1f764/base_image:latest
env: TRAINER_IMAGE=gcr.io/qwiklabs-gcp-01-3aff5ef1f764/trainer_image:latest
env: COMPONENT_URL_SEARCH_PREFIX=https://raw.githubusercontent.com/kubeflow/pipelines/0.2.5/components/gcp/
env: RUNTIME_VERSION=1.15
env: PYTHON_VERSION=3.7


#### Use the CLI compiler to compile the pipeline

In [23]:
!dsl-compile --py pipeline/covertype_training_pipeline.py --output covertype_training_pipeline.yaml

The result is the `covertype_training_pipeline.yaml` file. 

In [24]:
!head covertype_training_pipeline.yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: covertype-classifier-training-
  annotations: {pipelines.kubeflow.org/kfp_sdk_version: 0.5.1, pipelines.kubeflow.org/pipeline_compilation_time: '2020-11-27T13:31:14.897418',
    pipelines.kubeflow.org/pipeline_spec: '{"description": "The pipeline training
      and deploying the Covertype classifierpipeline_yaml", "inputs": [{"name": "project_id"},
      {"name": "region"}, {"name": "source_table_name"}, {"name": "gcs_root"}, {"name":
      "dataset_id"}, {"name": "evaluation_metric_name"}, {"name": "evaluation_metric_threshold"},
      {"name": "model_id"}, {"name": "version_id"}, {"name": "replace_existing_version"},


### Deploy the pipeline package

In [25]:
PIPELINE_NAME='covertype_continuous_training'

!kfp --endpoint $ENDPOINT pipeline upload \
-p $PIPELINE_NAME \
covertype_training_pipeline.yaml

(500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Content-Length': '1461', 'Content-Type': 'text/html; charset=utf-8', 'Date': 'Fri, 27 Nov 2020 13:31:23 GMT', 'Vary': 'Origin', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-Powered-By': 'Express', 'X-Xss-Protection': '0', 'Set-Cookie': 'S=cloud_datalab_tunnel=QvLfMb8LotIqUres7gFDB8s6CCA-KlKivktVb5cjpv0; Path=/; Max-Age=3600'})
HTTP response body: 
<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 500 (Internal Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#77

## Submitting pipeline runs

You can trigger pipeline runs using an API from the KFP SDK or using KFP CLI. To submit the run using KFP CLI, execute the following commands. Notice how the pipeline's parameters are passed to the pipeline run.

### List the pipelines in AI Platform Pipelines

In [26]:
!kfp --endpoint $ENDPOINT pipeline list

+--------------------------------------+-------------------------------------------------+---------------------------+
| Pipeline ID                          | Name                                            | Uploaded at               |
| defa3c60-637b-4332-88b9-d8647c2aec84 | covertype_continuous_training                   | 2020-11-27T13:29:43+00:00 |
+--------------------------------------+-------------------------------------------------+---------------------------+
| b9d5fe74-7c0a-4350-897c-27b373642fed | tfx_covertype-v2                                | 2020-11-24T15:46:37+00:00 |
+--------------------------------------+-------------------------------------------------+---------------------------+
| 4e656e01-47dc-45c3-9bb6-df557dba99ba | covertype_continuous_training_test              | 2020-11-24T15:28:38+00:00 |
+--------------------------------------+-------------------------------------------------+---------------------------+
| 892cd8cf-f8a2-4b5f-944e-ebb5d4e1d518 | tfx_cov

### Submit a run

Find the ID of the `covertype_continuous_training` pipeline you uploaded in the previous step and update the value of `PIPELINE_ID` .


In [17]:
PIPELINE_ID='defa3c60-637b-4332-88b9-d8647c2aec84'

In [27]:
EXPERIMENT_NAME = 'Covertype_Classifier_Training'
RUN_ID = 'Run_001'
SOURCE_TABLE = 'covertype_dataset.covertype'
DATASET_ID = 'splits'
EVALUATION_METRIC = 'accuracy'
EVALUATION_METRIC_THRESHOLD = '0.69'
MODEL_ID = 'covertype_classifier'
VERSION_ID = 'v01'
REPLACE_EXISTING_VERSION = 'True'

GCS_STAGING_PATH = '{}/staging'.format(ARTIFACT_STORE_URI)

In [29]:
!kfp --endpoint $ENDPOINT run submit \
-e $EXPERIMENT_NAME \
-r $RUN_ID \
-p $PIPELINE_ID \
project_id=$PROJECT_ID \
gcs_root=$GCS_STAGING_PATH \
region=$REGION \
source_table_name=$SOURCE_TABLE \
dataset_id=$DATASET_ID \
evaluation_metric_name=$EVALUATION_METRIC \
evaluation_metric_threshold=$EVALUATION_METRIC_THRESHOLD \
model_id=$MODEL_ID \
version_id=$VERSION_ID \
replace_existing_version=$REPLACE_EXISTING_VERSION

Run 4fdb59af-17b7-4136-83be-e7eeec23248b is submitted
+--------------------------------------+---------+----------+---------------------------+
| run id                               | name    | status   | created at                |
| 4fdb59af-17b7-4136-83be-e7eeec23248b | Run_001 |          | 2020-11-27T13:32:24+00:00 |
+--------------------------------------+---------+----------+---------------------------+


where

- EXPERIMENT_NAME is set to the experiment used to run the pipeline. You can choose any name you want. If the experiment does not exist it will be created by the command
- RUN_ID is the name of the run. You can use an arbitrary name
- PIPELINE_ID is the id of your pipeline. Use the value retrieved by the   `kfp pipeline list` command
- GCS_STAGING_PATH is the URI to the GCS location used by the pipeline to store intermediate files. By default, it is set to the `staging` folder in your artifact store.
- REGION is a compute region for AI Platform Training and Prediction. 

You should be already familiar with these and other parameters passed to the command. If not go back and review the pipeline code.


### Monitoring the run

You can monitor the run using KFP UI. Follow the instructor who will walk you through the KFP UI and monitoring techniques.

To access the KFP UI in your environment use the following URI:

https://[ENDPOINT]


**NOTE that your pipeline run may fail due to the bug in a BigQuery component that does not handle certain race conditions. If you observe the pipeline failure, retry the run from the KFP UI**


<font size=-1>Licensed under the Apache License, Version 2.0 (the \"License\");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.</font>