Orchestrating a training pipeline with Vertex Pipelines

**Learning Objectives:**

1.  **Understand the fundamentals of MLOps**: Grasp the importance of building automated and reproducible machine learning pipelines.
2.  **Learn to use the KFP SDK**: Get hands-on experience with the Kubeflow Pipelines SDK to define a pipeline's workflow in Python.
3.  **Learn to use Google Cloud Pipeline Components**: Learn how to use pre-built components that provide a simplified interface to Vertex AI services.
4.  **Build an end-to-end training pipeline**: Create a pipeline that automates data preparation, model training, and model deployment.
5.  **Run and monitor a pipeline**: Learn how to compile the pipeline, submit it to Vertex AI for execution, and monitor its progress.

## Setup

In [None]:
REGION = "us-central1"
PROJECT = !(gcloud config get-value project)
PROJECT = PROJECT[0]

In [None]:
# Set `PATH` to include the directory containing KFP CLI
PATH = %env PATH
%env PATH=/home/jupyter/.local/bin:{PATH}

## Understanding the pipeline design


A **Vertex AI pipeline** is a serverless tool that orchestrates your machine learning workflows. Under the hood, Vertex AI Pipelines uses [Kubeflow Pipelines (KFP)](https://www.kubeflow.org/docs/components/pipelines/v1/sdk/sdk-overview/), an open-source platform, which means you can define your pipelines using the KFP SDK and benefit from a large ecosystem of pre-built components.

Think of a pipeline as a graph of interconnected components. Each **component** is a self-contained set of code that performs a single step in your ML workflow, such as:

*   Preparing data.
*   Training a model.
*   Evaluating a model.
*   Deploying a model.

These components take inputs and produce outputs, which are then passed to downstream components, creating a dependency graph that Vertex AI executes for you. This approach has several advantages:

1.  **Reproducibility**: Each pipeline run is logged, and its artifacts are stored, which makes it easy to reproduce experiments and track model lineage.
2.  **Scalability**: Vertex AI handles the underlying infrastructure, so you can run your pipelines at scale without worrying about provisioning and managing servers.
3.  **Modularity**: Since pipelines are composed of individual components, you can easily reuse components across different pipelines and share them with your team.
4.  **Automation**: You can trigger pipeline runs on a schedule or in response to events, which is a key component of a robust MLOps strategy.

In this lab, you will define your pipeline in a Python file using the **KFP SDK**. You will also use pre-built **Google Cloud Pipeline Components** that provide a simplified, high-level interface to Vertex AI services like AutoML and Vertex AI Endpoints. This makes it very easy to orchestrate sophisticated workflows without having to write a lot of boilerplate code.

## Building and deploying the pipeline

A pipeline is composed of **components**. A component is a self-contained set of code that performs one step in your ML workflow. In this section, you will define the components that make up your pipeline and then you will orchestrate them in a single Python function.

You will use two types of components:

1.  **Custom component**: A component that you build yourself. In this lab, you will create a simple component that generates a BigQuery query.
2.  **Pre-built components**: The `google-cloud-pipeline-components` library provides a set of pre-built components that make it easy to interact with Vertex AI services.

In [None]:
GCP_PROJECTS = !gcloud config get-value project
PROJECT_ID = GCP_PROJECTS[0]
BUCKET_NAME = f"{PROJECT_ID}-fraudfinder"
config = !gsutil cat gs://{BUCKET_NAME}/config/notebook_env.py
print(config.n)
exec(config.n)

In [None]:
# Pipeline variables
PIPELINE_NAME = f"fraud-finder-automl-pipeline-{ID}"

# Feature Store component variables
BQ_DATASET = "tx"
READ_INSTANCES_TABLE = f"ground_truth_{ID}"
READ_INSTANCES_URI = f"bq://{PROJECT_ID}.{BQ_DATASET}.{READ_INSTANCES_TABLE}"

# Dataset component variables
DATASET_NAME = f"fraud_finder_dataset_{ID}"

### A custom lightweight component

For simple components that don't have a lot of boilerplate code, you can use the KFP SDK to create **lightweight components**. These are Python functions that are converted into pipeline components. KFP handles the process of building a container image for the component for you.

The first component in your pipeline will be a lightweight component that generates a BigQuery query. This query will select all the records from the `v_ff_training_dataset` view and save them to a new table. This new table will be the source for our Vertex AI dataset. The main reason for creating this component is to **parameterize the query** and make the pipeline more reusable.

In [None]:
%%writefile ./pipeline_vertex/create_load_query_component.py
# Copyright 2025 Google LLC

# Licensed under the Apache License, Version 2.0 (the "License"); you may not
# use this file except in compliance with the License. You may obtain a copy of
# the License at

# https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS"
# BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.

"""Lightweight component ingest features."""
from typing import Dict, List, NamedTuple

from kfp.dsl import Metrics, Output, component

@component(
    base_image="python:3.9",
)
def create_train_dataset_query(
    source_view: str,
    destination_table: str) -> str:
    """
    Creating or updating BigQuery training dataset shapshot.
    """
    # using only labeled records:
    query_template = f"""
    CREATE OR REPLACE TABLE `{destination_table}` AS SELECT * FROM {source_view}  WHERE tx_fraud IS NOT NULL
    """
    return query_template

### Defining the pipeline

Now that you have created the custom component, you can define the pipeline's workflow. You will use the KFP SDK to define a Python function that describes the graph of components that make up your pipeline. The `@dsl.pipeline` decorator compiles your Python function into a pipeline definition that can be submitted to Vertex AI.

The pipeline will have the following steps:

1.  **`create_train_dataset_query`**: This is the custom component you just created. It will generate a BigQuery query and return it as a string.
2.  **`BigqueryQueryJobOp`**: This is a pre-built component that takes a BigQuery query as input and executes it. This component will create the training dataset table in BigQuery.
3.  **`TabularDatasetCreateOp`**: This pre-built component creates a new Vertex AI Tabular Dataset from a BigQuery table.
4.  **`AutoMLTabularTrainingJobRunOp`**: This is the core component of the pipeline. It takes the Vertex AI Dataset as input and trains a tabular classification model using AutoML. You will configure it to:
    *   Use `tx_fraud` as the target column.
    *   Split the data into training, validation, and test sets.
    *   Use a 1-hour training budget.
5.  **`EndpointCreateOp`**: Once the model is trained, this component will create a new Vertex AI Endpoint. An endpoint is a resource that you can use to serve predictions from your model.
6.  **`ModelDeployOp`**: This final component will deploy the trained model to the endpoint. Once the model is deployed, you will be able to send it prediction requests.

In [None]:
%%writefile ./pipeline_vertex/pipeline_vertex_automl.py
# Copyright 2025 Google LLC

# Licensed under the Apache License, Version 2.0 (the "License"); you may not
# use this file except in compliance with the License. You may obtain a copy of
# the License at

# https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS"
# BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.

"""Kubeflow Fraudfinder Pipeline."""

import os

from google_cloud_pipeline_components.v1.automl.training_job import (
    AutoMLTabularTrainingJobRunOp,
)
from google_cloud_pipeline_components.v1.dataset import TabularDatasetCreateOp
from google_cloud_pipeline_components.v1.endpoint import (
    EndpointCreateOp,
    ModelDeployOp,
)
from google_cloud_pipeline_components.v1.bigquery import BigqueryQueryJobOp

from kfp import dsl
from create_load_query_component import create_train_dataset_query


PIPELINE_ROOT = os.getenv("PIPELINE_ROOT")
PROJECT = os.getenv("PROJECT")
REGION = os.getenv("REGION", "us-central1")
DATASET_SOURCE = os.getenv("DATASET_SOURCE")
PIPELINE_NAME = os.getenv("PIPELINE_NAME", "fraudfinder")
ENDPOINT_NAME = os.getenv("ENDPOINT_NAME", "ff_model_endpoint")
DISPLAY_NAME = os.getenv("MODEL_DISPLAY_NAME", PIPELINE_NAME)
TARGET_COLUMN = os.getenv("TARGET_COLUMN", "tx_fraud")
BUCKET_NAME = os.getenv("BUCKET_NAME")
SERVING_MACHINE_TYPE = os.getenv("SERVING_MACHINE_TYPE", "n1-standard-4")
ID = os.getenv("ID")

VERTEX_DATASET_SOURCE=f"bq://{PROJECT}.{DATASET_SOURCE}"

FEATURESTORE_ID=f"fraudfinder_{ID}"
# Feature Store component variables
BQ_DATASET = "tx"
READ_INSTANCES_TABLE = f"ground_truth_{ID}"
READ_INSTANCES_URI = f"bq://{PROJECT}.{BQ_DATASET}.{READ_INSTANCES_TABLE}"
bucket_name = f"gs://{BUCKET_NAME}"

column_specs = {
    'tx_amount': "numeric",
    'customer_id_avg_amount_14day_window': "numeric",
    'customer_id_avg_amount_15min_window': "numeric",
    'customer_id_avg_amount_1day_window': "numeric",
    'customer_id_avg_amount_30min_window': "numeric",
    'customer_id_avg_amount_60min_window': "numeric",
    'customer_id_avg_amount_7day_window': "numeric",
    'customer_id_nb_tx_14day_window': "numeric",
    'customer_id_nb_tx_15min_window': "numeric",
    'customer_id_nb_tx_1day_window': "numeric",
    'customer_id_nb_tx_30min_window': "numeric",
    'customer_id_nb_tx_60min_window': "numeric",
    'customer_id_nb_tx_7day_window': "numeric",
    'terminal_id_avg_amount_15min_window': "numeric",
    'terminal_id_avg_amount_30min_window': "numeric",
    'terminal_id_avg_amount_60min_window': "numeric",
    'terminal_id_nb_tx_14day_window': "numeric",
    'terminal_id_nb_tx_15min_window': "numeric",
    'terminal_id_nb_tx_1day_window': "numeric",
    'terminal_id_nb_tx_30min_window': "numeric",
    'terminal_id_nb_tx_60min_window': "numeric",
    'terminal_id_nb_tx_7day_window': "numeric",
    'terminal_id_risk_14day_window': "numeric",
    'terminal_id_risk_1day_window': "numeric",
    'terminal_id_risk_7day_window': "numeric"
}

SOURCE_VIEW = f"{BQ_DATASET}.v_ff_training_dataset"

@dsl.pipeline(
    name=f"{PIPELINE_NAME}-vertex-automl-pipeline",
    description=f"AutoML Vertex Pipeline for {PIPELINE_NAME}",
    pipeline_root=PIPELINE_ROOT
)
def create_pipeline():
    
    #Prepare SQL Query for BigQuery Job
    bq_load_query_op = create_train_dataset_query(
        source_view=SOURCE_VIEW,
        destination_table=DATASET_SOURCE
    )

    # Use the BigqueryQueryJobOp to ingest training dataset
    bq_job_op = BigqueryQueryJobOp(
        project=PROJECT,
        query=bq_load_query_op.output,
        #query_parameters=bq_query_params_list,
    )
    
    #Create Dataset
    dataset_create_task = TabularDatasetCreateOp(
        project=PROJECT,
        display_name=DISPLAY_NAME,
        bq_source=VERTEX_DATASET_SOURCE
    ).after(bq_job_op)

    # Run the AutoML Tabular Training Job
    automl_training_task = AutoMLTabularTrainingJobRunOp(
        project=PROJECT,
        display_name=DISPLAY_NAME,
        optimization_prediction_type="classification",
        dataset=dataset_create_task.outputs["dataset"],
        target_column=TARGET_COLUMN,
        timestamp_split_column_name='timestamp',
        training_fraction_split=0.8,
        validation_fraction_split=0.1,
        test_fraction_split=0.1,
        # Feature list configuration
        column_specs=column_specs,
        # column_transformations=column_transformations,
        # New parameters for budget and early stopping
        budget_milli_node_hours=1000,  # 1000 milli-node hours = 1 node hour
        disable_early_stopping=False   # Explicitly set to False to enable early stopping
    )
    
    # Create Vertex AI Endpoint
    endpoint_create_task = EndpointCreateOp(
        project=PROJECT,
        display_name=ENDPOINT_NAME,
    ).after(automl_training_task)

    # Deploy model to the Vertex AI Endpoint
    model_deploy_task = ModelDeployOp(  # pylint: disable=unused-variable
        model=automl_training_task.outputs["model"],
        endpoint=endpoint_create_task.outputs["endpoint"],
        deployed_model_display_name=DISPLAY_NAME,
        dedicated_resources_machine_type=SERVING_MACHINE_TYPE,
        dedicated_resources_min_replica_count=1,
        dedicated_resources_max_replica_count=1,
    )


### Compile the pipeline

Now that you have defined your pipeline in Python, you need to compile it into a format that Vertex AI can understand. The KFP SDK provides a compiler that takes your Python function and converts it into a YAML file. This YAML file contains a static definition of your pipeline's workflow and can be submitted to Vertex AI for execution.

Before you compile the pipeline, you need to define some environment variables. Your pipeline code is designed to be reusable, so instead of hardcoding values like your project ID or a GCS bucket path, you pass these values to the pipeline at runtime. The KFP compiler will embed the values of these environment variables in the compiled YAML file.

In [None]:
ARTIFACT_STORE = f"gs://{PROJECT}-kfp-artifact-store"
PIPELINE_ROOT = f"{ARTIFACT_STORE}/pipeline"
DATASET_SOURCE = "tx.train_table_automl_demo"

%env PIPELINE_ROOT={PIPELINE_ROOT}
%env PROJECT={PROJECT}
%env REGION={REGION}
%env DATASET_SOURCE={DATASET_SOURCE}
%env ID={ID}
%env BUCKET_NAME={BUCKET_NAME}
%env ENDPOINT_NAME={ENDPOINT_NAME}

The `PIPELINE_ROOT` variable points to a GCS bucket that will be used to store the **artifacts** of your pipeline runs. An artifact is an output that is generated by a component, such as a trained model or a dataset. Vertex AI will automatically store these artifacts for you, which is important for tracking model lineage and reproducing experiments.

In [None]:
!gsutil ls | grep ^{ARTIFACT_STORE}/$ || gsutil mb -l {REGION} {ARTIFACT_STORE}

Now you can use the KFP CLI to compile the pipeline. The `--py` flag points to your Python file, and the `--output` flag is the name of the YAML file that will be generated.

In [None]:
PIPELINE_YAML = "fraudfinder_automl_vertex_pipeline.yaml"

In [None]:
!kfp dsl compile --py pipeline_vertex/pipeline_vertex_automl.py --output $PIPELINE_YAML

**Note:** You can also use the Python SDK to compile the pipeline:

```python
from kfp import compiler

compiler.Compiler().compile(
    pipeline_func=create_pipeline, 
    package_path=PIPELINE_YAML,
)

```

The result is the pipeline file. 

In [None]:
!head {PIPELINE_YAML}

### Submit the pipeline for execution

With the compiled pipeline in hand, you can now submit it to Vertex AI for execution. You will use the Vertex AI Python SDK to do this.

The `aiplatform.PipelineJob` class is used to configure and run a pipeline. You will provide the following parameters:

*   `display_name`: A human-readable name for the pipeline run.
*   `template_path`: The path to the compiled YAML file.
*   `enable_caching`: If set to `True`, Vertex AI will try to reuse the outputs of previous component executions if the inputs have not changed. This can save you a lot of time and money when you are iterating on your pipelines.

Once you have configured the `PipelineJob`, you can call the `run()` method to start the execution. You will be able to monitor the progress of the pipeline run in the Vertex AI UI.

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT, location=REGION)

pipeline = aiplatform.PipelineJob(
    display_name="automl_fraudfinder_kfp_pipeline",
    template_path=PIPELINE_YAML,
    enable_caching=False,
)

pipeline.run()

## Summary

Congratulations! You have successfully built and run an end-to-end machine learning pipeline using Vertex AI Pipelines. 

In this lab, you learned how to:

*   Define a pipeline's workflow in Python using the KFP SDK.
*   Use a combination of custom and pre-built Google Cloud Pipeline Components to orchestrate a workflow that uses BigQuery and AutoML.
*   Compile a pipeline into a YAML file and submit it to Vertex AI for execution.

This pipeline provides a solid foundation for building more sophisticated MLOps workflows. From here, you could:

*   **Automate the pipeline**: Use Cloud Scheduler and Cloud Functions to trigger the pipeline on a schedule.
*   **Add an evaluation component**: Before deploying the model, you could add a component that evaluates the model's performance on a test set and only deploys the model if it meets a certain quality threshold.
*   **Experiment with different model architectures**: You could adapt this pipeline to train a custom model with Vertex AI Training instead of using AutoML.

Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.