# Orchestrating Vertex AI Pipelines with Cloud Composer

This notebook provides an example of integrating Cloud Composer with Vertex AI for automated MLOps workflows.
It demonstrates a common MLOps workflow on Google Cloud, leveraging **Cloud Composer** (managed Apache Airflow) 
to orchestrate data loading, transformation, and the execution of a **Vertex AI Pipeline**.

**Learning Objectives:**
1. Learn how to create a custom Directed acyclic graph (DAG) for Cloud Composer
2. Learn how to use Airflow operators to trigger Vertex AI Pipeline and monitor jobs status
3. Learn how to orchestrate Vertex AI Pipelines with existing ETL (Extract, Transform, Load) pipeline

**Important Notes:**
Airflow DAGs are typically uploaded directly to your Cloud Composer environment's GCS DAGs folder. Airflow workers then discover and parse these files. You do *not* run the DAG code directly from this notebook to execute the Airflow workflow.

## Directed acyclic graph (DAG) overview and implementation details

The provided DAG performs the following steps:

1.  **Load Data to BigQuery**: (`load_csv_to_bigquery`): Downloads a CSV dataset from Google Cloud Storage (GCS) and loads it into a BigQuery table. 
    ***This step emulates an ETL (Extract, Transform, Load) process for preparing data*** and load it to BigQuery.
    * **Operator**: `GCSToBigQueryOperator`
    * **Purpose**: This task transfers a CSV file (`data/covertype/dataset.csv` from the `asl-public` bucket) to a specified BigQuery table (`airflow_demo_dataset.covertype`).
    * **Configuration**: It's configured to create the table if it doesn't exist and truncate it if it does, ensuring a fresh load for each run. It also handles skipping a header row.
    * **Trigger**: This is the initial task, executing first.

2.  **Export Data from BigQuery to GCS** (`bigquery_to_gcs_export`): Exports the processed data from BigQuery back to GCS. This step prepares the data in a format suitable for consumption by a Vertex AI Pipeline.
    * **Operator**: `BigQueryToGCSOperator`
    * **Purpose**: This task exports the data from the BigQuery table (`airflow_demo_dataset.covertype`) to a CSV file in GCS, specified by `params.gcs_train_dataset_path`. This emulates a data preparation step where data is transformed and then made available for downstream ML processes.
    * **Trigger**: Executes once `load_csv_to_bigquery` successfully completes.

3.  **Run Vertex AI Pipeline** (`start_vertex_ai_pipeline`): Triggers a pre-compiled Kubeflow Pipeline (KFP) on Vertex AI using a YAML file stored in GCS. This pipeline can encapsulate various machine learning tasks like training, evaluation, and deployment.
    * **Operator**: `RunPipelineJobOperator`
    * **Purpose**: After the data is exported to GCS, this task triggers a new **Vertex AI Pipeline** job using the specified compiled pipeline YAML file from GCS.
    * **Parameter Passing**: It passes the GCS path of the exported training data (`params.gcs_train_dataset_path`) as the `training_file_path` parameter to the Kubeflow pipeline.
    * **Dynamic Naming**: The `display_name` is dynamically generated with a timestamp to ensure uniqueness for each pipeline run.
    * **XCom**: The `pipeline_job_id` of the triggered pipeline is pushed to XCom, allowing subsequent tasks to reference this specific job.
    * **Trigger**: Executes once `bigquery_to_gcs_export` successfully completes.

4.  **Get Vertex AI Pipeline Status** (`vertex_ai_pipline_status`): Retrieves the status and details of the running Vertex AI Pipeline job.
    * **Operator**: `GetPipelineJobOperator`
    * **Purpose**: This task retrieves detailed information and the current status of the Vertex AI Pipeline job initiated by the previous task. It uses the `pipeline_job_id` pulled from XCom.
    * **Trigger**: Executes once `start_vertex_ai_pipeline` successfully completes.

5.  **Delete Vertex AI Pipeline Job** (`delete_vertex_ai_pipeline_job`): Cleans up by deleting the Vertex AI Pipeline job.
    * **Operator**: `DeletePipelineJobOperator`
    * **Purpose**: This task cleans up the Vertex AI Pipeline job by deleting it. This is important for managing resources and keeping your Vertex AI environment tidy.
    * **Trigger Rule**: `TriggerRule.ALL_DONE` ensures this task runs regardless of whether the preceding tasks succeeded or failed, as long as they have all completed their execution. This is a robust approach for cleanup tasks.
    * **Trigger**: Executes once `vertex_ai_pipline_status` completes (or if any previous task fails, due to `ALL_DONE` trigger rule).

---

## Prerequisites

Before deploying and running this DAG, ensure you have the following:

* A **Google Cloud Project** with billing enabled.
* A **Cloud Composer environment** provisioned in your GCP project. (This notebook assumes the Cloud Composer instance is already created by following the instructions covered in the [Run an Apache Airflow DAG in Cloud Composer](https://cloud.google.com/composer/docs/composer-3/run-apache-airflow-dag). If you haven't run it, please create Cloud Composer instance using above instructions.)
* **Vertex AI API** enabled in your GCP project.
* **BigQuery API** enabled in your GCP project.
* A compiled **Kubeflow Pipeline YAML file** uploaded to a GCS bucket (e.g., `gs://your-bucket/covertype_kfp_pipeline.yaml`). This file should define all the steps of your Vertex AI Pipeline. its recommented to use Lab "Continuous Training with Kubeflow Pipeline and Vertex AI" from "asl-ml-immersion/notebooks/kubeflow_pipelines/pipelines/solutions/kfp_pipeline_vertex_lightweight.ipynb" notebook to create "covertype_kfp_pipeline.yaml"

---

## Setup and Configuration

1.  **Update Placeholders**:
    In the next notebook cell replace the placeholder values with your specific project details:

    * `PROJECT_ID`: Replace `"my_project_id"` with your actual Google Cloud Project ID.
    * `GCS_VERTEX_AI_PIPELINE_YAML`: Replace `gs://.../covertype_kfp_pipeline.yaml` with the GCS path to your compiled Kubeflow Pipeline YAML file.
    * `GCS_TRAIN_DATASET_PATH`: Update `gs://.../train_export.csv` to the desired GCS path for the exported training data.
    * `BIGQUERY_DATASET_ID`: Replace `airflow_demo_dataset` with the ID of your BigQuery dataset. If it doesn't exist, it will be created by the DAG.

2.  **Ensure IAM Permissions**:
    The service account associated with your Cloud Composer environment must have the necessary IAM roles to:
    * Read from and write to **BigQuery**.
    * Read from and write to **Cloud Storage**.
    * Create, run, and manage **Vertex AI Pipeline Jobs**.

    Recommended roles include:
    * `BigQuery Data Editor`
    * `Storage Object Admin`
    * `Vertex AI User`


In [7]:
#Creating ./dags folder
!mkdir dags

mkdir: dags: File exists


### Create composer_vertex_ai_pipelines.py file:

In [9]:
%%writefile ./dags/composer_vertex_ai_pipelines.py
# Copyright 2025 Google LLC

# Licensed under the Apache License, Version 2.0 (the "License"); you may not
# use this file except in compliance with the License. You may obtain a copy of
# the License at

# https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS"
# BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.
"""Ae example of using Cloud Composer DAG for VertexAI Pipelines integration"""

import datetime

from airflow import DAG
from airflow.providers.google.cloud.operators.vertex_ai.pipeline_job import (
    DeletePipelineJobOperator,
    GetPipelineJobOperator,
    RunPipelineJobOperator,
)
from airflow.providers.google.cloud.transfers.bigquery_to_gcs import (
    BigQueryToGCSOperator,
)
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
    GCSToBigQueryOperator,
)
from airflow.utils.trigger_rule import TriggerRule

# Replace with your actual project and region
# TODO: Put your project id here
PROJECT_ID = "my_project_id"
REGION = "us-central1"

# TODO: Change path to the covertype_kfp_pipeline.yaml file:
GCS_VERTEX_AI_PIPELINE_YAML = "gs:// ... /dags/covertype_kfp_pipeline.yaml"

GCS_SOURCE_DATASET_PATH = "data/covertype/dataset.csv"
GCS_BUCKET_NAME = "asl-public"

# TODO: Put your BigQuery dataset id here:
BIGQUERY_DATASET_ID = "airflow_demo_dataset"
TABLE_ID = "covertype"

# TODO: Put path for a train dataset
TRAINING_FILE_PATH = "gs://.../train_export.csv"

BIGQUERY_TABLE_SCHEMA = (
    [
        {"name": "Elevation", "type": "INTEGER", "mode": "NULLABLE"},
        {"name": "Aspect", "type": "INTEGER", "mode": "NULLABLE"},
        {"name": "Slope", "type": "INTEGER", "mode": "NULLABLE"},
        {
            "name": "Horizontal_Distance_To_Hydrology",
            "type": "INTEGER",
            "mode": "NULLABLE",
        },
        {
            "name": "Vertical_Distance_To_Hydrology",
            "type": "INTEGER",
            "mode": "NULLABLE",
        },
        {
            "name": "Horizontal_Distance_To_Roadways",
            "type": "INTEGER",
            "mode": "NULLABLE",
        },
        {"name": "Hillshade_9am", "type": "INTEGER", "mode": "NULLABLE"},
        {"name": "Hillshade_Noon", "type": "INTEGER", "mode": "NULLABLE"},
        {"name": "Hillshade_3pm", "type": "INTEGER", "mode": "NULLABLE"},
        {
            "name": "Horizontal_Distance_To_Fire_Points",
            "type": "INTEGER",
            "mode": "NULLABLE",
        },
        {"name": "Wilderness_Area", "type": "STRING", "mode": "NULLABLE"},
        {"name": "Soil_Type", "type": "STRING", "mode": "NULLABLE"},
        {"name": "Cover_Type", "type": "INTEGER", "mode": "NULLABLE"},
    ],
)

default_args = {
    'retries': 0,  # Disable retries
}

with DAG(
    dag_id="composer_vertex_ai_pipelines",
    start_date=datetime.datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
    default_args=default_args,
    tags=["vertex_ai", "pipeline", "ml"],
) as dag:

    # Load dataset from GCS to BigQuery (Emulating basic ETL process)
    load_gcs_to_bigquery = GCSToBigQueryOperator(
        task_id="load_csv_to_bigquery",
        bucket=GCS_BUCKET_NAME,
        source_objects=[GCS_SOURCE_DATASET_PATH],
        destination_project_dataset_table=f"{BIGQUERY_DATASET_ID}.{TABLE_ID}",
        # Optional: Define schema, remove if auto-detect works for you
        schema_fields=BIGQUERY_TABLE_SCHEMA,
        # Or "NEWLINE_DELIMITED_JSON", "PARQUET", "AVRO", etc.
        source_format="CSV",
        # Creates the table if it doesn't exist
        create_disposition="CREATE_IF_NEEDED",
        # Overwrites the table if it exists. Use "WRITE_APPEND" to append.
        write_disposition="WRITE_TRUNCATE",
        skip_leading_rows=1,  # For CSVs with a header row
        field_delimiter=",",  # For CSVs
    )

    # Export dataset from BigQuery to GCS
    bigquery_to_gcs = BigQueryToGCSOperator(
        task_id="bigquery_to_gcs_export",
        source_project_dataset_table=f"{BIGQUERY_DATASET_ID}.{TABLE_ID}",
        destination_cloud_storage_uris=TRAINING_FILE_PATH,
        export_format="CSV",
        print_header=True,
    )

    # Trigger the pipeline with a GCS compiled yaml file
    run_vertex_ai_pipeline = RunPipelineJobOperator(
        task_id="start_vertex_ai_pipeline",
        project_id=PROJECT_ID,
        region=REGION,
        template_path=GCS_VERTEX_AI_PIPELINE_YAML,
        # example of passing params to kubeflow pipeline to override default values:
        parameter_values={
            "training_file_path": TRAINING_FILE_PATH,
        },
        #turn on caching for the run
        enable_caching=False,
        # Unique display name
        display_name="triggered-demo-pipeline-{{ ts_nodash }}",
    )

    # Get VertexAI pipeline job information
    get_vertexai_ai_pipline_status = GetPipelineJobOperator(
        task_id="vertex_ai_pipline_status",
        project_id=PROJECT_ID,
        region=REGION,
        pipeline_job_id="{{ task_instance.xcom_pull("
        "task_ids='start_vertex_ai_pipeline', "
        "key='pipeline_job_id') }}",
    )

    # Delete VertexAI pipeline job
    delete_pipeline_job = DeletePipelineJobOperator(
        task_id="delete_vertex_ai_pipeline_job",
        project_id=PROJECT_ID,
        region=REGION,
        pipeline_job_id="{{ task_instance.xcom_pull("
        "task_ids='start_vertex_ai_pipeline', "
        "key='pipeline_job_id') }}",
        trigger_rule=TriggerRule.ALL_DONE,
    )

    # Combine all steps into a DAG
    (
        load_gcs_to_bigquery
        >> bigquery_to_gcs
        >> run_vertex_ai_pipeline
        >> get_vertexai_ai_pipline_status
        >> delete_pipeline_job
    )

Overwriting ./dags/composer_vertex_ai_pipelines.py


## Airflow DAG Code
Inspect saved Airflow DAG .py file (`./dags/composer_vertex_ai_pipelines.py`) that you intend to upload to your Cloud Composer environment.

## Uploading the DAG to Cloud Composer Storage

To deploy this DAG to your Cloud Composer environment, you need to upload it to the DAGs folder in your Composer's associated Cloud Storage bucket.

**Before running this cell, make sure you identified your Composer DAGs bucket:**
This is typically named `gs://us-central1-YOUR_COMPOSER_ENV_NAME-HASH-bucket/dags/`. 
You can find this in the Cloud Composer console.

In [1]:
# TODO: put your Cloud Composer Bucket here:
CLOUD_COMPOSER_BUCKET = "gs://...-bucket"
CLOUD_COMPOSER_DAGS_PATH = f"{CLOUD_COMPOSER_BUCKET}/dags"
%env CLOUD_COMPOSER_DAGS_PATH={CLOUD_COMPOSER_DAGS_PATH}

env: CLOUD_COMPOSER_DAGS_PATH=gs://...-bucket/dags


In [10]:
%%bash

gsutil cp ./dags/composer_vertex_ai_pipelines.py $CLOUD_COMPOSER_DAGS_PATH

Copying file://./dags/composer_vertex_ai_pipelines.py [Content-Type=text/x-python]...
/ [1 files][  6.2 KiB/  6.2 KiB]                                                
Operation completed over 1 objects/6.2 KiB.                                      


## Running the DAG

You can trigger the DAG manually from the Airflow UI:

1.  Navigate to your Cloud Composer environment in the Google Cloud Console.
2.  Click on the "Airflow UI" link.
3.  In the Airflow UI, find the `composer_vertex_ai_pipelines` DAG.
4.  Toggle the DAG to "On" if it's not already.
5.  Click the "Trigger DAG" button.

You can also schedule the DAG by uncommenting and configuring the `schedule_interval` parameter in the DAG definition.

---

## Monitoring

Monitor the DAG run from the Airflow UI. You can view the status of each task, logs, and XCom values. For Vertex AI Pipeline job details, you can refer to the Vertex AI section in the Google Cloud Console.

---


Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.