# Using Cloud Composer to orchestrate Kubeflow pipeline on Vertex AI

**Learning Objectives:**
1. Learn how to create a custom DAG for Cloud Composer to trigger and check status of Kubeflow pipeline on Vertex AI
1. Learn how to write a Cloud Build config file to build and push all the artifacts for a KFP
1. Learn how to setup a Cloud Build GitHub trigger a new run of the Kubeflow PIpeline

Uploading an Airflow DAG to Cloud Composer Storage via Jupyter (Illustrative)
This notebook demonstrates the Python code for an Airflow DAG and shows how you *would* programmatically interact with Google Cloud Storage to upload files.
**Important Notes:**
1.  **DAG Execution:** Airflow DAGs are typically uploaded directly to your Cloud Composer environment's GCS DAGs folder. Airflow workers then discover and parse these files. You do *not* run the DAG code directly from this notebook to execute the Airflow workflow.
2.  **Authentication:** To run the GCS upload code, ensure your Jupyter environment (e.g., Colab, a local Jupyter server with `gcloud` authenticated, or a Vertex AI Workbench instance) has the necessary Google Cloud credentials and permissions to write to your Composer DAGs bucket.
3.  **Cloud Composer DAGs Folder:** The target GCS path for DAGs in Cloud Composer is usually `gs://YOUR_COMPOSER_BUCKET/dags/`.

In [None]:
import os

## Configuring environment settings

In [None]:
PROJECT_ID = !(gcloud config get-value project)
PROJECT_ID = PROJECT_ID[0]
REGION = "us-central1"
ARTIFACT_STORE = f"gs://{PROJECT_ID}-kfp-artifact-store"
os.environ["REGION"] = REGION
os.environ["ARTIFACT_STORE"] = ARTIFACT_STORE
VERTEX_AI_PIPELINE_YAML = "gs://your-bucket-name/path/to/covertype_kfp_pipeline.yaml" # TODO: Update path to your compiled KFP YAML
GCS_SOURCE_DATASET_PATH = "data/covertype/dataset.csv"
BIGQUERY_DATASET_ID
TABLE_ID

## Airflow DAG Code
Below is the Airflow DAG (`demo_vertex_ai_pipeline_integration.py`) that you intend to upload to your Cloud Composer environment.

In [None]:
airflow_dag_code = f"""
import datetime

from airflow import DAG
from airflow.providers.google.cloud.operators.vertex_ai.pipeline_job import (
    DeletePipelineJobOperator,
    GetPipelineJobOperator,
    RunPipelineJobOperator,
)
from airflow.providers.google.cloud.transfers.bigquery_to_gcs import (
    BigQueryToGCSOperator,
)
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import (
    GCSToBigqueryOperator,
)
from airflow.utils.trigger_rule import TriggerRule

# Replace with your actual project and region
PROJECT_ID = "{PROJECT_ID}"  # Update with your Project ID
REGION = "{REGION}"

# Path to a compiled kubeflow pipeline yaml
VERTEX_AI_PIPELINE_YAML = "{VERTEX_AI_PIPELINE_YAML}" # Update path to your compiled KFP YAML

GCS_SOURCE_DATASET_PATH = "{GCS_SOURCE_DATASET_PATH}"
GCS_BUCKET_NAME = "asl-public" # This is a public bucket, if you use your own data, use your own bucket name

GCS_TRAIN_DATASET_PATH = "gs://your-bucket-name/data/train_export.csv" # <<< IMPORTANT: Update path for exported training data

# Put your BigQuery dataset id here:
BIGQUERY_DATASET_ID = "airflow_demo_dataset"
TABLE_ID = "covertype"

BIGQUERY_TABLE_SCHEMA = (
    [
        {"name": "Elevation", "type": "INTEGER", "mode": "NULLABLE"},
        {"name": "Aspect", "type": "INTEGER", "mode": "NULLABLE"},
        {"name": "Slope", "type": "INTEGER", "mode": "NULLABLE"},
        {
            "name": "Horizontal_Distance_To_Hydrology",
            "type": "INTEGER",
            "mode": "NULLABLE",
        },
        {
            "name": "Vertical_Distance_To_Hydrology",
            "type": "INTEGER",
            "mode": "NULLABLE",
        },
        {
            "name": "Horizontal_Distance_To_Roadways",
            "type": "INTEGER",
            "mode": "NULLABLE",
        },
        {"name": "Hillshade_9am", "type": "INTEGER", "mode": "NULLABLE"},
        {"name": "Hillshade_Noon", "type": "INTEGER", "mode": "NULLABLE"},
        {"name": "Hillshade_3pm", "type": "INTEGER", "mode": "NULLABLE"},
        {
            "name": "Horizontal_Distance_To_Fire_Points",
            "type": "INTEGER",
            "mode": "NULLABLE",
        },
        {"name": "Wilderness_Area", "type": "STRING", "mode": "NULLABLE"},
        {"name": "Soil_Type", "type": "STRING", "mode": "NULLABLE"},
        {"name": "Cover_Type", "type": "INTEGER", "mode": "NULLABLE"},
    ],
)

with DAG(
    dag_id="demo_vertex_ai_pipeline_integration",
    start_date=datetime.datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
    tags=["vertex_ai", "pipeline", "ml"],
    params={
        "gcs_train_dataset_path": GCS_TRAIN_DATASET_PATH,
    },
) as dag:

    # Loading dataset from GCS to BigQuery (Emulating basic ETL process)
    load_gcs_to_bigquery = GCSToBigqueryOperator(
        task_id="load_csv_to_bigquery",
        bucket=GCS_BUCKET_NAME,
        source_objects=[GCS_SOURCE_DATASET_PATH],
        destination_project_dataset_table="{BIGQUERY_DATASET_ID}.{TABLE_ID}",
        # Optional: Define schema, remove if auto-detect works for you
        schema_fields=BIGQUERY_TABLE_SCHEMA,
        # Or "NEWLINE_DELIMITED_JSON", "PARQUET", "AVRO", etc.
        source_format="CSV",
        # Creates the table if it doesn't exist
        create_disposition="CREATE_IF_NEEDED",
        # Overwrites the table if it exists. Use "WRITE_APPEND" to append.
        write_disposition="WRITE_TRUNCATE",
        skip_leading_rows=1,  # For CSVs with a header row
        field_delimiter=",",  # For CSVs
    )

    # exporting dataset from BigQuery to GCS
    bigquery_to_gcs = BigQueryToGCSOperator(
        task_id="bigquery_to_gcs_export",
        source_project_dataset_table=f"{BIGQUERY_DATASET_ID}.{TABLE_ID}",
        destination_cloud_storage_uris=GCS_TRAIN_DATASET_PATH,
        export_format="CSV",
        print_header=True,
    )

    # Triggering a pipeline from a GCS compiled yaml file
    run_vertex_ai_pipeline = RunPipelineJobOperator(
        task_id="start_vertex_ai_pipeline",
        project_id=PROJECT_ID,
        region=REGION,
        template_path=VERTEX_AI_PIPELINE_YAML,
        # example of passing params to kubeflow pipeline
        parameter_values={
            "training_file_path": "{{ params.gcs_train_dataset_path }}",
        },
        # Unique display name
        display_name="triggered-demo-pipeline-{{ ts_nodash }}",
    )

    # Fetching VertexAI pipeline job information
    get_vertexai_ai_pipline_status = GetPipelineJobOperator(
        task_id="vertex_ai_pipline_status",
        project_id=PROJECT_ID,
        region=REGION,
        pipeline_job_id="{{ task_instance.xcom_pull("
        "task_ids='start_vertex_ai_pipeline', "
        "key='pipeline_job_id') }}",
    )

    # Deleting VertexAI pipeline job
    delete_pipeline_job = DeletePipelineJobOperator(
        task_id="delete_vertex_ai_pipeline_job",
        project_id=PROJECT_ID,
        region=REGION,
        pipeline_job_id="{{ task_instance.xcom_pull("
        "task_ids='start_vertex_ai_pipeline', "
        "key='pipeline_job_id') }}",
        trigger_rule=TriggerRule.ALL_DONE,
    )

    # Combine all steps into a DAG
    (
        load_gcs_to_bigquery
        >> bigquery_to_gcs
        >> run_vertex_ai_pipeline
        >> get_vertexai_ai_pipline_status
        >> delete_pipeline_job
    )
"""

Also, this notebook assumes the dataset is already created and stored in Google Cloud Storage following the instructions covered in the [walkthrough notebook](https://github.com/GoogleCloudPlatform/asl-ml-immersion/blob/master/notebooks/kubeflow_pipelines/walkthrough/solutions/kfp_walkthrough_vertex.ipynb).

If you haven't run it, please run the cell below and create the dataset before running the pipeline.

In [None]:
%%bash
gsutil cp gs://asl-public/data/covertype/training/dataset.csv $ARTIFACT_STORE/data/training/dataset.csv
gsutil cp gs://asl-public/data/covertype/validation/dataset.csv $ARTIFACT_STORE/data/validation/dataset.csv

Let's create a GCS bucket to save the build log.

In [None]:
BUCKET = PROJECT_ID + "-cicd-log"
os.environ["BUCKET"] = BUCKET

## Saving the DAG to a Local File
First, let's save the DAG code to a local Python file.

In [None]:
dag_filename = "demo_vertex_ai_pipeline_integration.py"

with open(dag_filename, "w") as f:
    f.write(airflow_dag_code)

print(f"DAG saved locally as {dag_filename}")

## Uploading the DAG to Cloud Composer Storage

To deploy this DAG to your Cloud Composer environment, you need to upload it to the DAGs folder in your Composer's associated Cloud Storage bucket.

**Before running this cell, make sure you have:**
1.  **Installed Google Cloud Storage client library:** `pip install google-cloud-storage`
2.  **Authenticated:** Your environment needs to be authenticated to GCP (e.g., `gcloud auth application-default login` or running in a GCP VM/Cloud Run/Vertex AI Workbench).
3.  **Identified your Composer DAGs bucket:** This is typically named `gs://us-central1-YOUR_COMPOSER_ENV_NAME-HASH-bucket/dags/`. You can find this in the Cloud Composer console.
#

Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.