In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI Pipelines: Training and batch prediction with BigQuery source and destination for a custom tabular classification model 

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/custom_tabular_train_batch_pred_bq_pipeline.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fpipelines%2Fcustom_tabular_train_batch_pred_bq_pipeline.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-enterprise-logo-32px.png" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/custom_tabular_train_batch_pred_bq_pipeline.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
<a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/pipelines/custom_tabular_train_batch_pred_bq_pipeline.ipynb" target='_blank'>
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
     </a>
   </td>
</table>
<br/><br/><br/><br/>

*Note: This notebook uses KFP 1.x and GCPC 1.x. We recommend using 2.x*

## Overview

This notebook demonstrates performing training and batch prediction for a custom tabular classification model inside a Vertex AI pipeline. The batch prediction job takes data from a BigQuery source and writes the results to a BigQuery destination.

Learn more about [Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction) and [Vertex AI Batch Prediction components](https://cloud.google.com/vertex-ai/docs/pipelines/batchprediction-component).

### Objective

In this tutorial, you train a scikit-learn tabular classification model and create a batch prediction job for it through a Vertex AI pipeline using google_cloud_pipeline_components. The source and destination data for the batch prediction job is served in BigQuery.

This tutorial uses the following Google Cloud ML services and resources:

- Vertex AI Pipelines
- Vertex AI dataset
- Vertex AI Training
- Vertex AI Model Registry
- Vertex AI batch prediction

The steps performed include:

- Create a dataset in BigQuery.
- Set some data aside from the source dataset for batch prediction.
- Create a custom python package for training application.
- Upload the python package to Cloud Storage.
- Create a Vertex AI Pipeline that:
    - creates a Vertex AI dataset from the source dataset.
    - trains a scikit-learn RandomForest classification model on the dataset.
    - uploads the trained model to Vertex AI Model Registry.
    - runs a batch prediction job with the model on the test data.
- Check the prediction results from the destination table in BigQuery.
- Clean up the resources created in this notebook.

### Dataset

The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this notebook uses for training is available publicly at the BigQuery location bigquery-public-data.ml_datasets.census_adult_income. It consists of the following fields:

- age: Age.
- workclass: Nature of employment.
- functional_weight: Sample weight of the individual from the original Census data. How likely they were to be included in this dataset, based on their demographic characteristics vs. whole-population estimates.
- education: Level of education completed.
- education_num: Estimated years of education completed based on the value of the education field.
- marital_status: Marital status.
- occupation: Occupation category.
- relationship: Relationship to the household.
- race: Race.
- sex: Gender.
- capital_gain: Amount of capital gains.
- capital_loss: Amount of capital loss.
- hours_per_week: Hours worked per week.
- native_country: Country of birth.
- income_bracket: Either " >50K" or " <=50K" based on income.

### Costs 
This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), 
[BigQuery pricing](https://cloud.google.com/bigquery/pricing), 
[Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the 
[Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Get started


Install Vertex AI SDK for Python and other required packages


In [None]:
! pip3 install --upgrade --quiet google-cloud-aiplatform \
                                 google-cloud-bigquery \
                                 pandas \
                                 pyarrow \
                                 'kfp<2' \
                                 'google-cloud-pipeline-components<2' \
                                 db-dtypes 

### Restart runtime (Colab only)

To use the newly installed packages, you must restart the runtime on Google Colab.


In [None]:
import sys

if "google.colab" in sys.modules:

    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">,
<b>⚠️ The kernel is going to restart. Wait until it's finished before continuing to the next step. ⚠️</b>,
</div>

### Authenticate your notebook environment (Colab only)
Authenticate your environment on Google Colab.


In [None]:
import sys

if "google.colab" in sys.modules:

    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information

To get started using Vertex AI, you must have an existing Google Cloud project. Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**If your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $LOCATION -p $PROJECT_ID $BUCKET_URI

#### Service Account

You use a service account to create Vertex AI Pipeline jobs. If you don't want to use your project's Compute Engine service account, set `SERVICE_ACCOUNT` to another service account ID.

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
import sys

IS_COLAB = "google.colab" in sys.modules
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### Set service account access for Vertex AI Pipelines

Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run this step once per service account.

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

### Import libraries

Import the Vertex AI Python SDK and other required Python libraries.

In [None]:
from google.cloud import aiplatform, bigquery
from kfp.dsl import pipeline
from kfp.v2 import compiler

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
# Initialize Vertex AI SDK
aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

# Initialize BigQuery client
bq_client = bigquery.Client(
    project=PROJECT_ID,
    credentials=aiplatform.initializer.global_config.credentials,
)

### Define constants

Set constants that you need while training the model, creating and running the pipeline.

In [None]:
# Source of the dataset
DATA_SOURCE = "bq://bigquery-public-data.ml_datasets.census_adult_income"
# Set name for the managed Vertex AI dataset
DATASET_DISPLAY_NAME = "adult_census_dataset_unique"
# BigQuery Dataset name
BQ_DATASET_ID = "income_prediction_unique1"
# Set name for the BigQuery source table for batch prediction
BQ_INPUT_TABLE = "income_test_data_unique"
# Set the size(%) of the train set
TRAIN_SPLIT = 0.9
# Provide the container for training the model
TRAINING_CONTAINER = "us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.0-23:latest"
# Provide the container for serving the model
SERVING_CONTAINER = "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-23:latest"
# Set the display name for training job
TRAINING_JOB_DISPLAY_NAME = "income_classify_train_job_unique"
# Model display name for Vertex AI Model Registry
MODEL_DISPLAY_NAME = "income_classify_model_unique"
# Set the name for batch prediction job
BATCH_PREDICTION_JOB_NAME = "income_classify_batch_pred_unique"
# Dispaly name for the Vertex AI Pipeline
PIPELINE_DISPLAY_NAME = "income_classfiy_batch_pred_pipeline_unique"
# Filename to compile the pipeline to
PIPELINE_FILE_NAME = f"{PIPELINE_DISPLAY_NAME}.json"

## Create a BigQuery dataset
For this tutorial, your input and output for the batch prediction job need to lie in BigQuery. So, you create a dataset in BigQuery to store them.

In [None]:
# Create a BQ dataset
bq_dataset = bigquery.Dataset(f"{PROJECT_ID}.{BQ_DATASET_ID}")
bq_dataset = bq_client.create_dataset(bq_dataset)
print(f"Created dataset {bq_client.project}.{bq_dataset.dataset_id}")

## Create test data for batch prediction

Query the public dataset source and create a test set in the created BigQuery dataset.

For batch prediction, test set is created by randomly selecting a small fraction (1-TRAIN_SPLIT) of the source dataset.

In [None]:
# Query to create a test set from the source table
query = f"""
CREATE OR REPLACE TABLE
  `{PROJECT_ID}.{BQ_DATASET_ID}.{BQ_INPUT_TABLE}` AS

SELECT
  * EXCEPT (pseudo_random, income_bracket)
FROM (
  SELECT
    *,
    RAND() AS pseudo_random 
  FROM
    `bigquery-public-data.ml_datasets.census_adult_income` )
WHERE pseudo_random > {TRAIN_SPLIT}
"""
# Run the query
_ = bq_client.query(query)

## Create a Python package for your training application

Before you perform the batch prediction task, you train the Random Forest classification model on the income census dataset. You perform the training through using a prebuilt container in Vertex AI. For this purpose, you package the training application in the following steps.

Learn more about [creating a Python training application for a prebuilt container](https://cloud.google.com/vertex-ai/docs/training/create-python-prebuilt-container).

### Prepare the source directory

Create a source directory named python_package with a trainer subfolder inside. Next, create a __init__.py file in the trainer folder to make it a package.

In [None]:
!mkdir -p python_package
!mkdir -p python_package/trainer
!touch python_package/trainer/__init__.py

### Create the trainer task
Within trainer/, create a module named task.py that serves as the entrypoint for your training code.

The trainer code below preprocesses the train set and stores the preprocessing transforms in a scikit-learn pipeline. Further, a [Random Forest model is trained](https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) on the preprocessed train data and added as an estimator to the pipeline. After saving the model, it's uploaded to the Cloud Storage bucket for deployment.

An advantage of using a scikit-learn pipeline is that it saves you from the trouble of writing additional scripts for preprocessing the data while generating predictions. 

Learn more about [scikit-learn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

In [None]:
%%writefile python_package/trainer/task.py
import os
import joblib
import argparse
from google.cloud import storage
from google.cloud import bigquery
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelBinarizer
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier

# Read environmental variables
PROJECT = os.getenv("CLOUD_ML_PROJECT_ID")
TRAINING_DATA_URI = os.getenv("AIP_TRAINING_DATA_URI")

# Set Bigquery Client
bq_client = bigquery.Client(project=PROJECT)
storage_client = storage.Client(project=PROJECT)

# Define the constants
TARGET = 'income_bracket'
ARTIFACTS_PATH = os.getenv("AIP_MODEL_DIR")
# Get the bucket name from the model dir
BUCKET_NAME = ARTIFACTS_PATH.replace("gs://","").split("/")[0]

MODEL_FILENAME = 'model.joblib' 
# Define the format of your input data, excluding the target column.
# These are the columns from the census data files.
COLUMNS = [
    'age',
    'workclass',
    'functional_weight',
    'education',
    'education_num',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
    'native_country'
]
# Categorical columns are columns that need to be turned into a numerical value to be used by scikit-learn
CATEGORICAL_COLUMNS = [
    'workclass',
    'education',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native_country'
]

# Function to fetch the data from BigQuery
def download_table(bq_table_uri: str):
    prefix = "bq://"
    if bq_table_uri.startswith(prefix):
        bq_table_uri = bq_table_uri[len(prefix):]

    table = bigquery.TableReference.from_string(bq_table_uri)
    rows = bq_client.list_rows(
        table,
    )
    return rows.to_dataframe(create_bqstorage_client=False)

# Function to upload local files to GCS
def upload_model(bucket_name: str,
                filename: str):
     # Upload the saved model file to GCS
    bucket = storage_client.get_bucket(bucket_name)
    storage_path = os.path.join(ARTIFACTS_PATH, filename)
    blob = storage.blob.Blob.from_string(storage_path, client=storage_client)
    blob.upload_from_filename(filename)
    

if __name__ == '__main__':
    # Load the training data
    X_train = download_table(TRAINING_DATA_URI)

    # Remove the column we are trying to predict ('income-level') from our features list
    # Convert the Dataframe to a lists of lists
    train_features = X_train.drop(TARGET, axis=1).to_numpy().tolist()
    # Create our training labels list, convert the Dataframe to a lists of lists
    train_labels = X_train[TARGET].to_numpy().tolist()

    # Since the census data set has categorical features, we need to convert
    # them to numerical values. We use a list of pipelines to convert each
    # categorical column and then use FeatureUnion to combine them before calling
    # the RandomForestClassifier.
    categorical_pipelines = []

    # Each categorical column needs to be extracted individually and converted to a numerical value.
    # To do this, each categorical column use a pipeline that extracts one feature column via
    # SelectKBest(k=1) and a LabelBinarizer() to convert the categorical value to a numerical one.
    # A scores array (created below) selects and extracts the feature column. The scores array is
    # created by iterating over the COLUMNS and checking if it's a CATEGORICAL_COLUMN.
    for i, col in enumerate(COLUMNS):
        if col in CATEGORICAL_COLUMNS:
            # Create a scores array to get the individual categorical column.
            # Example:
            #  data = [39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married', 'Adm-clerical',
            #         'Not-in-family', 'White', 'Male', 2174, 0, 40, 'United-States']
            #  scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
            #
            # Returns: [['Sate-gov']]
            scores = []
            # Build the scores array
            for j in range(len(COLUMNS)):
                if i == j: # This column is the categorical column we want to extract.
                    scores.append(1) # Set to 1 to select this column
                else: # Every other column should be ignored.
                    scores.append(0)
            skb = SelectKBest(k=1)
            skb.scores_ = scores
            # Convert the categorical column to a numerical value
            lbn = LabelBinarizer()
            r = skb.transform(train_features)
            lbn.fit(r)
            # Create the pipeline to extract the categorical feature
            categorical_pipelines.append(
                ('categorical-{}'.format(i), Pipeline([
                    ('SKB-{}'.format(i), skb),
                    ('LBN-{}'.format(i), lbn)])))

    # Create pipeline to extract the numerical features
    skb = SelectKBest(k=6)
    # From COLUMNS use the features that are numerical
    skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]
    categorical_pipelines.append(('numerical', skb))

    # Combine all the features using FeatureUnion
    preprocess = FeatureUnion(categorical_pipelines)

    # Create the classifier
    classifier = RandomForestClassifier()

    # Transform the features and fit them to the classifier
    classifier.fit(preprocess.transform(train_features), train_labels)

    # Create the overall model as a single pipeline
    pipeline = Pipeline([
        ('union', preprocess),
        ('classifier', classifier)
    ])

    # Save the pipeline locally
    joblib.dump(pipeline, MODEL_FILENAME)
    
    # Upload the locally saved model to GCS
    upload_model(bucket_name = BUCKET_NAME, 
                 filename=MODEL_FILENAME
                )

### Create a setup file
Create a setup.py file that tells Setuptools how to create the source distribution. You also specify your application's standard dependencies as part of the setup.py file. Vertex AI uses pip to install your training application on the replicas that it allocates for your job. 

Learn more about [Setuptools](https://setuptools.readthedocs.io/en/latest/).

In [None]:
%%writefile python_package/setup.py

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = ['pandas','pyarrow']

setup(
    name='trainer',
    version='0.1',
    packages=find_packages(),
    include_package_data=True,
    description='My training application.'
)

### Create the source distribution

Run the following command to create a source distribution, dist/trainer-0.1.tar.gz.

In [None]:
!cd python_package && python3 setup.py sdist --formats=gztar

### Copy the source distribution to Cloud Storage

To train the custom classification model using a prebuilt container, copy the source distribution of your training application to a Cloud Storage path. While training you let the Vertex AI SDK locate the package through the python_package_gcs_uri parameter.

In [None]:
!gsutil cp -r python_package/dist/* $BUCKET_URI/training_package/

## Create and run the pipeline

All the preparations have been done for your pipeline. In the current step, you create a Vertex AI Pipeline that comprises the following components each serving their own purpose in order:

- TabularDatasetCreateOp: Creates a new managed tabular dataset in Vertex AI. 
- CustomPythonPackageTrainingJobRunOp: Creates and runs a custom training job in Vertex AI using a Python package.
- ModelBatchPredictOp: Creates a batch prediction job in Vertex AI and waits for it to complete.

All the above components are imported from the google-cloud-pipeline-components Python library. Learn more about [Google Cloud Pipeline Components](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-1.0.26/index.html).

In [None]:
# Define the pipeline


@pipeline(name="custom-model-bq-batch-prediction-pipeline")
def custom_model_bq_batch_prediction_pipeline(
    project: str,
    location: str,
    dataset_display_name: str,
    dataset_bq_source: str,
    training_job_dispaly_name: str,
    gcs_staging_directory: str,
    python_package_gcs_uri: str,
    python_package_module_name: str,
    training_split: float,
    test_split: float,
    training_container_uri: str,
    serving_container_uri: str,
    training_bigquery_destination: str,
    model_display_name: str,
    batch_prediction_display_name: str,
    batch_prediction_instances_format: str,
    batch_prediction_predictions_format: str,
    batch_prediction_source_uri: str,
    batch_prediction_destination_uri: str,
    batch_prediction_machine_type: str = "n1-standard-4",
    batch_prediction_batch_size: int = 1000,
):
    from google_cloud_pipeline_components.aiplatform import (
        CustomPythonPackageTrainingJobRunOp, ModelBatchPredictOp,
        TabularDatasetCreateOp)

    # Create the dataset
    dataset_create_op = TabularDatasetCreateOp(
        project=project,
        location=location,
        display_name=dataset_display_name,
        bq_source=dataset_bq_source,
    )

    # Run the training task
    train_op = CustomPythonPackageTrainingJobRunOp(
        display_name=training_job_dispaly_name,
        python_package_gcs_uri=python_package_gcs_uri,
        python_module_name=python_package_module_name,
        container_uri=training_container_uri,
        model_display_name=model_display_name,
        model_serving_container_image_uri=serving_container_uri,
        dataset=dataset_create_op.outputs["dataset"],
        base_output_dir=gcs_staging_directory,
        bigquery_destination=training_bigquery_destination,
        training_fraction_split=training_split,
        test_fraction_split=test_split,
        staging_bucket=gcs_staging_directory,
    )

    # Run the batch prediction task
    _ = ModelBatchPredictOp(
        project=project,
        location=location,
        model=train_op.outputs["model"],
        instances_format=batch_prediction_instances_format,
        bigquery_source_input_uri=batch_prediction_source_uri,
        predictions_format=batch_prediction_predictions_format,
        bigquery_destination_output_uri=batch_prediction_destination_uri,
        job_display_name=batch_prediction_display_name,
        machine_type=batch_prediction_machine_type,
        manual_batch_tuning_parameters_batch_size=batch_prediction_batch_size,
    )

### Compile the pipeline

After defining your pipeline, compile it to a file (PIPELINE_FILE_NAME) in `JSON` or `YAML` format.

In [None]:
# Compile the pipeline
compiler.Compiler().compile(
    pipeline_func=custom_model_bq_batch_prediction_pipeline,
    package_path=PIPELINE_FILE_NAME,
)

### Set the parameters

Now, define the paramters to run your pipeline.

To pass the required arguments to the individual components in your pipeline, you define the following paramters:
- project: Project ID for the Google Cloud project where the pipeline needs to run.
- location: Region where the pipeline needs to run.
- dataset_display_name: Display name for the managed dataset resource in Vertex AI.
- dataset_bq_source: BigQuery table URI to serve as a source for the managed dataset in Vertex AI.
- training_job_dispaly_name: Display name for the the custom python package training job.
- gcs_staging_directory: Staging directory for Vertex AI to store training artifacts.
- python_package_gcs_uri: Cloud Storage path to the Python package for training.
- python_package_module_name: Module name (trainer task) inside the Python package for training.
- training_split: Percentage of the total data to be considered for training.
- test_split: Percentage of the total data to be considered for testing. Split percentage parameters provided for the **CustomPythonPackageTrainingJobRunOp** component should always sum up to 1.
- training_container_uri: Prebuilt container image URI for training the model. 
- serving_container_uri: Prebuilt container image URI for serving the model on Vertex AI.
- training_bigquery_destination: The BigQuery project location where the training data is to be written to during training.
- model_display_name: Dispaly name for the model to be deployed in Vertex AI Model Registry.
- batch_prediction_display_name: Dispaly name for the batch prediction job.
- batch_prediction_instances_format: Format of the input instances for batch prediction.
- batch_prediction_predictions_format: Format of the results from the batch prediction.
- batch_prediction_source_uri: Source URI of the input data.
- batch_prediction_destination_uri: Destination URI where the batch prediction results need to be stored.

**Note:** Though a test split percentage is provided, test data isn't used during the training process. This test data is different from the test data created in earlier steps for batch prediction.

In [None]:
# Define the parameters for running the pipeline
parameters = {
    "project": PROJECT_ID,
    "location": LOCATION,
    "dataset_display_name": DATASET_DISPLAY_NAME,
    "dataset_bq_source": DATA_SOURCE,
    "training_job_dispaly_name": TRAINING_JOB_DISPLAY_NAME,
    "gcs_staging_directory": BUCKET_URI,
    "python_package_gcs_uri": f"{BUCKET_URI}/training_package/trainer-0.1.tar.gz",
    "python_package_module_name": "trainer.task",
    "training_split": TRAIN_SPLIT,
    "test_split": 1 - TRAIN_SPLIT,
    "training_container_uri": TRAINING_CONTAINER,
    "serving_container_uri": SERVING_CONTAINER,
    "training_bigquery_destination": f"bq://{PROJECT_ID}",
    "model_display_name": MODEL_DISPLAY_NAME,
    "batch_prediction_display_name": BATCH_PREDICTION_JOB_NAME,
    "batch_prediction_instances_format": "bigquery",
    "batch_prediction_predictions_format": "bigquery",
    "batch_prediction_source_uri": f"bq://{PROJECT_ID}.{BQ_DATASET_ID}.{BQ_INPUT_TABLE}",
    "batch_prediction_destination_uri": f"bq://{PROJECT_ID}.{BQ_DATASET_ID}",
}

### Run the pipeline

Create a Vertex AI Pipeline job and run it using the `PipelineJob` class.

The PipelineJob class takes the following parameters:

- display_name: The display name of the Vertex AI pipeline.
- template_path: The path of PipelineJob or PipelineSpec (JSON or YAML) file.
- parameter_values: The mapping from runtime parameter names to its values that control the pipeline run.
- enable_caching: Whether to turn on caching for the run.

Learn more about the `PipelineJob` class from [Vertex AI PipelineJob documentation](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.PipelineJob).

In [None]:
# Create a Vertex AI Pipeline job
job = aiplatform.PipelineJob(
    display_name=PIPELINE_DISPLAY_NAME,
    template_path=PIPELINE_FILE_NAME,
    parameter_values=parameters,
    enable_caching=True,
)
# Run the pipeline job
job.run(service_account=SERVICE_ACCOUNT)

## Fetch results from the predictions table

After the Vertex AI pipeline job is finished successfully, fetch the results from batch prediction into a dataframe.

### Get the table name

Run the below cell to get the name of the predictions table from the pipeline job's artifact details.

In [None]:
OUTPUT_TABLE = None
# Load the batch prediction job details using the display name
[batch_prediction_job] = aiplatform.BatchPredictionJob.list(
    filter=f'display_name="{BATCH_PREDICTION_JOB_NAME}"'
)
# Fetch the name of the output table
OUTPUT_TABLE = batch_prediction_job.output_info.bigquery_output_table
print("Predictions table ID:", OUTPUT_TABLE)

### Query the results table

Fetch a specified number of rows from the predictions table using the following cell.

In [None]:
# Specify the needed no.of rows
ROWS = 10
# Define the query
query = f"""
    Select prediction from `{PROJECT_ID}.{BQ_DATASET_ID}.{OUTPUT_TABLE}` limit {ROWS}
"""
# Fetch the data into a dataframe
df = bq_client.query(query).to_dataframe()
# Display the dataframe
df

### Fetch the resources for deletion

Using the display names of the individual resources, load the resources created inside the pipeline for the clean up step.

In [None]:
# Load the Vertex AI tabular dataset using the display name
[dataset] = aiplatform.TabularDataset.list(
    filter=f'display_name="{DATASET_DISPLAY_NAME}"'
)

# Load the Vertex AI model resource using the display name
[model] = aiplatform.Model.list(filter=f'display_name="{MODEL_DISPLAY_NAME}"')

# Load the custom training job using the display name
[training_job] = aiplatform.CustomPythonPackageTrainingJob.list(
    filter=f'display_name="{TRAINING_JOB_DISPLAY_NAME}"'
)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Vertex AI Pipeline job
- Vertex AI TabularDataset
- Vertex AI model
- Vertex AI Training job
- Vertex AI batch prediction job
- BigQuery dataset
- Cloud Storage bucket (Set `delete_bucket` to **True** to delete the Cloud Storage bucket)

In [None]:
delete_bucket = False

# Delete the Vertex AI Pipeline job
job.delete()

# Delete the Vertex AI TabularDataset
dataset.delete()

# Delete the Vertex AI Model
model.delete()

# Delete the Vertex AI Training job
training_job.delete()

# Delete the Vertex AI Batch prediction job
batch_prediction_job.delete()

# Delete the BigQuery dataset
! bq rm -r -f -d $PROJECT_ID:$BQ_DATASET_ID

# Delete Cloud Storage objects
if delete_bucket:
    ! gsutil -m rm -r $BUCKET_URI

! rm $PIPELINE_FILE_NAME