In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI Pipelines: Batch Prediction with BigQuery source and destinantion from a Custom Tabular Classification model

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/custom_tabular_classification_bq_io_pipeline.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/custom_tabular_classification_bq_io_pipeline.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pipelines/custom_tabular_classification_bq_io_pipeline.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook demonstrates how to do a batch prediction with a custom tabular classification model inside a Vertex AI pipeline. The pipeline is constructed with the `google_cloud_pipeline_components` Python library and the batch prediction is configured with a bigquery source and destination.

### Objective

In this tutorial, you train a scikit-learn Tabular Classification model and learn how to create batch prediction job for it through a Vertex AI pipeline job using `google_cloud_pipeline_components`. The source and destination data for the batch prediction job is served in BigQuery.

This tutorial uses the following Google Cloud ML services and resources:

- Vertex AI `Model Registry`
- Vertex AI `Pipelines`
- Vertex AI `Batch Predictions`


The steps performed include:

- Save data to BigQuery for batch prediction.
- Train a scikit-learn RandomForest classification model on the dataset.
- Upload the scikit-learn model to Vertex AI Model Registry.
- Create a Vertex AI Pipeline job to fetch the model and run a batch prediction job.
- Run the Vertex AI Pipeline job.
- Check the prediction results from the destination BigQuery table.
- Clean up the resources created in this notebook.

### Dataset

The [Census Income Data Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) that this notebook uses for training is available publicly at the BigQuery location `bigquery-public-data.ml_datasets.census_adult_income`. It consists of the following fields:

- `age`: Age.
- `workclass`: Nature of employment.
- `functional_weight`: Sample weight of the individual from the original Census data. How likely they were to be included in this dataset, based on their demographic characteristics vs. whole-population estimates.
- `education`: Level of education completed.
- `education_num`: Estimated years of education completed based on the value of the education field.
- `marital_status`: Marital status.
- `occupation`: Occupation category.
- `relationship`: Relationship to the household.
- `race`: Race.
- `sex`: Gender.
- `capital_gain`: Amount of capital gains.
- `capital_loss`: Amount of capital loss.
- `hours_per_week`: Hours worked per week.
- `native_country`: Country of birth.
- `income_bracket`: Either " >50K" or " <=50K" based on income.

### Costs 
This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Cloud Storage
* Artifact Registry
* Cloud Build

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), [Artifact Registry pricing](https://cloud.google.com/artifact-registry/pricing), [Cloud Build pricing](https://cloud.google.com/build/pricing) and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Vertex AI Workbench Notebooks**, your environment already meets
all the requirements to run this notebook.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK
* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

1. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3. Activate the virtual environment.

1. To install Jupyter, run `pip3 install jupyter` on the
command-line in a terminal shell.

1. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

1. Open this notebook in the Jupyter Notebook Dashboard.

## Installation

Install the following packages required to execute this notebook. 

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade google-cloud-aiplatform \
                            google-cloud-bigquery \
                            kfp \
                            google-cloud-pipeline-components \
                            pyarrow \
                            pandas {USER_FLAG} -q
! pip3 install scikit-learn==1.0  {USER_FLAG} -q

### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Vertex AI, Compute Engine, Artifact Registry, BigQuery and Cloud Build APIs](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute.googleapis.com,artifactregistry.googleapis.com,bigquery.googleapis.com,cloudbuild.googleapis.com).

1. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. It is recommended that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string

# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already
authenticated. 

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS '[your-service-account-key-path]'

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you submit a Pipeline or a Batch prediction job using the Vertex AI SDK, Vertex AI uses a Cloud Storage bucket as a staging location. Instead of providing while running the job, the staging location can also be provided to Vertex AI while initializing. In this tutorial, Vertex AI is initialized with a staging bucket that you create in the next steps.

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "aip-" + UUID
    BUCKET_URI = f"gs://{BUCKET_NAME}"

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

#### Service Account

You use a service account to create Vertex AI Pipeline jobs. If you do not want to use your project's Compute Engine service account, set `SERVICE_ACCOUNT` to another service account ID.

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    else:  # IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### Set service account access for Vertex AI Pipelines

Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step. You only need to run this step once per service account.

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

### Import libraries

Import the Vertex AI Python SDK and other required Python libraries.

In [None]:
from google.cloud import bigquery
from google.cloud import aiplatform
from google.cloud.aiplatform_v1 import types
import kfp
from kfp.v2 import compiler
from sklearn.model_selection import train_test_split
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
# Initialize Vertex AI SDK
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

# Initialize BigQuery client 
bq_client = bigquery.Client(
    project=PROJECT_ID,
    credentials=aiplatform.initializer.global_config.credentials,
)

### Define constants

Set constants that you need while training the model, creating and running the pipeline.

In [None]:
DATA_SOURCE = "bigquery-public-data.ml_datasets.census_adult_income" # Source of the dataset
TEST_SIZE = 0.25 # size(%) of the test set
RANDOM_STATE = 36 # Random state

# Define the format of your input data, excluding the target column.
# These are the columns from the census data files.
COLUMNS = [
    'age',
    'workclass',
    'functional_weight',
    'education',
    'education_num',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
    'native_country'
]

# Categorical columns are columns that need to be turned into a numerical value to be used by scikit-learn
CATEGORICAL_COLUMNS = [
    'workclass',
    'education',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native_country'
]

# Target column for training
TARGET = 'income_bracket'

# BigQuery Dataset name
BQ_DATASET_ID = f"income_prediction_{UUID}"

# name of the BigQuery source table for batch prediction
BQ_INPUT_TABLE = f"income_test_data_{UUID}"

# name of the BigQuery output table for batch prediction
BQ_OUTPUT_TABLE = f"{BQ_INPUT_TABLE}_predictions"

# Filename to save the model locally
MODEL_FILENAME = "model.joblib"

# Model display name for Vertex AI Model Registry
MODEL_DISPLAY_NAME = f"income_pred_skl_{UUID}"

# Dispaly name for the Vertex AI Pipeline
PIPELINE_DISPLAY_NAME = f"income_classifier_pipeline_{UUID}"

# Filename to compile the pipeline to
PIPELINE_FILENAME = "income_classification_pipeline.json"

## Dataset preparation

### Fetch the Dataset

Get the data from the public BigQuery dataset to train your model locally.

In [None]:
# Query to fetch the data from BQ
query = f'''
Select * from `{DATA_SOURCE}`
'''

# Fetch the data into a dataframe
df = bq_client.query(query).to_dataframe()
df.head()

### Split into train and test sets

Split the fetched dataset into train and test sets.

In [None]:
# Split the dataset into train and test
X_train, X_test = train_test_split(
    df, test_size=TEST_SIZE, random_state=RANDOM_STATE
)
print(X_train.shape, X_test.shape)

### Save test data to BigQuery

To run a batch prediction using BigQuery input, save the test data to a BigQuery table. 

#### Create Schema configuration

To create a table in BigQuery with proper data types, define the schema configuration of the table based on the data-types available in the dataframe.

In [None]:
# Define the schema configuration
schema_config = []
for i in COLUMNS:
    if X_test[i].dtype == "int64":
        schema_config.append(bigquery.SchemaField(i, 'INTEGER'))
    elif X_test[i].dtype in ['object', 'category']:
        schema_config.append(bigquery.SchemaField(i, 'STRING'))

#### Create a BigQuery Dataset

Run the below cell to create a dataset in BigQuery.

In [None]:
# Create a BQ dataset
bq_dataset = bigquery.Dataset(f"{PROJECT_ID}.{BQ_DATASET_ID}")
bq_dataset = bq_client.create_dataset(bq_dataset)
print(f"Created dataset {bq_client.project}.{bq_dataset.dataset_id}")

#### Create the table from dataframe

Create the BigQuery table and load the test data from dataframe.

**Note:** You save only the features to the table excluding the target column.

In [None]:
# Create a BQ table from dataframe
table_ref = bq_dataset.table(BQ_INPUT_TABLE)
job_config = bigquery.LoadJobConfig(schema=schema_config,
                                           write_disposition="WRITE_TRUNCATE" )

job = bq_client.load_table_from_dataframe(X_test[COLUMNS], table_ref, location = REGION.split('-')[0].upper())

job.result()  # Waits for table load to complete.
print("Loaded dataframe to {}".format(table_ref.path))

## Preprocess the data

Run the code below to preprocess the train set and store the preprocessing steps in a scikit-learn pipeline. 

The scikit-learn pipeline saves you from the trouble of writing additional scripts to preprocess the test data for generating predictions. 

Learn more about [scikit-learn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

In [None]:
# Remove the column we are trying to predict ('income-level') from our features list
# Convert the Dataframe to a lists of lists
train_features = X_train.drop(TARGET, axis=1).to_numpy().tolist()
# Create our training labels list, convert the Dataframe to a lists of lists
train_labels = X_train[TARGET].to_numpy().tolist()

# Since the census data set has categorical features, we need to convert
# them to numerical values. We use a list of pipelines to convert each
# categorical column and then use FeatureUnion to combine them before calling
# the RandomForestClassifier.
categorical_pipelines = []

# Each categorical column needs to be extracted individually and converted to a numerical value.
# To do this, each categorical column use a pipeline that extracts one feature column via
# SelectKBest(k=1) and a LabelBinarizer() to convert the categorical value to a numerical one.
# A scores array (created below) selects and extracts the feature column. The scores array is
# created by iterating over the COLUMNS and checking if it is a CATEGORICAL_COLUMN.
for i, col in enumerate(COLUMNS):
    if col in CATEGORICAL_COLUMNS:
        # Create a scores array to get the individual categorical column.
        # Example:
        #  data = [39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married', 'Adm-clerical',
        #         'Not-in-family', 'White', 'Male', 2174, 0, 40, 'United-States']
        #  scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        #
        # Returns: [['Sate-gov']]
        scores = []
        # Build the scores array
        for j in range(len(COLUMNS)):
            if i == j: # This column is the categorical column we want to extract.
                scores.append(1) # Set to 1 to select this column
            else: # Every other column should be ignored.
                scores.append(0)
        skb = SelectKBest(k=1)
        skb.scores_ = scores
        # Convert the categorical column to a numerical value
        lbn = LabelBinarizer()
        r = skb.transform(train_features)
        lbn.fit(r)
        # Create the pipeline to extract the categorical feature
        categorical_pipelines.append(
            ('categorical-{}'.format(i), Pipeline([
                ('SKB-{}'.format(i), skb),
                ('LBN-{}'.format(i), lbn)])))

# Create pipeline to extract the numerical features
skb = SelectKBest(k=6)
# From COLUMNS use the features that are numerical
skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]
categorical_pipelines.append(('numerical', skb))

# Combine all the features using FeatureUnion
preprocess = FeatureUnion(categorical_pipelines)

## Create a RandomForest model

Using  scikit-learn's `RandomForestClassifier`, fit a Random-forest model on the train data. 

Further, add the trained classifier to the scikit-learn pipeline.

Learn more about the RandomForestClassifer from [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
# Create the classifier
classifier = RandomForestClassifier()

# Transform the features and fit them to the classifier
classifier.fit(preprocess.transform(train_features), train_labels)

# Create the overall model as a single pipeline
pipeline = Pipeline([
    ('union', preprocess),
    ('classifier', classifier)
])

Test the trained classifier on a few samples from the training data.

In [None]:
instances= X_train[COLUMNS].sample(5).to_numpy().tolist()
pipeline.predict(instances)

Save the model pipeline to a file using `joblib` Python library.

In [None]:
# Save the pipeline
joblib.dump(pipeline, MODEL_FILENAME)

## Upload the model to Vertex AI

Create a Vertex AI model resource from the saved model file.

For this step, you can use Vertex AI SDK's `Model.upload_scikit_learn_model_file` method that lets you create a Vertex AI model resource directly from your scikit-learn model file. It takes the following arguments:

- `model_file_path`: Local file path of the model.
- `sklearn_version`: The version of the Scikit-learn serving container. You use "1.0" which is the latest supported version in this notebook.
- `display_name`: The display name of the Vertex AI Model.

Learn more details about Vertex AI Model class from [`Vertex AI Model documentation`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model).

In [None]:
# Create the Vertex AI model resource
model = aiplatform.Model.upload_scikit_learn_model_file(
            model_file_path=MODEL_FILENAME,
            display_name=MODEL_DISPLAY_NAME,
            sklearn_version="1.0"
        )

Get the resource name of the created Vertex AI Model.

In [None]:
MODEL_RESOURCE_NAME = model.resource_name

## Create Vertex AI Pipeline

Next, you create a Vertex AI pipeline using `google-cloud-pipeline-components`. This pipeline fetches the Vertex AI model resource and runs a batch prediction job with it.

### Define the Pipeline 

The pipeline uses the following components:

- `GetVertexModelOp`: Gets a Vertex AI Model Artifact. Learn more about [GetVertexModelOp component](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-1.0.23/google_cloud_pipeline_components.experimental.evaluation.html#google_cloud_pipeline_components.experimental.evaluation.GetVertexModelOp).

- `ModelBatchPredictOp`: Creates a Google Cloud Vertex AI batch prediction job and waits for it to complete. Learn more about [ModelBatchPredictOp component](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-1.0.23/google_cloud_pipeline_components.aiplatform.html#google_cloud_pipeline_components.aiplatform.ModelBatchPredictOp).

In [None]:
# Define the pipeline
@kfp.dsl.pipeline(
    name='custom-model-batch-prediction-pipeline')
def custom_model_batch_prediction_pipeline(
    project: str,
    location: str,
    model_name: str,
    bigquery_source_input_uri: str,
    bigquery_destination_output_uri: str,
    batch_prediction_display_name: str = "model-registry-batch-prediction",
    batch_predict_machine_type: str = "n1-standard-4",
    
):
  
    from google_cloud_pipeline_components.experimental.evaluation import GetVertexModelOp
    from google_cloud_pipeline_components.aiplatform import ModelBatchPredictOp
    
    # Get the Vertex AI model resource
    get_model_task = GetVertexModelOp(model_resource_name=model_name)
    
    # Run Batch Predictions
    _ = ModelBatchPredictOp(
                            project=project,
                            location=location,
                            model=get_model_task.outputs['model'],
                            instances_format='bigquery',
                            bigquery_source_input_uri=bigquery_source_input_uri,
                            predictions_format='bigquery',
                            bigquery_destination_output_uri=bigquery_destination_output_uri,
                            job_display_name=batch_prediction_display_name,
                            machine_type=batch_predict_machine_type
                        )

### Compile the pipeline

Compile the pipeline to a `json` file.

In [None]:
compiler.Compiler().compile(
    pipeline_func=custom_model_batch_prediction_pipeline,
    package_path=PIPELINE_FILENAME
)

### Run the Pipeline

Define the parameters to create a Vertex AI Pipeline job.

In [None]:
parameters = {
            'project':PROJECT_ID,
            'location':REGION,
            'model_name':MODEL_RESOURCE_NAME,
            'bigquery_source_input_uri':f"bq://{PROJECT_ID}.{table_ref.dataset_id}.{table_ref.table_id}",
            'bigquery_destination_output_uri':f"bq://{PROJECT_ID}.{table_ref.dataset_id}"
        }

Create a Vertex AI Pipeline job and run it using `PipelineJob` class.

The `PipelineJob` class takes the following parameters as arguments:

- `display_name`: The display name of the Vertex AI Pipeline.
- `template_path`: The path of PipelineJob or PipelineSpec JSON (or YAML) file.
- `parameter_values`: The mapping from runtime parameter names to its values that control the pipeline run.
- `enable_caching`: Whether to turn on caching for the run.

Learn more about the `PipelineJob` class from [Vertex AI PipelineJob documentation](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.PipelineJob).

In [None]:
job = aiplatform.PipelineJob(
    display_name=PIPELINE_DISPLAY_NAME,
    template_path=PIPELINE_FILENAME,
    parameter_values=parameters,
    enable_caching=False
)

job.run(service_account=SERVICE_ACCOUNT)

## Fetch results from the predictions table

After the Vertex AI pipeline job is finished successfully, fetch the results from batch prediction into a dataframe.

### Get the table name

Run the below cell to get the name of the predictions table from the pipeline job's artifact details.

In [None]:
OUTPUT_TABLE = None

# Iterate over the pipeline tasks
for task in job._gca_resource.job_detail.task_details:
    if  task.task_name == "model-batch-predict" and (
            task.state == types.PipelineTaskDetail.State.SUCCEEDED
            or task.state == types.PipelineTaskDetail.State.SKIPPED
        ):
    
        # Obtain the artifacts from the batch-prediction task
        OUTPUT_TABLE = task.outputs["bigquery_output_table"].artifacts[0].metadata["tableId"]
        

print("Predictions table ID:", OUTPUT_TABLE)

### Query the predictions table

Fetch a specified number of rows from the predictions table using the following cell.

In [None]:
ROWS = 10 # specify the needed no.of rows
query = f'''
    Select prediction from `{PROJECT_ID}.{BQ_DATASET_ID}.{OUTPUT_TABLE}` limit {ROWS}
'''

# Fetch the data into a dataframe
df = bq_client.query(query).to_dataframe()

df

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial.

Set `delete_bucket` to **True** to delete the Cloud Storage bucket used in this notebook.

In [None]:
delete_bucket = False

# Delete the Vertex AI Pipeline job
job.delete()

# Delete the Vertex AI Model
model.delete()

# Delete the BigQuery dataset
! bq rm -r -f -d $PROJECT_ID:$BQ_DATASET_ID

# Delete Cloud Storage objects
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI