# Data Journey Day 2 - Vertex AI Pipeline for AutoML Tabular

<table align="left">

  <td>
    <a href="https://github.com/AmritRaj23/data-journey/blob/main/day-2/Vertex-Pipelines/DataJourneyVpip.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://github.com/AmritRaj23/data-journey/blob/main/day-2/Vertex-Pipelines/DataJourneyVpip.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
     </a>
  </td>
</table>
</table>
<br/><br/><br/>

In [1]:
# ! pip3 install --upgrade --user google-cloud-aiplatform \
#                                     google-cloud-storage \
#                                     kfp \
#                                     google-cloud-pipeline-components -q

### Set GCP config

In [1]:
PROJECT_ID = "<project-id>"
REGION ="<region>" 
BUCKET_NAME = "<bucket>"  
BUCKET_URI = f"gs://{BUCKET_NAME}"
SERVICE_ACCOUNT = "<default compute service account>"  

In [3]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [4]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


### Import libraries and define constants

In [None]:
import os
import json
import logging
import time
from typing import NamedTuple

# Import GCP libraries.
from google.cloud import aiplatform
from google_cloud_pipeline_components.aiplatform import (
    AutoMLTabularTrainingJobRunOp, EndpointCreateOp, ModelDeployOp,
    TabularDatasetCreateOp)
from google.cloud import bigquery


# Import Kubeflow Pipeline SDK.
import kfp.v2 as kfp
from kfp.v2 import dsl, compiler
from kfp.v2.dsl import (Artifact, ClassificationMetrics, Input, Metrics,
                        Output, component)

In [8]:
# set path for storing the pipeline artifacts
PIPELINE_NAME = "automl-tabular-beans-training"
PIPELINE_ROOT = "{}/pipeline_root/beans".format(BUCKET_URI)

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [9]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

## Define a metrics evaluation custom component

In this tutorial, you define one custom pipeline component. The remaining components are pre-built
components for Vertex AI services.

The custom pipeline component you define is a Python-function-based component.
Python function-based components make it easier to iterate quickly by letting you build your component code as a Python function and generating the component specification for you.

Note the `@component` decorator.  When you evaluate the `classification_model_eval` function, the component is compiled to what is essentially a task factory function, that can be used in the the pipeline definition.

In addition, a `tabular_eval_component.yaml` component definition file will be generated.  The component `yaml` file can be shared & placed under version control, and used later to define a pipeline step.

The component definition specifies a base image for the component to use, and specifies that the `google-cloud-aiplatform` package should be installed. When not specified, the base image defaults to Python 3.7

The custom pipeline component retrieves the classification model evaluation generated by the AutoML tabular training process, parses the evaluation data, and renders the ROC curve and confusion matrix for the model. It also uses given metrics threshold information and compares that to the evaluation results to determine whether the model is sufficiently accurate to deploy.

*Note:* This custom component is specific to an AutoML tabular classification.

In [10]:
@component(
    base_image="gcr.io/deeplearning-platform-release/tf2-cpu.2-6:latest",
    output_component_file="tabular_eval_component.yaml",
    packages_to_install=["google-cloud-aiplatform"],
)
def classification_model_eval_metrics(
    project: str,
    location: str,
    thresholds_dict_str: str,
    model: Input[Artifact],
    metrics: Output[Metrics],
    metricsc: Output[ClassificationMetrics],
) -> NamedTuple("Outputs", [("dep_decision", str)]):  # Return parameter.

    import json
    import logging

    from google.cloud import aiplatform

    aiplatform.init(project=project)

    # Fetch model eval info
    def get_eval_info(model):
        response = model.list_model_evaluations()
        metrics_list = []
        metrics_string_list = []
        for evaluation in response:
            evaluation = evaluation.to_dict()
            print("model_evaluation")
            print(" name:", evaluation["name"])
            print(" metrics_schema_uri:", evaluation["metricsSchemaUri"])
            metrics = evaluation["metrics"]
            for metric in metrics.keys():
                logging.info("metric: %s, value: %s", metric, metrics[metric])
            metrics_str = json.dumps(metrics)
            metrics_list.append(metrics)
            metrics_string_list.append(metrics_str)

        return (
            evaluation["name"],
            metrics_list,
            metrics_string_list,
        )

    # Use the given metrics threshold(s) to determine whether the model is
    # accurate enough to deploy.
    def classification_thresholds_check(metrics_dict, thresholds_dict):
        for k, v in thresholds_dict.items():
            logging.info("k {}, v {}".format(k, v))
            if k in ["auRoc", "auPrc"]:  # higher is better
                if metrics_dict[k] < v:  # if under threshold, don't deploy
                    logging.info("{} < {}; returning False".format(metrics_dict[k], v))
                    return False
        logging.info("threshold checks passed.")
        return True

    def log_metrics(metrics_list, metricsc):
        test_confusion_matrix = metrics_list[0]["confusionMatrix"]
        logging.info("rows: %s", test_confusion_matrix["rows"])

        # log the ROC curve
        fpr = []
        tpr = []
        thresholds = []
        for item in metrics_list[0]["confidenceMetrics"]:
            fpr.append(item.get("falsePositiveRate", 0.0))
            tpr.append(item.get("recall", 0.0))
            thresholds.append(item.get("confidenceThreshold", 0.0))
        print(f"fpr: {fpr}")
        print(f"tpr: {tpr}")
        print(f"thresholds: {thresholds}")
        metricsc.log_roc_curve(fpr, tpr, thresholds)

        # log the confusion matrix
        annotations = []
        for item in test_confusion_matrix["annotationSpecs"]:
            annotations.append(item["displayName"])
        logging.info("confusion matrix annotations: %s", annotations)
        metricsc.log_confusion_matrix(
            annotations,
            test_confusion_matrix["rows"],
        )

        # log textual metrics info as well
        for metric in metrics_list[0].keys():
            if metric != "confidenceMetrics":
                val_string = json.dumps(metrics_list[0][metric])
                metrics.log_metric(metric, val_string)

    logging.getLogger().setLevel(logging.INFO)

    # extract the model resource name from the input Model Artifact
    model_resource_path = model.metadata["resourceName"]
    logging.info("model path: %s", model_resource_path)

    # Get the trained model resource
    model = aiplatform.Model(model_resource_path)

    # Get model evaluation metrics from the the trained model
    eval_name, metrics_list, metrics_str_list = get_eval_info(model)
    logging.info("got evaluation name: %s", eval_name)
    logging.info("got metrics list: %s", metrics_list)
    log_metrics(metrics_list, metricsc)

    thresholds_dict = json.loads(thresholds_dict_str)
    deploy = classification_thresholds_check(metrics_list[0], thresholds_dict)
    if deploy:
        dep_decision = "true"
    else:
        dep_decision = "false"
    logging.info("deployment decision is %s", dep_decision)

    return (dep_decision,)

## Define pipeline 

Define the pipeline for AutoML tabular classification using the components from `google_cloud_pipeline_components`.

In [11]:
@kfp.dsl.pipeline(name=PIPELINE_NAME, pipeline_root=PIPELINE_ROOT)
def pipeline(
    bq_source: str,
    DATASET_DISPLAY_NAME: str,
    TRAINING_DISPLAY_NAME: str,
    MODEL_DISPLAY_NAME: str,
    ENDPOINT_DISPLAY_NAME: str,
    MACHINE_TYPE: str,
    project: str,
    gcp_region: str,
    thresholds_dict_str: str,
):

    # Defining Dataset create component with bq_source data as input.
    dataset_create_op = TabularDatasetCreateOp(
        project=project, display_name=DATASET_DISPLAY_NAME, bq_source=bq_source
    )
    
    # Defining training job component with previously created dataset as input.
    training_op = AutoMLTabularTrainingJobRunOp(
        project=project,
        display_name=TRAINING_DISPLAY_NAME,
        optimization_prediction_type="classification",
        optimization_objective="minimize-log-loss",
        budget_milli_node_hours=1000,
        model_display_name=MODEL_DISPLAY_NAME,
        column_specs={
            "Area": "numeric",
            "Perimeter": "numeric",
            "MajorAxisLength": "numeric",
            "MinorAxisLength": "numeric",
            "AspectRation": "numeric",
            "Eccentricity": "numeric",
            "ConvexArea": "numeric",
            "EquivDiameter": "numeric",
            "Extent": "numeric",
            "Solidity": "numeric",
            "roundness": "numeric",
            "Compactness": "numeric",
            "ShapeFactor1": "numeric",
            "ShapeFactor2": "numeric",
            "ShapeFactor3": "numeric",
            "ShapeFactor4": "numeric",
            "Class": "categorical",
        },
        dataset=dataset_create_op.outputs["dataset"],
        target_column="Class",
    )
    
    # Define (custom) evaluation component with training job model as input.
    model_eval_task = classification_model_eval_metrics(
        project,
        gcp_region,
        thresholds_dict_str,
        training_op.outputs["model"],
    )
    
    # Define Condition wrapper component checking if model will be deployed.
    with dsl.Condition(
        model_eval_task.outputs["dep_decision"] == "true",
        name="deploy_decision",
    ):
        
        # Defining endpoint and deploy model if condition true. 
        endpoint_op = EndpointCreateOp(
            project=project,
            location=gcp_region,
            display_name=ENDPOINT_DISPLAY_NAME,
        )

        ModelDeployOp(
            model=training_op.outputs["model"],
            endpoint=endpoint_op.outputs["endpoint"],
            dedicated_resources_min_replica_count=1,
            dedicated_resources_max_replica_count=1,
            dedicated_resources_machine_type=MACHINE_TYPE,
        )

## Compile the pipeline

Next, compile the pipeline to the specified json file.

In [12]:
compiler.Compiler().compile(
    pipeline_func=pipeline,
    package_path="tabular_classification_pipeline.json",
)



## Run the pipeline


In [None]:
# Otherwise, use the default display-names
UUID = ""
PIPELINE_DISPLAY_NAME = f"pipeline_beans_{UUID}"
DATASET_DISPLAY_NAME = f"dataset_beans_{UUID}"
MODEL_DISPLAY_NAME = f"model_beans_{UUID}"
TRAINING_DISPLAY_NAME = f"automl_training_beans_{UUID}"
ENDPOINT_DISPLAY_NAME = f"endpoint_beans_{UUID}"

# Set machine type
MACHINE_TYPE = "n1-standard-4"

In [14]:
# Configure the pipeline
job = aiplatform.PipelineJob(
    display_name=PIPELINE_DISPLAY_NAME,
    template_path="tabular_classification_pipeline.json",
    pipeline_root=PIPELINE_ROOT,
    parameter_values={
        "project": PROJECT_ID,
        "gcp_region": REGION,
        "bq_source": "bq://jp-sandbox-359611.beans_vpip.beans",
        "thresholds_dict_str": '{"auRoc": 0.95}',
        "DATASET_DISPLAY_NAME": DATASET_DISPLAY_NAME,
        "TRAINING_DISPLAY_NAME": TRAINING_DISPLAY_NAME,
        "MODEL_DISPLAY_NAME": MODEL_DISPLAY_NAME,
        "ENDPOINT_DISPLAY_NAME": ENDPOINT_DISPLAY_NAME,
        "MACHINE_TYPE": MACHINE_TYPE,
    },
    enable_caching=False,
)

Run the pipeline job. Click on the generated link to see your run in the Cloud Console.

In [15]:
# Run the job
job.run()

Creating PipelineJob
PipelineJob created. Resource name: projects/887365041836/locations/us-central1/pipelineJobs/automl-tabular-beans-training-20220905081603
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/887365041836/locations/us-central1/pipelineJobs/automl-tabular-beans-training-20220905081603')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/automl-tabular-beans-training-20220905081603?project=887365041836
PipelineJob projects/887365041836/locations/us-central1/pipelineJobs/automl-tabular-beans-training-20220905081603 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/887365041836/locations/us-central1/pipelineJobs/automl-tabular-beans-training-20220905081603 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/887365041836/locations/us-central1/pipelineJobs/automl-tabular-beans-training-20220905081603 current state:
PipelineState.PIPELINE_S

## Send Prediction Requests

In [55]:
from typing import Dict

from google.cloud import aiplatform
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

# Define function to request prediction from model endpoint based on JSON input.
def predict_tabular_classification_sample(
    project: str,
    endpoint_id: str,
    instance_dict: Dict,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
    # for more info on the instance schema, please use get_model_sample.py
    # and look at the yaml found in instance_schema_uri
    instance = json_format.ParseDict(instance_dict, Value())
    instances = [instance]
    parameters_dict = {}
    parameters = json_format.ParseDict(parameters_dict, Value())
    endpoint = client.endpoint_path(
        project=project, location=location, endpoint=endpoint_id
    )
    response = client.predict(
        endpoint=endpoint, instances=instances, parameters=parameters
    )
    print("response")
    print(" deployed_model_id:", response.deployed_model_id)
    # See gs://google-cloud-aiplatform/schema/predict/prediction/tabular_classification_1.0.0.yaml for the format of the predictions.
    predictions = response.predictions
    for prediction in predictions:
        print(" prediction:", dict(prediction))

In [56]:
# Construct a BigQuery client object.
client = bigquery.Client()

# Define vertex model endpoint ID.
ENDPOINT_ID = "3179954753295613952"

In [57]:
# Querying the datapoint with max area value
query = """
    SELECT 
      *
    FROM `jp-sandbox-359611.beans_vpip.beans`
    WHERE Area = (
      SELECT MAX(Area)
      FROM `jp-sandbox-359611.beans_vpip.beans`
    )
    """
query_job = client.query(query)  # Make an API request.

# Extract and format records as json.
records = [dict(row) for row in query_job]
json_results = json.dumps(str(records))

In [59]:
predict_tabular_classification_sample(
                project=PROJECT_ID,
                endpoint_id=ENDPOINT_ID,
                location="us-central1",
                instance_dict=records[0])

response
 deployed_model_id: 1514075889959174144
 prediction: {'scores': [0.0, 9.879711165818666e-33, 1.630759597343158e-20, 1.140950795012128e-18, 4.064171665874028e-09, 3.03625202237312e-12, 1.0], 'classes': ['DERMASON', 'SIRA', 'SEKER', 'HOROZ', 'CALI', 'BARBUNYA', 'BOMBAY']}


## Send altered prediction requests

In [None]:
while True:
    predict_tabular_classification_sample(
                project=PROJECT_ID,
                endpoint_id=ENDPOINT_ID,
                location="us-central1",
                instance_dict=records[0])
    
    time.sleep(1)

response
 deployed_model_id: 1514075889959174144
 prediction: {'classes': ['DERMASON', 'SIRA', 'SEKER', 'HOROZ', 'CALI', 'BARBUNYA', 'BOMBAY'], 'scores': [0.0, 9.879711165818666e-33, 1.630759597343158e-20, 1.140950795012128e-18, 4.064171665874028e-09, 3.03625202237312e-12, 1.0]}
response
 deployed_model_id: 1514075889959174144
 prediction: {'classes': ['DERMASON', 'SIRA', 'SEKER', 'HOROZ', 'CALI', 'BARBUNYA', 'BOMBAY'], 'scores': [0.0, 9.879711165818666e-33, 1.630759597343158e-20, 1.140950795012128e-18, 4.064171665874028e-09, 3.03625202237312e-12, 1.0]}
response
 deployed_model_id: 1514075889959174144
 prediction: {'scores': [0.0, 9.879711165818666e-33, 1.630759597343158e-20, 1.140950795012128e-18, 4.064171665874028e-09, 3.03625202237312e-12, 1.0], 'classes': ['DERMASON', 'SIRA', 'SEKER', 'HOROZ', 'CALI', 'BARBUNYA', 'BOMBAY']}
response
 deployed_model_id: 1514075889959174144
 prediction: {'scores': [0.0, 9.879711165818666e-33, 1.630759597343158e-20, 1.140950795012128e-18, 4.0641716658