In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# E2E ML on GCP: MLOps stage 4 : formalization: get started with Vertex ML Metadata

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage4/get_started_with_vertex_ml_metadata.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage4/get_started_with_vertex_ml_metadata.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/ml_ops/stage4/get_started_with_vertex_ml_metadata.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>            
</table>
<br/><br/><br/>

## Overview


This tutorial demonstrates how to use Vertex AI for E2E MLOps on Google Cloud in production. This tutorial covers stage 4 : formalization: get started with Vertex ML Metadata.

### Objective

In this tutorial, you learn how to use `Vertex ML Metadata`.

This tutorial uses the following Google Cloud ML services:

- `Vertex ML Metadata`
- `Vertex AI Pipelines`

The steps performed include:

- Create a `Metadatastore` resource.
- Create (record)/List an `Artifact`, with artifacts and metadata.
- Create (record)/List an `Execution`.
- Create (record)/List a `Context`.
- Add `Artifact` to `Execution` as events.
- Add `Execution` and `Artifact` into the `Context`
- Delete `Artifact`, `Execution` and `Context`.
- Create and run a `Vertex AI Pipeline` ML workflow to train and deploy a scikit-learn model.
    - Create custom pipeline components that generate artifacts and metadata.
    - Compare Vertex AI Pipelines runs.
    - Trace the lineage for pipeline-generated artifacts.
    - Query your pipeline run metadata.

### Dataset

The dataset used for this tutorial is the UCI Machine Learning ['Dry beans dataset'](https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset), from: KOKLU, M. and OZKAN, I.A., (2020), "Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques."In Computers and Electronics in Agriculture, 174, 105507. [DOI](https://doi.org/10.1016/j.compag.2020.105507).

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Installations

Install the packages required for executing the notebook.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade google-cloud-aiplatform[tensorboard] $USER_FLAG -q
! pip3 install --upgrade google-cloud-pipeline-components $USER_FLAG -q

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### GPU runtime

*Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select* **Runtime > Change Runtime Type > GPU**

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)

3. [Enable the following APIs: Vertex AI APIs, Compute Engine APIs, and Cloud Storage.](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component,storage-component.googleapis.com)

4. If you are running this notebook locally, you need to install the [Cloud SDK]((https://cloud.google.com/sdk)).

5. Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$`.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [None]:
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

In the Cloud Console, go to the [Create service account key](https://console.cloud.google.com/apis/credentials/serviceaccountkey) page.

**Click Create service account**.

In the **Service account name** field, enter a name, and click **Create**.

In the **Grant this service account access to project** section, click the Role drop-down list. Type "Vertex" into the filter box, and select **Vertex Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

Click Create. A JSON file that contains your key downloads to your local environment.

Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell.

In [None]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = False
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        IS_COLAB = True
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "aip-" + TIMESTAMP
    BUCKET_URI = "gs://" + BUCKET_NAME

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_URI

#### Service Account

**If you don't know your service account**, try to get your service account using `gcloud` command by executing the second cell below.

In [None]:
SERVICE_ACCOUNT = "[your-service-account]"  # @param {type:"string"}

In [None]:
if (
    SERVICE_ACCOUNT == ""
    or SERVICE_ACCOUNT is None
    or SERVICE_ACCOUNT == "[your-service-account]"
):
    # Get your service account from gcloud
    if not IS_COLAB:
        shell_output = !gcloud auth list 2>/dev/null
        SERVICE_ACCOUNT = shell_output[2].replace("*", "").strip()

    if IS_COLAB:
        shell_output = ! gcloud projects describe  $PROJECT_ID
        project_number = shell_output[-1].split(":")[1].strip().replace("'", "")
        SERVICE_ACCOUNT = f"{project_number}-compute@developer.gserviceaccount.com"

    print("Service Account:", SERVICE_ACCOUNT)

#### Set service account access for Vertex AI Pipelines

Run the following commands to grant your service account access to read and write pipeline artifacts in the bucket that you created in the previous step -- you only need to run these once per service account.

In [None]:
! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectCreator $BUCKET_URI

! gsutil iam ch serviceAccount:{SERVICE_ACCOUNT}:roles/storage.objectViewer $BUCKET_URI

### Set up variables

Next, set up some variables used throughout the tutorial.
### Import libraries and define constants

In [None]:
import google.cloud.aiplatform as aip

#### Import Vertex AI SDK

Import the Vertex AI SDK into your Python environment.

In [None]:
import google.cloud.aiplatform_v1beta1 as aip_beta

#### Vertex AI constants

Setup up the following constants for Vertex AI:

- `API_ENDPOINT`: The Vertex AI API service endpoint for `ML Metadata` services.

In [None]:
# API service endpoint
API_ENDPOINT = "{}-aiplatform.googleapis.com".format(REGION)

# Vertex location root path for your dataset, model and endpoint resources
PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION

## Set up clients

The Vertex  works as a client/server model. On your side (the Python script) you will create a client that sends requests and receives responses from the Vertex AI server.

You will use different clients in this tutorial for different steps in the workflow. So set them all up upfront.

- Metadata Service for creating recording, searching and analyzing artifacts and metadata.

In [None]:
# client options same for all services
client_options = {"api_endpoint": API_ENDPOINT}


def create_metadata_client():
    client = aip_beta.MetadataServiceClient(client_options=client_options)
    return client


clients = {}
clients["metadata"] = create_metadata_client()

for client in clients.items():
    print(client)

## Introduction to Vertex AI Metadata

The `Vertex ML Metadata` service provides you with the ability to record, and subsequently search and analyze, the artifacts and corresponding metadata produced by your ML workflows. For example, during experimentation one might desire to record the location of the model artifacts, as artifacts, and the training hyperparameters and evaluation metrics as the corresponding metadata.

The service supports recording ML metadata both manually and automatically, with the later occurring when you use Vertex AI Pipelines.

### Concepts and organization

Vertex ML Metadata describes your ML system's metadata as a graph.

**Artifacts**: Artifacts are pieces of data that ML systems consume or produce, such as: datasets, models, or logs. For large artifacts like datasets or models, the artifact record includes the URI where the data is stored.

**Executions**: Executions describe a single step in your ML system's workflow.

**Events**: Executions can depend on artifacts as inputs or produce artifacts as outputs. Events describe the relationship between artifacts and executions to help you determine the lineage of artifacts. For example, an event is created to record that a dataset is used by an execution, and another event is created to record that this execution produced a model.

**Contexts**: Contexts let you group artifacts and executions together in a single, queryable, and typed category.

### ML artifact lineage

Vertex ML Metadata provides the ability to understand changes in the performance of your machine ML system, and analyze the metadata produced by your ML workflow and the lineage of its artifacts. An artifact's lineage includes all the factors that contributed to its creation, as well as artifacts and metadata that descend from this artifact.

Learn more about [Introduction to Vertex ML Metadata ](https://cloud.google.com/vertex-ai/docs/ml-metadata/introduction)

### Create a `MetadataStore` resource

Each project may have one or more `MetadataStore` resources. By default, if none is explicity created, each project has a default, which is specified as:

    projects/<project_id>/locations/<region>/metadataStores/<name>

You create a `MetadataStore` resource using the `create_metadata_store()` method, with the following parameters:

- `parent`: The fully qualified subpath for all resources in your project, i.e., projects/<project_id>/locations/<location>
- `metadata_store_id`: The name of the `MetadataStore` resource.

In [None]:
metadata_store = clients["metadata"].create_metadata_store(
    parent=PARENT, metadata_store_id="my-metadata-store"
)

metadata_store_id = str(metadata_store.result())[7:-2]
print(metadata_store_id)

### List metadata schemas

When you create an `Artifact`, `Execution` or `Context` resource, you specify a schema that describes the corresponding metadata. The schemas must be pre-registered for your `Metadatastore` resource.

You can get a list of all registered schemas, default and user defined, using the `list_metadata_schemas()` method, with the following parameters:

- `name`: The fully qualified resource identifier for the `MetadataStore` resource.

Learn more about [Metadata system schemas](https://cloud.google.com/vertex-ai/docs/ml-metadata/system-schemas).

In [None]:
schemas = clients["metadata"].list_metadata_schemas(parent=metadata_store_id)

for schema in schemas:
    print(schema)

### Create an `Artifact` resource

You create an `Artifact` resource using the `create_artifact()` method, with the following parameters:

- `parent`: The fully qualified resource identifier to the `Metadatastore` resource.
- `artifact`: The definition of the `Artifact` resource
    - `display_name`: The human readable name for the `Artifact` resource.
    - `uri`: The uniform resource identifier of the artifact file. May be empty if there is no actual artifact file.
    - `labels`: User defined labels to assign to the `Artifact` resource.
    - `schema_title`: The title of the schema that describes the metadata.
    - `metadata`: The metadata key/value pairs to associate with the `Artifact` resource.
- `artifact_id`: (optional) A user defined short ID for the `Artifact` resource.

In [None]:
from google.cloud.aiplatform_v1beta1.types import Artifact

artifact_item = Artifact(
    display_name="my_example_artifact",
    uri="my_url",
    labels={"my_label": "value"},
    schema_title="system.Artifact",
    metadata={"param": "value"},
)

artifact = clients["metadata"].create_artifact(
    parent=metadata_store_id,
    artifact=artifact_item,
    artifact_id="myartifactid",
)

print(artifact)

### List `Artifact` resources in a `Metadatastore`

You can list all `Artifact` resources using the `list_artifacts()` method, with the following parameters:

- `parent`: The fully qualified resource identifier for the `MetadataStore` resource.

In [None]:
artifacts = clients["metadata"].list_artifacts(parent=metadata_store_id)

for _artifact in artifacts:
    print(_artifact)

### Create an `Execution` resource

You create an `Execution` resource using the `create_execution()` method, with the following parameters:

- `parent`: The fully qualified resource identifier to the `Metadatastore` resource.
- `execution`:
    - `display_name`: A human readable name for the `Execution` resource.
    - `schema_title`: The title of the schema that describes the metadata.
    - `metadata`: The metadata key/value pairs to associate with the `Execution` resource.
- `execution_id`: (optional) A user defined short ID for the `Execution` resource.

In [None]:
from google.cloud.aiplatform_v1beta1.types import Execution

execution = clients["metadata"].create_execution(
    parent=metadata_store_id,
    execution=Execution(
        display_name="my_execution",
        schema_title="system.CustomJobExecution",
        metadata={"value": "param"},
    ),
    execution_id="myexecutionid",
)

print(execution)

### List `Execution` resources in a `Metadatastore`

You can list all `Execution` resources using the `list_executions()` method, with the following parameters:

- `parent`: The fully qualified resource identifier for the `MetadataStore` resource.

In [None]:
executions = clients["metadata"].list_executions(parent=metadata_store_id)

for _execution in executions:
    print(_execution)

### Create a `Context` resource

You create an `Context` resource using the `create_context()` method, with the following parameters:

- `parent`: The fully qualified resource identifier to the `Metadatastore` resource.
- `context`:
    - `display_name`: A human readable name for the `Execution` resource.
    - `schema_title`: The title of the schema that describes the metadata.
    - `labels`: User defined labels to assign to the `Context` resource.
    - `metadata`: The metadata key/value pairs to associate with the `Execution` resource.
- `context_id`: (optional) A user defined short ID for the `Context` resource.

In [None]:
from google.cloud.aiplatform_v1beta1.types import Context

context = clients["metadata"].create_context(
    parent=metadata_store_id,
    context=Context(
        display_name="my_context",
        labels=[{"my_label", "my_value"}],
        schema_title="system.Pipeline",
        metadata={"param": "value"},
    ),
    context_id="mycontextid",
)

print(context)

### List `Context` resources in a `Metadatastore`

You can list all `Context` resources using the `list_contexts()` method, with the following parameters:

- `parent`: The fully qualified resource identifier for the `MetadataStore` resource.

In [None]:
contexts = clients["metadata"].list_contexts(parent=metadata_store_id)

for _context in contexts:
    print(_context)

### Add events to `Execution` resource

An `Execution` resource consists of a sequence of events that occurred during the execution. Each event consists of an artifact that is either an input or an output of the `Execution` resource.

You can add execution events to an `Execution` resource using the `add_execution_events()` method, with the following parameters:

- `execution`: The fully qualified resource identifier for the `Execution` resource.
- `events`: The sequence of events constituting the execution.

In [None]:
from google.cloud.aiplatform_v1beta1.types import Event

clients["metadata"].add_execution_events(
    execution=execution.name,
    events=[
        Event(
            artifact=artifact.name,
            type_=Event.Type.INPUT,
            labels={"my_label": "my_value"},
        )
    ],
)

### Combine Artifacts and Executions into a Context

A Context is used to group `Artifact` resources and `Execution` resources together under a single, queryable, and typed category. Contexts can be used to represent sets of metadata.

You can combine a set of `Artifact` and `Execution` resources into a `Context` resource using the `add_context_artifacts_and_executions()` method, with the following parameters:

- `context`: The fully qualified resource identifier of the `Context` resource.
- `artifacts`: A list of fully qualified resource identifiers of the `Artifact` resources.
- `executions`: A list of fully qualified resource identifiers of the `Execution` resources.

In [None]:
clients["metadata"].add_context_artifacts_and_executions(
    context=context.name, artifacts=[artifact.name], executions=[execution.name]
)

### Query a context

You can query the subgraph of a `Context` resource using the method `query_context_lineage_subgraph()` method, with the following parameters:

- `context`: The fully qualified resource identifier of the `Context` resource.

In [None]:
subgraph = clients["metadata"].query_context_lineage_subgraph(context=context.name)

print(subgraph)

### Delete an `Artifact` resource

You can delete an `Artifact` resource using the `delete_artifact()` method, with the following parameters:

- `name`: The fully qualified resource identifier for the `Artifact` resource.

In [None]:
clients["metadata"].delete_artifact(name=artifact.name)

### Delete an `Execution` resource

You can delete an `Execution` resource using the `delete_execution()` method, with the following parameters:

- `name`: The fully qualified resource identifier for the `Execution` resource.

In [None]:
clients["metadata"].delete_execution(name=execution.name)

### Delete a `Context` resource

You can delete an `Context` resource using the `delete_context()` method, with the following parameters:

- `name`: The fully qualified resource identifier for the `Context` resource.

In [None]:
clients["metadata"].delete_context(name=context.name)

## Introduction to tracking ML Metadata in a `Vertex AI Pipeline`

Vertex AI Pipelines automatically records the metrics and artifacts created when the pipeline is exeuted. You can then use the SDK to track and analyze the metrics and artifacts across pipeline runs.

In [None]:
from kfp.v2 import compiler, dsl
from kfp.v2.dsl import (Artifact, Dataset, Input, Metrics, Model, Output,
                        OutputPath, component, pipeline)

### Creating a 3-step pipeline with custom components

First, you create a pipeline to run on `Vertex AI Pipelines`, consisting of the following custom components:

* `get_dataframe`: Retrieve data from a BigQuery table and convert it into a pandas DataFrame.
* `sklearn_train`: Use the pandas DataFrame to train and export a scikit-learn model, along with some metrics.
* `deploy_model`: Deploy the exported scikit-learn model to a `Vertex AI Endpoint` resource.

#### get_dataframe component

This component does the following:

* Creates a reference to a BigQuery table using the BigQuery client library
* Downloads the BigQuery table and converts it to a shuffled pandas DataFrame
* Exports the DataFrame to a CSV file

#### sklearn_train component

This component does the following:

* Imports a CSV as a pandas DataFrame
* Splits the DataFrame into train and test sets
* Trains a scikit-learn model
* Logs metrics from the model
* Saves the model artifacts as a local `model.joblib` file

#### deploy_model component

This component does the following:

* Uploads the scikit-learn model to a `Vertex AI Model` resource.
* Deploys the model to a `Vertex AI Endpoint` resource.

In [None]:
@component(
    packages_to_install=["google-cloud-bigquery", "pandas", "pyarrow"],
    base_image="python:3.9",
    output_component_file="create_dataset.yaml",
)
def get_dataframe(bq_table: str, output_data_path: OutputPath("Dataset")):
    from google.cloud import bigquery

    bqclient = bigquery.Client()
    table = bigquery.TableReference.from_string(bq_table)
    rows = bqclient.list_rows(table)
    dataframe = rows.to_dataframe(
        create_bqstorage_client=True,
    )
    dataframe = dataframe.sample(frac=1, random_state=2)
    dataframe.to_csv(output_data_path)


@component(
    packages_to_install=["sklearn", "pandas", "joblib"],
    base_image="python:3.9",
    output_component_file="beans_model_component.yaml",
)
def sklearn_train(
    dataset: Input[Dataset], metrics: Output[Metrics], model: Output[Model]
):
    import pandas as pd
    from joblib import dump
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier

    df = pd.read_csv(dataset.path)
    labels = df.pop("Class").tolist()
    data = df.values.tolist()
    x_train, x_test, y_train, y_test = train_test_split(data, labels)

    skmodel = DecisionTreeClassifier()
    skmodel.fit(x_train, y_train)
    score = skmodel.score(x_test, y_test)
    print("accuracy is:", score)

    metrics.log_metric("accuracy", (score * 100.0))
    metrics.log_metric("framework", "Scikit Learn")
    metrics.log_metric("dataset_size", len(df))
    dump(skmodel, model.path + ".joblib")


@component(
    packages_to_install=["google-cloud-aiplatform"],
    base_image="python:3.9",
    output_component_file="beans_deploy_component.yaml",
)
def deploy_model(
    model: Input[Model],
    project: str,
    region: str,
    vertex_endpoint: Output[Artifact],
    vertex_model: Output[Model],
):
    from google.cloud import aiplatform

    aiplatform.init(project=project, location=region)

    deployed_model = aiplatform.Model.upload(
        display_name="beans-model-pipeline",
        artifact_uri=model.uri.replace("model", ""),
        serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest",
    )
    endpoint = deployed_model.deploy(machine_type="n1-standard-4")

    # Save data to the output params
    vertex_endpoint.uri = endpoint.resource_name
    vertex_model.uri = deployed_model.resource_name

### Construct and compile the pipeline

Next, construct the pipeline:

In [None]:
PIPELINE_ROOT = f"{BUCKET_URI}/pipeline_root/3step"


@dsl.pipeline(
    # Default pipeline root. You can override it when submitting the pipeline.
    pipeline_root=PIPELINE_ROOT,
    # A name for the pipeline.
    name="mlmd-pipeline",
)
def pipeline(
    bq_table: str = "",
    output_data_path: str = "data.csv",
    project: str = PROJECT_ID,
    region: str = REGION,
):
    dataset_task = get_dataframe(bq_table)

    model_task = sklearn_train(dataset_task.output)

    deploy_model(model=model_task.outputs["model"], project=project, region=region)

### Compile and execute two runs of the pipeline

Next, you compile the pipeline and then run two separate instances of the pipeline. In the first instance, you train the model with a small version of the dataset and in the second instance you train it with a larger version of the dataset.

In [None]:
NOW = datetime.now().isoformat().replace(".", ":")[:-7]

compiler.Compiler().compile(pipeline_func=pipeline, package_path="mlmd_pipeline.json")

run1 = aip.PipelineJob(
    display_name="mlmd-pipeline",
    template_path="mlmd_pipeline.json",
    job_id="mlmd-pipeline-small-{}".format(TIMESTAMP),
    parameter_values={"bq_table": "sara-vertex-demos.beans_demo.small_dataset"},
    enable_caching=True,
)

run2 = aip.PipelineJob(
    display_name="mlmd-pipeline",
    template_path="mlmd_pipeline.json",
    job_id="mlmd-pipeline-large-{}".format(TIMESTAMP),
    parameter_values={"bq_table": "sara-vertex-demos.beans_demo.large_dataset"},
    enable_caching=True,
)

run1.run()
run2.run()

run1.delete()
run2.delete()

! rm -f mlmd_pipeline.json *.yaml

### Compare the pipeline runs

Now that you have two pipeline completed pipeline runs, you can compare the runs.

You can use the `get_pipeline_df()` method to access the metadata from the runs. The `mlmd-pipeline` parameter here refers to the name you gave to your pipeline:

**Alternately, for guidance on inspecting pipeline artifacts and metadata in the Vertex AI Console, see [this codelab](https://codelabs.developers.google.com/vertex-mlmd-pipelines#5).**

In [None]:
df = aip.get_pipeline_df(pipeline="mlmd-pipeline")
print(df)

### Visualize the pipeline runs

Next, you create a custom visualization with matplotlib to see the relationship between your model's accuracy and the amount of data used for training.

In [None]:
import matplotlib.pyplot as plt

plt.plot(df["metric.dataset_size"], df["metric.accuracy"], label="Accuracy")
plt.title("Accuracy and dataset size")
plt.legend(loc=4)
plt.show()

### Quering your `Metadatastore` resource

Finally, you query your `Metadatastore` resource by specifying a `filter` parameter when calling the `list_artifacts()` method.

In [None]:
FILTER = f'create_time >= "{NOW}" AND state = LIVE'
artifact_req = {
    "parent": metadata_store_id,
    "filter": FILTER,
}

artifacts = clients["metadata"].list_artifacts(artifact_req)

for _artifact in artifacts:
    print(_artifact)
    clients["metadata"].delete_artifact(name=_artifact.name)

### Delete a `MetadataStore` resource

You can delete a `MetadataStore` resource using the `delete_metadata_store()` method, with the following parameters:

- `name`: The fully qualified resource identifier for the `MetadataStore` resource.

In [None]:
clients["metadata"].delete_metadata_store(name=metadata_store_id)

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial.

In [None]:
delete_bucket = False

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI