In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# AutoMLOps - BQML Introduction Training Example

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/automlops/main/examples/training/03_bqml_introduction_training_example.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/automlops/blob/main/examples/training/03_bqml_introduction_training_example.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/automlops/main/examples/training/03_bqml_introduction_training_example.ipynb">
        <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

# Overview

In this tutorial, you will build a [Vertex AI](https://cloud.google.com/vertex-ai) pipeline, complete with an integrated CI/CD pipeline. This tutorial will walk you through how to use AutoMLOps to define, create and run pipelines for BQML training jobs.

# Objective
In this tutorial, you will learn how to create and run MLOps pipelines integrated with CI/CD. This tutorial goes through an example pipeline that uses BQML for model training and evaluation. The pipeline has the following steps:
1. create_dataset: A custom component that will create an empty BQ dataset resource.
2. train_model: A custom component that will train a DNN classifier on the training data.
3. evaluate_model: A custom component that will evaluate the performance of the classifier.
3. deploy_model: A custom component that will take the classifier and deploy it to an endpoint.

# Prerequisites

In order to use AutoMLOps, the following are required:

- Python 3.7 - 3.10
- [Google Cloud SDK 407.0.0](https://cloud.google.com/sdk/gcloud/reference)
- [beta 2022.10.21](https://cloud.google.com/sdk/gcloud/reference/beta)
- `git` installed
- `git` logged-in:
```
  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"
```
- [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/provide-credentials-adc) are setup. This can be done through the following commands:
```
gcloud auth application-default login
gcloud config set account <account@example.com>
```

# APIs & IAM
Based on the user options selection, AutoMLOps will enable up to the following APIs during the provision step:
- [aiplatform.googleapis.com](https://cloud.google.com/vertex-ai/docs/reference/rest)
- [artifactregistry.googleapis.com](https://cloud.google.com/artifact-registry/docs/reference/rest)
- [cloudbuild.googleapis.com](https://cloud.google.com/build/docs/api/reference/rest)
- [cloudfunctions.googleapis.com](https://cloud.google.com/functions/docs/reference/rest)
- [cloudresourcemanager.googleapis.com](https://cloud.google.com/resource-manager/reference/rest)
- [cloudscheduler.googleapis.com](https://cloud.google.com/scheduler/docs/reference/rest)
- [compute.googleapis.com](https://cloud.google.com/compute/docs/reference/rest/v1)
- [iam.googleapis.com](https://cloud.google.com/iam/docs/reference/rest)
- [iamcredentials.googleapis.com](https://cloud.google.com/iam/docs/reference/credentials/rest)
- [pubsub.googleapis.com](https://cloud.google.com/pubsub/docs/reference/rest)
- [run.googleapis.com](https://cloud.google.com/run/docs/reference/rest)
- [storage.googleapis.com](https://cloud.google.com/storage/docs/apis)
- [sourcerepo.googleapis.com](https://cloud.google.com/source-repositories/docs/reference/rest)


AutoMLOps will create the following service account and update [IAM permissions](https://cloud.google.com/iam/docs/understanding-roles) during the provision step:
1. Pipeline Runner Service Account (defaults to: vertex-pipelines@PROJECT_ID.iam.gserviceaccount.com). Roles added:
- roles/aiplatform.user
- roles/artifactregistry.reader
- roles/bigquery.user
- roles/bigquery.dataEditor
- roles/iam.serviceAccountUser
- roles/storage.admin
- roles/cloudfunctions.admin

# User Guide

For a user-guide, please view these [slides](../../AutoMLOps_User_Guide.pdf).

# Costs

This tutorial uses billable components of Google Cloud:
- Vertex AI
- Artifact Registry
- Cloud Storage
- Cloud Source Repository
- Cloud Build
- Cloud Run
- Cloud Scheduler
- Cloud Pub/Sub

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

# Ground-rules for using AutoMLOps
1. Do not use variables, functions, code, etc. not defined within the scope of a custom component. These custom components will become containers and will have no reference to the out of scope code.
2. Import statements and helper functions must be added inside the function. Provide parameter type hints.
3. Test each of your components for accuracy and correctness before running them using AutoMLOps. We cannot fix bugs automatically; bugs are much more difficult to fix once they are made into pipelines.
4. If you are using Kubeflow, be sure to define all the requirements needed to run the custom component - it can be easy to leave out packages which will cause the container to fail when running within a pipeline. 


# Dataset
The dataset used for this tutorial is the Penguins dataset from [BigQuery public datasets](https://cloud.google.com/bigquery/public-data). This version of the dataset is used to predict the species of penguins from the available features like culmen-length, flipper-depth etc.

# Setup Git
Set up your git configuration below

In [None]:
!git config --global user.email 'you@example.com'
!git config --global user.name 'Your Name'

# Install AutoMLOps

Install AutoMLOps from [PyPI](https://pypi.org/project/google-cloud-automlops/), or locally by cloning the repo and running `pip install .`

In [None]:
!pip3 install google-cloud-automlops --user

# Restart the kernel
Once you've installed the AutoMLOps package, you need to restart the notebook kernel so it can find the package.

**Note: Once this cell has finished running, continue on. You do not need to re-run any of the cells above.**

In [1]:
import os

if not os.getenv('IS_TESTING'):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

# Set your project ID
Set your project ID below. If you don't know your project ID, leave the field blank and the following cells may be able to find it.

In [1]:
PROJECT_ID = '[your-project-id]'  # @param {type:"string"}

In [2]:
if PROJECT_ID == '' or PROJECT_ID is None or PROJECT_ID == '[your-project-id]':
    # Get your GCP project id from gcloud
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print('Project ID:', PROJECT_ID)

Project ID: automlops-sandbox


In [3]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


Set your Model_ID below:

In [None]:
MODEL_ID = 'penguins-dnn'

# AutoMLOps Pipeline - Using Kubeflow components
This workflow will generate a pipeline using Kubeflow spec.  AutoMLOps provides 2 functions for defining MLOps pipelines:

- `AutoMLOps.component(...)`: Defines a component, which is a containerized python function.
- `AutoMLOps.pipeline(...)`: Defines a pipeline, which is a series of components.

AutoMLOps provides 5 functions for building and maintaining MLOps pipelines:

- `AutoMLOps.generate(...)`: Generates the MLOps codebase. Users can specify the tooling and technologies they would like to use in their MLOps pipeline.
- `AutoMLOps.provision(...)`: Runs provisioning scripts to create and maintain necessary infra for MLOps.
- `AutoMLOps.deprovision(...)`: Runs deprovisioning scripts to tear down MLOps infra created using AutoMLOps.
- `AutoMLOps.deploy(...)`: Builds and pushes component container, then triggers the pipeline job.
- `AutoMLOps.launchAll(...)`: Runs `generate()`, `provision()`, and `deploy()` all in succession. 

Please see the [readme](https://github.com/GoogleCloudPlatform/automlops/blob/main/README.md) for more information.

**Note: This workflow requires python packages `kfp<2.0.0` and `google-cloud-aiplatform`.**

In [None]:
!pip3 install kfp<2.0.0 google-cloud-aiplatform

## Imports

In [4]:
from kfp.v2 import dsl
from kfp.v2.dsl import Artifact, Metrics, Model, Output
from google_cloud_automlops import AutoMLOps

## Clear the cache
`AutoMLOps.clear_cache` will remove previous instantiations of AutoMLOps components and pipelines. Use this function if you have previously defined a component that you no longer need.

In [5]:
AutoMLOps.clear_cache()

Cache cleared.


## Create Dataset
Define a Kubeflow custom component for creating an empty dataset resource. You must specify the `output_component_file` with the name of your component. For `AutoMLOps` to know where to find the Kubeflow component spec, set this variable to the following string `f"{AutoMLOps.OUTPUT_DIR}/your_component_name.yaml"`

In [6]:
@dsl.component(
    packages_to_install=[
        'google-cloud-bigquery'
    ],
    output_component_file=f'{AutoMLOps.OUTPUT_DIR}/create_dataset.yaml'
)
def create_dataset(
    bq_table: str,
    project_id: str
):
    """Custom component that creates an empty dataset resource.

    Args:
        bq_table: The source biquery table.
        project_id: The project ID.
    """
    from google.cloud import bigquery
    bq_client = bigquery.Client(project=project_id)
    bq_data_name = bq_table.split('.')[-1]

    dataset_query = f'''CREATE SCHEMA {bq_data_name}'''
    job = bq_client.query(dataset_query)

## Model Training
Define a Kubeflow custom component for creating and training a BigQuery ML tabular classification model from the public dataset penguins and store the model in your project using the `CREATE MODEL` statement. The model configuration is specified in the `OPTIONS` statement as follows:

- `model_type`: The type and archictecture of tabular model to train, e.g., DNN classification. Learn more about supported [model types](https://cloud.google.com/bigquery/docs/vertex-xai).
- `labels`: The column which are the labels.

Learn more about [The CREATE MODEL statement](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create).

In [7]:
@dsl.component(
    packages_to_install=[
        'google-cloud-bigquery'
    ],
    output_component_file=f'{AutoMLOps.OUTPUT_DIR}/train_model.yaml'
)
def train_model(
    bq_table: str,
    labels: str,
    model_name: str,
    model_type: str,
    project_id: str
):
    """Custom component that trains a DNN classifier on the training data.

    Args:
        bq_table: Full uri of the BQ training data.
        labels: The type and archictecture of tabular model to train, e.g., DNN classification.
        model_name: Name for the model.
        model_type: The column which are the labels.
        project_id: Project id.
    """
    from google.cloud import bigquery
    bq_client = bigquery.Client(project=project_id)
    bq_data_name = bq_table.split('.')[-1]

    model_query = f'''
    CREATE OR REPLACE MODEL `{bq_data_name}.{model_name}`
    OPTIONS(
        model_type='{model_type}',
        labels = ['{labels}'],
        model_registry='vertex_ai'
        )
    AS
    SELECT *
    FROM `{bq_table}`
    '''

    job = bq_client.query(model_query)
    print(job.errors, job.state)

    while job.running():
        from time import sleep

        sleep(30)
        print('Running ...')
    print(job.errors, job.state)

    tblname = job.ddl_target_table
    tblname = f'{tblname.dataset_id}.{tblname.table_id}'
    print(f'{tblname} created in {job.ended - job.started}')

## Evaluate the trained BigQuery ML model

Define a Kubeflow custom component to retrieve the model evaluation for the trained BigQuery ML model.

Learn more about [The ML.EVALUATE function](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-evaluate).

In [8]:
@dsl.component(
    packages_to_install=[
        'google-cloud-bigquery', 
        'pandas',
        'pyarrow',
        'db_dtypes'
    ],
    output_component_file=f'{AutoMLOps.OUTPUT_DIR}/evaluate_model.yaml',
)
def evaluate_model(
    bq_table: str,
    metrics: Output[Metrics],
    model_name: str,
    project_id: str
):
    """Custom component that evaluates the model.

    Args:
        model_name: Name for the model.
        project_id: Project id.
    """
    from google.cloud import bigquery
    bq_client = bigquery.Client(project=project_id)
    bq_data_name = bq_table.split('.')[-1]

    eval_query = f'''
    SELECT *
    FROM
      ML.EVALUATE(MODEL {bq_data_name}.{model_name})
    ORDER BY  roc_auc desc
    LIMIT 1'''

    job = bq_client.query(eval_query)
    results = job.result().to_dataframe()

    for column in results:
        metrics.log_metric(column, results[column].values[0])

## Uploading & Deploying the Model
Define a Kubeflow custom component for uploading and deploying a model in Vertex AI.

In [9]:
@dsl.component(
    packages_to_install=[
        'google-cloud-aiplatform'
    ],
    output_component_file=f'{AutoMLOps.OUTPUT_DIR}/deploy_model.yaml',
)
def deploy_model(
    machine_type: str,
    model_name: str,
    project_id: str,
    region: str,
    vertex_endpoint: Output[Artifact],
    vertex_model: Output[Model]
):
    """Custom component that deploys the model.

    Args:
        model_name: Name for the model.
        project_id: Project id.
        region: Resource region.
    """
    from google.cloud import aiplatform
    aiplatform.init(project=project_id, location=region)
    model = aiplatform.Model(model_name=model_name)

    endpoint = model.deploy(
        machine_type=machine_type,
        deployed_model_display_name=f'deployed-{model_name}')
    vertex_endpoint.uri = endpoint.resource_name
    vertex_model.uri = endpoint.resource_name

## Define the Pipeline
Define your pipeline. You can optionally give the pipeline a name and description. Define the structure by listing the components to be called in your pipeline; use `.after` to specify the order of execution.

In [10]:
@AutoMLOps.pipeline(name='bqml-automlops-pipeline', description='This is an example demo for BQML.')
def pipeline(bq_table: str,
             labels: str,
             machine_type: str,
             model_name: str,
             model_type: str,
             project_id: str,
             region: str
            ):

    create_dataset_task = create_dataset(
        bq_table=bq_table,
        project_id=project_id)

    train_model_task = train_model(
        bq_table=bq_table,
        labels=labels,
        model_name=model_name,
        model_type=model_type,
        project_id=project_id).after(create_dataset_task)

    evaluate_model_task = evaluate_model(
        bq_table=bq_table,
        model_name=model_name,
        project_id=project_id).after(train_model_task)

    deploy_model_task = deploy_model(
        machine_type=machine_type,
        model_name=model_name,
        project_id=project_id,
        region=region).after(evaluate_model_task)

## Define the Pipeline Arguments

In [11]:
pipeline_params = {
    'bq_table': 'bigquery-public-data.ml_datasets.penguins',
    'labels': 'species',
    'machine_type': 'n1-standard-4',
    'model_name': 'penguins_dnn',
    'model_type': 'DNN_CLASSIFIER',
    'project_id': PROJECT_ID,
    'region': 'us-central1'
}

## Generate and Run the pipeline
`AutoMLOps.generate(...)` generates the MLOps codebase. Users can specify the tooling and technologies they would like to use in their MLOps pipeline.

In [None]:
AutoMLOps.generate(project_id=PROJECT_ID,
                   pipeline_params=pipeline_params,
                   use_ci=True,
                   naming_prefix=MODEL_ID,
                   schedule_pattern='59 11 * * 0' # retrain every Sunday at Midnight
)

Writing directories under AutoMLOps/
Writing configurations to AutoMLOps/configs/defaults.yaml
Writing Kubeflow Pipelines code to AutoMLOps/pipelines, AutoMLOps/components, AutoMLOps/services
Writing README.md to AutoMLOps/README.md
Writing scripts to AutoMLOps/scripts
Writing CloudBuild config to AutoMLOps/cloudbuild.yaml
Code Generation Complete.


`AutoMLOps.provision(...)` runs provisioning scripts to create and maintain necessary infra for MLOps.

In [None]:
AutoMLOps.provision(hide_warnings=False)            # hide_warnings is optional, defaults to True

-cloudfunctions.functions.get
-serviceusage.services.use
-serviceusage.services.enable
-cloudfunctions.functions.create
-pubsub.subscriptions.list
-cloudscheduler.jobs.list
-pubsub.topics.create
-source.repos.list
-artifactregistry.repositories.create
-resourcemanager.projects.setIamPolicy
-iam.serviceAccounts.listiam.serviceAccounts.create
-pubsub.subscriptions.create
-cloudscheduler.jobs.create
-storage.buckets.create
-source.repos.create
-artifactregistry.repositories.list
-cloudbuild.builds.create
-cloudbuild.builds.list
-pubsub.topics.list
-storage.buckets.get

You are currently using: srastatter@google.com. Please check your account permissions.
The following are the recommended roles for provisioning:
-roles/resourcemanager.projectIamAdmin
-roles/cloudfunctions.admin
-roles/artifactregistry.admin
-roles/iam.serviceAccountAdmin
-roles/serviceusage.serviceUsageAdmin
-roles/aiplatform.serviceAgent
-roles/cloudscheduler.admin
-roles/pubsub.editor
-roles/source.admin
-roles/cloudbuil

`AutoMLOps.deploy(...)` builds and pushes component container, then triggers the pipeline job.

In [None]:
AutoMLOps.deploy(precheck=True,                     # precheck is optional, defaults to True
                 hide_warnings=False)               # hide_warnings is optional, defaults to True

-artifactregistry.repositories.get
-cloudbuild.builds.get
-resourcemanager.projects.getIamPolicy
-storage.buckets.update
-serviceusage.services.get
-cloudfunctions.functions.get
-pubsub.topics.get
-iam.serviceAccounts.get
-source.repos.update
-pubsub.subscriptions.get

You are currently using: srastatter@google.com. Please check your account permissions.
The following are the recommended roles for deploying with precheck:
-roles/serviceusage.serviceUsageViewer
-roles/iam.roleViewer
-roles/pubsub.viewer
-roles/storage.admin
-roles/cloudbuild.builds.editor
-roles/source.writer
-roles/iam.serviceAccountUser
-roles/cloudfunctions.viewer
-roles/artifactregistry.reader

Checking for required API services in project automlops-sandbox...
Checking for Artifact Registry in project automlops-sandbox...
Checking for Storage Bucket in project automlops-sandbox...
Checking for Pipeline Runner Service Account in project automlops-sandbox...
Checking for IAM roles on Pipeline Runner Service Account in