In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# AutoMLOps - Clustering Example

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/automlops/blob/main/examples/training/01_clustering_example.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/automlops/blob/main/examples/training/01_clustering_example.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/automlops/main/examples/training/01_clustering_example.ipynb">
        <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

# Overview

In this tutorial, you will build a [Vertex AI](https://cloud.google.com/vertex-ai) pipeline, complete with an integrated CI/CD pipeline. This tutorial will walk you through how to use AutoMLOps to define, create and run pipelines.

# Objective
In this tutorial, you will learn how to create and run MLOps pipelines integrated with CI/CD. The example pipeline fits a K Means model to the Sklearn Iris dataset; the pipeline goes through a very basic workflow:
1. create_dataset: A custom component that loads the sklearn Iris dataset and writes it to GCS.
2. fit_kmeans: A custom component that determines K Means clusters within the data.

# Prerequisites

In order to use AutoMLOps, the following are required:

- Python 3.7 - 3.10
- [Google Cloud SDK 407.0.0](https://cloud.google.com/sdk/gcloud/reference)
- [beta 2022.10.21](https://cloud.google.com/sdk/gcloud/reference/beta)
- `git` installed
- `git` logged-in:
```
  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"
```
- [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/provide-credentials-adc) are setup. This can be done through the following commands:
```
gcloud auth application-default login
gcloud config set account <account@example.com>
```

# APIs & IAM
Based on the user options selection, AutoMLOps will enable up to the following APIs during the provision step:
- [aiplatform.googleapis.com](https://cloud.google.com/vertex-ai/docs/reference/rest)
- [artifactregistry.googleapis.com](https://cloud.google.com/artifact-registry/docs/reference/rest)
- [cloudbuild.googleapis.com](https://cloud.google.com/build/docs/api/reference/rest)
- [cloudfunctions.googleapis.com](https://cloud.google.com/functions/docs/reference/rest)
- [cloudresourcemanager.googleapis.com](https://cloud.google.com/resource-manager/reference/rest)
- [cloudscheduler.googleapis.com](https://cloud.google.com/scheduler/docs/reference/rest)
- [compute.googleapis.com](https://cloud.google.com/compute/docs/reference/rest/v1)
- [iam.googleapis.com](https://cloud.google.com/iam/docs/reference/rest)
- [iamcredentials.googleapis.com](https://cloud.google.com/iam/docs/reference/credentials/rest)
- [logging.googleapis.com](https://cloud.google.com/logging/docs/reference/v2/rest)
- [pubsub.googleapis.com](https://cloud.google.com/pubsub/docs/reference/rest)
- [run.googleapis.com](https://cloud.google.com/run/docs/reference/rest)
- [storage.googleapis.com](https://cloud.google.com/storage/docs/apis)


AutoMLOps will create the following service account and update [IAM permissions](https://cloud.google.com/iam/docs/understanding-roles) during the provision step:
1. Pipeline Runner Service Account (defaults to: vertex-pipelines@PROJECT_ID.iam.gserviceaccount.com). Roles added:
- roles/aiplatform.user
- roles/artifactregistry.reader
- roles/bigquery.user
- roles/bigquery.dataEditor
- roles/iam.serviceAccountUser
- roles/storage.admin
- roles/cloudfunctions.admin

# User Guide

For a user-guide, please view these [slides](../../AutoMLOps_User_Guide.pdf).

# Costs

This tutorial uses billable components of Google Cloud:
- Vertex AI
- Artifact Registry
- Cloud Storage
- Cloud Build
- Cloud Run
- Cloud Scheduler
- Cloud Pub/Sub

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

# Ground-rules for using AutoMLOps
1. Do not use variables, functions, code, etc. not defined within the scope of a custom component. These custom components will become containers and will have no reference to the out of scope code.
2. Import statements and helper functions must be added inside the function. Provide parameter type hints.
3. Test each of your components for accuracy and correctness before running them using AutoMLOps. We cannot fix bugs automatically; bugs are much more difficult to fix once they are made into pipelines.
4. If you are using Kubeflow, be sure to define all the requirements needed to run the custom component - it can be easy to leave out packages which will cause the container to fail when running within a pipeline.


# Setup Git
Set up your git configuration below

In [None]:
!git config --global user.email 'you@example.com'
!git config --global user.name 'Your Name'

# Install AutoMLOps

Install AutoMLOps from [PyPI](https://pypi.org/project/google-cloud-automlops/), or locally by cloning the repo and running `pip install .`

In [None]:
!pip3 install google-cloud-automlops --user

# Restart the kernel
Once you've installed the AutoMLOps package, you need to restart the notebook kernel so it can find the package.

**Note: Once this cell has finished running, continue on. You do not need to re-run any of the cells above.**

In [None]:
import os

if not os.getenv('IS_TESTING'):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

# Set your project ID
Set your project ID below. If you don't know your project ID, leave the field blank and the following cells may be able to find it.

In [1]:
PROJECT_ID = '[your-project-id]'  # @param {type:"string"}

In [2]:
if PROJECT_ID == '' or PROJECT_ID is None or PROJECT_ID == '[your-project-id]':
    # Get your GCP project id from gcloud
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print('Project ID:', PROJECT_ID)

Project ID: automlops-sandbox


In [3]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


Set your Model_ID below:

In [4]:
MODEL_ID = 'iris-k-means'

# 1. AutoMLOps Clustering Example
This workflow will define and generate a pipeline using AutoMLOps. AutoMLOps provides 2 functions for defining MLOps pipelines:

- `AutoMLOps.component(...)`: Defines a component, which is a containerized python function.
- `AutoMLOps.pipeline(...)`: Defines a pipeline, which is a series of components.

AutoMLOps provides 6 functions for building and maintaining MLOps pipelines:

- `AutoMLOps.generate(...)`: Generates the MLOps codebase. Users can specify the tooling and technologies they would like to use in their MLOps pipeline.
- `AutoMLOps.provision(...)`: Runs provisioning scripts to create and maintain necessary infra for MLOps.
- `AutoMLOps.deprovision(...)`: Runs deprovisioning scripts to tear down MLOps infra created using AutoMLOps.
- `AutoMLOps.deploy(...)`: Builds and pushes component container, then triggers the pipeline job.
- `AutoMLOps.launchAll(...)`: Runs `generate()`, `provision()`, and `deploy()` all in succession.
- `AutoMLOps.monitor(...)`: Creates model monitoring jobs on deployed endpoints.

Please see the [readme](https://github.com/GoogleCloudPlatform/automlops/blob/main/README.md) for more information.

## Import AutoMLOps

In [5]:
from google_cloud_automlops import AutoMLOps

## Data Loading
Define a custom component for loading and creating a dataset using `@AutoMLOps.component`. Import statements and helper functions must be added inside the function. Provide parameter type hints.

In [7]:
@AutoMLOps.component
def create_dataset(data_path: str):
    """Custom component that loads the sklearn Iris dataset and writes it to GCS.

    Args:
        data_path: The gcs location to write the Iris data.
    """
    import pandas as pd
    from sklearn import datasets

    # Load data
    iris = datasets.load_iris()
    data = pd.DataFrame(data=iris.data, columns=iris.feature_names)  
    target = pd.DataFrame(data=iris.target, columns=['Species'])
    df = pd.concat([data, target], axis=1)

    # Calculate petal and sepal area and save dataset
    df['sepal_area'] = df['sepal length (cm)'] * df['sepal width (cm)']
    df['petal_area'] = df['petal length (cm)'] * df['petal width (cm)']
    df.to_csv(data_path, index=False)

## Model Fitting
Define a custom component for fitting KMeans clusters using `@AutoMLOps.component`. Import statements and helper functions must be added inside the function. 

In [8]:
@AutoMLOps.component
def fit_kmeans(
    data_path: str,
    cluster_path: str
):
    """Custom component that determines KMeans clusters.

    Args:
        data_path (str): The gcs location of the Iris data.
        cluster_path (str): The gcs location of the Iris dataset augmented with clusters.
    """
    from sklearn.cluster import KMeans
    import pandas as pd

    # Load data
    df = pd.read_csv(data_path)

    # Fit KMeans with 3 clusters to the sepal and petal area
    kmeans = KMeans(n_clusters=3, n_init='auto')
    df['Cluster'] = kmeans.fit_predict(df[['sepal_area', 'petal_area']])

    df[['sepal_area', 'petal_area', 'Species', 'Cluster']].to_csv(cluster_path, index=False)

## Define the Pipeline
Define your pipeline using `@AutoMLOps.pipeline`. You can optionally give the pipeline a name and description. Define the structure by listing the components to be called in your pipeline; use `.after` to specify the order of execution.

In [9]:
@AutoMLOps.pipeline #(name='automlops-pipeline', description='This is an optional description')
def pipeline(data_path: str,
             cluster_path: str):

    create_dataset_task = create_dataset(
        data_path=data_path)

    fit_kmeans_task = fit_kmeans(
        data_path=data_path,
        cluster_path=cluster_path).after(create_dataset_task)

## Define the Pipeline Arguments

In [10]:
import datetime
date_bucket = datetime.datetime.now()
pipeline_params = {
    'data_path': f'gs://{PROJECT_ID}-bucket/kmeans/{date_bucket}/iris.csv',
    'cluster_path': f'gs://{PROJECT_ID}-bucket/kmeans/{date_bucket}/iris_clusters.csv',
}

## Generate and Run the pipeline
`AutoMLOps.generate(...)` generates the MLOps codebase. Users can specify the tooling and technologies they would like to use in their MLOps pipeline. If you are interested in integrating with Github and Github Actions, please follow the setup steps in [this doc](../../docs/Using%20Github%20With%20AMO.md) and uncomment the relevant code block below.

In [11]:
# Setup using local scripts and cloudbuild:
AutoMLOps.generate(project_id=PROJECT_ID,
                   pipeline_params=pipeline_params,
                   use_ci=False,
                   naming_prefix=MODEL_ID,
                   deployment_framework='cloud-build',
)

# # Setup using Github, Github Actions, and Terraform:
# AutoMLOps.generate(project_id=PROJECT_ID,
#                    pipeline_params=pipeline_params,
#                    naming_prefix=MODEL_ID,
#                    schedule_pattern='59 11 * * 0', # retrain every Sunday at Midnight
#                    use_ci=True,
#                    deployment_framework='github-actions',
#                    provisioning_framework='terraform',   
#                    source_repo_type='github',
#                    project_number='<project_number>',
#                    source_repo_name='<source/repo/string>',
#                    workload_identity_pool='<identity_pool_string>',
#                    workload_identity_provider='<identity_provider_string>',
#                    workload_identity_service_account='<workload_identity_sa>'
# )

Writing directories under AutoMLOps/
Writing configurations to AutoMLOps/configs/defaults.yaml
Writing Kubeflow Pipelines code to AutoMLOps/pipelines, AutoMLOps/components, AutoMLOps/services
Writing README.md to AutoMLOps/README.md
Writing scripts to AutoMLOps/scripts
Writing CloudBuild config to AutoMLOps/cloudbuild.yaml
Code Generation Complete.


`AutoMLOps.provision(...)` runs provisioning scripts to create and maintain necessary infra for MLOps.

In [12]:
AutoMLOps.provision()            # hide_warnings is optional, defaults to True

[0;32m Setting up API services in project automlops-sandbox [0m
Operation "operations/acat.p2-45373616427-129a5ec2-c1b1-4af8-b33b-89f236eeb7ae" finished successfully.
[0;32m Setting up Artifact Registry in project automlops-sandbox [0m
Listing items under project automlops-sandbox, location us-central1.

Creating Artifact Registry: iris-k-means-artifact-registry in project automlops-sandbox
Create request issued for: [iris-k-means-artifact-registry]
Waiting for operation [projects/automlops-sandbox/locations/us-central1/operations/70548b29-cf18-4377-9263-e287eff77b10] to complete...
......done.
Created repository [iris-k-means-artifact-registry].
[0;32m Setting up Storage Bucket in project automlops-sandbox [0m
BucketNotFoundException: 404 gs://automlops-sandbox-iris-k-means-bucket bucket does not exist.
Creating GS Bucket: automlops-sandbox-iris-k-means-bucket in project automlops-sandbox
Creating gs://automlops-sandbox-iris-k-means-bucket/...
[0;32m Setting up Pipeline Job Run

`AutoMLOps.deploy(...)` builds and pushes component container, then triggers the pipeline job.

In [13]:
AutoMLOps.deploy()                     # precheck is optional, defaults to True
                                       # hide_warnings is optional, defaults to True

Initialized empty Git repository in /Users/srastatter/Documents/2023/MLOps-graduation/AutoMLOps-github/examples/training/.git/
Switched to a new branch 'automlops'
[automlops (root-commit) 007af6e] init
 1 file changed, 149 insertions(+)
 create mode 100644 .gitignore
remote: Waiting for private key checker: 1/1 objects left        
To https://source.developers.google.com/p/automlops-sandbox/r/iris-k-means-repository
 * [new branch]      automlops -> automlops
[automlops 44f89f9] Run AutoMLOps
 23 files changed, 1031 insertions(+)
 create mode 100644 AutoMLOps/README.md
 create mode 100644 AutoMLOps/cloudbuild.yaml
 create mode 100644 AutoMLOps/components/component_base/Dockerfile
 create mode 100644 AutoMLOps/components/component_base/requirements.txt
 create mode 100644 AutoMLOps/components/component_base/src/create_dataset.py
 create mode 100644 AutoMLOps/components/component_base/src/fit_kmeans.py
 create mode 100644 AutoMLOps/components/create_dataset/component.yaml
 create mode 1