In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Running Vertex Pipelines using E2E Samples repository - Triggering Pipelines.


<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/teamdatatonic/vertex-pipelines-end-to-end-samples/blob/develop/examples/pipelines_colab.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/teamdatatonic/vertex-pipelines-end-to-end-samples/blob/develop/examples/pipelines.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/teamdatatonic/vertex-pipelines-end-to-end-samples/blob/develop/examples/pipelines_workbench.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

This notebook shows you how to run production ready pipelines on Google Cloud using Datatonic's Vertex Pipelines End-to-end Samples repository.

Learn more about [Vertex Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction).

### Objective

In this tutorial, you learn how to launch your first training and predicition pipeline, and analyse the results:

This tutorial uses the following Google Cloud services and resources:

- *`Vertex Pipelines`*
- *`Google Cloud Storage`*
- *`Artifact Registry`*
- *`BigQuery`*
- *`Cloud Build`*

The steps performed include:

* Run ML training and batch prediction pipelines using the Kubeflow Pipelines SDK for an example use case.

### Dataset

Example ML training and predictions pipelines for scikit-learn/XGBoost will use the popular [Chicago Taxi Trips Dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=chicago_taxi_trips&page=dataset). The dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency.


This public dataset is hosted in Google BigQuery.

### Costs 


This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Cloud Storage
* Cloud Build
* Artifact Registry


Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
[BigQuery pricing](https://cloud.google.com/bigquery/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),
and [Cloud Build pricing](https://cloud.google.com/build/pricing),
and [Artifact Registry](https://cloud.google.com/artifact-registry/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Prerequisites

- [Pyenv](https://github.com/pyenv/pyenv#installation) for managing Python versions
- [Google Cloud SDK (gcloud)](https://cloud.google.com/sdk/docs/quickstart)
- Make
- [Poetry](https://python-poetry.org)
- [pyenv](https://github.com/pyenv/pyenv)
- Infrastructure deployed ([01_infrastructure_setup.ipynb](01_infrastructure_setup.ipynb)).

## Change working directory

In [None]:
%cd vertex-pipelines-end-to-end-samples/

## Installation

Install the packages required for executing this notebook.

In [None]:
# Install the correct Python version
! pyenv install -skip-existing

# configure poetry 
! poetry config virtualenvs.prefer-active-python true

#Install poetry dependencies for ML pipelines
! make install

### Terraform

**If you do not have terraform installed please follow the instructions below.**

**Local JupyterLab instance**

Please follow official documentation for installation instructions: https://developer.hashicorp.com/terraform/downloads

## Before You Begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[my-project-id]"  # @param {type:"string"}
# Set the project id
! gcloud config set project {PROJECT_ID}

### Authenticate your Google Cloud account


As you're running a Jupyter environment locally, you'll need to authenticate manually. Please follow the instructions provided below.

In [None]:
 ! gcloud auth login

### Environment Setup

In order to run make commands relevant environment variables need to be set. You can re-use `.env` file defined in [01_infrastructure_setup.ipynb](01_infrastructure_setup.ipynb).

Load environment variables using [Python-dotenv](https://pypi.org/project/python-dotenv/).

In [None]:
# ! pip install python-dotenv

In [None]:
from dotenv import load_dotenv
load_dotenv()

## Example ML Pipelines

To automate, monitor, and govern your ML workflows, you can use [Vertex AI](https://cloud.google.com/vertex-ai/docs/start/introduction-mlops). Vertex AI is a powerful platform offered by Google Cloud that empowers organizations to streamline and enhance their Machine Learning (ML) workflows through automation, monitoring, and governance.

- **Automation**: Vertex AI offers a suite of tools and services designed to automate various aspects of ML development and deployment. This includes automating data preprocessing, model training, hyperparameter tuning, and model deployment. Automation not only saves time but also reduces the potential for human error, making your ML workflows more efficient and reliable.

- **Monitoring**: Effective monitoring is crucial for maintaining the performance and reliability of ML models in production. Vertex AI provides monitoring capabilities that allow you to track model performance, detect drift in data distributions, and set up alerts for anomalies. This proactive monitoring ensures that your models continue to deliver accurate results as data and business conditions change over time.

- **Governance**: Managing ML models and data in a secure and compliant manner is essential for businesses, especially those in regulated industries. Vertex AI helps you implement governance policies to control access to data, monitor model usage, and enforce compliance with data privacy regulations. This ensures that your ML operations are in line with legal and ethical standards.

- **Scalability**: Vertex AI is built on Google Cloud's infrastructure, which means it offers unparalleled scalability. Whether you're dealing with small-scale experiments or large-scale production deployments, Vertex AI can scale to meet your needs, ensuring that your ML workflows can handle increased workloads without performance bottlenecks.

- **Collaboration**: Collaboration is essential in ML development, and Vertex AI provides features that facilitate collaboration among data scientists, machine learning engineers, and other stakeholders. You can share notebooks, collaborate on model development, and maintain version control of your ML assets.

- **Model Serving**: Vertex AI makes it easy to deploy ML models as APIs for real-time inference or batch processing. This enables you to integrate your ML models into applications, websites, or other services with ease.

- **Cost Management**: Cost control is a crucial aspect of any ML project. Vertex AI offers cost management tools and insights to help you optimize your ML workflows and keep your expenses in check.

By utilizing Vertex AI, you can take advantage of Google Cloud's cutting-edge technology and expertise in machine learning to accelerate your MLOps journey. The platform offers a comprehensive set of tools and services that cover the entire ML lifecycle, from data preparation to model deployment and beyond, making it a valuable resource for organizations looking to harness the power of ML in a scalable, efficient, and secure manner.

To learn more about MLOps on Vertex AI and how it can transform your ML workflows, you can visit the [official documentation](https://cloud.google.com/vertex-ai/docs/start/introduction-mlops).


**This repository provides an example ML training and prediction pipelines for XGBoost using the popular [Chicago Taxi Dataset](https://console.cloud.google.com/marketplace/details/city-of-chicago-public-data/chicago-taxi-trips).**


#### Pre-requisites

Before you can successfully execute the example pipelines, there are a few additional components you need to deploy. These components have not been included in the Terraform code as they are specific to these pipelines.

1. Create a new BigQuery dataset for the Chicago Taxi data:

In [None]:
! bq --location=${VERTEX_LOCATION} mk --dataset "${VERTEX_PROJECT_ID}:chicago_taxi_trips"

2. Create a new BigQuery dataset for data processing during the pipelines:

In [None]:
! bq --location=${VERTEX_LOCATION} mk --dataset "${VERTEX_PROJECT_ID}:preprocessing"

3. Set up a BigQuery transfer job to mirror the Chicago Taxi dataset to your project

In [None]:
! pip install google-cloud-bigquery-datatransfer

In [None]:
import os
from google.cloud import bigquery_datatransfer

transfer_client = bigquery_datatransfer.DataTransferServiceClient()

destination_project_id = os.environ["VERTEX_PROJECT_ID"]
destination_dataset_id = "chicago_taxi_trips"
source_project_id = "bigquery-public-data"
source_dataset_id = "chicago_taxi_trips"
transfer_config = bigquery_datatransfer.TransferConfig(
    destination_dataset_id=destination_dataset_id,
    display_name="Chicago taxi trip mirror",
    data_source_id="cross_region_copy",
    params={
        "source_project_id": source_project_id,
        "source_dataset_id": source_dataset_id,
    },
    schedule="every 24 hours",
)
transfer_config = transfer_client.create_transfer_config(
    parent=transfer_client.common_project_path(destination_project_id),
    transfer_config=transfer_config,
)
TRANSFER_CONFIG=transfer_config.name
print(f"Created transfer config: {TRANSFER_CONFIG}")

### Building the container images

The `model` directory contains the code for custom training and serving container images `model/training/train.py`.

A custom container is a Docker image that you create to run your training application. By running your machine learning (ML) training job in a custom container, you can use ML frameworks, non-ML dependencies, libraries, and binaries that are not otherwise supported on Vertex AI. To learn more you can check out [official documentation](https://cloud.google.com/vertex-ai/docs/training/containers-overview).

To build the training and serving container images and push them to Artifact Registry run the next cell.

In [None]:
! make build

### Running Pipelines

You can run the training pipeline by executing cell below.

This will start the pipeline using the chosen template on Vertex AI, namely it will:

1. Compile the pipeline using the Kubeflow Pipelines SDK
1. Trigger the pipeline with the help of `pipelines/trigger/main.py`

**Run Training Pipeline**

In [None]:
! make run pipeline=training build=false

After executing the command, a link will appear, leading you to the Vertex AI platform, where you can monitor your pipeline job. Alternatively, you can access it directly through the Google Cloud Console UI by navigating to Vertex AI and then selecting Pipelines.

**Training Pipeline**
![Training Pipeline](https://github.com/teamdatatonic/vertex-pipelines-end-to-end-samples/blob/develop/docs/images/training_pipeline.png?raw=true)

**Run Prediction Pipeline** 

After successful training job, you can try prediction pipeline.

In [None]:
! make run pipeline=prediction build=false

**Prediction Pipeline**

![Predictions Pipeline](https://github.com/teamdatatonic/vertex-pipelines-end-to-end-samples/blob/develop/docs/images/prediction_pipeline.png?raw=true)

### Congratulations on successfully running your first ML pipelines!