In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Running Vertex Pipelines using End-to-end Samples repository.

{TODO: Update the links below.} 

{TODO: 🔴 Potentially remove colab and workbench as terraform installation could be difficult?}

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/notebook_template.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

This notebook shows you how to run production ready pipelines on Google Cloud using Datatonic's Vertex Pipelines End-to-end Samples repository.

Learn more about [Vertex Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/introduction).

### Objective

In this tutorial, you learn how to set up the repo, launch your first training and predicition pipeline, and analyse the results:

This tutorial uses the following Google Cloud services and resources:

- *`Vertex Pipelines`*
- *`Google Cloud Storage`*
- *`Artifact Registry`*
- *`BigQuery`*
- *`Cloud Build`*

The steps performed include:

* Deploy infrastructure using Terraform for a typical setup of Vertex AI and other relevant services.
* Run ML training and batch prediction pipelines using the Kubeflow Pipelines SDK for an example use case.

### Dataset

Example ML training and predictions pipelines for scikit-learn/XGBoost will use the popular [Chicago Taxi Trips Dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=chicago_taxi_trips&page=dataset). The dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency.


This public dataset is hosted in Google BigQuery.

### Costs 


This tutorial uses billable components of Google Cloud:

* Vertex AI
* BigQuery
* Cloud Storage
* Cloud Build
* Artifact Registry


Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
[BigQuery pricing](https://cloud.google.com/bigquery/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing),
and [Cloud Build pricing](https://cloud.google.com/build/pricing),
and [Artifact Registry](https://cloud.google.com/artifact-registry/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Prerequisites

- [Pyenv](https://github.com/pyenv/pyenv#installation) for managing Python versions
- [Google Cloud SDK (gcloud)](https://cloud.google.com/sdk/docs/quickstart)
- Make
- [Poetry](https://python-poetry.org)
- [Terraform](https://www.terraform.io) (To install Terraform on your local machine we recommend using [tfswitch](https://tfswitch.warrensbox.com/) to automatically choose and download an appropriate version.)

## Clone Turbo Templates repository

In [None]:
# Clone a Git repository
!git clone -b develop https://github.com/teamdatatonic/vertex-pipelines-end-to-end-samples

In [None]:
%cd vertex-pipelines-end-to-end-samples/

## Installation

Install the packages required for executing this notebook.

In [None]:
# Install the correct Python version
! pyenv install -skip-existing

# configure poetry 
! poetry config virtualenvs.prefer-active-python true

#Install poetry dependencies for ML pipelines
! make install

### Colab only: Uncomment the following cell to restart the kernel. 🔴{TODO: Potentially remove colab as terraform installation could be difficult?}

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before You Begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[my-project-id]"  # @param {type:"string"}
# Set the project id
! gcloud config set project {PROJECT_ID}

### Authenticate your Google Cloud account {TODO: Potentially remove colab and workbench as terraform installation could be difficult?🔴🔴🔴}


Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

### Environment Setup {Work in progress}

In order to run make commands relevant environment variables need to be set. Please create `env.sh`, and update the environment variables for your dev environment (particularly `VERTEX_PROJECT_ID`, `VERTEX_LOCATION` and `RESOURCE_SUFFIX`).

#### Using `os` library

In [None]:
import os

os.environ['VERTEX_PROJECT_ID'] = 'my-project-id'
os.environ['VERTEX_LOCATION'] = 'europe-west2'

os.environ['VERTEX_CMEK_IDENTIFIER'] = ''  # optional
os.environ['VERTEX_NETWORK'] = ''  # optional

# Suffix (e.g. '<your name>') to facilitate running concurrent pipelines in the same Google Cloud project.
# Change if working in a team to avoid overwriting resources during development 
os.environ['RESOURCE_SUFFIX'] = 'default'

# Leave as-is
os.environ['VERTEX_SA_EMAIL'] = f'vertex-pipelines@{os.environ["VERTEX_PROJECT_ID"]}.iam.gserviceaccount.com'
os.environ['VERTEX_PIPELINE_ROOT'] = f'gs://{os.environ["VERTEX_PROJECT_ID"]}-pl-root'
os.environ['CONTAINER_IMAGE_REGISTRY'] = f'{os.environ["VERTEX_LOCATION"]}-docker.pkg.dev/{os.environ["VERTEX_PROJECT_ID"]}/vertex-images'

Using `os` library could suitable. See below:

In [None]:
# 1. Env variables are available in shell commands.
!echo $VERTEX_PROJECT_ID

In [None]:
# 2. Env variables available for make commands.
# ! make install
! make compile pipeline=training

In [None]:
# 3. Env variables can be used in cells using os library.
PROJECT_ID = os.environ["VERTEX_PROJECT_ID"]
TF_BUCKET_URI = f"{PROJECT_ID}-tfstate"
print(TF_BUCKET_URI)

#### Using dotenv library

In [None]:
%%writefile .env

VERTEX_CMEK_IDENTIFIER= # optional
VERTEX_LOCATION=europe-west2
VERTEX_NETWORK= # optional
VERTEX_PROJECT_ID=my-project-id-dot

# Suffix (e.g. '<your name>') to facilitate running concurrent pipelines in the same Google Cloud project. Change if working in a team to avoid overwriting resources during development 
RESOURCE_SUFFIX=default

# Leave as-is
VERTEX_SA_EMAIL=vertex-pipelines@${VERTEX_PROJECT_ID}.iam.gserviceaccount.com
VERTEX_PIPELINE_ROOT=gs://${VERTEX_PROJECT_ID}-pl-root
CONTAINER_IMAGE_REGISTRY=${VERTEX_LOCATION}-docker.pkg.dev/${VERTEX_PROJECT_ID}/vertex-images

In [None]:
! pip install python-dotenv

In [None]:
from dotenv import load_dotenv

load_dotenv()

In [None]:
# 1. Env variables are available in shell commands.
!echo $VERTEX_PROJECT_ID
!echo $CONTAINER_IMAGE_REGISTRY

In [None]:
# 2. Env variables available for make commands.
# ! make install
! make compile pipeline=training

In [None]:
# 3. Env variables can be used in cells using os library.
import os
PROJECT_ID = os.environ["VERTEX_PROJECT_ID"]
TF_BUCKET_URI = f"{PROJECT_ID}-tfstate"
print(TF_BUCKET_URI)

#### Writing `env.sh`

This approach has some drawbacks .

In [None]:
%%writefile env.sh

#!/bin/bash
export VERTEX_CMEK_IDENTIFIER= # optional
export VERTEX_LOCATION=europe-west2
export VERTEX_NETWORK= # optional
export VERTEX_PROJECT_ID=my-project-id-envsh

# Suffix (e.g. '<your name>') to facilitate running concurrent pipelines in the same Google Cloud project. Change if working in a team to avoid overwriting resources during development 
export RESOURCE_SUFFIX=default

# Leave as-is
export VERTEX_SA_EMAIL=vertex-pipelines@${VERTEX_PROJECT_ID}.iam.gserviceaccount.com
export VERTEX_PIPELINE_ROOT=gs://${VERTEX_PROJECT_ID}-pl-root
export CONTAINER_IMAGE_REGISTRY=${VERTEX_LOCATION}-docker.pkg.dev/${VERTEX_PROJECT_ID}/vertex-images

This approach has some issues (see below): 🔴🔴🔴 

In [None]:
# 1. Env variables are not available in shell commands. 🔴

In [None]:
%%bash
source env.sh

In [None]:
! echo $VERTEX_PROJECT_ID

In [None]:
# To use env variables in a command you need to add %%bash and source env.sh into the cell everytime.
# For example some of the steps in infrastructure deployment would look like:🔴

In [None]:
%%bash
source env.sh
echo $VERTEX_PROJECT_ID

In [None]:
%%bash
source env.sh
gsutil mb -l $VERTEX_LOCATION -p $VERTEX_PROJECT_ID gs://$VERTEX_PROJECT_ID-tfstate

In [None]:
%%bash
source env.sh
terraform -chdir=terraform/envs/dev init -backend-config="bucket=${VERTEX_PROJECT_ID}-tfstate" 

In [None]:
# Make commands are working fine since it cointains:
# -include env.sh
# export

In [None]:
! make compile pipeline=training

In [None]:
# Env variables are not available for python scripts 🔴
import os
PROJECT_ID = os.environ["VERTEX_PROJECT_ID"]
TF_BUCKET_URI = f"{PROJECT_ID}-tfstate"
print(TF_BUCKET_URI)

## Infrastructure deployment using terraform.

Please check if you have terraform installed. If not we recommend using [`tfswitch`](https://tfswitch.warrensbox.com/) to automatically choose and download an appropriate version for you.

#### Create tfstate bucket

Before provisioning our infrastructure we need to create Google Cloud Storage (GCS) bucket that will be used to store the state files for Terraform deployments.

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $VERTEX_LOCATION -p $VERTEX_PROJECT_ID gs://$VERTEX_PROJECT_ID-tfstate

### Deploy required infrastructure

**Initialize Backend**

In [None]:
! terraform -chdir=terraform/envs/dev init -backend-config="bucket=${VERTEX_PROJECT_ID}-tfstate" 

**Terraform Plan**

In [None]:
! terraform -chdir=terraform/envs/dev plan -var "project_id=${VERTEX_PROJECT_ID}" -var "region=${VERTEX_LOCATION}" # -lock=false

**Terraform Apply** Please check tf plan output to see what infrastructure will be provisioned. If everything looks fine uncomment and run cell below.

In [None]:
! terraform -chdir=terraform/envs/dev apply -var "project_id=${VERTEX_PROJECT_ID}" -var "region=${VERTEX_LOCATION}" -auto-approve # -lock=false

In [None]:
# I've been getting state lock issues 🔴🔴🔴  -lock=false fixes it

## 2. Example ML Pipelines

This repository contains example ML training and prediction pipelines for two popular frameworks (XGBoost/sklearn and Tensorflow) using the popular [Chicago Taxi Dataset](https://console.cloud.google.com/marketplace/details/city-of-chicago-public-data/chicago-taxi-trips). The details of these can be found in the [separate README](pipelines/README.md).

#### Pre-requisites

Before you can run these example pipelines successfully there are a few additional things you will need to deploy (they have not been included in the Terraform code as they are specific to these pipelines)

1. Create a new BigQuery dataset for the Chicago Taxi data:

In [None]:
! bq --location=${VERTEX_LOCATION} mk --dataset "${VERTEX_PROJECT_ID}:chicago_taxi_trips"

2. Create a new BigQuery dataset for data processing during the pipelines:

In [None]:
! bq --location=${VERTEX_LOCATION} mk --dataset "${VERTEX_PROJECT_ID}:preprocessing"

3. Set up a BigQuery transfer job to mirror the Chicago Taxi dataset to your project

In [None]:
! pip install google-cloud-bigquery-datatransfer

In [None]:
from google.cloud import bigquery_datatransfer

transfer_client = bigquery_datatransfer.DataTransferServiceClient()

destination_project_id = os.environ["VERTEX_PROJECT_ID"]
destination_dataset_id = "chicago_taxi_trips"
source_project_id = "bigquery-public-data"
source_dataset_id = "chicago_taxi_trips"
transfer_config = bigquery_datatransfer.TransferConfig(
    destination_dataset_id=destination_dataset_id,
    display_name="Chicago taxi trip mirror",
    data_source_id="cross_region_copy",
    params={
        "source_project_id": source_project_id,
        "source_dataset_id": source_dataset_id,
    },
    schedule="every 24 hours",
)
transfer_config = transfer_client.create_transfer_config(
    parent=transfer_client.common_project_path(destination_project_id),
    transfer_config=transfer_config,
)
TRANSFER_CONFIG=transfer_config.name
print(f"Created transfer config: {TRANSFER_CONFIG}")

### Building the container images

The [model/](/model/) directory contains the code for custom training and serving container images [model/training/train.py](model/training/train.py).

Build the training and serving container images and push them to Artifact Registry.

In [None]:
! make build

### Running Pipelines

You can run the training pipeline (for example) by executing cell below.

This will start the pipeline using the chosen template on Vertex AI, namely it will:

1. Compile the pipeline using the Kubeflow Pipelines SDK
1. Trigger the pipeline with the help of `pipelines/trigger/main.py`

**Run Training Pipeline**

In [None]:
! make run pipeline=training build=false

**Run Prediction Pipeline** 

After successful training run you can try prediction pipeline.

In [None]:
! make run pipeline=prediction build=false

## 3. Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:


**Terraform Plan (destroy)**

In [None]:
! terraform -chdir=terraform/envs/dev plan -destroy -var "project_id=${VERTEX_PROJECT_ID}" -var "region=${VERTEX_LOCATION}" # -lock=false

**Terraform Destroy**

🔴🔴🔴  Cant delete staging and root buckets because contains objects and `force_destroy` is not set to true in our terraform configurations

In [None]:
# ! terraform -chdir=terraform/envs/dev destroy -var "project_id=${VERTEX_PROJECT_ID}" -var "region=${VERTEX_LOCATION}" -auto-approve # -lock=false  

1. Delete data transfer config.

In [None]:
import google.api_core.exceptions
from google.cloud import bigquery_datatransfer

transfer_client = bigquery_datatransfer.DataTransferServiceClient()

transfer_config_name = TRANSFER_CONFIG
try:
    transfer_client.delete_transfer_config(name=transfer_config_name)
except google.api_core.exceptions.NotFound:
    print("Transfer config not found.")
else:
    print(f"Deleted transfer config: {transfer_config_name}")

2. Delete BigQuery dataset for the Chicago Taxi data:

In [None]:
! bq --location=${VERTEX_LOCATION} rm -f -r --dataset "${VERTEX_PROJECT_ID}:chicago_taxi_trips" 

3. Delete BigQuery dataset for data processing during the pipelines:

In [None]:
! bq --location=${VERTEX_LOCATION} rm -f -r --dataset "${VERTEX_PROJECT_ID}:preprocessing" 