In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# [TODO] Add your H1 title heading here

{TODO: Update the links below.} 

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/notebook_template.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

{TODO: Include a paragraph or two explaining what this example demonstrates, who should be interested in it, and what you need to know before you get started.}

Learn more about [web-doc-title](linkback-to-webdoc-page). {TODO: if more than one primary feature, add tag/linkback for each one}

### Objective

In this tutorial, you learn how to set up the repo, launch your first training and predicition pipeline, and analyse the results:

This tutorial uses the following Google Cloud ML services and resources:

- *{TODO: Add high level bullets for the services/resources demonstrated; e.g., Vertex AI Training}*


The steps performed include:

- *{TODO: Add high level bullets for the steps of performed in the notebook}*

### Dataset

{TODO: Include a paragraph with Dataset information and where to obtain it.} 

{TODO: Make sure the dataset is accessible to the public. **Googlers**: Add your dataset to the [public samples bucket](http://goto/cloudsamples#sample-storage-bucket) within gs://cloud-samples-data/vertex-ai, if it doesn't already exist there.}

### Costs 

{TODO: Update the list of billable products that your tutorial uses.}

This tutorial uses billable components of Google Cloud:

* Vertex AI
* {TODO: BigQuery}
* Cloud Storage

{TODO: Include links to pricing documentation for each product you listed above.
 NOTE: If you use BigQuery or Dataflow, you need to add this to the pricing.
}

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
{ TODO: [BigQuery pricing](https://cloud.google.com/bigquery/pricing), }
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), 
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Prerequisites

- [Pyenv](https://github.com/pyenv/pyenv#installation) for managing Python versions
- [Google Cloud SDK (gcloud)](https://cloud.google.com/sdk/docs/quickstart)
- Make
- [Poetry](https://python-poetry.org)

## Clone Turbo Templates repository

In [None]:
# Clone a Git repository
!git clone -b develop https://github.com/teamdatatonic/vertex-pipelines-end-to-end-samples

In [None]:
%cd vertex-pipelines-end-to-end-samples/

## Installation

Install the following packages required to execute this notebook. 

{TODO: Suggest using the latest major GA version of each package; i.e., --upgrade}

In [None]:
# Install the correct Python version
! pyenv install -skip-existing

# configure poetry 
! poetry config virtualenvs.prefer-active-python true

#Install poetry dependencies for ML pipelines
! make install

# Install pre-commit hooks
! cd pipelines && poetry run pre-commit install


### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## 1. Setup

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
vertex_project_id = "[my-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {vertex_project_id}

# Set the environment variable
%env VERTEX_PROJECT_ID = {vertex_project_id}

#### Location

You can also change the `VERTEX_LOCATION` variable used by Vertex AI. Learn more about [Vertex AI locations](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
vertex_location = "europe-west2"  # @param {type: "string"}

# Set the environment variable
%env VERTEX_LOCATION={vertex_location}

#### Resource Suffix

Specify the `RESOURCE_SUFFIX` (e.g. `<your name>`) to facilitate running concurrent pipelines in the same Google Cloud project. 
This is important to change if you are working in a team to avoid overwriting resources during development 

In [None]:
resource_suffix="default" # @param {type: "string"}

# Set the environment variable
%env RESOUCE_SUFFIX=resource_suffix

#### Vertex Pipelines Service Account

We need to set the `VERTEX_SA_EMAIL` that is in the Google Cloud project so that this service account can run the pipelines for us.

In [None]:
vertex_sa_email=f"vertex-pipelines@{vertex_project_id}.iam.gserviceaccount.com"

# Set the environment variable
%env VERTEX_SA_EMAIL={vertex_sa_email}

#### Vertex Pipelines Root Bucket

We need to set the `VERTEX_PIPELINE_ROOT` that is in the Google Cloud project which is used to stage pipeline artifacts.

In [None]:
vertex_pipeline_root=f"gs://{vertex_project_id}-pl-root"

# Set the environment variable
%env VERTEX_PIPELINE_ROOT={vertex_pipeline_root}

#### Container Image Registry

We need to set the `CONTAINER_IMAGE_REGISTRY` that is container image repository for training/serving container images.

In [None]:
container_image_registry=f"{vertex_location}-docker.pkg.dev/{vertex_project_id}/vertex-images"

# Set the environment variable
%env CONTAINER_IMAGE_REGISTRY={container_image_registry}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Infrastructure deployment using terraform.

Install Terraform on your local machine. We recommend using [`tfswitch`](https://tfswitch.warrensbox.com/) to automatically choose and download an appropriate version for you (run `tfswitch` from the [`terraform/envs/dev`](terraform/envs/dev/) directory).

Now you can deploy the infrastructure using Terraform:

#### Create tfstate bucket

Before provisioning our infrastructure we need to create Google Cloud Storage (GCS) bucket that will be used to store the state files for Terraform deployments.

- *{Note to notebook author: For any user-provided strings that need to be unique (like bucket names or model ID's), append "-unique" to the end so proper testing can occur}*

In [None]:
tf_state_bucket_uri = f"gs://{vertex_project_id}-tfstate"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l {vertex_location} -p {vertex_project_id} {tf_state_bucket_uri}

### Deploy required infrastructure

**Initialize Backend**

In [None]:
! terraform -chdir=terraform/envs/dev init -backend-config="bucket=${VERTEX_PROJECT_ID}-tfstate" 

**Terraform Plan**

In [None]:
! terraform -chdir=terraform/envs/dev plan -var "project_id=${VERTEX_PROJECT_ID}" -var "region=${VERTEX_LOCATION}"

**Terraform Apply** Please check tf plan output to see what infrastructure will be provisioned. If everything looks ok uncomment and run cell below.

In [None]:
# ! terraform -chdir=terraform/envs/dev apply -var "project_id=${VERTEX_PROJECT_ID}" -var "region=${VERTEX_LOCATION}" -auto-approve -lock=false

## 2. Example ML Pipelines

This repository contains example ML training and prediction pipelines for two popular frameworks (XGBoost/sklearn and Tensorflow) using the popular [Chicago Taxi Dataset](https://console.cloud.google.com/marketplace/details/city-of-chicago-public-data/chicago-taxi-trips). The details of these can be found in the [separate README](pipelines/README.md).

#### Pre-requisites

Before you can run these example pipelines successfully there are a few additional things you will need to deploy (they have not been included in the Terraform code as they are specific to these pipelines)

1. Create a new BigQuery dataset for the Chicago Taxi data:

In [None]:
! bq --location=${VERTEX_LOCATION} mk --dataset "${VERTEX_PROJECT_ID}:chicago_taxi_trips"

2. Create a new BigQuery dataset for data processing during the pipelines:

In [None]:
! bq --location=${VERTEX_LOCATION} mk --dataset "${VERTEX_PROJECT_ID}:preprocessing"

3. Set up a BigQuery transfer job to mirror the Chicago Taxi dataset to your project

In [None]:
! pip install google-cloud-bigquery-datatransfer

In [None]:
from google.cloud import bigquery_datatransfer

transfer_client = bigquery_datatransfer.DataTransferServiceClient()

destination_project_id = vertex_project_id
destination_dataset_id = "chicago_taxi_trips"
source_project_id = "bigquery-public-data"
source_dataset_id = "chicago_taxi_trips"
transfer_config = bigquery_datatransfer.TransferConfig(
    destination_dataset_id=destination_dataset_id,
    display_name="Chicago taxi trip mirror",
    data_source_id="cross_region_copy",
    params={
        "source_project_id": source_project_id,
        "source_dataset_id": source_dataset_id,
    },
    schedule="every 24 hours",
)
transfer_config = transfer_client.create_transfer_config(
    parent=transfer_client.common_project_path(destination_project_id),
    transfer_config=transfer_config,
)
print(f"Created transfer config: {transfer_config.name}")

### Building the container images

The [model/](/model/) directory contains the code for custom training and serving container images [model/training/train.py](model/training/train.py).

Build the training and serving container images and push them to Artifact Registry.

In [None]:
! make build

### Running Pipelines

You can run the training pipeline (for example) with:

This will execute the pipeline using the chosen template on Vertex AI, namely it will:

1. Compile the pipeline using the Kubeflow Pipelines SDK
1. Trigger the pipeline with the help of `pipelines/trigger/main.py`

**Run Training Pipeline**

In [None]:
! make run pipeline=training build=false

**Run Prediction Pipeline** 

After successful training run you can try prediction pipeline.

In [None]:
! make run pipeline=prediction build=false

## 4. Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

{TODO: Include infrastructure cleanup}