# Lab: Chicago taxifare tip prediction on Google Cloud Vertex Pipelines using the TFX SDK

## Learning objectives

* Define a machine learning pipeline to predict taxi fare tips using the TFX SDK.
* Compile and run a TFX pipeline on Google Cloud's Vertex Pipelines.

## Dataset

The [Chicago Taxi Trips](https://pantheon.corp.google.com/marketplace/details/city-of-chicago-public-data/chicago-taxi-trips) dataset is one of the [public datasets hosted with BigQuery](https://cloud.google.com/bigquery/public-data/), which includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. The task is to predict whether a given trip will result in a tip > 20%.

## Setup

### Define constants

In [24]:
GOOGLE_CLOUD_PROJECT_ID = !(gcloud config get-value core/project)
GOOGLE_CLOUD_PROJECT_ID = GOOGLE_CLOUD_PROJECT_ID[0]

In [25]:
GOOGLE_CLOUD_REGION = 'us-central1'

In [80]:
BQ_DATASET_NAME = 'chicago_taxifare_tips'
BQ_TABLE_NAME = 'chicago_taxi_tips_ml'
BQ_LOCATION = 'US'
BQ_URI = f"bq://{GOOGLE_CLOUD_PROJECT_ID}.{BQ_DATASET_NAME}.{BQ_TABLE_NAME}"

In [135]:
DATASET_DISPLAY_NAME = 'chicago-taxifare-tips'
MODEL_DISPLAY_NAME = f'{DATASET_DISPLAY_NAME}-classifier'
PIPELINE_NAME = f'{MODEL_DISPLAY_NAME}-train-pipeline'

### Create Google Cloud Storage bucket for storing Vertex Pipeline artifacts

In [151]:
GCS_LOCATION = f"gs://{PROJECT_ID}-tfx"

In [152]:
!gsutil mb -l $REGION $GCS_LOCATION

Creating gs://dougkelly-vertex-demos-tfx/...
ServiceException: 409 A Cloud Storage bucket named 'dougkelly-vertex-demos-tfx' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


### Import libraries

In [95]:
import os
import tensorflow as tf
import tfx
import kfp

from google.cloud import bigquery
from google.cloud import aiplatform as vertex_ai

In [29]:
print(f"tensorflow: {tf.__version__}")
print(f"tfx: {tfx.__version__}")
print(f"kfp: {kfp.__version__}")
print(f"Google Cloud Vertex AI Python SDK: {vertex_ai.__version__}")

tensorflow: 2.6.2
tfx: 1.4.0
kfp: 1.8.1
Google Cloud Vertex AI Python SDK: 1.7.1


## Create BigQuery dataset

In [26]:
!bq --location=$BQ_LOCATION mk -d \
$GOOGLE_CLOUD_PROJECT_ID:$BQ_DATASET_NAME

BigQuery error in mk operation: Dataset 'dougkelly-vertex-demos:chicago_taxi'
already exists.


## Create BigQuery dataset for ML classification task

In [75]:
SAMPLE_SIZE = 20000
YEAR = 2020

In [76]:
sql_script = '''
CREATE OR REPLACE TABLE `@PROJECT_ID.@DATASET.@TABLE` 
AS (
    WITH
      taxitrips AS (
      SELECT
        trip_start_timestamp,
        trip_seconds,
        trip_miles,
        payment_type,
        pickup_longitude,
        pickup_latitude,
        dropoff_longitude,
        dropoff_latitude,
        tips,
        fare
      FROM
        `bigquery-public-data.chicago_taxi_trips.taxi_trips`
      WHERE 1=1 
      AND pickup_longitude IS NOT NULL
      AND pickup_latitude IS NOT NULL
      AND dropoff_longitude IS NOT NULL
      AND dropoff_latitude IS NOT NULL
      AND trip_miles > 0
      AND trip_seconds > 0
      AND fare > 0
      AND EXTRACT(YEAR FROM trip_start_timestamp) = @YEAR
    )

    SELECT
      trip_start_timestamp,
      EXTRACT(MONTH from trip_start_timestamp) as trip_month,
      EXTRACT(DAY from trip_start_timestamp) as trip_day,
      EXTRACT(DAYOFWEEK from trip_start_timestamp) as trip_day_of_week,
      EXTRACT(HOUR from trip_start_timestamp) as trip_hour,
      trip_seconds,
      trip_miles,
      payment_type,
      ST_AsText(
          ST_SnapToGrid(ST_GeogPoint(pickup_longitude, pickup_latitude), 0.1)
      ) AS pickup_grid,
      ST_AsText(
          ST_SnapToGrid(ST_GeogPoint(dropoff_longitude, dropoff_latitude), 0.1)
      ) AS dropoff_grid,
      ST_Distance(
          ST_GeogPoint(pickup_longitude, pickup_latitude), 
          ST_GeogPoint(dropoff_longitude, dropoff_latitude)
      ) AS euclidean,
      CONCAT(
          ST_AsText(ST_SnapToGrid(ST_GeogPoint(pickup_longitude,
              pickup_latitude), 0.1)), 
          ST_AsText(ST_SnapToGrid(ST_GeogPoint(dropoff_longitude,
              dropoff_latitude), 0.1))
      ) AS loc_cross,
      IF((tips/fare >= 0.2), 1, 0) AS tip_bin,
      IF(ABS(MOD(FARM_FINGERPRINT(STRING(trip_start_timestamp)), 10)) < 9, 'UNASSIGNED', 'TEST') AS data_split
    FROM
      taxitrips
    LIMIT @LIMIT
)
'''

In [77]:
sql_script = sql_script.replace(
    '@PROJECT_ID', PROJECT_ID).replace(
    '@DATASET', BQ_DATASET_NAME).replace(
    '@TABLE', BQ_TABLE_NAME).replace(
    '@YEAR', str(YEAR)).replace(
    '@LIMIT', str(SAMPLE_SIZE))

In [78]:
bq_client = bigquery.Client(project=GOOGLE_CLOUD_PROJECT_ID, location=BQ_LOCATION)
job = bq_client.query(sql_script)
_ = job.result()

In [79]:
%%bigquery

SELECT data_split, COUNT(*)
FROM chicago_taxi.chicago_taxi_tips_raw
GROUP BY data_split

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1097.55query/s]                        
Downloading: 100%|██████████| 2/2 [00:01<00:00,  1.66rows/s]


Unnamed: 0,data_split,f0_
0,UNASSIGNED,18204
1,TEST,1796


## Create a Vertex AI managed dataset resource for pipeline dataset lineage tracking

### Initialize Vertex AI Python SDK

In [82]:
vertex_ai.init(project=GOOGLE_CLOUD_PROJECT_ID, location=GOOGLE_CLOUD_REGION)

### Create Vertex managed tabular dataset

In [None]:
tabular_dataset = vertex_ai.TabularDataset.create(display_name=f"{BQ_DATASET_NAME}", bq_source=f"{BQ_URI}")
tabular_dataset.gca_resource

## Create a TFX pipeline

In [92]:
PIPELINE_DIR="tfx_taxifare_tips"

### Write model code

In [None]:
%%writefile {PIPELINE_DIR}/model_training/features.py


In [None]:
%%writefile {PIPELINE_DIR}/model_training/preprocessing.py


In [39]:
%%writefile {PIPELINE_DIR}/model_training/model.py


Writing tfx-taxifare-tips/model.py


### Write pipeline definition with the TFX SDK

In [41]:
%%writefile {PIPELINE_DIR}/pipeline.py


Writing tfx-taxifare-tips/pipeline.py


In [None]:
%%writefile {PIPELINE_DIR}/runner.py


## Run your TFX pipeline on Vertex Pipelines

### Create a Artifact Registry on Google Cloud for your pipeline container image

In [89]:
ARTIFACT_REGISTRY="tfx-taxifare-tips" 

In [90]:
# TODO: create a Docker Artifact Registry using the gcloud CLI.
# Documentation link: https://cloud.google.com/sdk/gcloud/reference/artifacts/repositories/create

!gcloud artifacts repositories create {ARTIFACT_REGISTRY} \
--repository-format=docker \
--location={GOOGLE_CLOUD_REGION} \
--description="Artifact registry for TFX pipeline images for Chicago taxifare prediction."

[1;31mERROR:[0m (gcloud.artifacts.repositories.create) ALREADY_EXISTS: the repository already exists


In [99]:
IMAGE_NAME="tfx-taxifare-tips"
IMAGE_TAG="latest"
IMAGE_URI=f"{GOOGLE_CLOUD_REGION}-docker.pkg.dev/{GOOGLE_CLOUD_PROJECT_ID}/{ARTIFACT_REGISTRY}/{IMAGE_NAME}:{IMAGE_TAG}"

### Set the pipeline configurations for the Vertex AI run

In [145]:
os.environ["DATASET_DISPLAY_NAME"] = DATASET_DISPLAY_NAME
os.environ["MODEL_DISPLAY_NAME"] = MODEL_DISPLAY_NAME
os.environ["PIPELINE_NAME"] = PIPELINE_NAME
os.environ["GOOGLE_CLOUD_PROJECT_ID"] = GOOGLE_CLOUD_PROJECT_ID
os.environ["GOOGLE_CLOUD_REGION"] = GOOGLE_CLOUD_REGION
os.environ["GCS_LOCATION"] = GCS_LOCATION
os.environ["TRAIN_LIMIT"] = "20000"
os.environ["TEST_LIMIT"] = "2000"
os.environ["BEAM_RUNNER"] = "DataflowRunner"
os.environ["TRAINING_RUNNER"] = "vertex"
os.environ["TFX_IMAGE_URI"] = IMAGE_URI
os.environ["ENABLE_CACHE"] = "1"

In [146]:
from tfx_taxifare_tips.tfx_pipeline import config
import importlib
importlib.reload(config)

for key, value in config.__dict__.items():
    if key.isupper(): print(f'{key}: {value}')

GOOGLE_CLOUD_PROJECT_ID: dougkelly-vertex-demos
GOOGLE_CLOUD_REGION: us-central1
GCS_LOCATION: gs://dougkelly-vertex-demos-tfx
ARTIFACT_STORE_URI: gs://dougkelly-vertex-demos-tfx/tfx_artifacts
MODEL_REGISTRY_URI: gs://dougkelly-vertex-demos-tfx/model_registry
BQ_DATASET_NAME: chicago_taxi
MODEL_DISPLAY_NAME: chicago-taxifare-tips-classifier
PIPELINE_NAME: chicago-taxifare-tips-classifier-train-pipeline
ML_USE_COLUMN: ml_use
EXCLUDE_COLUMNS: trip_start_timestamp
TRAIN_LIMIT: 20000
TEST_LIMIT: 2000
SERVE_LIMIT: 0
NUM_TRAIN_SPLITS: 4
NUM_EVAL_SPLITS: 1
ACCURACY_THRESHOLD: 0.8
USE_KFP_SA: False
TFX_IMAGE_URI: us-central1-docker.pkg.dev/dougkelly-vertex-demos/tfx-taxifare-tips/tfx-taxifare-tips:latest
BEAM_RUNNER: DataflowRunner
BEAM_DIRECT_PIPELINE_ARGS: ['--project=dougkelly-vertex-demos', '--temp_location=gs://dougkelly-vertex-demos-tfx/temp']
BEAM_DATAFLOW_PIPELINE_ARGS: ['--project=dougkelly-vertex-demos', '--temp_location=gs://dougkelly-vertex-demos-tfx/temp', '--region=us-central1', 

### Build the TFX pipeline container image

In [147]:
!echo $TFX_IMAGE_URI

us-central1-docker.pkg.dev/dougkelly-vertex-demos/tfx-taxifare-tips/tfx-taxifare-tips:latest


In [148]:
!gcloud builds submit --tag $TFX_IMAGE_URI . --timeout=20m --machine-type=e2-highcpu-8

Creating temporary tarball archive of 54 file(s) totalling 150.9 KiB before compression.
Uploading tarball of [.] to [gs://dougkelly-vertex-demos_cloudbuild/source/1638755445.219295-e2efcc66af9f4caeabeda50be85c4bfe.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/dougkelly-vertex-demos/locations/global/builds/318dba06-51a3-42a4-863f-a50a8299cb66].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/318dba06-51a3-42a4-863f-a50a8299cb66?project=617979904441].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "318dba06-51a3-42a4-863f-a50a8299cb66"

FETCHSOURCE
Fetching storage object: gs://dougkelly-vertex-demos_cloudbuild/source/1638755445.219295-e2efcc66af9f4caeabeda50be85c4bfe.tgz#1638755445589760
Copying gs://dougkelly-vertex-demos_cloudbuild/source/1638755445.219295-e2efcc66af9f4caeabeda50be85c4bfe.tgz#1638755445589760...
/ [1 files][ 36.5 KiB/ 36.5 KiB]                                                
Ope

### Compile the TFX pipeline

In [153]:
importlib.reload(pipeline_runner)

<module 'tfx_taxifare_tips.tfx_pipeline.pipeline_runner' from '/home/jupyter/training-data-analyst/self-paced-labs/vertex-ai/vertex-pipelines/tfx/tfx_taxifare_tips/tfx_pipeline/pipeline_runner.py'>

In [154]:
from tfx_taxifare_tips.tfx_pipeline import pipeline_runner

pipeline_definition_file = f'{config.PIPELINE_NAME}.json'
pipeline_definition = pipeline_runner.compile_training_pipeline(pipeline_definition_file)

INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:root:Pipeline components: TrainDataGen, StatisticsGen, ExampleValidator, Tranform, ModelTrainer, BaselineModelResolver, ModelEvaluator, ModelPusher
INFO:root:Beam pipeline args: ['--project=dougkelly-vertex-demos', '--temp_location=gs://dougkelly-vertex-demos-tfx/temp', '--region=us-central1', '--runner=DataflowRunner']


TypeError: None has type NoneType, but expected one of: bytes, unicode

In [None]:
PIPELINE_DEFINITION_FILE = PIPELINE_NAME + "_pipeline.json"

In [None]:
vertex_ai.init(project=project_id, location=region)

In [None]:
pipeline_job = vertex_ai.pipeline_jobs.PipelineJob(
          display_name=self.pipeline_name,
          template_path=self.pipeline_definition,
          pipeline_root=pipeline_root
      )
pipeline_job.run(sync=False)

### Extracting pipeline run metadata

In [None]:
pipeline_df = vertex_ai.get_pipeline_df(PIPELINE_NAME)
pipeline_df = pipeline_df[pipeline_df.pipeline_name == PIPELINE_NAME]
pipeline_df.T

### Upload trained model from Google Cloud Storage to Vertex AI