# Taxi fare prediction using chicago taxi-cab dataset

## Table of contents
* [Overview](#section-1)
* [Dataset](#section-2)
* [Objective](#section-3)
* [Costs](#section-4)
* [Data analysis](#section-5)
* [Fit a simple linear regression model](#section-6)
* [Save the model and upload to a GCS bucket](#section-7)
* [Deploy the model on Vertex AI with support for Vertex Explainable AI](#section-8)
* [Get explanations from the deployed model](#section-9)
* [Clean up](#section-10)

## Overview
<a name="section-1"></a>

This notebooks demonstrates analysis, feature selection, model building and deployment with Vertex Explainable AI configured on Vertex AI on a subset of the Chicago Taxi-cab dataset for Taxi-fare prediction problem.

Note: This notebook file was developed to run in a [Vertex AI Workbench managed notebooks](https://console.cloud.google.com/vertex-ai/workbench/list/managed) instance using the Python(Local) kernel. Some components of this notebook may not work in other notebook environments.

## Dataset
<a name="section-2"></a>

The Chicago Taxi-cab dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. To protect privacy but allow for aggregate analyses, the Taxi ID is consistent for any given taxi medallion number but does not show the number, Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes. Due to the data reporting process, not all trips are reported but the City believes that most are. This dataset is publicly available on Bigquery under the public datasets with the Table ID : `bigquery-public-data.chicago_taxi_trips.taxi_trips` and also as public dataset on Kaggle Datasets at : [Chicago Taxi Trips Dataset](https://www.kaggle.com/chicago/chicago-taxi-trips-bq).

 For more information about this dataset and how it was created, please refer [Chicago Digital website](http://digital.cityofchicago.org/index.php/chicago-taxi-data-released).

## Objective
<a name="section-3"></a>

The goal of this notebook is to provide an overview on the latest Vertex AI features like Explainable AI  and Bigquery in Notebook by trying to solve a Taxi-fare prediction problem. The steps followed in this notebook include : 

- Loading the dataset using `Bigquery in Notebooks`.
- Performing exploratory data analysis on the dataset.
- Feature selection and preprocessing.
- Building a linear regression model using scikit-learn.
- Configuring the model for Vertex Explainable AI.
- Deploying the model to Vertex AI.
- Testing the deployed model.
- Clean up.

## Costs
<a name="section-4"></a>

This tutorial uses the following billable components of Google Cloud:

- Vertex AI
- Bigquery
- Cloud Storage


Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Bigquery pricing](https://cloud.google.com/bigquery/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

## Select or Create Cloud Storage Bucket for storing the model

When you create a model resource on Vertex AI using the Cloud SDK, you need to give a Cloud Storage bucket uri of the model where the model is stored. Using the model saved, you can then create Vertex AI model and endpoint resources in order to serve online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.You may also change the REGION variable, which is used for operations throughout the rest of this notebook. Make sure to choose a region where Vertex AI services are available.

In [None]:
BUCKET_NAME = "[your-bucket-name]"
BUCKET_URI = f"gs://{BUCKET_NAME}"
LOCATION = "us-central1"

In [None]:
# Set a default bucketname in case bucket name is not given
if BUCKET_NAME == "" or BUCKET_NAME == "[your-bucket-name]" or BUCKET_NAME is None:

    TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
    BUCKET_NAME = PROJECT_ID + "aip-" + TIMESTAMP
    BUCKET_URI = "gs://" + BUCKET_NAME

<b>Only if your bucket doesn't already exist</b>: Run the following cell to create your Cloud Storage bucket.

## Import the required libraries and define constants

In [None]:
! gsutil mb -l $REGION $BUCKET_NAME

Next, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_NAME

In [None]:
import matplotlib.pyplot as plt
# load the required libraries
import pandas as pd
import seaborn as sns

%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

The dataset is quite a large and noisy one and so data from a specific date range will be used. Based on various blogs and resources that are available online, many of them seem to have used the data from around May-2018 which gave some really good results compared to the other date ranges. While there are also some complicated research models propsed for the same problem like considering the weather data, holidays and seasons etc., the current notebook only explores a simple linear regression model as our main objective is to demonstrate the model deployment with Vertex Explainable AI configured on Vertex AI.

## Accessing the data through Bigquery in Notebooks
`Bigquery in Notebooks` feature of Vertex AI's managed notebooks allows us to use Bigquery and its features from the notebook itself eliminating the need to switch between tabs everytime. For every cell in the notebook, there is an option for Bigquery integration at the top right selecting which would enable us to compose a SQL query that can be executed in Bigquery. 

The chosen dataset consists of the following fields :
- `unique_key` : Unique identifier for the trip.
- `taxi_id` : A unique identifier for the taxi.
- `trip_start_timestamp`: When the trip started, rounded to the nearest 15 minutes.
- `trip_end_timestamp`: When the trip ended, rounded to the nearest 15 minutes.
- `trip_seconds`: Time of the trip in seconds.
- `trip_miles`: Distance of the trip in miles.
- `pickup_census_tract`: The Census Tract where the trip began. For privacy, this Census Tract is not shown for some trips.
- `dropoff_census_tract`: The Census Tract where the trip ended. For privacy, this Census Tract is not shown for some trips.
- `pickup_community_area`: The Community Area where the trip began.
- `dropoff_community_area`: The Community Area where the trip ended.
- `fare`: The fare for the trip.
- `tips`: The tip for the trip. Cash tips generally will not be recorded.
- `tolls`: The tolls for the trip.
- `extras`: Extra charges for the trip.
- `trip_total`: Total cost of the trip, the total of the fare, tips, tolls, and extras.
- `payment_type`: Type of payment for the trip.
- `company`: The taxi company.
- `pickup_latitude`: The latitude of the center of the pickup census tract or the community area if the census tract has been hidden for privacy.
- `pickup_longitude`: The longitude of the center of the pickup census tract or the community area if the census tract has been hidden for privacy.
- `pickup_location`: The location of the center of the pickup census tract or the community area if the census tract has been hidden for privacy.
- `dropoff_latitude`: The latitude of the center of the dropoff census tract or the community area if the census tract has been hidden for privacy.
- `dropoff_longitude`: The longitude of the center of the dropoff census tract or the community area if the census tract has been hidden for privacy.
- `dropoff_location`: The location of the center of the dropoff census tract or the community area if the census tract has been hidden for privacy.

Among the available fields in the dataset, only the fields that seem common and relevant for analysis and modeling like `taxi_id`, `trip_start_timestamp`, `trip_seconds`, `trip_miles`, `payment_type` and `trip_total` are selected. Further, the field `trip_total` is treated as the target variable that would be predicted by the machine learning model. Apparently, this field is a summation of `fare`,`tips`,`tolls` and `extras` fields and so because of their correlation with the target variable, they are being excluded for modeling. Due to the volume of the data, a subset of the dataset over the course of one week i.e., 12-May-2018 to 18-May-2018 is being considered. Within this date range itself, the datapoints can be noisy and so a few conditions like the following are considered : 

- Time taken for the trip > 0.
- Distance covered during the trip > 0.
- Total trip charges > 0 and
- Pickup and dropoff areas are valid(not empty).

#@bigquery

select 
-- select the required fields
taxi_id, trip_start_timestamp, 
trip_seconds, trip_miles, trip_total, 
payment_type

from `bigquery-public-data.chicago_taxi_trips.taxi_trips` 
where 
-- specify the required criteria
trip_start_timestamp >= '2018-05-12' and 
trip_end_timestamp <= '2018-05-18' and
trip_seconds > 0 and
trip_miles > 0 and
trip_total > 3 and
pickup_community_area is not NULL and 
dropoff_community_area is not NULL


The Bigquery integration also allows us to load the queried data into a pandas dataframe using the `Query and load as DataFrame` button. Clicking the button adds a new cell below that provides a code snippet to load the data into a dataframe.

In [None]:
# The following two lines are only necessary to run once.
# Comment out otherwise for speed-up.
from google.cloud.bigquery import Client

client = Client()

query = """select 
taxi_id, trip_start_timestamp, 
trip_seconds, trip_miles, trip_total, 
payment_type, pickup_community_area, 
dropoff_community_area 

from `bigquery-public-data.chicago_taxi_trips.taxi_trips` 
where 
trip_start_timestamp >= '2018-05-12' and 
trip_end_timestamp <= '2018-05-18' and
trip_seconds > 0 and trip_seconds < 6*60*60 and
trip_miles > 0 and
trip_total > 3 and
pickup_community_area is not NULL and 
dropoff_community_area is not NULL"""
job = client.query(query)
df = job.to_dataframe()

Check the fields in the data and the shape.

In [None]:
# check the dataframe's shape
print(df.shape)
# check the columns in the dataframe
df.columns

Check some sample data.

In [None]:
df.head()

Check the dtypes of fields in the data.

In [None]:
df.dtypes

Check for null values in the dataframe.

In [None]:
df.info()

Depending on the percentage of null values in the data, one can choose to either drop them or impute them with mean/median(for numerical values) and mode(for categorical values). In the current data, there doesn't seem to be any null values.

Check the numerical distributions of the fields (numerical). In case there are any fields with constant values, those fields can be dropped as they don't add any value to the model.

In [None]:
df.describe().T

In the current dataset, `trip_total` is the target field. To access the fields by their type easily, identify the categorical and numerical fields in the data and save them.

In [None]:
target = "trip_total"
categ_cols = ["payment_type", "pickup_community_area", "dropoff_community_area"]
num_cols = ["trip_seconds", "trip_miles"]

## Analyze numerical data
<a name="section-5"></a>

To further anaylyze the data, there are various plots that can be used on numerical and categorical fields. In case of numerical data, one can use histograms and box-plots while bar charts are suited for categorical data to better understand the distribution of the data and the outliers in the data.

Plot Histograms and Box-plots on the numerical fields.

In [None]:
for i in num_cols + [target]:
    _, ax = plt.subplots(1, 2, figsize=(12, 4))
    df[i].plot(kind="hist", bins=100, ax=ax[0])
    ax[0].set_title(str(i) + " -Histogram")
    df[i].plot(kind="box", ax=ax[1])
    ax[1].set_title(str(i) + " -Boxplot")
    plt.show()

The field `trip_seconds` describes the time taken for the trip in seconds. Optionally, it can be converted into hours for an easier understanding.

In [None]:
df["trip_hours"] = round(df["trip_seconds"] / 3600, 2)
df["trip_hours"].plot(kind="box")

Similarly, another field `trip_speed` can be added by dividing `trip_miles` and `trip_hours` to understand the speed of the trip in miles/hour.

In [None]:
df["trip_speed"] = round(df["trip_miles"] / df["trip_hours"], 2)
df["trip_speed"].plot(kind="box")

So far we've only considered to look at the univariate plots. To better understand the relationship between the variables, a pair-plot can be plotted.

In [None]:
sns.pairplot(
    data=df[["trip_seconds", "trip_miles", "trip_total", "trip_speed"]].sample(10000)
)
plt.show()

From the box-plots and the histograms plotted so far, it is evident that there are some outliers causing skewness in the data which perhaps could be removed. Also, we can certainly see some linear relationship between the independent variables considered in the pair-plot i.e., `trip_seconds` and `trip_miles` and the dependant variable `trip_total`.

Restrict the data based on the following conditions to remove the outliers in the data to some extent :
- Total charge being at least more than $3.
- Total miles driven greater than 0 and less than 300 miles.
- Total seconds driven at least 1 minute.
- Total hours driven not more than 2 hours.
- Speed of the trip not being more than 70 mph.

These conditions are based on some general assumptions as clearly there were some recording errors like speed being greater than 500 mph and travel-time being more than 5 hours that led to outliers in the data. 

In [None]:
# set constraints to remove outliers
df = df[df["trip_total"] > 3]

df = df[(df["trip_miles"] > 0) & (df["trip_miles"] < 300)]

df = df[df["trip_seconds"] >= 60]

df = df[df["trip_hours"] <= 2]

df = df[df["trip_speed"] <= 70]
df.reset_index(drop=True, inplace=True)
df.shape

## Analyze Categorical data

Further, explore the categorical data by plotting the distribution of all the levels in each field.

In [None]:
for i in categ_cols:
    print(df[i].unique().shape)
    df[i].value_counts(normalize=True).plot(kind="bar", figsize=(10, 4))
    plt.title(i)
    plt.show()

From the above analysis, one can see that almost 99% of the transaction types are Cash and Credit Card. While there are also other type of transactions, their distribution is very less. In such a case, the lower distribution levels can be dropped. On the other hand, total number of pickup and dropoff community areas both seem to have the same levels which make sense. In this case also, one can choose to omit the lower distribution levels but it has to be made sure that both the fields have the same levels afterwards. In the current notebook, we'd keep them as is and proceed with the modeling.

The relationships between the target variable and the categorical fields can be represented through boxplots. For each level, the corresponding distribution of the target variable can be identified.

In [None]:
for i in categ_cols:
    plt.figure(figsize=(10, 4))
    sns.boxplot(x=i, y=target, data=df)
    plt.xticks(rotation=45)
    plt.title(i)
    plt.show()

There seems to be one case where the `trip_total` is over 3000 and has the same pickup and dropoff community area i.e., 28 which is clearly an outlier compared to the rest of the points. This datapoint can be removed.

In [None]:
df = df[df["trip_total"] < 3000].reset_index(drop=True)

Keep only the `Credit Card` and `Cash` payment types. Further, encode them by assigning 0 for `Credit Card` and 1 for `Cash` payment types.

In [None]:
# add payment_type
df = df[df["payment_type"].isin(["Credit Card", "Cash"])].reset_index(drop=True)
# encode the payment types
df["payment_type"] = df["payment_type"].apply(
    lambda x: 0 if x == "Credit Card" else (1 if x == "Cash" else None)
)

There are also timestamp fields in the data that can prove to be useful. `trip_start_timestamp` represents the start timestamp of the taxi-trip and fields like what day of week it was and what hour it was can be dervied from it.

In [None]:
df["trip_start_timestamp"] = pd.to_datetime(df["trip_start_timestamp"])
df["dayofweek"] = df["trip_start_timestamp"].dt.dayofweek
df["hour"] = df["trip_start_timestamp"].dt.hour

Since the current dataset is considered only for a week, if there isn't much variation in the newly dervied fields with respect to the target variable, they can be dropped.

Plot sum and average of the `trip_total` with respect to the `dayofweek`.

In [None]:
# plot sum and average of trip_total w.r.t the dayofweek
_, ax = plt.subplots(1, 2, figsize=(10, 4))
df[["dayofweek", "trip_total"]].groupby("dayofweek").trip_total.sum().plot(
    kind="bar", ax=ax[0]
)
ax[0].set_title("Sum of trip_total")
df[["dayofweek", "trip_total"]].groupby("dayofweek").trip_total.mean().plot(
    kind="bar", ax=ax[1]
)
ax[1].set_title("Avg. of trip_total")
plt.show()

Plot sum and average of the `trip_total` with respect to the `hour`.

In [None]:
_, ax = plt.subplots(1, 2, figsize=(10, 4))
df[["hour", "trip_total"]].groupby("hour").trip_total.sum().plot(kind="bar", ax=ax[0])
ax[0].set_title("Sum of trip_total")
df[["hour", "trip_total"]].groupby("hour").trip_total.mean().plot(kind="bar", ax=ax[1])
ax[1].set_title("Avg. of trip_total")
plt.show()

As these plots don't seem to have constant figures with respect to the target variable across their levels, they can be considered for training. In fact, to simplify things these dervied features can be bucketed into less number of levels.

`dayofweek` field can be bucketed into a binary field considering whether or not it was a weekend. If it is a weekday, the record can be assigned 1, else 0. Similarly, `hour` field can also be bucketed and encoded. The normal working hours in Chicago can be assumed to be between *8AM*-*10PM* and if the value falls in between the working hours, it can be encoded as 1, else 0.

In [None]:
# bucket and encode the dayofweek and hour
df["dayofweek"] = df["dayofweek"].apply(lambda x: 0 if x in [5, 6] else 1)
df["hour"] = df["hour"].apply(lambda x: 0 if x in [23, 0, 1, 2, 3, 4, 5, 6, 7] else 1)

Check the data distribution before training the model.

In [None]:
df.describe().T

## Divide the data in Train and Test sets

Split the preprocessed dataset into train and test sets so that the linear regression model can be validated on the test set.

In [None]:
cols = [
    "trip_seconds",
    "trip_miles",
    "payment_type",
    "pickup_community_area",
    "dropoff_community_area",
    "dayofweek",
    "hour",
    "trip_speed",
]
x = df[cols].copy()
y = df[target].copy()

# split the data into 75-25% ratio
X_train, X_test, y_train, y_test = train_test_split(
    x, y, train_size=0.75, test_size=0.25, random_state=13
)
X_train.shape, X_test.shape

## Fit a Simple Linear Regression model
<a name="section-6"></a>

Fit a linear regression model using Sklearn's LinearRegression method on the train data.

In [None]:
# Building the regression model
reg = LinearRegression()
reg.fit(X_train, y_train)

Print the `R2 score` and `RMSE` values for the model on train and test sets.

In [None]:
# print test R2 score
y_train_pred = reg.predict(X_train)
train_score = r2_score(y_train, y_train_pred)
train_rmse = mean_squared_error(y_train, y_train_pred, squared=False)
y_test_pred = reg.predict(X_test)
test_score = r2_score(y_test, y_test_pred)
test_rmse = mean_squared_error(y_test, y_test_pred, squared=False)
print("Train R2-score:", train_score, "Train RMSE:", train_rmse)
print("Test R2-score:", test_score, "Test RMSE:", test_rmse)

A low RMSE error and a train and test R2 score of 0.93 suggests that the model has fitted well on the data. Further, the coefficients learned by the model for each of its independent variables can also be checked by checking the `coef_` attribute of the sklearn model. 

Check the coefficients learned by the model.

In [None]:
coef_df = pd.DataFrame({"col": cols, "coeff": reg.coef_})
coef_df.set_index("col").plot(kind="bar")

## Save the model and upload to a GCS bucket.
<a name="section-7"></a>

To deploy the model on Vertex AI, the model needs to be stored in a Cloud Storage bucket first.

In [None]:
import joblib
from google.cloud import storage

FILE_NAME = "model.joblib"
joblib.dump(reg, FILE_NAME)

# Upload the saved model file to Cloud Storage
BLOB_PATH = "taxicab_fare_prediction/"

BLOB_NAME = BLOB_PATH + FILE_NAME

bucket = storage.Client().bucket(BUCKET_NAME)
blob = bucket.blob(BLOB_NAME)
blob.upload_from_filename(FILE_NAME)

## Deploy the Model on Vertex AI with support for Vertex Explainable AI
<a name="section-8"></a>

Configure the Vertex Explainable AI before deploying the model. For further details, see [Configuring Vertex Explainable AI in Vertex AI models](https://cloud.google.com/vertex-ai/docs/explainable-ai/configuring-explanations#scikit-learn-and-xgboost-pre-built-containers).

In [None]:
MODEL_DISPLAY_NAME = "taxi_fare_prediction_model"
ARTIFACT_GCS_PATH = f"{BUCKET_URI}/{BLOB_PATH}"

# Feature-name(Inp_feature) and Output-name(Model_output) can be arbitrary
exp_metadata = {"inputs": {"Input_feature": {}}, "outputs": {"Predicted_taxi_fare": {}}}

Create a model resource from the uploaded model with explanation metadata configured.

In [None]:
from google.cloud import aiplatform
from google.cloud.aiplatform_v1.types import SampledShapleyAttribution
from google.cloud.aiplatform_v1.types.explanation import ExplanationParameters

# Create a Vertex AI model resource with support for Vertex Explainable AI

aiplatform.init(project=PROJECT, location=LOCATION)

model = aiplatform.Model.upload(
    display_name=MODEL_DISPLAY_NAME,
    artifact_uri=ARTIFACT_GCS_PATH,
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest",
    explanation_metadata=exp_metadata,
    explanation_parameters=ExplanationParameters(
        sampled_shapley_attribution=SampledShapleyAttribution(path_count=25)
    ),
)

model.wait()

print(model.display_name)
print(model.resource_name)

Create an Endpoint resource for the model.

In [None]:
ENDPOINT_DISPLAY_NAME = "taxi_fare_prediction_endpoint"

endpoint = aiplatform.Endpoint.create(
    display_name=ENDPOINT_DISPLAY_NAME, project=PROJECT, location=LOCATION
)

print(endpoint.display_name)
print(endpoint.resource_name)

Save the Endpoint Id for inference.

In [None]:
ENDPOINT_ID = ""

Deploy the model to the created endpoint with the required machine-type.

In [None]:
DEPLOYED_MODEL_NAME = "taxi_fare_prediction_deployment"
MACHINE_TYPE = "n1-standard-2"

# deploy the model to the endpoint
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=DEPLOYED_MODEL_NAME,
    machine_type=MACHINE_TYPE,
)

model.wait()

print(model.display_name)
print(model.resource_name)

Save the ID of the deployed model. The ID of the deployed model can also checked using the `endpoint.list_models()` method.

In [None]:
DEPLOYED_MODEL_ID = ""

## Get explanations from the deployed model.
<a name="section-9"></a>

For testing the deployed online model, select two instances from the test data as payload.

In [None]:
# format the top 2 test instances as the request's payload
test_json = {"instances": [X_test.iloc[0].tolist(), X_test.iloc[1].tolist()]}

Call the endpoint with the payload request and parse the response for explanations. The explanations consists of attributions on the independent variables used for training the model which are based on the configured attribution method. In this case, we've used the `Sampled Shapely` method which assigns credit for the outcome to each feature, and considers different permutations of the features. This method provides a sampling approximation of exact Shapley values. Further information on the attribution methods for explantions can be found at [Overview of ExplainableAI](https://cloud.google.com/vertex-ai/docs/explainable-ai/overview) page.

In [None]:
features = X_train.columns.to_list()


def plot_attributions(attrs):
    """
    Function to plot the features and their attributions for an instance
    """
    rows = {"feature_name": [], "attribution": []}
    for i, val in enumerate(features):
        rows["feature_name"].append(val)
        rows["attribution"].append(attrs["Input_feature"][i])
    attr_df = pd.DataFrame(rows).set_index("feature_name")
    attr_df.plot(kind="bar")
    plt.show()
    return


def explain_tabular_sample(
    project: str, location: str, endpoint_id: str, instances: list
):
    """
    Function to make an explanation request for the specified payload and generate feature attribution plots
    """
    aiplatform.init(project=project, location=location)

    endpoint = aiplatform.Endpoint(endpoint_id)

    response = endpoint.explain(instances=instances)
    print("#" * 10 + "Explanations" + "#" * 10)
    for explanation in response.explanations:
        print(" explanation")
        # Feature attributions.
        attributions = explanation.attributions

        for attribution in attributions:
            print("  attribution")
            print("   baseline_output_value:", attribution.baseline_output_value)
            print("   instance_output_value:", attribution.instance_output_value)
            print("   output_display_name:", attribution.output_display_name)
            print("   approximation_error:", attribution.approximation_error)
            print("   output_name:", attribution.output_name)
            output_index = attribution.output_index
            for output_index in output_index:
                print("   output_index:", output_index)

            plot_attributions(attribution.feature_attributions)

    print("#" * 10 + "Predictions" + "#" * 10)
    for prediction in response.predictions:
        print(prediction)

    return response


test_json = [X_test.iloc[0].tolist(), X_test.iloc[1].tolist()]
prediction = explain_tabular_sample(PROJECT, LOCATION, ENDPOINT_ID, test_json)

## Next Steps

Since the Chicago-Taxicab dataset is continuously updating, one can preform the same kind of analysis and model training every time a new set of data is available. The date range can also be increased from a week to a month or more depending on the quality of data. Most of the steps followed in this notebook would still be valid and can be applied over the new data unless the data is too noisy. Perhaps, the notebook itself can be scheduled to run at the specified times to retrain the model using the scheduling option of the [Vertex AI workbench's Executor](https://console.cloud.google.com/vertex-ai/workbench/list/executions) feature. 

## Clean Up
<a name="section-10"></a>

Delete the resources created in this notebook.

Undeploy the model by specifying the `DEPLOYED_MODEL_ID`.

In [None]:
endpoint.undeploy(deployed_model_id=DEPLOYED_MODEL_ID)

Delete the endpoint resource.

In [None]:
endpoint.delete()

Delete the model resource.

In [None]:
model.delete()

Remove the contents of the created Cloud Storage bucket.

In [None]:
! gsutil -m rm -r $BUCKET_URI