In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table align="left">

  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-managed-notebook?download_url=https://raw.githubusercontent.com/hyunuk/vertex-ai-samples/experiment/notebooks/official/workbench/spark/spark_sample_notebook.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This notebook tutorial shows you Apache SparkML jobs with Dataproc and BigQuery. Through this notebook, you can learn a common use case in the machine learning pipeline: Ingestion, data cleaning, feature engineering, modeling, and evaluation.

### Dataset

The two datasets, [NYC TLC(Taxi and Limousine Commission) Trips](https://console.cloud.google.com/marketplace/product/city-of-new-york/nyc-tlc-trips) dataset and [NYC Citi Bike Trips](https://console.cloud.google.com/marketplace/product/city-of-new-york/nyc-citi-bike) dataset, is available in [BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data), and provides free querying of up to 1TB of data each month. It contains trips data for each Taxi and Citi Bike, the public bicycle sharing system serving the New York City.

### Objective

This notebook tutorial runs an Apache SparkML job that fetches data from the BigQuery dataset, performs exploratory data analysis, cleans data, executes feature engineering, trains the model, evaluates the model, debriefs for the result and saves the model to a Cloud Storage.

This notebook tutorial performs the following steps:

- Setting up a Google Cloud project and Dataproc cluster.
- Configuring the spark-bigquery-connector.
- Ingesting data from BigQuery into a Spark DataFrame.


- Preprocessing ingested data.
- Querying the most frequently used programming language in monoglot repos.
- Querying the average size (MB) of code in each language stored in monoglot repos.
- Querying the languages files most frequently found together in polyglot repos.
- Writing the query results back into BigQuery.
- Deleting the resources created for this notebook tutorial.
- Disabling the APIs used in the tutorial.

### Costs 

This tutorial uses billable components of Google Cloud:

* [Vertex AI](https://cloud.google.com/vertex-ai/pricing)
* [Cloud Storage](https://cloud.google.com/storage/pricing)
* [Dataproc](https://cloud.google.com/dataproc/pricing)

You can use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Before you begin

### Set up your Google Cloud project:

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you create an account, you receive a $300 credit towards to your compute and storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Notebooks API, Vertex AI API, and Dataproc API](https://console.cloud.google.com/flows/enableapi?apiid=notebooks.googleapis.com,aiplatform.googleapis.com,dataproc&_ga=2.209429842.1903825585.1657549521-326108178.1655322249)

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and inserts the value of Python variables prefixed with `$` into the commands.

### Create a Dataproc cluster

The Spark job executed in this notebook tutorial is compute intensive. Since the job can take a significant amount time to complete in a standard notebook environment, this notebook tutorial runs on a Dataproc cluster that is created with the Dataproc Component Gateway and Jupyter component installed on the cluster.

**Existing Dataproc with Jupyter cluster?**: If you have a running Dataproc cluster that has the [Component Gateway and Jupyter component installed on the cluster](https://cloud.google.com/dataproc/docs/concepts/components/jupyter#gcloud-command)), you can use it in this tutorial. If you plan to use it, skip this step, and go to `Switch your kernel`.

In [None]:
CLUSTER_NAME = "[your-cluster]"  # @param {type: "string"}
CLUSTER_REGION = "[your-region]"  # @param {type: "string"}

if CLUSTER_REGION == "[your-region]":
    CLUSTER_REGION = "us-central1"

print(f"CLUSTER_NAME: {CLUSTER_NAME}")
print(f"CLUSTER_REGION: {CLUSTER_REGION}")

In [None]:
!gcloud dataproc clusters create $CLUSTER_NAME \
    --region=$CLUSTER_REGION \
    --enable-component-gateway \
    --optional-components=JUPYTER

Your `CLUSTER_NAME` must be **unique within your Google Cloud project**. It must start with a lowercase letter, followed by up to 51 lowercase letters, numbers, and hyphens, and cannot end with a hyphen.

#### Switch your kernel

Your notebook kernel is listed at the top of the notebook page. Your notebook should run on the Python 3 kernel running on your Dataproc cluster.

Select **Kernel > Change Kernel** from the top menu, then select `Python 3 on CLUSTER_NAME: Dataproc cluster in REGION (Remote)`.

### Set your project ID

Run the following cell to get you project ID.

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

If the previous command has no output, copy your project ID from the project selector in the [Google Cloud console](https://console.cloud.google.com/). Insert the ID in the `[your-project-id]` placeholder, then run the following command.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type: "string"}

In [None]:
! gcloud config set project $PROJECT_ID -q

### Create a Cloud Storage bucket

The Spark DataFrame created in this tutorial is stored in BigQuery, with the data first being written to a Google Cloud Storage bucket before it is written into BigQuery.

#### Region

Before creating a Cloud Storage bucket, re-define the `REGION` variable (when you changed the notebook kernel earlier, previously set variables were deleted).

In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### Timestamp

To avoid name collisions, you can create a timestamp for the current notebook session, then append the timestamp to the name of resources that you create in this tutorial, such as the Cloud Storage bucket or BigQuery dataset that you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

Replace the `[your-bucket-name]` placeholder with the name of your Cloud Storage bucket. The name must be unique across all Cloud Storage buckets.

In [None]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}/"

if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = f"{PROJECT_ID}{TIMESTAMP}"
    BUCKET_URI = f"gs://{BUCKET_NAME}/"

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

Confirm your access to the Cloud Storage bucket by displaying the bucket's metadata:

In [None]:
! gsutil ls -L -b $BUCKET_URI

In [None]:
! gsutil ls -al $BUCKET_URI

## Tutorial

### Import required libraries

In [None]:
# A Spark Session is how you interact with Spark SQL to create Dataframes
from pyspark.sql import SparkSession

# PySpark functions
from pyspark.sql.functions import avg, col, count, desc, round, size, udf, to_timestamp, unix_timestamp, broadcast, pandas_udf, PandasUDFType, to_date

# These allow us to create a schema for our data
from pyspark.sql.types import ArrayType, IntegerType, StringType, DoubleType, BooleanType

from geopandas import gpd
from shapely.geometry import Point
import pandas as pd

from pyspark.ml.regression import GBTRegressor
import matplotlib.pyplot as plt
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.evaluation import RegressionEvaluator


### Initialize the SparkSession

To use Apache Spark with BigQuery, you must include the [spark-bigquery-connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector) when you initialize the SparkSession.

In [None]:
# Initialize the SparkSession with the following config.
spark = (
    SparkSession.builder.appName("spark-bigquery-ml-nyc-trips-demo")
    .config(
        "spark.jars",
        "gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.2.jar",
    )
    .getOrCreate()
)

### Fetch data from BigQuery

In [None]:
# Load NYC_taxi in Github Activity Public Dataset from BigQuery.
taxi_df = (
    spark.read.format("bigquery")
    .option("table", "bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018")
    .load()
)

# Load NYC_Citibike in Github Acitivity Public dataset from BQ.
bike_df = (
    spark.read.format("bigquery")
    .option("table", "bigquery-public-data.new_york_citibike.citibike_trips")
    .load()
)

### Perform Exploratory Data Analysis(EDA)

As we get started with a new problem, the first step is to gain an understanding of what the dataset contains. EDA is used to derive insights from the data. Data scientists and analysts try to find different patterns, relations, and anomalies in the data using some statistical graphs and other visualization techniques. It allows analysts to understand the data better before making any assumptions.

Check the data types for Taxi dataset first.

In [None]:
taxi_df.printSchema()

Filter out unnecessary columns and check null counts of the fields.

In [None]:
taxi_df = taxi_df.select(
    col("pickup_datetime"),
    col("dropoff_datetime"),
    col("trip_distance"),
    col("fare_amount"),
    col("pickup_location_id"),
    col("dropoff_location_id"),
)
taxi_df.describe().show()

From this summary, you are able to know a lot of information.
  - There are over 112 millions of trip history for Yellow Taxi in 2018. 
  - The current dataset has some abnormal values such as null and negative values in it.
  - `pickup_datetime` and `dropoff_datetime` are string format. To use it effectively, it needs to be re-formatted.
  - In previous years, the exact latitude and longitude were used for the pickup and the dropoff locations. It raised a lot of [privacy concerns](https://agkn.wordpress.com/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/) and the dataset has been providing `pickup_location_id` and `dropoff_location_id` instead. This id is corresponded to the [NYC Taxi Zones](https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc), roughly based on NYC Department of City Planning’s Neighborhood Tabulation Areas (NTAs) and are meant to approximate neighborhoods.
  - The maximum value of `pickup_location_id` and `dropoff_location_id` shows `99`. However, these might be wrong since the data type of both is string.

First, you can manipulate the time. `pickup_datetime` and `dropoff_datetime` is currently a string format, so using `to_timestamp()` function and `unix_timestamp()` function, you are able to get each pickup and droppoff datetime as a Unix Timestamp type.

Unix time is a way of representing time as the number of seconds since `January 1st, 1970 at 00:00:00 UTC`. Compared to the Timestamp type, Unix time can be represented as an integer, making it easier to parse and use across different systems.

After we get `start_time` and `end_time` by converting the original `pickup_datetime` and `dropoff_datetime`, we are able to get more insteresting columns using these two Timestamps.

In [None]:
@udf(returnType=BooleanType())
def is_weekdays(timestamp):
    """
    The preprocessing function takes timestamp and returns whether the timestamp is weekdays or not.
    Args:
        timestamp: Unix Timestamp format that represent the time.
                (e.g., timestamp = 1659268800, represents "Sun, 31 Jul 2022 12:00:00 GMT")
    Returns:
        A boolean value whether the given timestamp is weekdays or not.
    """
    day_of_week = ((timestamp // 86400) + 4) % 7 if timestamp else 7
    return 0 < day_of_week < 6

@udf(returnType=IntegerType())
def timestamp_to_time_in_minutes(timestamp):
    """
    The preprocessing function takes timestamp and returns whether the timestamp is weekdays or not.
    Args:
        timestamp: Unix Timestamp format that represent the time.
                (e.g., if timestamp == 1659268800, represents "Sun, 31 Jul 2022 12:00:00 GMT")
    Returns:
        A number that represents given time in minutes in EST (UTC-05).
                (e.g., if timestamp == 1659268800, returns 420 since it is 7:00 in EST)
    """
    return ((timestamp % 86400) // 60) - 300 if timestamp else None

@udf(returnType=DoubleType())
def manhattan_dist(start_lat, start_lon, end_lat, end_lon):
    """
    The preprocessing function takes two coordinates(latitude and longitude) and returns the Manhattan distance.
    Args:
        start_lat: The latitude of the start station.
        start_lon: The longitude of the start station.
        end_lat: The latitude of the end station.
        end_lon: The longitude of the end station.
    Returns:
        The Manhattan distance of given two coordinates.
    """
    return abs(end_lon - start_lon) + abs(end_lat - start_lat)

In [None]:
# Convert the type of pickup_datetime from a string to a Unix timestamp.
taxi_df = taxi_df.withColumn('start_time', unix_timestamp(to_timestamp(col('pickup_datetime'))))

# Convert the type of dropoff_datetime from a string to a Unix timestamp.
taxi_df = taxi_df.withColumn('end_time', unix_timestamp(to_timestamp(col('dropoff_datetime'))))

# Convert start_time to days_of_week
taxi_df = taxi_df.withColumn('is_weekdays', is_weekdays(col('start_time')))

# Convert start_time to start_time_in_minute
taxi_df = taxi_df.withColumn('start_time_in_minute', timestamp_to_time_in_minutes(col('start_time')))

# Calculate trip_duration
taxi_df = taxi_df.withColumn('trip_duration', col('end_time') - col('start_time'))

In [None]:
taxi_df.printSchema()

Before we go deeper into the Taxi dataset, let's do the similar work for the Citibike dataset.

In [None]:
bike_df.printSchema()

In [None]:
bike_df = bike_df.select(
    col("tripduration").alias("trip_duration"),
    col("starttime"),
    col("stoptime"),
    col("start_station_latitude"),
    col("start_station_longitude"),
    col("end_station_latitude"),
    col("end_station_longitude"),
    col("usertype"),
)
bike_df.describe().show()

From this summary, there is also interesting information from the dataset's summary.
  - There are over 53 millions of trip history for Citibike from 2013 to 2018.
  - The current dataset has some abnormal values.
  - `starttime` and `stoptime` are string format. To use it effectively, it needs to be re-formatted.
  - Unlike the Taxi dataset, starting and ending location has exact latitude and longitude, but since every bike is parked in their station, these coordinates represent the station.

In [None]:
# Convert the type of starttime from a string to a Unix timestamp.
bike_df = bike_df.withColumn('starttime', unix_timestamp(to_timestamp(col('starttime'))))

# Convert the type of stoptime from a string to a Unix timestamp.
bike_df = bike_df.withColumn('stoptime', unix_timestamp(to_timestamp(col('stoptime'))))

# Check whether the starttime is a weekday or a weekend.
bike_df = bike_df.withColumn('is_weekdays', is_weekdays(col('starttime')))

# Convert starttime to start_time_in_minute
bike_df = bike_df.withColumn('start_time_in_minute', timestamp_to_time_in_minutes(col('starttime')))

# Calculate the Manhattan distance between start_station and end_station
bike_df = bike_df.withColumn('trip_distance', manhattan_dist('start_station_latitude', 'start_station_longitude', 'end_station_latitude', 'end_station_longitude'))

In [None]:
bike_df.printSchema()
bike_df.describe().show()

#### Visualization
Check the distributions for the numerical columns. In PySpark, visualizing is expensive because the data is too large. For example, the NYC Taxi dataset in 2018 has more than 112M rows. Therefore, approximately 2% of total data (approx. 2.2M rows) are extracted as a sample, which is enough to have 99% confidence interval and less than 0.1% of margin of error.

In [None]:
taxi_sample = taxi_df.sample(0.02)
bike_sample = bike_df.sample(0.02)

taxi_pd = taxi_sample.toPandas()
bike_pd = bike_sample.toPandas()

taxi_pd.info()
bike_pd.info()

When a [Decimal Type](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.types.DecimalType.html) column in PySpark are converted to Pandas DataFrame, it is converted into object Type, not float type. To visualize these "object" columns, they need to be converted into the float type.

In [None]:
FLOAT_TYPE_COLUMNS = ["trip_distance", "fare_amount", "pickup_location_id", "dropoff_location_id"]

taxi_pd = taxi_pd.drop(columns=['pickup_datetime', 'dropoff_datetime'])
for COLUMN in FLOAT_TYPE_COLUMNS:
    taxi_pd[COLUMN] = taxi_pd[COLUMN].astype(float)
taxi_pd.info()

Iterate through columns and plot them to box and histogram.

In [None]:
for column in df.columns:
    if column == 'is_weekdays':
        continue
    _, ax = plt.subplots(1, 2, figsize=(10, 4))
    taxi_pd[column].plot(kind="box", ax=ax[0])
    taxi_pd[column].plot(kind="hist", ax=ax[1])
    plt.title(column)
    plt.show()

### Data Cleaning

In [None]:
gdf_zone = gpd.read_file("https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc?method=export&format=GeoJSON")
gdf_zone['location_id'] = gdf_zone['location_id'].astype('long')
location_set = set()
hm = gdf_zone.to_dict('index')
for i in hm:
    if hm[i]['borough'] == "Manhattan":
        location_set.add(hm[i]['location_id'])

In [None]:
START_2018 = 1514782800 # Jan 1, 2018 00:00:00
END_2018 = 1546318800 # Dec 31, 2018 23:59:59

@pandas_udf('long')
def preprocess_zone_id(lat: pd.Series, lon: pd.Series) -> pd.Series:
    point_var = [Point(xy) for xy in zip(lon, lat)]
    gdf_points = gpd.GeoDataFrame(pd.DataFrame({'lat': lat, 'lon': lon}), crs='epsg:4326', geometry=point_var)
    gdf_joined = gpd.sjoin(gdf_points, gdf_zone, how='left')
    return gdf_joined['location_id']

@udf(returnType=BooleanType())
def is_in_manhattan(location_id):
    return location_id in location_set

In [None]:
taxi_df = taxi_df.dropna()
taxi_df = taxi_df.withColumn("dropoff_location_id", taxi_df.dropoff_location_id.cast('int'))
taxi_df = taxi_df.withColumn("pickup_location_id", taxi_df.pickup_location_id.cast('int'))
taxi_df = taxi_df.withColumn("is_start_manhattan", is_in_manhattan(col("pickup_location_id")))
taxi_df = taxi_df.withColumn("is_end_manhattan", is_in_manhattan(col("dropoff_location_id")))

taxi_df = taxi_df.where(
    (col('start_time') >= START_2018)
    & (col('start_time') <= END_2018)
    & (col('end_time') >= START_2018)
    & (col('end_time') <= END_2018)
    & (col('start_time') < col('end_time'))
    & (col('trip_duration') > 0) 
    & (col('trip_duration') < 4000)
    & (col('trip_distance') > 0.2)
    & (col('trip_distance') < 15)
    & (col("pickup_location_id") != col("dropoff_location_id")) 
    & (col("fare_amount") > 0)
    & (col("fare_amount") < 500)
    & (col("is_start_manhattan") == True)
    & (col("is_end_manhattan") == True)
)

taxi_df.printSchema()
# taxi_df.describe().show()

In [None]:
bike_df = bike_df.withColumn('start_zone_id', preprocess_zone_id(bike_df['start_station_latitude'], bike_df['start_station_longitude']))
bike_df = bike_df.withColumn('end_zone_id', preprocess_zone_id(bike_df['end_station_latitude'], bike_df['end_station_longitude']))
bike_df = bike_df.withColumn('is_start_manhattan', is_in_manhattan(col('start_zone_id')))
bike_df = bike_df.withColumn('is_end_manhattan', is_in_manhattan(col('end_zone_id')))

bike_df = bike_df.where(
    (col('tripduration') > 0)
    & (col('tripduration') < 7200)
    & (col("start_zone_id") != col("end_zone_id")) 
    & (col('usertype') == "Subscriber")
    & (col('starttime') < col('stoptime'))
    & (col("is_start_manhattan") == True)
    & (col("is_end_manhattan") == True)
).dropna()

bike_df.printSchema()
# bike_df.describe().show()

In [None]:
taxi_feature_cols = [
    "is_weekdays",
    "start_time_in_minute",
    "dropoff_location_id",
    "pickup_location_id",
    "trip_distance",
]

In [None]:
bike_feature_cols = [
    "is_weekdays",
    "start_station_longitude",
    "start_station_latitude",
    "end_station_longitude",
    "end_station_latitude",
    "start_zone_id",
    "end_zone_id",
    "start_time_in_minute",
    "trip_distance",
]

In [None]:
taxi_assembler = VectorAssembler(inputCols=taxi_feature_cols, outputCol='features')
bike_assembler = VectorAssembler(inputCols=bike_feature_cols, outputCol='features')
standard_scaler = StandardScaler(inputCol="features", outputCol="features_scaled")
gbt = GBTRegressor(
    featuresCol="features",
    labelCol="trip_duration",
    predictionCol="pred_trip_duration",
)
evaluator_r2 = RegressionEvaluator(
    labelCol=gbt.getLabelCol(),
    predictionCol=gbt.getPredictionCol(),
    metricName="r2"
)
evaluator_rmse = RegressionEvaluator(
    labelCol=gbt.getLabelCol(),
    predictionCol=gbt.getPredictionCol(),
    metricName="rmse"
)

In [None]:
taxi_transformed_data = taxi_assembler.transform(taxi_df)
taxi_scaled_df = standard_scaler.fit(taxi_transformed_data).transform(taxi_transformed_data)
taxi_scaled_df.select("features", "features_scaled").show(10, truncate=False)
(taxi_training_data, taxi_test_data) = taxi_scaled_df.randomSplit([0.7, 0.3])

# Wall time: 1min 3s

In [None]:
bike_transformed_data = bike_assembler.transform(bike_df)
bike_scaled_df = standard_scaler.fit(bike_transformed_data).transform(bike_transformed_data)
bike_scaled_df.select("features", "features_scaled").show(10, truncate=False)
(bike_training_data, bike_test_data) = bike_scaled_df.randomSplit([0.7, 0.3])

# Wall time: 2min 19s

In [None]:
taxi_gbt_model = gbt.fit(taxi_training_data)
taxi_gbt_predictions = taxi_gbt_model.transform(taxi_test_data)

# Wall time: 7min 12s

In [None]:
bike_gbt_model = gbt.fit(bike_training_data)
bike_gbt_predictions = bike_gbt_model.transform(bike_test_data)

# Wall time: 16min 8s

In [None]:
taxi_gbt_accuracy_r2 = evaluator_r2.evaluate(taxi_gbt_predictions)
taxi_gbt_accuracy_rmse = evaluator_rmse.evaluate(taxi_gbt_predictions)

print(f"Taxi Test GBT R2 Accuracy = {taxi_gbt_accuracy_r2}")
print(f"Taxi Test GBT RMSE Accuracy = {taxi_gbt_accuracy_rmse}")

# RMSE:245.99563606237268

# print(f"Taxi Coefficients: {taxi_model.coefficients}")
# print(f"Taxi Intercept: {taxi_model.intercept}")
# Taxi Test GBT R2 Accuracy = 0.708183954907601 <- 
# Taxi Test GBT R2 Accuracy = 0.6598682899938532
# rmse = 265.5806200025988
# Wall time: 2min 13s

In [None]:
bike_gbt_accuracy_r2 = evaluator_r2.evaluate(bike_gbt_predictions)
bike_gbt_accuracy_rmse = evaluator_rmse.evaluate(bike_gbt_predictions)
print(f"Bike Test GBT R2 Accuracy = {bike_gbt_accuracy_r2}")
print(f"Bike Test GBT RMSE Accuracy = {bike_gbt_accuracy_rmse}")

# print(f"bike Coefficients: {bike_gbt_model.coefficients}")
# print(f"bike Intercept: {bike_gbt_model.intercept}")

# Bike Test GBT R2 Accuracy = 0.540752577358068
# Bike Test GBT RMSE Accuracy = 341.8116656204376

# Exclude subscriber
# Bike Test GBT R2 Accuracy = 0.4413798408518055
# Bike Test GBT RMSE Accuracy = 440.703561958477

# change to manhattan distance
# Bike Test GBT R2 Accuracy = 0.5165664510244884
# Bike Test GBT RMSE Accuracy = 350.56915653602664
# Wall time: 4min 29s

# include lat lon
# Bike Test GBT R2 Accuracy = 0.5305022032396006
# Bike Test GBT RMSE Accuracy = 345.53336096800973
# Wall time: 5min 48s


### Save the model to a Cloud Storage path

In [None]:
bike_gbt_model.write().overwrite().save(f"{BUCKET_URI}/")
# print(bike_gbt_model)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

### Delete Vertex AI Workbench - Managed Notebook

To delete Vertex Ai Workbench - Managed Notebook used in this project, you can use this [Clean up](https://cloud.google.com/vertex-ai/docs/workbench/managed/create-managed-notebooks-instance-console-quickstart#clean-up) part of `Managed notebooks` page.

### Delete a Dataproc Cluster

To delete a Dataproc Cluster, you can use this [Deleting a cluster](https://cloud.google.com/dataproc/docs/guides/manage-cluster#deleting_a_cluster) part of `Manage a cluster` page.

In [None]:
# Delete Google Cloud Storage bucket
! gsutil rm -r $BUCKET_URI

In [None]:
# Delete BigQuery dataset
! bq rm -r -f $DATASET_NAME

After you delete the BigQuery dataset, you can check your Datasets in BigQuery using the following command.

In [None]:
! bq ls