# <img src="https://lh3.googleusercontent.com/mUTbNK32c_DTSNrhqETT5aQJYFKok2HB1G2nk2MZHvG5bSs0v_lmDm_ArW7rgd6SDGHXo0Ak2uFFU96X6Xd0GQ=w160-h128" width="45" valign="top" alt="BigQuery"> Welcome to Rideshare AI Lakehouse powered by BigQuery Studio

## **Step 01: Overview**

### Summary

This portion of the demo will showcase how you can build an AI Lakehouse using Google Cloud's Data and Analytics stack.  The demo is for our fitcious company Rideshare Plus where we will perform some exploratory data analysis (EdA) over the trips data.

All code is located on GitHub: https://goo.gle/dagd

Please check out the other part of the demo which uses AI, image object analysis and data at scale.

In this notebook, we will focus on using BigQuery Dataframes, a new BigQuery feature that provides a Pythonic DataFrame and machine learning (ML) API powered by the BigQuery engine. BigQuery Dataframes implement:

* `bigframes.pandas` provides a pandas-like API for analytics.
* `bigframes.ml` provides a scikit-learn-like API for ML.
Learn more about BigQuery DataFrames [here](https://cloud.google.com/python/docs/reference/bigframes/latest)

This the steps we will perform:
- Loading the `rideshare_lakehouse_enriched.biglake_rideshare_trip_iceberg` table into a BigQuery Dataframe
- Perfom some basic cleaning operations like dropping columns and deleting `NaNs` values
- Using a supervised ML algotihm like [K-Means](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.cluster.KMeans), we will segment the different trips into a number of clusters.
- Will select a subset of trips for each ckuster and ask an LLM (PalM-2) [link text](https://https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.PaLM2TextGenerator) to make sense of the cluster for us.

## **Step 02: Loading and exploring data**

Loading python packages. Since this is beign executed from Colab Enteprise there is no need to `pip install bigrames` as it already included.

In [1]:
import pandas as pd
import json
import bigframes.pandas as bpd
from bigframes.ml.cluster import KMeans
from bigframes.ml.llm import PaLM2TextGenerator

Seting BigQuery location

In [2]:
# bpd.reset_session()
bpd.options.bigquery.location = "US"
session = bpd.get_global_session()

Loading data from a BigQuery table into a BigQuery Dataframe

In [3]:
TABLE_ID = "${project_id}.${bigquery_rideshare_llm_enriched_dataset}.trip"
df = bpd.read_gbq(TABLE_ID)

HTML(value='Query job 7ad20f88-bba3-47a2-a70f-cb51eab1d9da is RUNNING. <a target="_blank" href="https://consol…

In [33]:
type(df)

bigframes.dataframe.DataFrame

Now, we operate with the dataframe as with a classical pandas dataframe.

In [4]:
df.head()

HTML(value='Query job 1fffa241-9a1f-45c2-a1c1-65af8e532087 is DONE. 736.5 MB processed. <a target="_blank" hre…

HTML(value='Query job 8cc4ca36-d4f0-4964-977a-91ee4f84da85 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 227f4d24-e018-44f1-ab81-59cf1586186d is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,rideshare_trip_id,pickup_location_id,pickup_datetime,dropoff_location_id,dropoff_datetime,ride_distance,is_airport,payment_type_id,fare_amount,tip_amount,taxes_amount,total_amount,credit_card_number,credit_card_expire_date,credit_card_cvv_code,partition_date
0,8d5add86-faf2-40dd-83bb-30c8e351c462,230,2020-02-20 17:01:07+00:00,68,2020-02-20 17:08:36+00:00,1.39,False,1,7.0,1.13,4.3,12.43,4712-1545-8669-8650,2023-06-08,979,2020-02-01
1,bc8df12b-e253-4563-a6a1-60a005a37519,13,2020-02-19 19:39:23+00:00,40,2020-02-19 19:50:14+00:00,3.54,False,1,12.5,5.73,10.42,28.65,2466-7804-1267-4276,2023-09-13,774,2020-02-01
2,c0d3ee0f-a07d-42ec-9938-4f1ba8c61620,170,2020-02-20 18:48:13+00:00,230,2020-02-20 19:02:01+00:00,1.34,False,1,10.0,2.14,4.3,16.440001,6034-2304-8303-3285,2022-12-25,693,2020-02-01
3,959945b9-e3dc-4ea2-aab6-88c01076792c,138,2020-02-20 16:49:39+00:00,148,2020-02-20 17:23:07+00:00,9.13,False,1,31.5,7.16,4.3,42.959999,1435-2960-3143-4533,2023-03-20,534,2020-02-01
4,5ee62c5f-f0e3-4689-8b0d-220ccbc33ee1,68,2020-02-20 19:18:16+00:00,90,2020-02-20 19:23:58+00:00,0.83,False,1,5.5,1.96,4.3,11.76,2626-6341-6870-6351,2025-01-26,233,2020-02-01


In [5]:
df.dtypes

rideshare_trip_id                                  string
pickup_location_id                                  Int64
pickup_datetime            timestamp[us, tz=UTC][pyarrow]
dropoff_location_id                                 Int64
dropoff_datetime           timestamp[us, tz=UTC][pyarrow]
ride_distance                                     Float64
is_airport                                        boolean
payment_type_id                                     Int64
fare_amount                                       Float64
tip_amount                                        Float64
taxes_amount                                      Float64
total_amount                                      Float64
credit_card_number                                 string
credit_card_expire_date              date32[day][pyarrow]
credit_card_cvv_code                               string
partition_date                       date32[day][pyarrow]
dtype: object

In [6]:
df.shape

HTML(value='Query job 5acb76f7-b70a-45a7-9392-9b7884f221a9 is DONE. 0 Bytes processed. <a target="_blank" href…

(92065103, 16)

In [7]:
df['pickup_location_id'].value_counts()

HTML(value='Query job acd7aefe-38fc-4bcd-a75f-8126e4988fb7 is DONE. 736.5 MB processed. <a target="_blank" hre…

HTML(value='Query job 952105f4-67ab-430f-9158-a6a242643a72 is DONE. 736.5 MB processed. <a target="_blank" hre…

pickup_location_id
237    4232359
236    3893134
161    3348262
132    3234829
186    2962252
162    2855960
142    2800124
170    2745446
48     2629145
239    2574292
141    2457604
230    2428247
234    2375138
163    2338341
79     2206273
68     2114802
107    2091523
238    2037461
263    2008511
138    1982481
140    1952300
229    1892468
164    1851651
249    1788509
90     1578441
Name: count, dtype: Int64

In [8]:
df['dropoff_location_id'].value_counts()

HTML(value='Query job c1e0f352-b3fc-47f1-a239-c14f2c017387 is DONE. 736.5 MB processed. <a target="_blank" hre…

HTML(value='Query job 4c060a3a-508e-42f0-b276-66d23287e9bd is DONE. 736.5 MB processed. <a target="_blank" hre…

dropoff_location_id
236    4034306
237    3735420
161    3054195
170    2653724
141    2504885
239    2504708
142    2447666
48     2336382
162    2320331
238    2215909
230    2198198
234    2065140
68     2035711
163    2017032
263    1995475
186    1967517
140    1947550
229    1885624
79     1842346
107    1765890
164    1679978
246    1539785
75     1468699
249    1460092
262    1408448
Name: count, dtype: Int64

In [9]:
df = df.drop(columns=['trip_id','customer_id','driver_id','fare_amount','tip_amount','payment_type_id']).sample(frac=0.05,random_state=1)

HTML(value='Query job 95f69bf3-654c-49f0-baba-c1b97245f18e is DONE. 0 Bytes processed. <a target="_blank" href…

In [10]:
df.shape

HTML(value='Query job b196a745-c1c0-43f1-9ef7-7cb48d594f04 is RUNNING. <a target="_blank" href="https://consol…

(4603255, 7)

In [11]:
df = df.dropna()

## **Step 03: Training a K-means model**

Fit the k-Means model

In [12]:
N_CLUSTERS = 4
kmeans_model = KMeans(n_clusters=N_CLUSTERS).fit(df)

In [13]:
kmeans_model.get_params()


{'n_clusters': 4}

In [14]:
kmeans_model.score(df)

HTML(value='Query job 38b6de9c-e61b-4e0a-ad4f-1ab48b8efdaa is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job dface5fb-a222-462a-a8d2-df2451af68b1 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 4bd51cb5-f7e6-48a7-8e88-501239c15266 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 225b47d3-87bc-4c65-a796-95ff1bab83b4 is DONE. 24 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 6e485900-d17f-43b6-8b61-58daa9c981dd is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,2.035664,3.88733


Predict the centroid_id (cluster) for the trips Dataframe

In [15]:
centroid_id = kmeans_model.predict(df)

HTML(value='Query job 5d1f0d74-5a4b-4d70-b981-caa10c6e500c is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job ba198661-09c6-4859-91a9-34c1ef3ed5ff is RUNNING. <a target="_blank" href="https://consol…

Append the cluster information to the original dataframe

In [16]:
df['centroid_id'] = centroid_id['CENTROID_ID']


In [17]:
df

HTML(value='Query job d101b1f1-47b5-4d08-bcba-23c6cb6f2327 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 863fc379-c251-407e-b440-1c4be7e554e4 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job b7bfbf8f-4d7f-4ca4-a09b-6c8819b759dd is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,pickup_location_id,pickup_datetime,dropoff_location_id,dropoff_datetime,ride_distance,is_airport,total_amount,centroid_id
31788803,141,2020-01-07 00:01:12+00:00,256,2020-01-07 00:17:26+00:00,6.86,False,31.0,2
34848326,186,2022-07-28 20:04:17+00:00,4,2022-07-28 20:30:46+00:00,3.0,False,24.950001,3
6289096,116,2022-02-04 13:05:58+00:00,166,2022-02-04 13:15:23+00:00,1.58,False,9.3,3
2050367,48,2021-10-01 23:43:23+00:00,239,2021-10-01 23:57:03+00:00,3.86,False,19.299999,3
13697325,144,2021-05-22 18:46:21+00:00,68,2021-05-22 18:58:52+00:00,2.42,False,15.3,4
3357510,113,2022-08-19 11:08:31+00:00,231,2022-08-19 11:16:16+00:00,0.89,False,11.8,3
43980201,137,2021-05-20 22:13:00+00:00,74,2021-05-20 22:26:00+00:00,5.73,False,23.469999,4
31810276,233,2022-03-10 12:36:17+00:00,100,2022-03-10 12:52:00+00:00,1.0,False,16.0,3
50124132,236,2022-05-25 08:22:14+00:00,162,2022-05-25 08:35:36+00:00,1.5,False,15.3,3
23856329,107,2022-07-27 21:56:48+00:00,137,2022-07-27 22:01:48+00:00,0.9,False,11.15,3


Save the model to BigQuery

In [40]:
kmeans_model.to_gbq(model_name="${project_id}.${bigquery_rideshare_llm_enriched_dataset}.trips_cluster_model")

KMeans(n_clusters=4)

## **Step 04: LLM (Palm-2) inference**

Construct a LLM prompt, attaching some information from the dataframe.

In [60]:
promtp_explain = "Generate an explanation and give a representative name for the following clusters of trips for a ride sharing application in simple terms, answer in JSON format with the following format { clusters_summary: [ cluster_id: <CLUSTER_ID_HERE>, name: <CLUSTER_NAME_HERE>, description: <CLUSTER_DESCRIPTION_HERE>, ...] }"
for i in range(N_CLUSTERS):
    promtp_explain = promtp_explain + f"5 examples trips for Cluster {i+1} are : {df[df['centroid_id'] == i+1].sample(5).to_pandas().to_json(orient='split')}" + ","

HTML(value='Query job a4d31e86-dc7c-430d-95cd-ee20c798e16b is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 731d77f8-5905-4fed-bd90-dd9e07825bb3 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 877a6cae-c2cf-49ad-990e-502842a76e17 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 18b174c5-37d3-4224-a521-ed9554e386c4 is RUNNING. <a target="_blank" href="https://consol…

In [61]:
promtp_explain

'Generate an explanation and give a representative name for the following clusters of trips for a ride sharing application in simple terms, answer in JSON format with the following format { clusters_summary: [ cluster_id: <CLUSTER_ID_HERE>, name: <CLUSTER_NAME_HERE>, description: <CLUSTER_DESCRIPTION_HERE>, ...] }5 examples trips for Cluster 1 are : {"columns":["pickup_location_id","pickup_datetime","dropoff_location_id","dropoff_datetime","ride_distance","is_airport","total_amount","centroid_id"],"index":[31181016,88429264,67600119,10169417,18550742],"data":[[237,1634722584000,144,1634724139000,3.9700000286,false,23.7999992371,1],[239,1619722351000,142,1619722569000,0.5699999928,false,10.5600004196,1],[262,1633182744000,262,1633182906000,0.5799999833,false,9.1199998856,1],[236,1633620532000,163,1633621432000,1.6100000143,false,16.5599994659,1],[170,1622989470000,261,1622990522000,5.4800000191,false,26.7600002289,1]]},5 examples trips for Cluster 2 are : {"columns":["pickup_location_id

In [62]:
llm_model = PaLM2TextGenerator(session=session, connection_name='us.vertex-ai')

In [63]:

df_prompt = pd.DataFrame(
        {
            "prompt": [promtp_explain],
        })
bf_df_prompt = bpd.read_pandas(df_prompt)

HTML(value='Load job 0ef038b7-f709-4570-90f4-3fcd3fa4db84 is RUNNING. <a target="_blank" href="https://console…

Call inference on the LLM model

In [64]:
pred = llm_model.predict(bf_df_prompt,max_output_tokens=1024).to_pandas()

HTML(value='Query job f37b3401-bf13-44ee-831b-9a450568e838 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job fc03dbfc-c1ff-434f-a1ef-03ce159c6ee9 is DONE. 8 Bytes processed. <a target="_blank" href…

HTML(value='Query job f9afd8b5-20e1-4fd7-bec2-99361b239d7c is DONE. 593 Bytes processed. <a target="_blank" hr…

In [79]:
result = pred['ml_generate_text_llm_result'][0].replace("```JSON\n","").replace("```","").replace("\n","")

In [76]:
result = result.replace("```JSON\n","").replace("```","").replace("\n","").replace(" json","")

In [80]:
json_result = json.loads(result)

In [81]:
json_result

{'clusters_summary': [{'cluster_id': 1,
   'name': 'Short rides',
   'description': 'Rides that are less than 2 miles and take less than 10 minutes.'},
  {'cluster_id': 2,
   'name': 'Medium rides',
   'description': 'Rides that are between 2 and 5 miles and take between 10 and 20 minutes.'},
  {'cluster_id': 3,
   'name': 'Long rides',
   'description': 'Rides that are more than 5 miles and take more than 20 minutes.'},
  {'cluster_id': 4,
   'name': 'Airport rides',
   'description': 'Rides that start or end at an airport.'}]}