# TLC Trip Data Record Analysis

This is the notebook used to analyze the TLC Trip Data Record. In this notebook, we will enrich the data with the New York weather data, public holiday data, drop unnecessary columns, and create new columns for further analysis.

We will save the data in Amazon S3 bucket partioned by year, month, day, hour and vehicle operater. And prepare the data to be used in visualization tool, and machine learning.

In [None]:
# Print the magics in Glue Spark kernel
%help

## 1. Launch the Glue Interactive Sessions development environment

We develop the notebook in the Glue Interactive Sessions development environment. To launch the environment, follow the steps below:

In [11]:
%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 4


You are already connected to a glueetl session c8b783ae-a23d-4769-99fa-79d01bd86409.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Current idle_timeout is 2880 minutes.
idle_timeout has been set to 2880 minutes.


You are already connected to a glueetl session c8b783ae-a23d-4769-99fa-79d01bd86409.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Setting Glue version to: 4.0


You are already connected to a glueetl session c8b783ae-a23d-4769-99fa-79d01bd86409.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Previous worker type: G.1X
Setting new worker type to: G.1X


You are already connected to a glueetl session c8b783ae-a23d-4769-99fa-79d01bd86409.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Previous number of workers: 4
Setting new number of workers to: 4


In [22]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

# pyspark.sql.functions
from pyspark.sql.functions import year, month, dayofmonth, hour, dayofweek, date_format
from pyspark.sql.functions import when
from pyspark.sql.functions import coalesce

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)





In [35]:
import os
# Detect if the code is running in a Glue job
def is_glue_job():
    try:
        args = getResolvedOptions(sys.argv,['JOB_NAME'])
        print("JOB_NAME: ", args['JOB_NAME'])
        return True
    except:
        return False

print("Running in Glue Job: ", is_glue_job())

Running in Glue Job:  False


## 2. TLC Trip Data Record 

The TLC Trip Data Record is a public dataset provided by the New York City Taxi and Limousine Commission (TLC) that contains data on over 1.1 billion taxi trips in New York City. The data is stored in Amazon S3 as a CSV file with a separate file for each month and year. The data is partitioned by year and month. The data is available from 2009 to the present. The data is updated on a monthly basis.

Before we start, we have already download the For-hire Vehicle (FHV) trip records from 2019.2 to 2013.6. The data is stored in the S3 bucket.

### 2.1 Load Trip Record Data

Load the data from S3 bucket and convert to a Spark DataFrame. The Parquet schema since 2023-02 has changed, the PULocationID & DOLocationID was INT64 before 2023-02, but INT32 after 2023-02. To make the data consistent, we will convert to INT64.

To to this, we put the data in two folders in Amazon S3 bucket, `fhvhv` contains the data before 2023-02, and `fhvhv_abnormal` contains the data since 2023-02. We load the data into 2 Spark DataFrame, cast the INT32 columns to INT64, concatenate the two DataFrame.

In [3]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, IntegerType, BinaryType, TimestampType, LongType

schema_since_2023_02 = StructType([
    StructField("hvfhs_license_num", StringType()),
    StructField("dispatching_base_num", StringType()),
    StructField("request_datetime", TimestampType()),
    StructField("pickup_datetime", TimestampType()),
    StructField("dropoff_datetime", TimestampType()),
    StructField("PULocationID", IntegerType()),
    StructField("DOLocationID", IntegerType()),
    StructField("trip_miles", DoubleType()),
    StructField("trip_time", LongType()),
    StructField("base_passenger_fare", DoubleType()),
    StructField("tolls", DoubleType()),
    StructField("sales_tax", DoubleType()),
    StructField("congestion_surcharge", DoubleType()),
    StructField("tips", DoubleType()),
    StructField("driver_pay", DoubleType()),
    StructField("shared_request_flag", StringType()),
    StructField("shared_match_flag", StringType())
])

schema_before_2023_02 = StructType([
    StructField("hvfhs_license_num", StringType()),
    StructField("dispatching_base_num", StringType()),
    StructField("request_datetime", TimestampType()),
    StructField("pickup_datetime", TimestampType()),
    StructField("dropoff_datetime", TimestampType()),
    StructField("PULocationID", LongType()),
    StructField("DOLocationID", LongType()),
    StructField("trip_miles", DoubleType()),
    StructField("trip_time", LongType()),
    StructField("base_passenger_fare", DoubleType()),
    StructField("tolls", DoubleType()),
    StructField("sales_tax", DoubleType()),
    StructField("congestion_surcharge", DoubleType()),
    StructField("tips", DoubleType()),
    StructField("driver_pay", DoubleType()),
    StructField("shared_request_flag", StringType()),
    StructField("shared_match_flag", StringType())
])




In [4]:
# if running in Glue job, read all data, otherwise, read a sample
if is_glue_job():
    print("Running in Glue job, read all data")
    df_before_2023_02 = spark.read.schema(schema_before_2023_02).parquet("s3://qiaoshi-aws-ml/tlc/fhvhv/")
    df_since_2023_02 = spark.read.schema(schema_since_2023_02).parquet("s3://qiaoshi-aws-ml/tlc/fhvhv_abnormal/")
else:
    print("Running in a development environment, read a sample")
    df_before_2023_02 = spark.read.schema(schema_before_2023_02).parquet("s3://qiaoshi-aws-ml/tlc/fhvhv/").sample(False, 0.01) # 1% sample
    df_since_2023_02 = spark.read.schema(schema_since_2023_02).parquet("s3://qiaoshi-aws-ml/tlc/fhvhv_abnormal/").sample(False, 0.01) # 1% sample


df_since_2023_02 = df_since_2023_02.withColumn("PULocationID", df_since_2023_02["PULocationID"].cast("long"))
df_since_2023_02 = df_since_2023_02.withColumn("DOLocationID", df_since_2023_02["DOLocationID"].cast("long"))
print("Count before 2023-02: ", df_before_2023_02.count())
print("Count since 2023-02: ", df_since_2023_02.count())

Running in a development environment, read a sample
Count before 2023-02:  7835862
Count since 2023-02:  967020


In [5]:
df_records = df_before_2023_02.union(df_since_2023_02)
print("Total counts after union: ", df_records.count())

Total counts after union:  8802882


### 2.2 Data Transformation

In this seciton, we did the following data transformation:

- The column `request_datetime` contains null, we fill it with the `pickup_datetime`.
- The column `congestion_surcharge` contains null value, we fill it with 0.
- The column `trip_time` is null, we fill it with the difference between `dropoff_datetime` and `pickup_datetime`.
- Convert the `hvfhs_license_num` to the rider name, Juno, Uber, Via, Lyft.
- Convert the U.S metrics to metric system, so it is easier to understand for international audience.
- Add columns `year`, `month`, `day`, `hour`, `weekday` based on the `request_datetime` for further partitioning the data. We will use this to join with the weather data in New York City.
- Convert the `trip_miles` from miles to kilometers.


The `hvfhs_license_num` is the TLC license number of the HVFHS base or business. Convert it to the well-recognized name as following.
- HV0002: Juno
- HV0003: Uber
- HV0004: Via
- HV0005: Lyft


In [7]:
# fill the request_datetime with pickup_datetime if null
df_records = df_records.withColumn("request_datetime", coalesce(df_records["request_datetime"], df_records["pickup_datetime"]))

# if the value of congestion_surcharge is null, fill it with 0
df_records = df_records.fillna({'congestion_surcharge': 0.0})
print("Total counts after filling null: ", df_records.count())

df_records.show(5)

Total counts after filling null:  8802882
+-----------------+--------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+---------+--------------------+----+----------+-------------------+-----------------+
|hvfhs_license_num|dispatching_base_num|   request_datetime|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|trip_miles|trip_time|base_passenger_fare|tolls|sales_tax|congestion_surcharge|tips|driver_pay|shared_request_flag|shared_match_flag|
+-----------------+--------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+---------+--------------------+----+----------+-------------------+-----------------+
|           HV0003|              B02884|2019-01-31 23:52:45|2019-02-01 00:03:13|2019-02-01 00:12:12|         174|         254|       2.2|      539|               6.54|  0.0|     0.5

In [8]:
# Add column `trip_time` from `dropoff_datetime` and `pickup_datetime`.
from pyspark.sql.functions import unix_timestamp

df_records = df_records.withColumn("trip_time", (unix_timestamp(df_records["dropoff_datetime"]) - unix_timestamp(df_records["pickup_datetime"])).cast("double"))
print("Total counts after adding trip_time: ", df_records.count())
df_records.show(5)

Total counts after adding trip_time:  8802882
+-----------------+--------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+---------+--------------------+----+----------+-------------------+-----------------+
|hvfhs_license_num|dispatching_base_num|   request_datetime|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|trip_miles|trip_time|base_passenger_fare|tolls|sales_tax|congestion_surcharge|tips|driver_pay|shared_request_flag|shared_match_flag|
+-----------------+--------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+---------+--------------------+----+----------+-------------------+-----------------+
|           HV0003|              B02884|2019-01-31 23:52:45|2019-02-01 00:03:13|2019-02-01 00:12:12|         174|         254|       2.2|    539.0|               6.54|  0.0|    

In [9]:
# add a new column to indicate the rider
df_records = df_records.withColumn("rider", 
                        when(df_records["hvfhs_license_num"] == "HV0002", "Juno")
                        .when(df_records["hvfhs_license_num"] == "HV0003", "Uber")
                        .when(df_records["hvfhs_license_num"] == "HV0004", "Via")
                        .when(df_records["hvfhs_license_num"] == "HV0005", "Lyft")
                        .otherwise("Unknown")  # Optional: for unmapped values
                        )
print("Total counts after adding rider: ", df_records.count())
df_records.show(5)

Total counts after adding rider:  8802882
+-----------------+--------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+---------+--------------------+----+----------+-------------------+-----------------+-----+
|hvfhs_license_num|dispatching_base_num|   request_datetime|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|trip_miles|trip_time|base_passenger_fare|tolls|sales_tax|congestion_surcharge|tips|driver_pay|shared_request_flag|shared_match_flag|rider|
+-----------------+--------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+---------+--------------------+----+----------+-------------------+-----------------+-----+
|           HV0003|              B02884|2019-01-31 23:52:45|2019-02-01 00:03:13|2019-02-01 00:12:12|         174|         254|       2.2|    539.0|               6

In [10]:
# Add columns `year`, `month`, `day`, `hour`, `weekday` from `request_datetime`.
df_records = df_records.withColumn("year", year(df_records["request_datetime"]))
df_records = df_records.withColumn("month", month(df_records["request_datetime"]))
df_records = df_records.withColumn("day", dayofmonth(df_records["request_datetime"]))
df_records = df_records.withColumn("hour", hour(df_records["request_datetime"]))
df_records = df_records.withColumn("weekday_n", dayofweek(df_records["request_datetime"]))
df_records = df_records.withColumn("weekday", date_format(df_records["request_datetime"], "EEEE"))





In [11]:
# Convert the trip_miles from miles to kilometers
df_records = df_records.withColumn("trip_km", df_records["trip_miles"] * 1.60934)




In [12]:
# Write the data to the S3 bucket in parquet format partitioned by year, month, day if running in Glue job
if is_glue_job():
    print("Writing to S3 bucket...")
    df_records.write.mode("append").format("parquet").partitionBy("year", "month", "day").save("s3://qiaoshi-aws-ml/tlc/fhvhv_partitioned/")
    print("Done writing to S3 bucket.")
else:
    print("Not running in Glue job, skip writing to S3 bucket.")


Not running in Glue job, skip writing to S3 bucket.


## 3. New York Weather data

Weather is a very important factor of the taxi trip. We use the NY_Weather data to join with the TLC Trip Data Record. The data is downloaded from [Visual Crossing](https://www.visualcrossing.com/). 

We download the weather data from 2019.1 to 2023.6, and store the data in the Amazon S3 bucket. After that, we use the data to join with the TLC Trip Data Record.

### 3.1 Data Column Definitions

- name: name of the weather station
- datetime: date and time of the weather record
- temp: temperature in Fahrenheit
- feelslike: feels like temperature in Fahrenheit
- dewp: dew point in Fahrenheit
- humid: relative humidity
- precip: precipitation in inches
- precipprob: probability of precipitation
- snow: snowfall in inches
- snowdepth: snow depth in inches
- windgust: wind gust in miles per hour
- windspeed: wind speed in miles per hour
- winddir: wind direction in degrees
- sealvlpressure: sea level pressure in millibars
- cloudcover: cloud cover percentage
- visibility: visibility in miles
- solarradiation: solar radiation in watts per square meter
- solarenergy: solar energy in megajoules per square meter
- uvindex: UV index
- severisky: severe weather risk index
- conditions: weather conditions
- icon: weather icon
- stations: number of weather stations reporting data

### 3.2 Load Weather Data

Since the weather data is well formated, we use Glue Crawler to create a Glue Table, and load into a Spark DataFrame.

The weather data is stored in the Amazon S3 bucket, and the Glue Table name is `tlcny_weather`.

In [24]:
dyf_weather = glueContext.create_dynamic_frame.from_catalog(database='tlc', table_name='tlcny_weather')
dyf_weather.printSchema()

root
|-- name: string
|-- datetime: string
|-- temp: string
|-- feelslike: string
|-- dew: string
|-- humidity: double
|-- precip: double
|-- precipprob: long
|-- preciptype: string
|-- snow: double
|-- snowdepth: double
|-- windgust: string
|-- windspeed: double
|-- winddir: string
|-- sealevelpressure: double
|-- cloudcover: double
|-- visibility: double
|-- solarradiation: string
|-- solarenergy: double
|-- uvindex: long
|-- severerisk: string
|-- conditions: string
|-- icon: string
|-- stations: string


In [25]:
# Convert Dynamic DataFrame to Spark DataFrame
df_weather = dyf_weather.toDF()
print("Count of weather records: ", df_weather.count())

Count of weather records:  38687


### 3.2 Data Transformation

In this seciton, we did the following data transformation:

1. Drop duplicated fields `name`, `dew`, `stations`, `conditions`, `severerisk`.
2. Convert the U.S metrics to metric system, so it is easier to understand for international audience.
3. Add columns `year`, `month`, `day`, `hour`.
4. Remove duplicated rows.

We added the `year`, `month`, `day`, `hour` fields based on the `request_datetime` in the TLC Trip Data Record.


In [26]:
# Drop columns: name (location name), stations (number of stations reporting), conditions (weather conditions), severerisk (severe weather risk)
df_weather = df_weather.drop("name", "dew", "stations", "conditions", "severerisk")

df_weather.show(5)

+-------------------+-----+---------+--------+------+----------+----------+----+---------+--------+---------+-------+----------------+----------+----------+--------------+-----------+-------+-----------+
|           datetime| temp|feelslike|humidity|precip|precipprob|preciptype|snow|snowdepth|windgust|windspeed|winddir|sealevelpressure|cloudcover|visibility|solarradiation|solarenergy|uvindex|       icon|
+-------------------+-----+---------+--------+------+----------+----------+----+---------+--------+---------+-------+----------------+----------+----------+--------------+-----------+-------+-----------+
|2019-02-01T00:00:00| -9.4|    -15.4|   47.32|   0.0|         0|          | 0.0|      0.0|        |     13.1|    250|          1030.9|       0.4|      16.0|             0|        0.0|      0|clear-night|
|2019-02-01T01:00:00| -9.4|    -15.9|   47.16|   0.0|         0|          | 0.0|      0.0|    44.6|     14.9|    260|          1030.5|       0.4|      16.0|             0|        0.0| 

In [27]:
from pyspark.sql.functions import round

# Convert the U.S metrics to metric system, so it is easier to understand for international audience
df_weather = df_weather.withColumn("temp_c", round((df_weather["temp"] - 32) * 5/9, 1))
df_weather = df_weather.withColumn("feelslike_c", round((df_weather["feelslike"] - 32) * 5/9, 1))
df_weather = df_weather.withColumn("precip_cm", round(df_weather["precip"] * 0.3048 * 100, 1)) # centimeters
df_weather = df_weather.withColumn("snow_cm", round(df_weather["snow"] * 0.3048 * 100, 1)) # centimeters
df_weather = df_weather.withColumn("snowdepth_cm", round(df_weather["snowdepth"] * 0.3048 * 100, 1)) #centimeters
df_weather = df_weather.withColumn("windgust_mps", round(df_weather["windgust"] * 0.44704, 1))  # meters per second
df_weather = df_weather.withColumn("windspeed_mps", round(df_weather["windspeed"] * 0.44704, 1)) # meters per second
df_weather = df_weather.withColumn("visibility_km", round(df_weather["visibility"] * 1.60934, 1)) # kilometers

df_weather.show(5)


+-------------------+-----+---------+--------+------+----------+----------+----+---------+--------+---------+-------+----------------+----------+----------+--------------+-----------+-------+-----------+------+-----------+---------+-------+------------+------------+-------------+-------------+
|           datetime| temp|feelslike|humidity|precip|precipprob|preciptype|snow|snowdepth|windgust|windspeed|winddir|sealevelpressure|cloudcover|visibility|solarradiation|solarenergy|uvindex|       icon|temp_c|feelslike_c|precip_cm|snow_cm|snowdepth_cm|windgust_mps|windspeed_mps|visibility_km|
+-------------------+-----+---------+--------+------+----------+----------+----+---------+--------+---------+-------+----------------+----------+----------+--------------+-----------+-------+-----------+------+-----------+---------+-------+------------+------------+-------------+-------------+
|2019-02-01T00:00:00| -9.4|    -15.4|   47.32|   0.0|         0|          | 0.0|      0.0|        |     13.1|    25

In [28]:
# Add columns year, month, day, hour based on field datetime

df_weather = df_weather.withColumn("year", year(df_weather["datetime"]))
df_weather = df_weather.withColumn("month", month(df_weather["datetime"]))
df_weather = df_weather.withColumn("day", dayofmonth(df_weather["datetime"]))
df_weather = df_weather.withColumn("hour", hour(df_weather["datetime"]))

df_weather.show(5)


+-------------------+-----+---------+--------+------+----------+----------+----+---------+--------+---------+-------+----------------+----------+----------+--------------+-----------+-------+-----------+------+-----------+---------+-------+------------+------------+-------------+-------------+----+-----+---+----+
|           datetime| temp|feelslike|humidity|precip|precipprob|preciptype|snow|snowdepth|windgust|windspeed|winddir|sealevelpressure|cloudcover|visibility|solarradiation|solarenergy|uvindex|       icon|temp_c|feelslike_c|precip_cm|snow_cm|snowdepth_cm|windgust_mps|windspeed_mps|visibility_km|year|month|day|hour|
+-------------------+-----+---------+--------+------+----------+----------+----+---------+--------+---------+-------+----------------+----------+----------+--------------+-----------+-------+-----------+------+-----------+---------+-------+------------+------------+-------------+-------------+----+-----+---+----+
|2019-02-01T00:00:00| -9.4|    -15.4|   47.32|   0.0|  

In [29]:
from pyspark.sql.functions import col

duplicated_rows = df_weather.groupBy("year", "month", "day", "hour").count().filter(col("count") > 1)
duplicated_rows.show()


+----+-----+---+----+-----+
|year|month|day|hour|count|
+----+-----+---+----+-----+
|2019|   11|  3|   1|    2|
|2022|   11|  6|   1|    2|
|2020|   11|  1|   1|    2|
|2021|   11|  7|   1|    2|
+----+-----+---+----+-----+


In [30]:
# There is duplicated records in weather data, remove the duplicated records
print("Count before removing duplicated records: ", df_weather.count())
df_weather = df_weather.dropDuplicates(["year", "month", "day", "hour"])
print("Count after removing duplicated records: ", df_weather.count())

Count before removing duplicated records:  38687
Count after removing duplicated records:  38683


##  Public holidays in New York City

We used ChatGPT to get the public holidays in New York City in from 2019 ~ 2023. The following is the code we used to get the public holidays, and ask it to output in CSV format. And save the result, and upload it to the S3 bucket.

```
List all public holidays in New York from 2019 to 2023. Output in CSV format. Here is an example:

<example>
Year,Month,Day,Holiday
2019,1,1,New Year's Day
2019,1,21,Martin Luther King Jr. Day
</example>
```

Command to upload to the Amazon S3 bucket, and use AWS Glue Crawler to create a table
```
aws s3 cp ~/Developer/sjtu/data-analytics/data/holidays_ny.csv s3://qiaoshi-aws-ml/tlc/holidays/ny.csv
```


### 3.1 Read the NY_Weather data from S3 bucket


In [18]:
dyf_holidays = glueContext.create_dynamic_frame.from_catalog(database='tlc', table_name='holidays')

# Convert Dynamic DataFrame to Spark DataFrame
df_holidays = dyf_holidays.toDF()

df_holidays.show(5)

+----+-----+---+--------------------+
|year|month|day|             holiday|
+----+-----+---+--------------------+
|2019|    1|  1|      New Year's Day|
|2019|    1| 21|Martin Luther Kin...|
|2019|    2| 18|     Presidents' Day|
|2019|    5| 27|        Memorial Day|
|2019|    7|  4|    Independence Day|
+----+-----+---+--------------------+
only showing top 5 rows


## 4. TLC Trip Data Record Enrichment, Cleanup & Exploration

In this section, we will enrich the TLC Trip Data Record with the weather data and public holidays data.

### 4.1 Enrich with Weather Data

In [20]:
# Join the TLC Trip Data Record with the weather data
print("total records before join with weather data: ", df_records.count())
df_enriched = df_records.join(df_weather, on=['year', 'month', 'day', 'hour'], how='left')


print("total records after join with weather data: ", df_enriched.count())
df_enriched.show(5)

total records before join with weather data:  8802882
total records after join with weather data:  8805238
+----+-----+---+----+-----------------+--------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+---------+--------------------+----+----------+-------------------+-----------------+-----+---------+--------+------------------+-------------------+----+---------+--------+------+----------+----------+----+---------+--------+---------+-------+----------------+----------+----------+--------------+-----------+-------+-----------+------+-----------+---------+-------+------------+------------+-------------+-------------+
|year|month|day|hour|hvfhs_license_num|dispatching_base_num|   request_datetime|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|trip_miles|trip_time|base_passenger_fare|tolls|sales_tax|congestion_surcharge|tips|driver_pay|shared_request_flag|shared_match_flag|

### 4.2 Enrich with Publich Holiday Data

In [21]:
# Join the TLC Trip Data Record with the public holiday data
df_enriched = df_enriched.join(df_holidays, on=['year', 'month', 'day'], how='left').dropDuplicates()
print("total records after join with holiday data: ", df_enriched.count())

# Add a column is_holiday of type boolean
df_enriched = df_enriched.withColumn("is_holiday", df_enriched["holiday"].isNotNull())
print("total records after adding is_holiday column: ", df_enriched.count())

df_enriched.show(3)

total records after join with holiday data:  8805238
+----+-----+---+----+-----------------+--------------------+-------------------+-------------------+-------------------+------------+------------+----------+---------+-------------------+-----+---------+--------------------+----+----------+-------------------+-----------------+-----+---------+--------+------------------+-------------------+----+---------+--------+------+----------+----------+----+---------+--------+---------+-------+----------------+----------+----------+--------------+-----------+-------+-----------+------+-----------+---------+-------+------------+------------+-------------+-------------+-------+----------+
|year|month|day|hour|hvfhs_license_num|dispatching_base_num|   request_datetime|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|trip_miles|trip_time|base_passenger_fare|tolls|sales_tax|congestion_surcharge|tips|driver_pay|shared_request_flag|shared_match_flag|rider|weekday_n| weekday|          

### 4.3 Data Cleanup

There is duplicated columns in the DataFrame, we will remove some duplicated records.

In [None]:
# drop duplicated columns
df_enriched = df_enriched.drop("hvfhs_license_num", "datetime", "temp", "feelslike", "precip", "snow", "snowdepth", "windgust", "windspeed", "visibility", "trip_miles")

df_enriched.printSchema()

## 5. Data Exploration

We conduct a draft analysis of the data. We will use the data to answer the following questions:

- Is there any relevance between the weather and the taxi trip?
- What is the relevance between each columns?
- What is the busiest day of 

### 5.1 Column Relevance

Find the relevance among each columns.

In [None]:
# import seaborn as sns
# import matplotlib.pyplot as plt

# # Compute the correlation matrix
# corr = df_enriched.corr()

# # Plot the correlation matrix as a heatmap
# plt.figure(figsize=(12, 10))
# sns.heatmap(corr, cmap='coolwarm', annot=True, fmt=".2f")
# plt.title("Correlation Matrix")
# plt.show()


### 5.2 XXXX



## 6. Save the enriched TLC Trip Data Record

After the ETL, we save the enriched TLC Trip Data Record to the S3 bucket. We will use the data to visualize the data in BI tools (e.g., Amazon QuickSight), and build the machine learning model in the Amazon SageMaker.

We will try both zero-code machine learning using Amazon SageMaker Canvas, and code-based machine learning using Amazon SageMaker Studio.


In [None]:
print("Count of records before saving: ", df_enriched.count())

In [None]:
# Save the data to Amazon S3 bucket, partitioned by year, month, day
df_enriched.write.partitionBy("year", "month", "day).parquet("s3://qiaoshi-aws-ml/tlc/fhvhv_final/")

In [None]:
# Commit the Job to AWS Glue, complete the job.
job.commit()

print('Data processing completed.')

## 7. Finance Data

Uber, and Lyft has been public companies for a while. We can get the finance data from the public market. We can use the finance data to analyze the financial performance of the companies, and compare with the TLC Trip Data Record.

As of today, Uber, Lyft has never split the stock, Juno, Via are not public companies. We will use the stock price to analyze the financial performance of the companies.