# Big Data Project Notebook

## Dataset

Name: NYC Taxi & Limousine Commission (TLC) Trip Record Data
Source link: [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page?utm_source=chatgpt.com)

## Main Job

**Objective**: Analyze how tip generosity varies by pickup location and time of day, and identify the top pickup zones with the most generous passengers.

### Plan

#### Setup

- clean up and select relevant columns
- create derived columns (tip percentage, hour of day)

#### Shuffles

- join zones id with zone lookup table
- aggregate per location and hour
- aggregate per average tip percentage
- order by average tip percentage


# Setup

- Import libraries
- Setup Spark
- Load dataset in memory

In [3]:
# Import libraries
import findspark
from pyspark.sql import SparkSession

# Setup Spark
findspark.init()
spark = SparkSession.builder \
    .appName("NYC Taxi Analysis - Sample") \
    .getOrCreate()
## health check
df = spark.range(5)
df.show()

# Load sample data
trip_file = "sample_data/yellow_tripdata_2025-01.parquet"
df_trips = spark.read.parquet(trip_file)
## health check
df_trips.show(5)
df_trips.printSchema()
# Load zone lookup table
zone_file = "sample_data/taxi_zone_lookup.csv"
df_zones = spark.read.csv(zone_file, header=True, inferSchema=True)
## health check
df_zones.show(5)
df_zones.printSchema()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|cbd_congestion_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|       1| 2025-01-01 00:18:38|  2025-01-01 00:26:59|              1|      

# Main Job

## Join Trip Data with Zone Lookup Table

In [4]:
from pyspark.sql.functions import col

# Join trip data with zone lookup to get human-readable pickup zones
df_joined = df_trips.join(
    df_zones,
    df_trips.PULocationID == df_zones.LocationID,
    "left"
)

# Select relevant columns only
df_joined = df_joined.select(
    "PULocationID",
    "Borough",
    "Zone",
    "tpep_pickup_datetime",
    "trip_distance",
    "fare_amount",
    "tip_amount",
    "total_amount",
    "payment_type"
)

# Health check
df_joined.show(5)
print("Joined dataset count:", df_joined.count())


+------------+---------+--------------------+--------------------+---------------------+-------------+-----------+----------+------------+------------+
|PULocationID|  Borough|                Zone|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|tip_amount|total_amount|payment_type|
+------------+---------+--------------------+--------------------+---------------------+-------------+-----------+----------+------------+------------+
|         229|Manhattan|Sutton Place/Turt...| 2025-01-01 00:18:38|  2025-01-01 00:26:59|          1.6|       10.0|       3.0|        18.0|           1|
|         236|Manhattan|Upper East Side N...| 2025-01-01 00:32:40|  2025-01-01 00:35:13|          0.5|        5.1|      2.02|       12.12|           1|
|         141|Manhattan|     Lenox Hill West| 2025-01-01 00:44:04|  2025-01-01 00:46:01|          0.6|        5.1|       2.0|        12.1|           1|
|         244|Manhattan|Washington Height...| 2025-01-01 00:14:27|  2025-01-01 00:20:01|

## Compute derived columns

In [5]:
from pyspark.sql.functions import col, hour, round

# Filter out trips with zero fare to avoid division by zero
df_cleaned = df_joined.filter(col("fare_amount") > 0)

# Add derived columns
df_enriched = df_cleaned.withColumn(
    "tip_percentage", round(col("tip_amount") / col("fare_amount") * 100, 2)
).withColumn(
    "hour_of_day", hour(col("tpep_pickup_datetime"))
)

# Health check
df_enriched.select("PULocationID", "Borough", "Zone", "hour_of_day", "tip_percentage").show(5)


+------------+---------+--------------------+-----------+--------------+
|PULocationID|  Borough|                Zone|hour_of_day|tip_percentage|
+------------+---------+--------------------+-----------+--------------+
|         229|Manhattan|Sutton Place/Turt...|          0|          30.0|
|         236|Manhattan|Upper East Side N...|          0|         39.61|
|         141|Manhattan|     Lenox Hill West|          0|         39.22|
|         244|Manhattan|Washington Height...|          0|           0.0|
|         244|Manhattan|Washington Height...|          0|           0.0|
+------------+---------+--------------------+-----------+--------------+
only showing top 5 rows


## Aggregate by Pickup Zone and Hour

In [6]:
from pyspark.sql.functions import avg, count

# Aggregate by pickup zone and hour
df_agg_hour = df_enriched.groupBy( # shuffle!
    "PULocationID", "Borough", "Zone", "hour_of_day"
).agg(
    round(avg("tip_percentage"), 2).alias("avg_tip_pct"),
    count("*").alias("num_trips")
)

# Health check
df_agg_hour.show(10)




+------------+---------+--------------------+-----------+-----------+---------+
|PULocationID|  Borough|                Zone|hour_of_day|avg_tip_pct|num_trips|
+------------+---------+--------------------+-----------+-----------+---------+
|         229|Manhattan|Sutton Place/Turt...|          1|      20.45|      669|
|         265|      N/A|      Outside of NYC|          1|      29.67|       55|
|         152|Manhattan|      Manhattanville|          0|       4.78|       51|
|         211|Manhattan|                SoHo|          2|      16.76|      773|
|         256| Brooklyn|Williamsburg (Sou...|         18|       4.91|      111|
|         243|Manhattan|Washington Height...|         22|        2.2|       50|
|          51|    Bronx|          Co-Op City|         15|        0.0|       14|
|         243|Manhattan|Washington Height...|         16|       3.92|       39|
|          53|   Queens|       College Point|         20|        0.0|        8|
|         159|    Bronx|       Melrose S

                                                                                

## Aggregate Across Hours

In [8]:
from pyspark.sql.functions import avg, round, desc, sum as spark_sum

# Aggregate across all hours per pickup zone
df_agg_zone = df_agg_hour.groupBy(
    "PULocationID", "Borough", "Zone"
).agg(
    round(avg("avg_tip_pct"), 2).alias("avg_tip_pct_overall"),
    spark_sum("num_trips").alias("total_trips")
)

# Health check
df_agg_zone 

## Order by average tip percentage

In [None]:
# Order by descending tip percentage to find top zones
df_top_zones = df_agg_zone.orderBy(desc("avg_tip_pct_overall"))

# Show top 10 tipping zones
df_top_zones.show(10)