# Big Data Project Notebook

## Dataset

Name: NYC Taxi & Limousine Commission (TLC) Trip Record Data
Source link: [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page?utm_source=chatgpt.com)

## Main Job

**Objective**: Analyze how tip generosity varies by pickup location and time of day, and identify the top pickup zones with the most generous passengers.

### Plan

#### Setup

- clean up and select relevant columns
- create derived columns (tip percentage, hour of day)

#### Shuffles

- join zones id with zone lookup table
- aggregate per location and hour
- aggregate per average tip percentage
- order by average tip percentage


# Setup

- Import libraries
- Setup Spark
- Load dataset in memory

In [1]:
# Import Spark
import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

# Spark session and context
conf = SparkConf().setAppName("NYC Taxi Analysis RDD").setMaster("local[*]")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

trip_file = "sample_data/yellow_tripdata_2025-01.parquet"
zone_file = "sample_data/taxi_zone_lookup.csv"

# Load datasets
df_trips = spark.read.parquet(trip_file).select(
    "PULocationID",
    "tpep_pickup_datetime",
    "fare_amount",
    "tip_amount"
)

# Convert to RDD of tuples: (PULocationID, (pickup_datetime, fare_amount, tip_amount))
rdd_trips = df_trips.rdd.map(lambda row: (
    row['PULocationID'],
    (row['tpep_pickup_datetime'], row['fare_amount'], row['tip_amount'])
))

rdd_zones = sc.textFile(zone_file)

header = rdd_zones.first()
rdd_zones = rdd_zones.filter(lambda line: line != header)

# Convert CSV line into tuple: (LocationID, (Borough, Zone))
def parse_zone(line):
    parts = line.split(",")
    return (int(parts[0]), (parts[1], parts[2]))

rdd_zones = rdd_zones.map(parse_zone)

# Health checks
print("Trips RDD count:", rdd_trips.count())
print("Zones RDD count:", rdd_zones.count())
print("Sample trips:", rdd_trips.take(5))
print("Sample zones:", rdd_zones.take(5))


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/10/30 18:28:54 WARN Utils: Your hostname, rioly, resolves to a loopback address: 127.0.1.1; using 192.168.1.4 instead (on interface wlan0)
25/10/30 18:28:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/30 18:28:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

Trips RDD count: 3475226
Zones RDD count: 265




Sample trips: [(229, (datetime.datetime(2025, 1, 1, 0, 18, 38), 10.0, 3.0)), (236, (datetime.datetime(2025, 1, 1, 0, 32, 40), 5.1, 2.02)), (141, (datetime.datetime(2025, 1, 1, 0, 44, 4), 5.1, 2.0)), (244, (datetime.datetime(2025, 1, 1, 0, 14, 27), 7.2, 0.0)), (244, (datetime.datetime(2025, 1, 1, 0, 21, 34), 5.8, 0.0))]
Sample zones: [(1, ('"EWR"', '"Newark Airport"')), (2, ('"Queens"', '"Jamaica Bay"')), (3, ('"Bronx"', '"Allerton/Pelham Gardens"')), (4, ('"Manhattan"', '"Alphabet City"')), (5, ('"Staten Island"', '"Arden Heights"'))]


                                                                                

# Main Job

## Join Trip Data with Zone Lookup Table

In [2]:
rdd_joined = rdd_trips.join(rdd_zones)

# Reformat for convenience: (PULocationID, Borough, Zone, pickup_dt, fare, tip)
rdd_enriched = rdd_joined.map(lambda x: (
    x[0],                     # PULocationID
    x[1][1][0],               # Borough
    x[1][1][1],               # Zone
    x[1][0][0],               # pickup_datetime
    x[1][0][1],               # fare_amount
    x[1][0][2]                # tip_amount
))

# Health check
print("Enriched trips RDD count:", rdd_enriched.count())
print("Sample enriched trips:", rdd_enriched.take(5))


                                                                                

Enriched trips RDD count: 3475226
Sample enriched trips: [(170, '"Manhattan"', '"Murray Hill"', datetime.datetime(2025, 1, 1, 0, 14, 47), 4.4, 2.35), (170, '"Manhattan"', '"Murray Hill"', datetime.datetime(2025, 1, 1, 0, 27, 19), 5.8, 0.0), (170, '"Manhattan"', '"Murray Hill"', datetime.datetime(2025, 1, 1, 0, 42, 44), 14.9, 3.98), (170, '"Manhattan"', '"Murray Hill"', datetime.datetime(2025, 1, 1, 0, 7, 28), 10.7, 2.83), (170, '"Manhattan"', '"Murray Hill"', datetime.datetime(2025, 1, 1, 0, 19, 2), 12.8, 3.55)]


## Compute derived columns

In [3]:
from datetime import datetime

# Function to extract hour from datetime string
def extract_hour(pickup_dt):
    if isinstance(pickup_dt, str):
        return datetime.strptime(pickup_dt, "%Y-%m-%d %H:%M:%S").hour
    else:
        return pickup_dt.hour  # if already datetime object

# Compute tip percentage and pickup hour
rdd_derived = rdd_enriched.map(lambda x: (
    x[0],             # PULocationID
    x[1],             # Borough
    x[2],             # Zone
    extract_hour(x[3]),       # pickup_hour
    x[4],             # fare_amount
    x[5],             # tip_amount
    (x[5] / x[4] * 100) if x[4] != 0 else 0  # tip_pct
))

# RDD structure now:
# (PULocationID, Borough, Zone, pickup_hour, fare_amount, tip_amount, tip_pct)

# Health check
print("Sample derived RDD:", rdd_derived.take(5))


Sample derived RDD: [(170, '"Manhattan"', '"Murray Hill"', 0, 4.4, 2.35, 53.40909090909091), (170, '"Manhattan"', '"Murray Hill"', 0, 5.8, 0.0, 0.0), (170, '"Manhattan"', '"Murray Hill"', 0, 14.9, 3.98, 26.711409395973153), (170, '"Manhattan"', '"Murray Hill"', 0, 10.7, 2.83, 26.448598130841123), (170, '"Manhattan"', '"Murray Hill"', 0, 12.8, 3.55, 27.734374999999993)]


## Aggregate by Pickup Zone and Hour

Foreach pickup zone and foreach hour, I compute the average tip percentage in that zone in that hour

In [5]:
# Map to key-value for aggregation
rdd_for_agg = rdd_derived.map(lambda x: (
    (x[0], x[1], x[2], x[3]),  # key = (PULocationID, Borough, Zone, pickup_hour)
    x[6]                        # tip_pct
))

rdd_grouped = rdd_for_agg.groupByKey()

rdd_agg_hour = rdd_grouped.map(lambda x: (
    x[0][0],     # PULocationID
    x[0][1],     # Borough
    x[0][2],     # Zone
    x[0][3],     # pickup_hour
    round(sum(x[1])/len(list(x[1])), 2),  # avg_tip_pct
    len(list(x[1]))                        # num_trips
))

# Health check
print("Sample aggregated by hour RDD (naive):", rdd_agg_hour.take(5))




Sample aggregated by hour RDD (naive): [(170, '"Manhattan"', '"Murray Hill"', 21, 18.73, 5005), (238, '"Manhattan"', '"Upper West Side North"', 7, 16.44, 3044), (238, '"Manhattan"', '"Upper West Side North"', 11, 20.57, 4464), (238, '"Manhattan"', '"Upper West Side North"', 22, 18.3, 1557), (255, '"Brooklyn"', '"Williamsburg (North Side)"', 5, 1.07, 21)]


                                                                                

## Aggregate Across Hours per Pickup Zone

Foreach pickup zone, I compute the mean between all percentage tips for all hours

In [6]:
# Map to key-value for aggregation
rdd_for_agg_zone = rdd_agg_hour.map(lambda x: (
    (x[0], x[1], x[2]),   # key = (PULocationID, Borough, Zone)
    (x[4], x[5])           # (avg_tip_pct, num_trips)
))


rdd_grouped_zone = rdd_for_agg_zone.groupByKey()

rdd_agg_zone = rdd_grouped_zone.map(lambda x: (
    x[0][0],  # PULocationID
    x[0][1],  # Borough
    x[0][2],  # Zone
    round(sum(v[0] for v in x[1]) / len(list(x[1])), 2),  # avg_tip_pct_overall
    sum(v[1] for v in x[1])                                # total_trips
))

# Health check
print("Sample aggregated by zone RDD (overall):", rdd_agg_zone.take(5))


Sample aggregated by zone RDD (overall): [(136, '"Bronx"', '"Kingsbridge Heights"', 0.58, 290), (1, '"EWR"', '"Newark Airport"', 31.73, 377), (87, '"Manhattan"', '"Financial District North"', 15.33, 18053), (121, '"Queens"', '"Hillcrest/Pomonok"', 1.1, 358), (243, '"Manhattan"', '"Washington Heights North"', 4.02, 1163)]


## Order by average tip percentage

Once the computation is done, to obtain the greatest I just order the result

In [7]:
rdd_top_zones = rdd_agg_zone.sortBy(lambda x: x[3], ascending=False) # avg_tip_pct_overall

# Health check: show top 10 zones
print("Top 10 pickup zones by tip generosity:")
for row in rdd_top_zones.take(10):
    print(row)


Top 10 pickup zones by tip generosity:
(265, '"N/A"', '"Outside of NYC"', 372.8, 1380)
(29, '"Brooklyn"', '"Brighton Beach"', 136.37, 270)
(23, '"Staten Island"', '"Bloomfield/Emerson Hill"', 115.15, 15)
(1, '"EWR"', '"Newark Airport"', 31.73, 377)
(22, '"Brooklyn"', '"Bensonhurst West"', 28.14, 345)
(199, '"Bronx"', '"Rikers Island"', 25.05, 1)
(264, '"Unknown"', '"N/A"', 22.87, 8141)
(162, '"Manhattan"', '"Midtown East"', 22.18, 117930)
(237, '"Manhattan"', '"Upper East Side South"', 21.53, 163703)
(142, '"Manhattan"', '"Lincoln Square East"', 21.5, 110585)


---

# Optimized Job

The following code assumes that the setup code has been executed.

In [11]:
from datetime import datetime

# Broadcast the zones table
zones_dict = dict(rdd_zones.collect())
broadcast_zones = sc.broadcast(zones_dict)

# Compute derived columns and aggregate
def enrich_trip(trip):
    pu_id, (pickup_dt, fare, tip) = trip
    
    if isinstance(pickup_dt, datetime):
        hour = pickup_dt.hour
    else:
        try:
            hour = datetime.strptime(str(pickup_dt), "%Y-%m-%d %H:%M:%S").hour
        except:
            hour = -1
    try:
        fare = float(fare)
        tip = float(tip)
    except:
        fare, tip = 0.0, 0.0
    
    # Calculate tip percentage
    tip_pct = (tip / fare * 100) if fare > 0 else 0.0
    
    # Get zone info from broadcast variable
    borough, zone = broadcast_zones.value.get(pu_id, ("Unknown", "Unknown"))
    
    # Return key-value pair for aggregation
    # Key: (PULocationID, Borough, Zone, hour)
    # Value: (tip_pct, count)
    return ((pu_id, borough, zone, hour), (tip_pct, 1))

# Helper function to sum tip percentages and counts
def sum_tip_count(a, b):
    return (a[0] + b[0], a[1] + b[1])

# Enrich trips and aggregate by (zone, hour)
rdd_zone_hour_agg = rdd_trips.map(enrich_trip).reduceByKey(sum_tip_count)

# Compute average tip per zone-hour
rdd_zone_hour_avg = rdd_zone_hour_agg.map(
    lambda kv: ((kv[0][0], kv[0][1], kv[0][2]), kv[1][0] / kv[1][1])
    # Key: (PULocationID, Borough, Zone), Value: avg_tip_pct for that hour
)

# Partition before aggregating across hours (for better performance)
num_partitions = 8
rdd_zone_hour_avg = rdd_zone_hour_avg.partitionBy(
    num_partitions, 
    partitionFunc=lambda key: hash(key[0])  # partition by PULocationID
).cache()

# Aggregate across all hours per zone to get overall average tip
rdd_zone_agg = rdd_zone_hour_avg.map(lambda kv: (kv[0], (kv[1], 1))) \
    .reduceByKey(sum_tip_count) \
    .map(lambda kv: (kv[0], kv[1][0] / kv[1][1]))

# Sort zones by descending average tip percentage
rdd_top_zones = rdd_zone_agg.sortBy(lambda kv: kv[1], ascending=False)

# Show top 10 zones
print("Top 10 pickup zones by tip percentage (optimized):")
for zone_info, avg_tip in rdd_top_zones.take(10):
    pu_id, borough, zone = zone_info
    print(f"{zone} ({borough}, ID: {pu_id}) - Avg Tip: {avg_tip:.2f}%")

                                                                                

Top 10 pickup zones by tip percentage (optimized):
"Outside of NYC" ("N/A", ID: 265) - Avg Tip: 372.80%
"Brighton Beach" ("Brooklyn", ID: 29) - Avg Tip: 136.36%
"Bloomfield/Emerson Hill" ("Staten Island", ID: 23) - Avg Tip: 115.15%
"Newark Airport" ("EWR", ID: 1) - Avg Tip: 31.77%
"Bensonhurst West" ("Brooklyn", ID: 22) - Avg Tip: 28.14%
"Rikers Island" ("Bronx", ID: 199) - Avg Tip: 25.05%
"N/A" ("Unknown", ID: 264) - Avg Tip: 22.86%
"Midtown East" ("Manhattan", ID: 162) - Avg Tip: 22.19%
"Lincoln Square East" ("Manhattan", ID: 142) - Avg Tip: 21.54%
"Upper East Side South" ("Manhattan", ID: 237) - Avg Tip: 21.54%
