# Big Data Project Notebook

## Dataset

Name: NYC Taxi & Limousine Commission (TLC) Trip Record Data
Source link: [NYC TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page?utm_source=chatgpt.com)

## Main Job

**Objective**: Analyze how tip generosity varies by pickup location and time of day, and identify the top pickup zones with the most generous passengers.

### Plan

#### Setup

- clean up and select relevant columns
- create derived columns (tip percentage, hour of day)

#### Shuffles

- join zones id with zone lookup table
- aggregate per location and hour
- aggregate per average tip percentage
- order by average tip percentage


# Setup

- Import libraries
- Setup Spark
- Load dataset in memory

In [3]:
# Import libraries
import findspark
from pyspark.sql import SparkSession

# Setup Spark
findspark.init()
spark = SparkSession.builder \
    .appName("NYC Taxi Analysis - Sample") \
    .getOrCreate()
## health check
df = spark.range(5)
df.show()

# Load sample data
trip_file = "sample_data/yellow_tripdata_2025-01.parquet"
df_trips = spark.read.parquet(trip_file)
## health check
df_trips.show(5)
df_trips.printSchema()
# Load zone lookup table
zone_file = "sample_data/taxi_zone_lookup.csv"
df_zones = spark.read.csv(zone_file, header=True, inferSchema=True)
## health check
df_zones.show(5)
df_zones.printSchema()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|cbd_congestion_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|       1| 2025-01-01 00:18:38|  2025-01-01 00:26:59|              1|      