# NYC Yellow Taxi - Bronze Layer Ingestion

## Overview

Initial Bronze layer ingestion for NYC Yellow Taxi trip data.

### Bronze Layer Principles

1. **Raw data preservation** - All columns cast to STRING
2. **Schema-on-read** - No type enforcement at ingestion
3. **Full lineage** - Source file tracking via `source_file` column
4. **No DQ filtering** - Data quality checks applied at Silver layer

---


## 1. Setup & Configuration


In [1]:
from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, to_timestamp

from nyc_taxi_eta.configs.settings import YELLOW_TAXI_DIR

# Create Spark session with Unity Catalog
# NOTE: Remove Delta extension - it conflicts with Unity Catalog's catalog routing
spark = (
    SparkSession.builder.appName("BronzeIngestion")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.jars.packages","io.delta:delta-spark_2.13:4.0.0,io.unitycatalog:unitycatalog-spark_2.13:0.3.0")
    # Unity Catalog configuration
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.delta.catalog.DeltaCatalog")
    .config("spark.sql.catalog.nyc_taxi", "io.unitycatalog.spark.UCSingleCatalog")
    .config("spark.sql.catalog.nyc_taxi.uri", "http://localhost:8080")
    .config("spark.sql.catalog.nyc_taxi.token", "")
    # Set nyc_taxi as default catalog
    .config("spark.sql.defaultCatalog", "nyc_taxi")
    # Performance configs
    .config("spark.driver.memory", "80g")
    .config("spark.sql.shuffle.partitions", "12")
    .config("spark.sql.adaptive.enabled", "true")
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
    .config("spark.sql.parquet.compression.codec", "snappy")
    .getOrCreate()
)

# Verify catalog is set correctly
print(f"Default catalog: {spark.catalog.currentCatalog()}")
print(f"Spark version: {spark.version}")

:: loading settings :: url = jar:file:/home/administrator/Desktop/datascience/github/nyc-taxi-eta/.venv/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/administrator/.ivy2.5.2/cache
The jars for the packages stored in: /home/administrator/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependency
io.unitycatalog#unitycatalog-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c8f2d847-a47d-4a32-8701-8a7f5845c5ad;1.0
	confs: [default]
	found io.delta#delta-spark_2.13;4.0.0 in central
	found io.delta#delta-storage;4.0.0 in central
	found org.antlr#antlr4-runtime;4.13.1 in central
	found io.unitycatalog#unitycatalog-spark_2.13;0.3.0 in central
	found io.unitycatalog#unitycatalog-client;0.3.0 in central
	found org.slf4j#slf4j-api;2.0.13 in central
	found org.apache.logging.log4j#log4j-slf4j2-impl;2.24.3 in central
	found org.apache.logging.log4j#log4j-api

Default catalog: nyc_taxi
Spark version: 4.0.0


## 2. Schema Definition


In [2]:
paths = [str(p) for p in YELLOW_TAXI_DIR.glob("yellow_tripdata_2009*.parquet")]

df = spark.read.parquet(*paths)
# df.show()

In [3]:
spark.catalog.currentCatalog()

'nyc_taxi'

In [10]:
from nyc_taxi_eta.configs.settings import BRONZE_DIR
try:
    start_time = datetime.now()

    catalog = "nyc_taxi"
    schema = "bronze"
    table = "yellow_taxi_trips"
    full_table_name = f"{catalog}.{schema}.{table}"

    # Verify we're in the right catalog
    print(f"Current catalog: {spark.catalog.currentCatalog()}")
    
    # Add year column extracted from Trip_Pickup_DateTime
    df_with_year = df.withColumn(
        "year", year(to_timestamp(col("Trip_Pickup_DateTime")))
    )
    
    # Drop table if exists (use full path)
    spark.sql(f"DROP TABLE IF EXISTS {full_table_name}")
    
    # Write using DataFrame API with explicit table identifier
    # Unity Catalog handles Delta format internally
    (
        df_with_year.write
        .format("delta")
        .mode("overwrite")
        .partitionBy("year")
        .option("path", f"{BRONZE_DIR}/{table}")  # External location
        .saveAsTable(full_table_name)
    )

    end_time = datetime.now()
    duration = (end_time - start_time).total_seconds()

    print(f"✅ Table written: {full_table_name}")
    print(f"⏱️  Duration: {duration:.1f} seconds")
    
    # Verify
    spark.sql(f"SELECT COUNT(*) as cnt FROM {full_table_name}").show()

except Exception as e:
    print(f"An error occurred: {e}")
    import traceback
    traceback.print_exc()

Current catalog: nyc_taxi


                                                                                

✅ Table written: nyc_taxi.bronze.yellow_taxi_trips
⏱️  Duration: 81.5 seconds
+---------+
|      cnt|
+---------+
|170896055|
+---------+



In [11]:
spark.sql(f"SELECT * from {full_table_name}").show()

+-----------+--------------------+---------------------+---------------+------------------+------------------+---------+---------+-----------------+------------------+---------+------------+-----------------+---------+-------+-------+---------+-----------------+----+
|vendor_name|Trip_Pickup_DateTime|Trip_Dropoff_DateTime|Passenger_Count|     Trip_Distance|         Start_Lon|Start_Lat|Rate_Code|store_and_forward|           End_Lon|  End_Lat|Payment_Type|         Fare_Amt|surcharge|mta_tax|Tip_Amt|Tolls_Amt|        Total_Amt|year|
+-----------+--------------------+---------------------+---------------+------------------+------------------+---------+---------+-----------------+------------------+---------+------------+-----------------+---------+-------+-------+---------+-----------------+----+
|        VTS| 2009-06-14 23:23:00|  2009-06-14 23:48:00|              1|             17.52|        -73.787442|40.641525|     NULL|             NULL|        -73.980072|40.742963|      Credit|      

In [12]:
# Stop Spark
spark.stop()
print("Done!")

Done!
