## 04 - Enrich GTFS Real-Time with Static Data (Gold Layer)

This notebook joins real-time GTFS vehicle updates with static route/trip metadata from King County Metro to create an enriched, analysis-ready dataset in the Gold layer.

### Purpose
To enhance real-time transit records by associating each update with its corresponding route type, direction, and short name — enabling more insightful downstream analysis.

### Workflow Summary
- Loads real-time GTFS updates from the Silver layer (today’s data only)
- Loads static `trips` and `routes` tables from the GTFS static feed
- Joins real-time updates with static trip and route metadata
- Appends the enriched data to the partitioned Gold Delta table


In [0]:
# dbutils.fs.rm("dbfs:/gold/gtfs_rt_enriched/", recurse=True)       One-time utility: This cell is only needed the very first time you switch from overwrite to append logic.

In [0]:
import datetime as dt

# Use today’s date for ingesting real-time data
INGESTION_DATE = dt.date.today().isoformat()

# Keep static GTFS date constant (doesn’t change daily)
STATIC_DATE = "2025-05-21"


In [0]:
from pyspark.sql import functions as F

# Define Silver-layer input path (real-time)
RT_SILVER_PATH = "dbfs:/silver/gtfs_rt/"

# Define static GTFS paths for trips and routes from the given snapshot date
TRIPS_SILVER_PATH  = f"dbfs:/silver/gtfs_static/trips"
ROUTES_SILVER_PATH = f"dbfs:/silver/gtfs_static/routes"

# Target path for writing enriched Gold-layer data
GOLD_PATH = "dbfs:/gold/gtfs_rt_enriched/"


In [0]:
# Load real-time Silver-layer data for today
# Filter out rows without trip_id or missing location data
df_rt = (
    spark.read.format("delta").load(RT_SILVER_PATH)
    .filter(F.col("ingestion_date") == INGESTION_DATE)
    .filter(F.col("trip_id").isNotNull())  # ✅ Required for joining
    .filter(F.col("latitude").isNotNull() & F.col("longitude").isNotNull())  # ✅ Required for location
)

# Load static trip and route metadata
df_trips  = spark.read.format("delta").load(TRIPS_SILVER_PATH)
df_routes = spark.read.format("delta").load(ROUTES_SILVER_PATH)


In [0]:
# Drop existing route_id from real-time data to avoid duplication in join
df_rt_clean = df_rt.drop("route_id")  # prevents duplicate later

# Join real-time updates with trips to get route_id and direction_id
df_joined = (
    df_rt_clean
    .join(df_trips.select("trip_id", "route_id", "direction_id"), on="trip_id", how="left")
)

# Further join with routes to get route type and short name
df_enriched = (
    df_joined
    .join(
        df_routes.select("route_id", "route_short_name", "route_type"),
        on="route_id",
        how="left"
    )
    .withColumn("joined_at", F.current_timestamp())
)


In [0]:
# Sanity check: how many records are missing route metadata after joining?
df_enriched.filter(F.col("route_id").isNull() | F.col("route_short_name").isNull()).count()


In [0]:
# Preview enriched result with key fields
df_enriched.select(
    "vehicle_id", "route_short_name", "direction_id", 
    "latitude", "longitude", "event_ts", "joined_at"
).show(5, truncate=False)


In [0]:
# Add ingestion date partition column before writing to Gold layer
df_enriched = df_enriched.withColumn("ingestion_date", F.lit(INGESTION_DATE))

# Deduplicate before writing to Gold
try:
    existing_gold = spark.read.format("delta").load(GOLD_PATH).select("vehicle_id", "event_ts")
    
    df_enriched = df_enriched.alias("new").join(
        existing_gold.alias("existing"),
        on=["vehicle_id", "event_ts"],
        how="left_anti"
    )
except Exception as e:
    print(f"✓ No existing Gold data found or table is empty. Proceeding without anti-join. Error: {e}")

# Append enriched data to Gold Delta table
df_enriched.write \
    .format("delta") \
    .mode("append") \
    .partitionBy("ingestion_date") \
    .save(GOLD_PATH)

print("✓ GTFS-RT Enriched data appended to Gold")


In [0]:
# Print column names to confirm schema
print(df_enriched.columns)
