## 10 - Transform GTFS Static Data (Silver Layer)

This notebook processes static GTFS tables from the Bronze layer into a cleaned Silver format.

### Purpose
To clean and enrich core GTFS reference tables — including `stops`, `routes`, and `trips` — by applying casting, enrichment, and schema validation steps.

### Workflow Summary
- Reads static GTFS data (stops, routes, trips) from the Bronze layer
- Applies column type casting and adds new columns (e.g., `location`, `ingestion_ts`)
- Writes the cleaned data into Delta tables in the Silver layer for consistent downstream usage


In [0]:
# Set the static GTFS ingestion date (same as used in Bronze layer)
import datetime as dt
from pyspark.sql import functions as F

# Date to match your Bronze ingest
TODAY = "2025-05-21"

# Set Bronze input and Silver output paths
BRONZE_PATH = f"dbfs:/bronze/gtfs_static/{TODAY}"
SILVER_PATH = f"dbfs:/silver/gtfs_static/{TODAY}"


In [0]:
# Transform 'stops' table — cast lat/lon to double, create location struct
df_stops = (
    spark.read.format("delta").load(f"{BRONZE_PATH}/stops")
    .withColumn("stop_lat", F.col("stop_lat").cast("double"))
    .withColumn("stop_lon", F.col("stop_lon").cast("double"))
    .withColumn("location", F.expr("struct(stop_lat, stop_lon)"))
    .withColumn("ingestion_ts", F.current_timestamp())
)

# Write to Silver layer
(
    df_stops.write
    .format("delta")
    .mode("overwrite")
    .save(f"{SILVER_PATH}/stops")
)


In [0]:
# Quick preview of Silver 'stops' table
spark.read.format("delta").load(f"{SILVER_PATH}/stops").show(5)


In [0]:
# View schema of 'stops' to confirm lat/lon + location struct
spark.read.format("delta").load(f"{SILVER_PATH}/stops").printSchema()


In [0]:
# Show stop_id and full location struct
df_silver = spark.read.format("delta").load(f"{SILVER_PATH}/stops")
df_silver.select("stop_id", "location").show(5, truncate=False)


In [0]:
# View separated lat/lon from location struct
df_silver.select(
    "stop_id",
    "location.stop_lat",
    "location.stop_lon"
).show(5)


In [0]:
# Transform 'routes' table — apply casting and timestamp
df_routes = (
    spark.read.format("delta").load(f"{BRONZE_PATH}/routes")
    .withColumn("route_id", F.col("route_id").cast("string"))
    .withColumn("route_short_name", F.col("route_short_name").cast("string"))
    .withColumn("route_type", F.col("route_type").cast("int"))
    .withColumn("ingestion_ts", F.current_timestamp())
)
# Save to Silver
df_routes.write.format("delta").mode("overwrite").save(f"{SILVER_PATH}/routes")
print("✓ Silver routes saved")


In [0]:
# Transform 'trips' table — cast fields and enrich with timestamp
df_trips = (
    spark.read.format("delta").load(f"{BRONZE_PATH}/trips")
    .withColumn("trip_id", F.col("trip_id").cast("string"))
    .withColumn("route_id", F.col("route_id").cast("string"))
    .withColumn("service_id", F.col("service_id").cast("string"))
    .withColumn("direction_id", F.col("direction_id").cast("int"))
    .withColumn("ingestion_ts", F.current_timestamp())
)
# Save to Silver
df_trips.write.format("delta").mode("overwrite").save(f"{SILVER_PATH}/trips")
print("✓ Silver trips saved")


In [0]:
# Preview both 'routes' and 'trips' in Silver
spark.read.format("delta").load(f"{SILVER_PATH}/routes").show(5)
spark.read.format("delta").load(f"{SILVER_PATH}/trips").show(5)


In [0]:
# View schema of both Silver tables
spark.read.format("delta").load(f"{SILVER_PATH}/routes").printSchema()    # Routes
spark.read.format("delta").load(f"{SILVER_PATH}/trips").printSchema()     # Trips
