## 01 - Ingest GTFS Static Data (Bronze Layer)

This notebook downloads and ingests the **static GTFS feed** from King County Metro into the **Bronze Delta Lake layer**.

### Purpose
To extract and store core static transit reference data (routes, stops, trips, calendars, etc.) for use in later enrichment and analysis steps.

### Workflow Summary
- Downloads the latest GTFS static `.zip` file from [King County Metro GTFS](https://metro.kingcounty.gov/gtfs/)
- Extracts and parses key `.txt` files (e.g., `routes.txt`, `stops.txt`, `calendar.txt`)
- Converts them into Spark DataFrames
- Writes each table to a separate **Delta table** under:


In [0]:
### Download and Ingest GTFS Static Files
from pyspark.sql import functions as F
import requests, zipfile, io, datetime as dt, os, shutil, tempfile   # dt alias

# ---------- config ----------
GTFS_URL    = "https://metro.kingcounty.gov/gtfs/google_transit.zip"        # GTFS static data feed
TODAY       = "2025-05-21"          # Static ingest date
BRONZE_BASE = "dbfs:/bronze"
BRONZE_PATH = f"{BRONZE_BASE}/gtfs_static/{TODAY}"

# Create temporary directory for downloaded and extracted files
tmp_dir = tempfile.mkdtemp(prefix="gtfs_")
print(f"Temp dir: {tmp_dir}")

try:
    print("Downloading GTFS zip …")
    z = zipfile.ZipFile(io.BytesIO(requests.get(GTFS_URL, timeout=30).content))

    for name in z.namelist():
        if not name.endswith(".txt"):
            continue        # Skip non-text files
        print(f"Processing {name} …")

        # Extract file locally
        local_file = os.path.join(tmp_dir, name)
        os.makedirs(os.path.dirname(local_file), exist_ok=True)
        with z.open(name) as src, open(local_file, "wb") as dst:
            dst.write(src.read())

        # Read into Spark DataFrame
        df = (spark.read
                .option("header", True)
                .csv(f"file://{local_file}"))

        # Write to Bronze layer as Delta table
        (df.write
           .format("delta")
           .mode("overwrite")
           .save(f"{BRONZE_PATH}/{name.replace('.txt','')}"))

    print("✓ GTFS static ingest complete")

finally:
    shutil.rmtree(tmp_dir, ignore_errors=True) # Clean up temp files


In [0]:
### Preview `stops` Table
# Show a few rows from Bronze/stops
spark.read.format("delta").load(f"{BRONZE_PATH}/stops").show(5)

In [0]:
### Preview Other GTFS Tables
# View 5 rows of 'routes' table
spark.read.format("delta").load(f"{BRONZE_PATH}/routes").show(5)

# View 'calendar' table
spark.read.format("delta").load(f"{BRONZE_PATH}/calendar").show(5)

# View 'trips' table
spark.read.format("delta").load(f"{BRONZE_PATH}/trips").show(5)


In [0]:
### Inspect Route Metadata
# Check key route attributes
# route_type defines the mode of transportation (bus, rail, etc.)
df_routes = spark.read.format("delta").load(f"{BRONZE_PATH}/routes")
df_routes.select("route_id", "route_short_name", "route_type").show(10)

In [0]:
spark.read.format("delta").load("dbfs:/bronze/gtfs_static/2025-05-21/stops").show(5)
