## 14 – Create Platinum Star‑Schema (for Power BI)

This notebook materialises **dimension** and **fact** tables that Power BI (or any BI tool)
can query efficiently.

**Layer:** Platinum fact_transit_event

**What it contains:** Real‑time vehicle events + nearest weather

**Source:** Gold `gtfs_rt_weather_joined`& Platinum dim_route / Platinum dim_trip

All tables are written to **`dbfs:/plat/*`** in Delta format.


In [0]:
from pyspark.sql import functions as F, Window
import datetime as dt

# ---------- I/O locations ----------
# Gold input path (real-time GTFS enriched with weather)
GOLD_RT_WEATHER = "dbfs:/gold/gtfs_rt_weather_joined" 
# Platinum root path for fact/dim tables
PLAT = "dbfs:/plat"                         

In [0]:
# Load dimension tables from Platinum layer
dim_trip = spark.read.format("delta").load(f"{PLAT}/dim_trip")
dim_route = spark.read.format("delta").load(f"{PLAT}/dim_route")

In [0]:
# Load enriched GTFS + Weather Gold data
fact = (
    spark.read.format("delta").load(GOLD_RT_WEATHER)
    )
print(f"Events loaded: {fact.count()}")


In [0]:
# Join with trip dimension table using SCD2 logic
# Match on trip_id AND timestamp within the [start_time, end_time) window
cond_trip = (
    (F.col("f.trip_id") == F.col("t.trip_id")) &
    (F.col("f.event_ts") >= F.col("t.start_time")) &
    (F.col("f.event_ts") <  F.coalesce(F.col("t.end_time"),
                                       F.lit("9999-12-31 23:59:59")))
)
# Join and add trip_sk
fact = (
    fact.alias("f")
    .join(dim_trip.alias("t"), cond_trip, "left")
    .select("f.*", "t.trip_sk")
).drop("t.route_id") # drop duplicate/unused column

In [0]:
# Join with route dimension table using SCD2 logic
cond_route = (
    (F.col("f.route_id") == F.col("r.route_id")) &
    (F.col("f.event_ts") >= F.col("r.start_time")) &
    (F.col("f.event_ts") <  F.coalesce(F.col("r.end_time"),
                                       F.lit("9999-12-31 23:59:59")))
)
# Join and add route_sk
fact = (
    fact.alias("f")
    .join(dim_route.alias("r"), cond_route, "left")
    .select("f.*", "r.route_sk")
)

In [0]:
# Drop unnecessary columns to keep fact table lean
fact_clean = (
    fact
    .drop("trip_id", "route_id",            # Already replaced by SKs
          "shortForecast",
          "route_short_name", "route_type",
          "timestamp" , "direction_id", 
          "temperature", "condition", 
          "windSpeed","forecast_retrieved_at"
          ) # keep dims lean too
)
# Write final fact_transit_event table to Platinum layer
fact_clean.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("ingestion_date") \
    .save(f"{PLAT}/fact_transit_event")

print(f"✓ fact_transit_event rows: {fact_clean.count()}")


In [0]:
# Preview the cleaned fact table
fact_clean.display()