# Gold Layer – Race Events

## Objective
This notebook creates the `gold.race_events` table by combining race control messages, session info, and weather conditions.  
It provides a timeline of race-critical events, flags, and conditions.

## Steps
1. Load Silver tables: `race_control_messages`, `session_info`, `weather_data`.  
2. Normalize and align event timestamps.  
3. Categorize events:
   - Yellow/Red/Green/Blue flags  
   - Virtual Safety Car (VSC)  
   - Safety Car deployment  
   - Other race messages  
4. Enrich with session metadata (round, circuit, year).  
5. Enrich with weather conditions at the time of events.  
6. Write final dataset into Gold layer as `gold.race_events`.  
7. Optimize table with ZORDER on `(session_key, Date)`.  


In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# 1. Load Silver tables
race_msgs = spark.table("silver.race_control_messages")
session_info = spark.table("silver.session_info")
weather = spark.table("silver.weather_data")

# 2. Clean and normalize race control messages
race_msgs = race_msgs.withColumnRenamed("Date", "event_time")

# Standardize flags and categories
race_msgs = race_msgs.withColumn(
    "event_type",
    F.when(F.col("Flag").rlike("YELLOW"), "Yellow Flag")
     .when(F.col("Flag").rlike("RED"), "Red Flag")
     .when(F.col("Flag").rlike("GREEN"), "Green Flag")
     .when(F.col("Flag").rlike("BLUE"), "Blue Flag")
     .when(F.col("Message").rlike("SAFETY CAR"), "Safety Car")
     .when(F.col("Message").rlike("VIRTUAL SAFETY CAR"), "Virtual Safety Car")
     .otherwise("Other")
)

# 3. Join with session info for context
race_events = (
    race_msgs
    .join(session_info, on="session_key", how="left")
)

# 4. Approximate weather conditions at event time
# Take average weather per session, as exact timestamp may not align
weather_summary = (
    weather.groupBy("session_key")
    .agg(
        F.avg("TrackTemp").alias("avg_track_temp"),
        F.avg("AirTemp").alias("avg_air_temp"),
        F.avg("Humidity").alias("avg_humidity"),
        F.avg("Rainfall").alias("avg_rainfall")
    )
)

race_events = race_events.join(weather_summary, on="session_key", how="left")

# 5. Final clean up
race_events = race_events.select(
    "session_key",
    "event_time",
    "event_type",
    "Message",
    "Category",
    "Scope",
    "Status",
    "Year",
    "Round",
    "Circuit",
    "SessionName",
    "EventName",
    "avg_track_temp",
    "avg_air_temp",
    "avg_humidity",
    "avg_rainfall"
).dropDuplicates()

# 6. Write to Gold layer
(
    race_events.write.format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable("gold.race_events")
)

# 7. Optimize
spark.sql("OPTIMIZE gold.race_events ZORDER BY (session_key, event_time)")
