# üìä Notebook 06 ‚Äî Visualization & Storytelling

## Purpose of This Notebook

All analytical computation for this project is complete.

By the end of Notebook 05, we have:
- Deterministic undercut detection
- Heuristic-free evaluation
- Validated undercut outcomes
- Final metrics persisted to PostgreSQL and `data/final`

**Notebook 06 introduces no new analytical logic.**

Its sole purpose is to:
- Prepare gold, BI-consumable tables
- Freeze the analytical contract
- Enable transparent visualization and storytelling

---

## What This Notebook Will Do

This notebook will:
- Load final, trusted outputs produced by Notebook 05
- Join descriptive context (race, circuit, driver names)
- Produce denormalized, read-only ‚Äúgold tables‚Äù
- Export data for visualization tools (Power BI)

---

## What This Notebook Will NOT Do

This notebook will explicitly NOT:
- Redefine what an undercut is
- Recompute or adjust any metrics
- Introduce heuristics or assumptions
- Change success classification logic

All interpretation belongs strictly in the visualization layer.

---

## Analytical Contract (Frozen)

### Primary Metric
- **Mean post-pit delta (ms)**  
  ‚Üí Used to define undercut success

### Secondary Metrics
- First post-pit lap delta (ms)
- Best lap delta (ms)

These explain *perception vs reality* but do not determine success.

---

## Downstream Consumer

The outputs of this notebook are intended for:
- Power BI dashboards
- Reports and presentations

All conclusions must be traceable back to:
- PostgreSQL tables
- `data/final` artifacts
- Notebook 05 logic


In [1]:
# ============================================================
# Notebook 06 ‚Äî Cell 1
# Load Final Trusted Outputs (Path-Safe)
# ============================================================

import sys
from pathlib import Path

# ------------------------------------------------------------
# Ensure project root is on PYTHONPATH
# ------------------------------------------------------------
PROJECT_ROOT = Path.cwd().resolve().parents[0]

if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# ------------------------------------------------------------
# Standard imports
# ------------------------------------------------------------
from sqlalchemy import text
import pandas as pd

# ------------------------------------------------------------
# Project imports
# ------------------------------------------------------------
from src.config import Config
from src.db import get_engine
from src.logging_config import setup_logging

# ------------------------------------------------------------
# Logging
# ------------------------------------------------------------
logger, _ = setup_logging()
logger.info("Notebook 06 started ‚Äî loading final analytical outputs")

# ------------------------------------------------------------
# Database connection
# ------------------------------------------------------------
engine = get_engine()
logger.info("PostgreSQL engine initialized successfully")

# ------------------------------------------------------------
# Load final undercut tables (AUTHORITATIVE)
# ------------------------------------------------------------
with engine.connect() as conn:
    undercut_events = pd.read_sql(
        text("SELECT * FROM undercut_events"),
        conn
    )

    undercut_summary = pd.read_sql(
        text("SELECT * FROM undercut_summary"),
        conn
    )

logger.info(
    "Final tables loaded ‚Äî "
    f"undercut_events: {len(undercut_events):,} rows, "
    f"undercut_summary: {len(undercut_summary):,} rows"
)

# ------------------------------------------------------------
# Defensive sanity checks (structure only)
# ------------------------------------------------------------
if undercut_events.empty:
    raise RuntimeError("undercut_events is empty ‚Äî Notebook 06 cannot proceed")

required_columns = {
    "race_id",
    "pit_lap",
    "attacking_driver",
    "defending_driver",
    "mean_delta_ms",
    "first_lap_delta_ms",
    "best_lap_delta_ms",
    "undercut_success",
    "green_flag_valid",
}

missing = required_columns - set(undercut_events.columns)
if missing:
    raise RuntimeError(
        f"undercut_events missing required columns: {sorted(missing)}"
    )

logger.info("Final undercut tables validated ‚Äî schema and row counts OK")


2025-12-23 11:07:39,249 | INFO | src.logging_config | Notebook 06 started ‚Äî loading final analytical outputs
2025-12-23 11:07:39,377 | INFO | src.logging_config | PostgreSQL engine initialized successfully
2025-12-23 11:07:39,616 | INFO | src.logging_config | Final tables loaded ‚Äî undercut_events: 122 rows, undercut_summary: 1 rows
2025-12-23 11:07:39,619 | INFO | src.logging_config | Final undercut tables validated ‚Äî schema and row counts OK


In [2]:
# ============================================================
# Notebook 06 ‚Äî Cell 2
# Build Gold BI-Ready Undercut Table (Schema-Faithful)
# ============================================================

logger.info("Building gold, BI-ready undercut table (schema-faithful)")

# ------------------------------------------------------------
# Load race dimension (actual schema)
# ------------------------------------------------------------
with engine.connect() as conn:
    races = pd.read_sql(
        text("""
            SELECT
                race_id,
                season,
                round
            FROM races
        """),
        conn
    )

logger.info(f"Loaded races table ‚Äî rows: {len(races):,}")

# ------------------------------------------------------------
# Build gold table (joins only, no logic)
# ------------------------------------------------------------
gold = (
    undercut_events
    .merge(
        races,
        on="race_id",
        how="left",
        validate="many_to_one"
    )
)

logger.info(
    f"Gold table constructed ‚Äî rows: {len(gold):,}, columns: {gold.shape[1]}"
)

# ------------------------------------------------------------
# Defensive checks
# ------------------------------------------------------------
required_gold_columns = {
    "season",
    "round",
    "race_id",
    "attacking_driver",
    "defending_driver",
    "pit_lap",
    "mean_delta_ms",
    "first_lap_delta_ms",
    "best_lap_delta_ms",
    "undercut_success",
    "green_flag_valid",
}

missing = required_gold_columns - set(gold.columns)
if missing:
    raise RuntimeError(
        f"Gold table missing required columns: {sorted(missing)}"
    )

if gold.isna().any().any():
    raise RuntimeError(
        "NaNs detected in gold table ‚Äî join integrity violated"
    )

logger.info("Gold table validated ‚Äî BI-ready and pipeline-aligned")


2025-12-23 11:07:39,637 | INFO | src.logging_config | Building gold, BI-ready undercut table (schema-faithful)
2025-12-23 11:07:39,644 | INFO | src.logging_config | Loaded races table ‚Äî rows: 68
2025-12-23 11:07:39,674 | INFO | src.logging_config | Gold table constructed ‚Äî rows: 122, columns: 13
2025-12-23 11:07:39,679 | INFO | src.logging_config | Gold table validated ‚Äî BI-ready and pipeline-aligned


In [3]:
# ============================================================
# Notebook 06 ‚Äî Cell 3
# Export Gold Table to data/final
# ============================================================

from pathlib import Path

logger.info("Exporting gold undercut table to data/final")

# ------------------------------------------------------------
# Resolve final output directory
# ------------------------------------------------------------
final_dir = Config.DATA_DIR / "final"
final_dir.mkdir(parents=True, exist_ok=True)

parquet_path = final_dir / "undercut_gold.parquet"
csv_path = final_dir / "undercut_gold.csv"

# ------------------------------------------------------------
# Write outputs
# ------------------------------------------------------------
gold.to_parquet(parquet_path, index=False)
gold.to_csv(csv_path, index=False)

logger.info(
    "Gold table exported successfully ‚Äî "
    f"{parquet_path.name}, {csv_path.name}"
)


2025-12-23 11:08:38,985 | INFO | src.logging_config | Exporting gold undercut table to data/final
2025-12-23 11:08:39,164 | INFO | src.logging_config | Gold table exported successfully ‚Äî undercut_gold.parquet, undercut_gold.csv


In [4]:
# D1 ‚Äî Distinct seasons in races table
races['season'].value_counts().sort_index()


season
2022    22
2023    22
2024    24
Name: count, dtype: int64

In [5]:
# D2 ‚Äî Distinct seasons in lap_features
lap_features = pd.read_sql_table("lap_features", engine)

lap_features['race_id'].str.slice(0, 4).value_counts().sort_index()


race_id
2022    23577
2023    24422
2024    26606
Name: count, dtype: int64

In [6]:
# D3 ‚Äî Candidate events by season
undercut_events['race_id'].str.slice(0, 4).value_counts().sort_index()


race_id
2022    104
2023     18
Name: count, dtype: int64

In [7]:
# D4 ‚Äî Unique races contributing undercut events
(
    undercut_events
    .assign(season=lambda df: df['race_id'].str.slice(0, 4))
    .groupby('season')['race_id']
    .nunique()
)


season
2022    3
2023    1
Name: race_id, dtype: int64

In [8]:
# D5 ‚Äî Pit laps per season
(
    lap_features
    .assign(season=lambda df: df['race_id'].str.slice(0, 4))
    .groupby('season')['is_pit_lap']
    .sum()
)


season
2022    22621
2023    23553
2024    25916
Name: is_pit_lap, dtype: int64

In [9]:
# D6 ‚Äî Green laps per season
(
    lap_features
    .assign(season=lambda df: df['race_id'].str.slice(0, 4))
    .groupby('season')['is_green_lap']
    .sum()
)


season
2022    1106
2023    1404
2024    1082
Name: is_green_lap, dtype: int64

In [12]:
# D7-A ‚Äî Inspect available columns in lap_features
lap_features.columns.tolist()


['race_id',
 'driver_code',
 'lap_number',
 'lap_start_time_ms',
 'lap_end_time_ms',
 'cumulative_time_ms',
 'gap_to_leader_ms',
 'delta_prev_lap_ms',
 'is_green_lap',
 'is_sc_lap',
 'is_vsc_lap',
 'is_red_lap',
 'is_pit_lap',
 'is_out_lap',
 'stint_id']

In [13]:
# D7-B ‚Äî NaN counts in derived timing columns by season
timing_cols = [
    "lap_start_time_ms",
    "lap_end_time_ms",
    "cumulative_time_ms",
    "gap_to_leader_ms",
    "delta_prev_lap_ms",
]

(
    lap_features
    .assign(season=lambda df: df["race_id"].str.slice(0, 4))
    .groupby("season")[timing_cols]
    .apply(lambda df: df.isna().sum())
)


Unnamed: 0_level_0,lap_start_time_ms,lap_end_time_ms,cumulative_time_ms,gap_to_leader_ms,delta_prev_lap_ms
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022,0,0,0,0,438
2023,0,0,0,0,438
2024,0,0,0,0,478


In [14]:
# D7-C ‚Äî Percentage NaNs in derived timing columns by season
(
    lap_features
    .assign(season=lambda df: df["race_id"].str.slice(0, 4))
    .groupby("season")[timing_cols]
    .apply(lambda df: df.isna().mean())
)


Unnamed: 0_level_0,lap_start_time_ms,lap_end_time_ms,cumulative_time_ms,gap_to_leader_ms,delta_prev_lap_ms
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022,0.0,0.0,0.0,0.0,0.018577
2023,0.0,0.0,0.0,0.0,0.017935
2024,0.0,0.0,0.0,0.0,0.017966


In [15]:
# D7-D ‚Äî Green laps per season
(
    lap_features
    .assign(season=lambda df: df["race_id"].str.slice(0, 4))
    .groupby("season")["is_green_lap"]
    .sum()
)


season
2022    1106
2023    1404
2024    1082
Name: is_green_lap, dtype: int64

In [16]:
# D7-E ‚Äî Pit laps per season
(
    lap_features
    .assign(season=lambda df: df["race_id"].str.slice(0, 4))
    .groupby("season")["is_pit_lap"]
    .sum()
)


season
2022    22621
2023    23553
2024    25916
Name: is_pit_lap, dtype: int64