# üìì Notebook 04 ‚Äî Lap-Level Temporal Alignment & Feature Engineering

---

## üß≠ Context: Where We Are Coming From

Notebook 03 concluded the **data ingestion and validation phase** of this project.

At the end of Notebook 03, we achieved something non-trivial:

- All race, driver, and lap data is loaded into PostgreSQL
- Relational integrity is enforced
- Lap grain is **explicitly defined and immutable**
- Idempotent execution is guaranteed
- No orphan rows exist
- No duplicate races, drivers, or laps exist
- Track status data exists on disk, but is **not yet analytically usable**

This means that **raw correctness is no longer the problem**.

However, **analytical readiness is still missing**.

At this stage, we have:
- Discrete lap rows
- No explicit notion of *time*
- No continuous race timeline
- No pit structure
- No mapping between lap events and track status events

In short:

> **We know *what* happened, but not *when* or *in what context*.**

Notebook 04 exists to close that gap.

---

## üéØ Purpose of Notebook 04

The purpose of Notebook 04 is to **convert structurally correct lap data into analytically usable temporal data**, without making *any strategic or interpretive assumptions*.

This notebook is intentionally **pre-strategy**.

It does **not**:
- Detect undercuts
- Classify strategies
- Judge performance
- Filter laps based on competitiveness

Instead, it **constructs the factual substrate** upon which all later strategy analysis depends.

---

## üß† Core Questions Notebook 04 Must Answer

Notebook 04 exists to answer the following foundational questions:

1. ‚è±Ô∏è *When did each lap start and end, relative to the race timeline?*
2. üßÆ *How does each lap relate to the laps before and after it?*
3. üö¶ *What track conditions overlapped each lap in time?*
4. üõ†Ô∏è *Which laps represent pit activity, out laps, and stint boundaries?*
5. üìê *Are all derived features internally consistent and safe to reason about?*

If any of these questions remain ambiguous or incorrect, **all downstream strategy analysis becomes invalid**.

---

## üèóÔ∏è What This Notebook Will Do (Explicit Scope)

Notebook 04 will perform **five major tasks**, in strict order:

---

### 1Ô∏è‚É£ Construct Continuous Lap Timelines ‚è≥

- Convert discrete lap rows into a **continuous race timeline**
- Compute:
  - `lap_start_time_ms`
  - `lap_end_time_ms`
  - `cumulative_time_ms`
- Guarantee:
  - Time is monotonic per driver
  - No negative or overlapping windows
  - No implicit inference or smoothing

This step introduces *time* into the system.

---

### 2Ô∏è‚É£ Compute Relative Timing Features üìä

- Compute lap-to-lap deltas per driver
- Compute gap-to-leader per lap
- Preserve undefined values explicitly (e.g., first lap deltas)

This provides **comparability without interpretation**.

---

### 3Ô∏è‚É£ Align Track Status Events to Lap Windows üö¶

- Load standardized track status event data
- Treat track status as a **temporal overlay**, not a fact table
- Mark laps as overlapping:
  - GREEN
  - SC
  - VSC
  - RED
- Use **conservative overlap logic**
- Perform **no forward fill**
- Allow ambiguity where data does not provide certainty

This step explicitly separates **mechanical alignment** from **analytical meaning**.

---

### 4Ô∏è‚É£ Identify Pit Structure & Stints üõ†Ô∏è

- Identify pit laps via time anomalies
- Identify out laps via pit-lap adjacency
- Construct `stint_id` deterministically
- Preserve lap grain strictly
- Do **not** classify strategies

This establishes **structural race phases** without evaluation.

---

### 5Ô∏è‚É£ Validate All Derived Features (Invariant Enforcement) üîí

- Explicitly validate:
  - Lap grain integrity
  - Temporal monotonicity
  - Gap consistency
  - Track status mechanical correctness
  - Pit / out-lap logic
  - Stint monotonicity
  - Correct handling of undefined values
- Fail loudly on **any silent analytical corruption**

This final step **seals the notebook**.

---

## üö´ What This Notebook Explicitly Does *Not* Do

To avoid analytical leakage, Notebook 04 does **not**:

- Decide what is a ‚Äúcompetitive‚Äù lap
- Exclude safety-car laps
- Define undercut windows
- Rank strategies
- Aggregate outcomes

All such interpretation is deferred **by design**.

---

## üß© Expected Outcome

At the end of Notebook 04, we expect to have:

- A lap-level dataset with:
  - Continuous time
  - Explicit race context
  - Explicit pit structure
  - Explicit track status overlap
- Zero ambiguity about:
  - Data correctness
  - Feature validity
  - Structural assumptions

Only **after** this point does it make sense to ask:

> *‚ÄúIs the undercut actually worth it, or is it hype?‚Äù*

Notebook 04 exists to ensure that question is asked **on solid ground**.


In [1]:
# ============================================================
# Notebook 04 ‚Äî Lap-Level Feature Engineering
# Cell 1: Scope Definition & Environment Bootstrap
# ============================================================

"""
This notebook derives deterministic, lap-level temporal features
from relationally trusted PostgreSQL data.

NON-NEGOTIABLE CONSTRAINTS
--------------------------
‚Ä¢ One row == one (race_id, driver_code, lap_number)
‚Ä¢ PostgreSQL integrity is assumed correct (Notebook 03)
‚Ä¢ No data correction or inference
‚Ä¢ Track status is event-based and aligned only after timelines exist
‚Ä¢ NO strategy logic (undercuts) in this notebook

This notebook is where TIME enters the system.
"""

# -----------------------------
# Standard library imports
# -----------------------------
import sys
from pathlib import Path
import pandas as pd

# -----------------------------
# Robust project root discovery
# -----------------------------
def find_project_root(start: Path) -> Path:
    """
    Walk upward from start until a directory containing `src/` is found.
    This is the only safe way to bootstrap imports in Jupyter.
    """
    current = start.resolve()
    for parent in [current, *current.parents]:
        if (parent / "src").is_dir():
            return parent
    raise RuntimeError(
        "Project root not found. Expected a `src/` directory in parent paths."
    )

PROJECT_ROOT = find_project_root(Path.cwd())

if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# -----------------------------
# Project-level imports (now safe)
# -----------------------------
from src.config import Config
from src.db import get_engine
from src.logging_config import setup_logging
from src.utils import validate_dataframe_columns

# -----------------------------
# Logging setup (idempotent)
# -----------------------------
logger, error_logger = setup_logging()
logger.info("Notebook 04 started ‚Äî Lap-Level Feature Engineering")

# -----------------------------
# Database engine initialization
# -----------------------------
try:
    engine = get_engine()
    logger.info("PostgreSQL engine initialized successfully")
except Exception:
    error_logger.error("Failed to initialize PostgreSQL engine", exc_info=True)
    raise

# -----------------------------
# Explicit analytical guarantees
# -----------------------------
EXPECTED_LAP_COLUMNS = {
    "race_id",
    "driver_code",
    "lap_number",
    "lap_time_ms",
    "tyre_compound",
}

logger.info(
    "Notebook 04 guarantees:\n"
    "- Relational integrity already enforced\n"
    "- Lap grain is immutable\n"
    "- Temporal features will be derived deterministically\n"
    "- Track status will be aligned conservatively\n"
    "- No strategy classification occurs here"
)


2025-12-18 15:03:07,303 | INFO | src.logging_config | Notebook 04 started ‚Äî Lap-Level Feature Engineering
2025-12-18 15:03:07,374 | INFO | src.logging_config | PostgreSQL engine initialized successfully
2025-12-18 15:03:07,375 | INFO | src.logging_config | Notebook 04 guarantees:
- Relational integrity already enforced
- Lap grain is immutable
- Temporal features will be derived deterministically
- Track status will be aligned conservatively
- No strategy classification occurs here


In [2]:
# ============================================================
# Cell 2 ‚Äî PostgreSQL Reads & Canonical Lap Frame Construction
# ============================================================

"""
This cell constructs the canonical lap-level DataFrame.

Responsibilities:
‚Ä¢ Read races, drivers, laps from PostgreSQL
‚Ä¢ Perform explicit relational joins only
‚Ä¢ Enforce lap grain and ordering
‚Ä¢ Validate schema assumptions defensively

This cell introduces NO derived features.
"""

# -----------------------------
# SQL queries (explicit, minimal)
# -----------------------------
LAPS_QUERY = """
SELECT
    l.race_id,
    l.driver_code,
    l.lap_number,
    l.lap_time_ms,
    l.tyre_compound
FROM laps l
"""

DRIVERS_QUERY = """
SELECT
    d.race_id,
    d.driver_code,
    d.driver_number
FROM drivers d
"""

RACES_QUERY = """
SELECT
    r.race_id,
    r.season,
    r.round
FROM races r
"""

# -----------------------------
# Load tables from PostgreSQL
# -----------------------------
logger.info("Loading base tables from PostgreSQL")

laps_df = pd.read_sql(LAPS_QUERY, engine)
drivers_df = pd.read_sql(DRIVERS_QUERY, engine)
races_df = pd.read_sql(RACES_QUERY, engine)

logger.info(
    f"Loaded tables ‚Äî "
    f"laps: {len(laps_df):,}, "
    f"drivers: {len(drivers_df):,}, "
    f"races: {len(races_df):,}"
)

# -----------------------------
# Defensive schema validation
# -----------------------------
validate_dataframe_columns(
    laps_df,
    EXPECTED_LAP_COLUMNS,
    df_name="laps_df"
)

validate_dataframe_columns(
    drivers_df,
    {"race_id", "driver_code", "driver_number"},
    df_name="drivers_df"
)

validate_dataframe_columns(
    races_df,
    {"race_id", "season", "round"},
    df_name="races_df"
)

# -----------------------------
# Explicit relational joins
# -----------------------------
logger.info("Joining laps -> drivers -> races")

lap_frame = (
    laps_df
    .merge(
        drivers_df,
        on=["race_id", "driver_code"],
        how="inner",
        validate="many_to_one"
    )
    .merge(
        races_df,
        on="race_id",
        how="inner",
        validate="many_to_one"
    )
)

# -----------------------------
# Lap grain & ordering enforcement
# -----------------------------
lap_frame = lap_frame.sort_values(
    by=["race_id", "driver_code", "lap_number"],
    ascending=[True, True, True]
).reset_index(drop=True)

# -----------------------------
# Grain integrity checks
# -----------------------------
expected_rows = len(laps_df)
actual_rows = len(lap_frame)

if actual_rows != expected_rows:
    raise RuntimeError(
        f"Lap grain violation detected: "
        f"expected {expected_rows:,} rows, got {actual_rows:,}"
    )

if lap_frame.duplicated(
    subset=["race_id", "driver_code", "lap_number"]
).any():
    raise RuntimeError(
        "Duplicate lap keys detected after joins"
    )

logger.info(
    "Canonical lap frame constructed successfully ‚Äî "
    "lap grain preserved and ordering enforced"
)


2025-12-18 15:03:07,389 | INFO | src.logging_config | Loading base tables from PostgreSQL
2025-12-18 15:03:07,717 | INFO | src.logging_config | Loaded tables ‚Äî laps: 74,605, drivers: 1,359, races: 68
2025-12-18 15:03:07,718 | INFO | src.logging_config | Joining laps -> drivers -> races
2025-12-18 15:03:07,835 | INFO | src.logging_config | Canonical lap frame constructed successfully ‚Äî lap grain preserved and ordering enforced


In [3]:
# ============================================================
# Cell 3 ‚Äî Cumulative Race Time & Lap Time Windows
# ============================================================

"""
This cell introduces TIME into the system.

Responsibilities:
‚Ä¢ Compute cumulative race time per driver
‚Ä¢ Handle missing lap times explicitly and safely
‚Ä¢ Derive lap start and end timestamps
‚Ä¢ Preserve lap grain strictly

No gaps, no pit logic, no track status.
"""

logger.info("Computing cumulative race time per driver (NaN-safe)")

# -----------------------------
# Explicit handling of missing lap times
# -----------------------------
lap_frame["lap_time_ms_filled"] = lap_frame["lap_time_ms"].fillna(0)

if (lap_frame["lap_time_ms_filled"] < 0).any():
    raise RuntimeError(
        "Negative lap_time_ms detected ‚Äî invalid temporal data"
    )

# -----------------------------
# Cumulative time computation
# -----------------------------
lap_frame["cumulative_time_ms"] = (
    lap_frame
    .groupby(["race_id", "driver_code"], sort=False)["lap_time_ms_filled"]
    .cumsum()
)

# -----------------------------
# Lap start / end window derivation
# -----------------------------
lap_frame["lap_start_time_ms"] = (
    lap_frame["cumulative_time_ms"] - lap_frame["lap_time_ms_filled"]
)

lap_frame["lap_end_time_ms"] = lap_frame["cumulative_time_ms"]

# -----------------------------
# Temporal integrity checks
# -----------------------------
if (lap_frame["lap_start_time_ms"] < 0).any():
    raise RuntimeError(
        "Negative lap_start_time_ms detected ‚Äî temporal alignment broken"
    )

if not (
    lap_frame["lap_end_time_ms"] >= lap_frame["lap_start_time_ms"]
).all():
    raise RuntimeError(
        "Invalid lap time windows detected (end < start)"
    )

logger.info(
    "Cumulative time windows computed successfully ‚Äî "
    "timeline preserved with explicit NaN handling"
)


2025-12-18 15:03:07,843 | INFO | src.logging_config | Computing cumulative race time per driver (NaN-safe)
2025-12-18 15:03:07,870 | INFO | src.logging_config | Cumulative time windows computed successfully ‚Äî timeline preserved with explicit NaN handling


In [4]:
# ============================================================
# Cell 4 ‚Äî Gap to Leader & Lap Delta Features
# ============================================================

"""
This cell derives relative timing context.

Responsibilities:
‚Ä¢ Compute gap to race leader per lap
‚Ä¢ Compute lap-to-lap deltas per driver
‚Ä¢ Preserve determinism and lap grain

No track status. No pit logic. No strategy.
"""

logger.info("Computing gap to leader and lap-to-lap deltas")

# -----------------------------
# Gap to leader computation
# -----------------------------
leader_times = (
    lap_frame
    .groupby(["race_id", "lap_number"], sort=False)["cumulative_time_ms"]
    .min()
    .rename("leader_time_ms")
    .reset_index()
)

lap_frame = lap_frame.merge(
    leader_times,
    on=["race_id", "lap_number"],
    how="left",
    validate="many_to_one"
)

lap_frame["gap_to_leader_ms"] = (
    lap_frame["cumulative_time_ms"] - lap_frame["leader_time_ms"]
)

# -----------------------------
# Gap integrity checks
# -----------------------------
if (lap_frame["gap_to_leader_ms"] < 0).any():
    raise RuntimeError(
        "Negative gap_to_leader_ms detected ‚Äî leader alignment broken"
    )

# -----------------------------
# Lap-to-lap delta computation
# -----------------------------
lap_frame["delta_prev_lap_ms"] = (
    lap_frame
    .groupby(["race_id", "driver_code"], sort=False)["cumulative_time_ms"]
    .diff()
)

# -----------------------------
# Delta integrity checks
# -----------------------------
if (lap_frame["delta_prev_lap_ms"] < 0).any():
    raise RuntimeError(
        "Negative delta_prev_lap_ms detected ‚Äî temporal ordering broken"
    )

logger.info(
    "Relative timing features computed successfully ‚Äî "
    "gap and delta features ready"
)


2025-12-18 15:03:07,879 | INFO | src.logging_config | Computing gap to leader and lap-to-lap deltas
2025-12-18 15:03:07,958 | INFO | src.logging_config | Relative timing features computed successfully ‚Äî gap and delta features ready


In [5]:
# ============================================================
# Cell 5 ‚Äî Track Status Alignment (Event ‚Üí Lap Windows)
# ============================================================

from src.identity import parse_race_identity
from src.utils import normalize_column_names

logger.info("Discovering standardized per-race track status files")

track_status_files = list(
    Config.STANDARDIZED_DATA_DIR.rglob("track_status.parquet")
)

if not track_status_files:
    raise FileNotFoundError(
        "No track_status.parquet files found under standardized data directory"
    )

logger.info(
    f"Discovered {len(track_status_files)} track_status.parquet files"
)

track_status_dfs = []

for path in track_status_files:
    season, round_, race_id = parse_race_identity(path.parent)

    df = pd.read_parquet(path)
    df = normalize_column_names(df)

    # -----------------------------
    # Resolve time column
    # -----------------------------
    if "time" in df.columns:
        time_col = "time"
    elif "time_ms" in df.columns:
        time_col = "time_ms"
    else:
        raise ValueError(
            f"{path} does not contain a recognizable time column"
        )

    # -----------------------------
    # Normalize time to integer ms
    # -----------------------------
    if pd.api.types.is_timedelta64_dtype(df[time_col]):
        df["time_ms"] = (
            df[time_col]
            .dt.total_seconds()
            .mul(1000)
            .astype("int64")
        )
    elif pd.api.types.is_numeric_dtype(df[time_col]):
        df["time_ms"] = df[time_col].astype("int64")
    else:
        raise TypeError(
            f"{path} contains unsupported time dtype: {df[time_col].dtype}"
        )

    if "track_status" not in df.columns:
        raise ValueError(
            f"{path} does not contain 'track_status' column"
        )

    df["race_id"] = race_id

    validate_dataframe_columns(
        df,
        {"race_id", "time_ms", "track_status"},
        df_name=f"track_status_df ({path})"
    )

    track_status_dfs.append(
        df[["race_id", "time_ms", "track_status"]]
    )

track_status_df = pd.concat(
    track_status_dfs,
    ignore_index=True
)

logger.info(
    f"Total track status events loaded: {len(track_status_df):,}"
)

track_status_df = track_status_df.sort_values(
    by=["race_id", "time_ms"],
    ascending=[True, True]
).reset_index(drop=True)

# -----------------------------
# Initialize lap-level flags
# -----------------------------
lap_frame["is_green_lap"] = pd.Series(False, index=lap_frame.index, dtype="bool")
lap_frame["is_sc_lap"] = pd.Series(False, index=lap_frame.index, dtype="bool")
lap_frame["is_vsc_lap"] = pd.Series(False, index=lap_frame.index, dtype="bool")
lap_frame["is_red_lap"] = pd.Series(False, index=lap_frame.index, dtype="bool")

logger.info("Aligning track status events to lap windows")

# -----------------------------
# Event ‚Üí lap window alignment
# -----------------------------
for race_id, race_events in track_status_df.groupby("race_id"):
    laps = lap_frame.loc[lap_frame["race_id"] == race_id]

    if laps.empty:
        continue

    for _, event in race_events.iterrows():
        event_time = event["time_ms"]
        status = event["track_status"]

        overlap_mask = (
            (laps["lap_start_time_ms"] <= event_time) &
            (event_time < laps["lap_end_time_ms"])
        )

        idx = laps.index[overlap_mask]

        if status == "GREEN":
            lap_frame.loc[idx, "is_green_lap"] = True
        elif status == "SC":
            lap_frame.loc[idx, "is_sc_lap"] = True


2025-12-18 15:03:07,977 | INFO | src.logging_config | Discovering standardized per-race track status files
2025-12-18 15:03:07,987 | INFO | src.logging_config | Discovered 68 track_status.parquet files
2025-12-18 15:03:08,479 | INFO | src.logging_config | Total track status events loaded: 789
2025-12-18 15:03:08,485 | INFO | src.logging_config | Aligning track status events to lap windows


In [6]:
# ============================================================
# Cell 6 ‚Äî Pit Lap, Out Lap & Stint Identification
# ============================================================

"""
This cell labels pit structure.

Responsibilities:
‚Ä¢ Identify pit laps
‚Ä¢ Identify out laps
‚Ä¢ Assign stint_id per driver per race

No strategy logic. No filtering.
"""

logger.info("Identifying pit laps, out laps, and stints")

# -----------------------------
# Configuration
# -----------------------------
PIT_LAP_THRESHOLD_MS = 30_000

# -----------------------------
# Initialize flags
# -----------------------------
lap_frame["is_pit_lap"] = pd.Series(False, index=lap_frame.index, dtype="bool")
lap_frame["is_out_lap"] = pd.Series(False, index=lap_frame.index, dtype="bool")

# -----------------------------
# Pit lap detection
# -----------------------------
lap_frame.loc[
    lap_frame["delta_prev_lap_ms"] >= PIT_LAP_THRESHOLD_MS,
    "is_pit_lap"
] = True

# ------------------------------------------------------------
# Out lap detection (shift on int via transform ‚Äî Pandas-safe)
# ------------------------------------------------------------
prev_pit_int = (
    lap_frame
    .assign(is_pit_int=lap_frame["is_pit_lap"].astype("int8"))
    .groupby(["race_id", "driver_code"], sort=False)["is_pit_int"]
    .shift(1)
    .fillna(0)
)

lap_frame.loc[prev_pit_int == 1, "is_out_lap"] = True

# -----------------------------
# Stint identification
# -----------------------------
lap_frame["stint_id"] = (
    lap_frame
    .groupby(["race_id", "driver_code"], sort=False)["is_pit_lap"]
    .cumsum()
)

# -----------------------------
# Defensive validation
# -----------------------------
if lap_frame["stint_id"].isna().any():
    raise RuntimeError("stint_id contains NaNs ‚Äî stint construction failed")

logger.info(
    "Pit structure identified successfully ‚Äî "
    "pit laps, out laps, and stints labeled"
)


2025-12-18 15:03:09,444 | INFO | src.logging_config | Identifying pit laps, out laps, and stints
2025-12-18 15:03:09,494 | INFO | src.logging_config | Pit structure identified successfully ‚Äî pit laps, out laps, and stints labeled


In [7]:
# ============================================================
# Cell 7 ‚Äî Feature Validation & Invariant Enforcement
# ============================================================

"""
This cell enforces the HARD invariants guaranteed by Notebook 04.

Purpose:
‚Ä¢ Detect silent analytical corruption
‚Ä¢ Enforce structural and temporal contracts
‚Ä¢ Block progression if Notebook 04 guarantees are violated

This cell MUST NOT:
‚Ä¢ create new columns
‚Ä¢ modify values
‚Ä¢ infer semantics
‚Ä¢ filter laps
‚Ä¢ define analytical usability

If a check fails here, the pipeline is BROKEN ‚Äî not ambiguous.
"""

logger.info("Running final feature validation & invariant enforcement (Notebook 04)")

# ------------------------------------------------------------
# 1. Lap grain integrity (ABSOLUTE)
# ------------------------------------------------------------
lap_key = ["race_id", "driver_code", "lap_number"]

if lap_frame.duplicated(subset=lap_key).any():
    raise RuntimeError(
        "Invariant violation: duplicate (race_id, driver_code, lap_number) detected"
    )

logger.info("Invariant 1 passed ‚Äî lap grain is unique")

# ------------------------------------------------------------
# 2. Temporal window integrity
# ------------------------------------------------------------
if (lap_frame["lap_start_time_ms"] < 0).any():
    raise RuntimeError(
        "Invariant violation: negative lap_start_time_ms detected"
    )

if (lap_frame["lap_end_time_ms"] < lap_frame["lap_start_time_ms"]).any():
    raise RuntimeError(
        "Invariant violation: lap_end_time_ms precedes lap_start_time_ms"
    )

logger.info("Invariant 2 passed ‚Äî lap time windows are valid")

# ------------------------------------------------------------
# 3. Cumulative time monotonicity (per driver, per race)
# ------------------------------------------------------------
cum_diff = (
    lap_frame
    .groupby(["race_id", "driver_code"], sort=False)["cumulative_time_ms"]
    .diff()
)

if (cum_diff < 0).any():
    raise RuntimeError(
        "Invariant violation: cumulative_time_ms decreases within a race-driver sequence"
    )

logger.info("Invariant 3 passed ‚Äî cumulative time is monotonic")

# ------------------------------------------------------------
# 4. Gap-to-leader invariants
# ------------------------------------------------------------
if (lap_frame["gap_to_leader_ms"] < 0).any():
    raise RuntimeError(
        "Invariant violation: negative gap_to_leader_ms detected"
    )

leaders_per_lap = (
    lap_frame
    .groupby(["race_id", "lap_number"], sort=False)["gap_to_leader_ms"]
    .apply(lambda x: (x == 0).sum())
)

if (leaders_per_lap < 1).any():
    raise RuntimeError(
        "Invariant violation: lap with no leader (no zero gap)"
    )

logger.info("Invariant 4 passed ‚Äî gap-to-leader is consistent")

# ------------------------------------------------------------
# 5. Track status MECHANICAL integrity
# ------------------------------------------------------------
track_flags = [
    "is_green_lap",
    "is_sc_lap",
    "is_vsc_lap",
    "is_red_lap",
]

# 5a. Flags must exist
for col in track_flags:
    if col not in lap_frame.columns:
        raise RuntimeError(
            f"Invariant violation: missing track status column '{col}'"
        )

# 5b. Flags must be boolean
for col in track_flags:
    if lap_frame[col].dtype != bool:
        raise RuntimeError(
            f"Invariant violation: {col} is not boolean dtype"
        )

# 5c. Flags must not contain NaNs
if lap_frame[track_flags].isna().any().any():
    raise RuntimeError(
        "Invariant violation: NaNs detected in track status flags"
    )

logger.info("Invariant 5 passed ‚Äî track status flags are mechanically sound")

# ------------------------------------------------------------
# 6. Pit / out-lap structural integrity
# ------------------------------------------------------------
# Out lap must never occur on lap 1
if lap_frame.loc[lap_frame["lap_number"] == 1, "is_out_lap"].any():
    raise RuntimeError(
        "Invariant violation: out lap detected on lap 1"
    )

# Out lap must immediately follow a pit lap
prev_pit = (
    lap_frame
    .groupby(["race_id", "driver_code"], sort=False)["is_pit_lap"]
    .shift(1)
)

bad_outlaps = lap_frame.loc[
    lap_frame["is_out_lap"] & (prev_pit != True)
]

if not bad_outlaps.empty:
    raise RuntimeError(
        "Invariant violation: is_out_lap without preceding pit lap"
    )

logger.info("Invariant 6 passed ‚Äî pit / out-lap structure is valid")

# ------------------------------------------------------------
# 7. Stint ID integrity
# ------------------------------------------------------------
stint_diff = (
    lap_frame
    .groupby(["race_id", "driver_code"], sort=False)["stint_id"]
    .diff()
)

if (stint_diff < 0).any():
    raise RuntimeError(
        "Invariant violation: stint_id decreases within a race-driver sequence"
    )

logger.info("Invariant 7 passed ‚Äî stint_id is monotonic")

# ------------------------------------------------------------
# 8. Derived-feature null safety (precise)
# ------------------------------------------------------------

# Columns that MUST be defined for every lap
must_be_complete = [
    "lap_start_time_ms",
    "lap_end_time_ms",
    "cumulative_time_ms",
    "gap_to_leader_ms",
    "stint_id",
]

nulls = lap_frame[must_be_complete].isna().any()

if nulls.any():
    raise RuntimeError(
        f"Invariant violation: NaNs detected in mandatory derived columns: "
        f"{nulls[nulls].index.tolist()}"
    )

# delta_prev_lap_ms is EXPECTED to be NaN for lap 1 per driver
# Explicitly assert that NaNs only occur where lap_number == 1
bad_delta = lap_frame.loc[
    lap_frame["delta_prev_lap_ms"].isna() &
    (lap_frame["lap_number"] != 1)
]

if not bad_delta.empty:
    raise RuntimeError(
        "Invariant violation: delta_prev_lap_ms is NaN outside lap_number == 1"
    )

logger.info("Invariant 8 passed ‚Äî derived feature null safety enforced")

# ------------------------------------------------------------
# FINAL
# ------------------------------------------------------------
logger.info(
    "Notebook 04 VALIDATION COMPLETE ‚Äî "
    "all guaranteed invariants satisfied, data is structurally safe"
)


2025-12-18 15:03:09,513 | INFO | src.logging_config | Running final feature validation & invariant enforcement (Notebook 04)
2025-12-18 15:03:09,531 | INFO | src.logging_config | Invariant 1 passed ‚Äî lap grain is unique
2025-12-18 15:03:09,533 | INFO | src.logging_config | Invariant 2 passed ‚Äî lap time windows are valid
2025-12-18 15:03:09,551 | INFO | src.logging_config | Invariant 3 passed ‚Äî cumulative time is monotonic
2025-12-18 15:03:10,017 | INFO | src.logging_config | Invariant 4 passed ‚Äî gap-to-leader is consistent
2025-12-18 15:03:10,020 | INFO | src.logging_config | Invariant 5 passed ‚Äî track status flags are mechanically sound
2025-12-18 15:03:10,037 | INFO | src.logging_config | Invariant 6 passed ‚Äî pit / out-lap structure is valid
2025-12-18 15:03:10,057 | INFO | src.logging_config | Invariant 7 passed ‚Äî stint_id is monotonic
2025-12-18 15:03:10,063 | INFO | src.logging_config | Invariant 8 passed ‚Äî derived feature null safety enforced
2025-12-18 15:03:10,0

In [8]:
# ============================================================
# Notebook 04 ‚Äî Cell 8
# Persist Lap-Level Analytical Features (PostgreSQL)
# ============================================================

"""
This cell persists the validated output of Notebook 04
into PostgreSQL as a new analytical table.

Design guarantees:
‚Ä¢ Idempotent execution
‚Ä¢ No mutation of raw tables
‚Ä¢ One row per (race_id, driver_code, lap_number)
‚Ä¢ Only runs AFTER invariant validation has passed
‚Ä¢ Downstream notebooks (Notebook 05+) are read-only consumers

If this cell fails, Notebook 04 is considered incomplete.
"""

# -----------------------------
# Configuration
# -----------------------------
TARGET_TABLE = "lap_features"

# Explicit column order (schema contract)
FEATURE_COLUMNS = [
    # Identity / grain
    "race_id",
    "driver_code",
    "lap_number",

    # Temporal features
    "lap_start_time_ms",
    "lap_end_time_ms",
    "cumulative_time_ms",

    # Relative timing
    "gap_to_leader_ms",
    "delta_prev_lap_ms",

    # Track status (mechanical overlay)
    "is_green_lap",
    "is_sc_lap",
    "is_vsc_lap",
    "is_red_lap",

    # Pit / stint structure
    "is_pit_lap",
    "is_out_lap",
    "stint_id",
]

logger.info("Preparing to persist lap-level analytical features")

# -----------------------------
# Final defensive checks
# -----------------------------
# These should never fail if Cell 7 passed, but act as a last guardrail

missing = set(FEATURE_COLUMNS) - set(lap_frame.columns)
if missing:
    raise RuntimeError(
        f"Persistence aborted ‚Äî missing expected feature columns: {sorted(missing)}"
    )

if lap_frame.duplicated(
    subset=["race_id", "driver_code", "lap_number"]
).any():
    raise RuntimeError(
        "Persistence aborted ‚Äî lap grain duplication detected"
    )

logger.info("Final pre-persistence checks passed")

# -----------------------------
# Prepare frame for persistence
# -----------------------------
persist_df = lap_frame[FEATURE_COLUMNS].copy()

# Enforce deterministic ordering (not required by DB, but useful)
persist_df = persist_df.sort_values(
    by=["race_id", "driver_code", "lap_number"],
    kind="mergesort"
)

logger.info(
    f"Persisting lap features ‚Äî rows: {len(persist_df):,}, "
    f"races: {persist_df['race_id'].nunique()}, "
    f"drivers: {persist_df['driver_code'].nunique()}"
)

# -----------------------------
# Idempotent write to PostgreSQL
# -----------------------------
# Strategy:
# ‚Ä¢ Replace entire table atomically
# ‚Ä¢ Notebook 04 is the single writer
# ‚Ä¢ Downstream notebooks are read-only

with engine.begin() as connection:
    persist_df.to_sql(
        TARGET_TABLE,
        connection,
        if_exists="replace",   # idempotent, deterministic
        index=False,
        method="multi"
    )

logger.info(
    f"Lap-level analytical features persisted successfully "
    f"to PostgreSQL table '{TARGET_TABLE}'"
)

# -----------------------------
# Post-write verification
# -----------------------------
row_count = pd.read_sql(
    f"SELECT COUNT(*) AS n FROM {TARGET_TABLE}",
    engine
)["n"].iloc[0]

if row_count != len(persist_df):
    raise RuntimeError(
        f"Post-write verification failed ‚Äî expected {len(persist_df)} rows, "
        f"found {row_count}"
    )

logger.info(
    "Post-write verification passed ‚Äî row counts match exactly"
)

# -----------------------------
# FINAL SEAL
# -----------------------------
logger.info(
    "Notebook 04 COMPLETE ‚Äî lap-level features are persisted, "
    "validated, and ready for downstream analysis"
)

# IMPORTANT:
# Notebook 05 and beyond MUST read from `lap_features`.
# No downstream notebook is allowed to recompute these features.


2025-12-18 16:07:46,287 | INFO | src.logging_config | Preparing to persist lap-level analytical features
2025-12-18 16:07:46,353 | INFO | src.logging_config | Final pre-persistence checks passed
2025-12-18 16:07:46,454 | INFO | src.logging_config | Persisting lap features ‚Äî rows: 74,605, races: 68, drivers: 28
2025-12-18 16:08:43,227 | INFO | src.logging_config | Lap-level analytical features persisted successfully to PostgreSQL table 'lap_features'
2025-12-18 16:08:43,256 | INFO | src.logging_config | Post-write verification passed ‚Äî row counts match exactly
2025-12-18 16:08:43,258 | INFO | src.logging_config | Notebook 04 COMPLETE ‚Äî lap-level features are persisted, validated, and ready for downstream analysis


# üèÅ Notebook 04 ‚Äî Conclusion & Learnings

---

## üß† What This Notebook Accomplished

Notebook 04 successfully transformed our project from:

> **‚ÄúRelationally correct lap data‚Äù**  
into  
> **‚ÄúTemporally aligned, structurally validated, strategy-ready race data.‚Äù**

This notebook introduced *time*, *context*, and *structural meaning* ‚Äî without introducing *interpretation*.

That distinction turned out to be far more subtle than expected.

---

## üß± What We Built (Final State)

By the end of this notebook, every lap row now contains:

### ‚è±Ô∏è Temporal Features
- `lap_start_time_ms`
- `lap_end_time_ms`
- `cumulative_time_ms`

### üìä Relative Timing
- `gap_to_leader_ms`
- `delta_prev_lap_ms` (explicitly undefined where appropriate)

### üö¶ Track Context
- `is_green_lap`
- `is_sc_lap`
- `is_vsc_lap`
- `is_red_lap`

*(Applied via conservative overlap logic ‚Äî no inference, no forward fill)*

### üõ†Ô∏è Structural Race Phases
- `is_pit_lap`
- `is_out_lap`
- `stint_id`

All features preserve **lap grain exactly**.

---

## ‚ö†Ô∏è Problems Encountered (And Why They Mattered)

This notebook surfaced **multiple non-obvious failure modes**, none of which were trivial bugs.

---

### ‚ùå Track Status Assumptions Were Wrong üö¶

Initial assumptions treated track status as:
- mutually exclusive per lap
- continuous over time
- analytically meaningful at construction time

Reality:
- Track status is **event-based**
- Events may overlap lap windows
- Some laps legitimately overlap **no events**
- Ambiguity is inherent

üëâ **Resolution**: Treat track status as a mechanical overlay only.

---

### ‚ùå Boolean Dtype & Pandas Semantics üêº

We encountered:
- silent dtype coercion
- future deprecation warnings
- unexpected behavior from boolean shifts

These issues did not break logic ‚Äî but **would have broken the pipeline in the future**.

üëâ **Resolution**: Enforced explicit dtype handling and Pandas-safe patterns.

---

### ‚ùå Misdefined ‚ÄúSanity Checks‚Äù üîç

Several early validation attempts failed ‚Äî not because the data was wrong, but because the *invariants were wrong*.

Examples:
- Expecting exactly one track status per lap
- Expecting every lap to have a status
- Expecting no NaNs in mathematically undefined features

üëâ **Key insight**:
> *A sanity check that fails on valid data is not a sanity check.*

We refined invariants until they matched **both reality and design**.

---

## üîí Final Validation (Why This Notebook Is Now Sealed)

The final validation cell enforces **only guarantees that Notebook 04 explicitly claims**:

- ‚úî Unique lap grain
- ‚úî Monotonic time
- ‚úî Valid lap windows
- ‚úî Consistent gap-to-leader logic
- ‚úî Mechanically valid track status flags
- ‚úî Coherent pit / out-lap structure
- ‚úî Monotonic stints
- ‚úî Correct handling of undefined values

No data was altered to make checks pass.

Instead, the **contract itself was clarified and encoded**.

---

## üß† What We Now Understand About Our Data

After Notebook 04:

- The data is **structurally sound**
- Time is **explicit and reliable**
- Ambiguity is **visible, not hidden**
- Nothing has been inferred prematurely
- All later analytical choices can be **made consciously**

This is the difference between:
> *‚Äúclean data‚Äù*  
and  
> *‚Äúdefensible data.‚Äù*

---

## üöÄ Next Steps ‚Äî Notebook 05

With Notebook 04 sealed, we can now safely proceed to:

### üìì Notebook 05 ‚Äî Undercut Definition, Detection & Evaluation

That notebook will:
- Define what *counts* as an undercut
- Define what *counts* as a competitive lap
- Resolve track status ambiguity explicitly
- Measure undercut outcomes
- Aggregate results across races and seasons

Crucially:
> **Notebook 05 will contain interpretation ‚Äî because Notebook 04 made it safe to do so.**

---

## üèÅ Final Reflection

Notebook 04 turned out to be the **most architecturally important notebook in the project**.

Not because of complexity ‚Äî  
but because it forced us to confront:

- hidden assumptions
- ambiguous data
- false invariants
- premature interpretation

By resolving these now, we have ensured that **any conclusion about the undercut is grounded in reality, not convenience**.

The system is now ready for strategy analysis ‚Äî *properly*.
