# Scientific Validation & Feature Interpretability

This notebook is a defense document for the seminar paper. It addresses common examiner questions about data integrity, leakage, feature validity, model choice, and error behavior.

**Citations:** FastF1 documentation (FastF1, 2024) and Theisen (2021) are referenced in relevant sections as required by the FOM guidelines.


In [None]:
# Ensure repo root is on sys.path so `src` is importable
import sys
from pathlib import Path

ROOT = Path.cwd().resolve().parent if Path.cwd().name == 'notebooks' else Path.cwd().resolve()
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))
print(f'Using project root: {ROOT}')


In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

try:
    import seaborn as sns
    _SEABORN_OK = True
except Exception:
    _SEABORN_OK = False
    print("seaborn not installed. Run: pip install seaborn")

from pathlib import Path

from src.data import load_laps_for_seasons, clean_laps
from _common import load_dataset, prepare_features, ROOT
from src.split import SplitConfig
from src.eval import compute_metrics
from src.models import make_model_registry, build_search
import joblib

np.random.seed(42)


## 1. Data Integrity (Defense: *Is your data clean?*)

**Visual Statistics:** Before/After cleaning summary and exact counts for the key removal rules.
**Outlier Justification:** Lap time distribution before and after cleaning to show extreme SC/incident laps removed.

Citations: FastF1 documentation (FastF1, 2024).


In [None]:
YEARS = [2022, 2023, 2024, 2025]
EXCLUDE_LAP1 = False

# Load raw laps (uses FastF1 cache when available)
raw = load_laps_for_seasons(YEARS)
raw = raw.copy()
raw["LapTimeSeconds"] = raw["LapTime"].dt.total_seconds()

# Use clean_laps to execute the official cleaning rules and print stats
clean = clean_laps(raw, exclude_lap1=EXCLUDE_LAP1, verbose=True)
clean = clean.copy()
clean["LapTimeSeconds"] = clean["LapTime"].dt.total_seconds()

# Recompute filter counts following the same order as clean_laps for key reasons
filter_counts = {}

def _apply_filter(df, mask, name, record=False):
    removed = int((~mask).sum())
    if record:
        filter_counts[name] = removed
    return df[mask]

_tmp = raw.copy()
if "LapTime" in _tmp.columns:
    _tmp = _apply_filter(_tmp, _tmp["LapTime"].notna(), "No LapTime")
if "IsAccurate" in _tmp.columns:
    _tmp = _apply_filter(_tmp, _tmp["IsAccurate"] == True, "IsAccurate=False", True)  # noqa: E712
if "PitOutTime" in _tmp.columns:
    _tmp = _apply_filter(_tmp, ~_tmp["PitOutTime"].notna(), "Pit-Out")
if "PitInTime" in _tmp.columns:
    _tmp = _apply_filter(_tmp, ~_tmp["PitInTime"].notna(), "Pit-In", True)
if "TrackStatus" in _tmp.columns:
    safety_mask = _tmp["TrackStatus"].astype(str).str.contains("[4567]", na=False)
    _tmp = _apply_filter(_tmp, ~safety_mask, "SC/VSC/RedFlag", True)
if "Deleted" in _tmp.columns:
    _tmp = _apply_filter(_tmp, _tmp["Deleted"] != True, "Deleted")  # noqa: E712
if "LapNumber" in _tmp.columns:
    _tmp = _apply_filter(_tmp, _tmp["LapNumber"] > 0, "Formation Lap")
if EXCLUDE_LAP1 and "LapNumber" in _tmp.columns:
    _tmp = _apply_filter(_tmp, _tmp["LapNumber"] > 1, "Lap 1 (Standing Start)")

# Summary table
summary = pd.DataFrame({
    "Stage": ["Before cleaning", "After cleaning"],
    "Lap count": [len(raw), len(clean)],
})
summary


In [None]:
# Key removal counts (exact reasons requested)
key_counts = pd.DataFrame({
    "Reason": ["IsAccurate=False", "Pit-In", "SC/VSC/RedFlag"],
    "Laps removed": [
        filter_counts.get("IsAccurate=False", 0),
        filter_counts.get("Pit-In", 0),
        filter_counts.get("SC/VSC/RedFlag", 0),
    ],
})
key_counts


In [None]:
# Visual summary: before vs after cleaning
fig = px.bar(
    summary,
    x="Stage",
    y="Lap count",
    color="Stage",
    title="Before vs After Cleaning: Lap Count",
    labels={"Stage": "Dataset stage", "Lap count": "Number of laps"},
)
fig.show()


In [None]:
# Outlier justification: histogram before vs after cleaning
fig = go.Figure()
fig.add_histogram(
    x=raw["LapTimeSeconds"].dropna(),
    name="Before",
    opacity=0.6,
    nbinsx=80,
)
fig.add_histogram(
    x=clean["LapTimeSeconds"].dropna(),
    name="After",
    opacity=0.6,
    nbinsx=80,
)
fig.update_layout(
    title="LapTimeSeconds Distribution Before vs After Cleaning",
    xaxis_title="Lap time (seconds)",
    yaxis_title="Count",
    barmode="overlay",
    legend_title="Dataset",
)
fig.show()


## 2. Temporal Splitting (Defense: *Did you cheat with Data Leakage?*)

**Methodological note:** Time-series splitting is mandatory because F1 is sequential. A random shuffle would allow the model to "see" future car setups and track conditions when predicting earlier laps, which is scientifically invalid.

Citations: FastF1 documentation (FastF1, 2024); Theisen (2021).


In [None]:
# Timeline plot: Train (2022-2023) vs Test (2024-2025)
if "SessionDate" not in clean.columns:
    raise KeyError("SessionDate not found in laps; expected from FastF1 session load.")

timeline = clean[["SessionDate", "LapTimeSeconds", "Season"]].dropna().copy()
timeline["Split"] = np.where(timeline["Season"] <= 2023, "Train (2022-2023)", "Test (2024-2025)")

# Sample for visualization clarity
sample_n = min(50000, len(timeline))
plot_df = timeline.sample(sample_n, random_state=42)

fig = px.scatter(
    plot_df,
    x="SessionDate",
    y="LapTimeSeconds",
    color="Split",
    title="Temporal Split: Training vs Test Data",
    labels={"SessionDate": "Date", "LapTimeSeconds": "Lap time (seconds)"},
)
fig.update_layout(legend_title="Split")
fig.show()


## 3. Physics Feature Impact (Defense: *Do engineered features actually matter?*)

This section validates that physics-inspired features have measurable statistical relationships with lap time.

Citations: FastF1 documentation (FastF1, 2024); Theisen (2021).


In [None]:
# Load processed feature dataset
feature_df, metadata = load_dataset()

required_cols = ["LapTimeSeconds", "EstimatedFuelWeight", "EstimatedGrip", "TrackEvolution", "TyreLife", "Compound"]
missing = [c for c in required_cols if c not in feature_df.columns]
if missing:
    raise KeyError(f"Missing required columns in processed dataset: {missing}")

# Correlation heatmap (seaborn)
if _SEABORN_OK:
    corr_cols = ["LapTimeSeconds", "EstimatedFuelWeight", "EstimatedGrip", "TrackEvolution"]
    corr = feature_df[corr_cols].corr(numeric_only=True)
    plt.figure(figsize=(6, 4))
    sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", cbar_kws={"label": "Correlation"})
    plt.title("Correlation Heatmap: LapTimeSeconds vs Physics Features")
    plt.xlabel("Features")
    plt.ylabel("Features")
    plt.show()


In [None]:
# Correlation ranking (absolute correlation with LapTimeSeconds)
features = ["EstimatedFuelWeight", "EstimatedGrip", "TrackEvolution"]

corr_series = (
    feature_df[["LapTimeSeconds"] + features]
    .corr(numeric_only=True)["LapTimeSeconds"]
    .drop("LapTimeSeconds")
    .abs()
    .sort_values()
)

plt.figure(figsize=(6, 3.5))
plt.barh(corr_series.index, corr_series.values)
plt.title("Absolute Correlation with LapTimeSeconds")
plt.xlabel("|Correlation|")
plt.ylabel("Feature")
plt.legend(["|Correlation|"])
plt.show()


In [None]:
# Physics sanity check: EstimatedGrip vs TyreLife by compound
plot_df = feature_df.copy()
plot_df["Compound"] = plot_df["Compound"].astype(str).str.upper()
plot_df = plot_df[plot_df["Compound"].isin(["SOFT", "MEDIUM", "HARD"])].dropna(subset=["TyreLife", "EstimatedGrip"])

# Sample for clarity
plot_df = plot_df.sample(min(30000, len(plot_df)), random_state=42)

if _SEABORN_OK:
    plt.figure(figsize=(7, 4))
    sns.lineplot(
        data=plot_df,
        x="TyreLife",
        y="EstimatedGrip",
        hue="Compound",
        estimator="median",
        errorbar=None,
    )
    plt.title("EstimatedGrip vs TyreLife by Compound")
    plt.xlabel("TyreLife (laps)")
    plt.ylabel("EstimatedGrip")
    plt.legend(title="Compound")
    plt.show()


## 4. Model Comparison (Defense: *Why this model?*)

This section compares MAE, RMSE, and $R^2$ across Linear, XGBoost, and Deep MLP. It also visualizes MAE vs RMSE to discuss bias vs variance.

Note: By default, this notebook **loads saved models** from `reports/models/` to avoid retraining and hyperparameter confusion. Set `LOAD_MODELS_IF_AVAILABLE = False` to retrain.


In [None]:
SEED = 42
TUNE_MODE = "off"  # used only if training is required
LOAD_MODELS_IF_AVAILABLE = True

MODELS_DIR = ROOT / "reports" / "models"
MODEL_PATHS = {
    "Linear": MODELS_DIR / "linear.joblib",
    "XGBoost": MODELS_DIR / "xgboost.joblib",
    "Deep MLP": MODELS_DIR / "deep_mlp.joblib",
}

split_config = SplitConfig(test_rounds=6)
df, metadata = load_dataset()
train_df, val_df, trainval_df, test_df, features = prepare_features(df, metadata, split_config=split_config)

X_train = train_df[features]
y_train = train_df["LapTimeSeconds"].to_numpy()
X_val = val_df[features]
y_val = val_df["LapTimeSeconds"].to_numpy()
X_trainval = trainval_df[features]
y_trainval = trainval_df["LapTimeSeconds"].to_numpy()
X_test = test_df[features]
y_test = test_df["LapTimeSeconds"].to_numpy()

fitted = {}
if LOAD_MODELS_IF_AVAILABLE and all(p.exists() for p in MODEL_PATHS.values()):
    for name, path in MODEL_PATHS.items():
        fitted[name] = joblib.load(path)
else:
    base_models = make_model_registry(features, random_state=SEED)
    models = {k: build_search(k, v, random_state=SEED, mode=TUNE_MODE) for k, v in base_models.items()}
    from src.eval import evaluate_models
    _, _, tmp = evaluate_models(models, X_train, y_train, X_val, y_val)
    for name, estimator in tmp.items():
        best = estimator.best_estimator_ if hasattr(estimator, "best_estimator_") else estimator
        best.fit(X_trainval, y_trainval)
        fitted[name] = best

# Compute test metrics
rows = []
for name, model in fitted.items():
    preds = model.predict(X_test)
    scores = compute_metrics(y_test, preds)
    scores["model"] = name
    rows.append(scores)

metrics_test = pd.DataFrame(rows).sort_values("mae").reset_index(drop=True)
metrics_test


In [None]:
# Bias-variance discussion: MAE vs RMSE
fig = px.scatter(
    metrics_test,
    x="mae",
    y="rmse",
    color="model",
    text="model",
    title="MAE vs RMSE (Bias-Variance Indicator)",
    labels={"mae": "MAE (s)", "rmse": "RMSE (s)"},
)
fig.update_traces(textposition="top center")
fig.show()


## 5. Error Analysis (Defense: *Where does your model fail?*)

We analyze residuals and error by Circuit and Compound using the best test model.


In [None]:
# Pick best model by MAE
best_row = metrics_test.iloc[0]
best_name = best_row["model"]
best_model = fitted[best_name]

preds = best_model.predict(X_test)
residuals = y_test - preds

# Residual distribution
fig = px.histogram(
    residuals,
    nbins=60,
    title=f"Residual Distribution (Actual - Predicted): {best_name}",
    labels={"value": "Residual (seconds)", "count": "Count"},
)
fig.update_traces(name="Residuals", showlegend=True)
fig.update_layout(showlegend=True, legend_title="Series")
fig.show()

# Error by category: Circuit and Compound
errors_df = test_df.copy()
errors_df["Pred"] = preds
errors_df["AbsError"] = np.abs(errors_df["Pred"] - errors_df["LapTimeSeconds"])

if "Circuit" in errors_df.columns:
    circuit_mae = (
        errors_df.groupby("Circuit")
        .agg(mae=("AbsError", "mean"))
        .sort_values("mae", ascending=False)
        .head(15)
        .reset_index()
    )
    fig = px.bar(
        circuit_mae,
        x="mae",
        y="Circuit",
        color="mae",
        orientation="h",
        title=f"MAE by Circuit (Top 15) - {best_name}",
        labels={"mae": "MAE (s)", "Circuit": "Circuit"},
    )
    fig.show()

if "Compound" in errors_df.columns:
    comp_mae = (
        errors_df.groupby("Compound")
        .agg(mae=("AbsError", "mean"))
        .sort_values("mae", ascending=False)
        .reset_index()
    )
    fig = px.bar(
        comp_mae,
        x="Compound",
        y="mae",
        title=f"MAE by Compound - {best_name}",
        labels={"mae": "MAE (s)", "Compound": "Compound"},
    )
    fig.show()


## 6. Future Outlook (A+ Bonus)

**Attention / Transformer concept:** The current model uses sliding-window lags (e.g., LapTimeLag1?3) which only capture short-term history. A Transformer with attention could ingest the **entire race sequence** per driver and learn long-range dependencies such as strategy phases, traffic effects, and evolving track conditions. This would allow the model to weigh distant but relevant events (e.g., pit stops, safety car periods) when predicting current lap performance.

Citations: Theisen (2021); FastF1 documentation (FastF1, 2024).
