# F1 Lap Time Prediction Model

**Professional approach: Offline Training → Stateful Inference**

### Architecture
1. **Bulk download** all historical race data once → save as Parquet (no API calls during training/inference)
2. **Train** on full history — model learns driver × circuit × tyre degradation patterns
3. **Predict** given current race state: *"Here are laps 1–15, predict lap 16"*
4. **Export** model → deploy to FastAPI backend

### Key Design Decisions
- **No API calls at inference time** — model runs from exported `.pkl` file
- **Prediction depends most on current race** — rolling averages from laps already completed
- **Tyre degradation modelled per compound** — captures non-linear deg curves
- **FastF1 rate limit safe** — downloads cached to Parquet, re-runs cost 0 API calls

---
## 1. Install & Import

In [None]:
!pip install fastf1 xgboost scikit-learn pandas numpy matplotlib seaborn pyarrow --quiet

In [None]:
import fastf1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import os
import warnings
import json
from pathlib import Path
from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from xgboost import XGBRegressor

warnings.filterwarnings('ignore')
sns.set_theme(style='darkgrid')

# Paths
CACHE_DIR = Path('f1_cache')
DATA_DIR = Path('f1_data')
MODEL_DIR = Path('model_export')

for d in [CACHE_DIR, DATA_DIR, MODEL_DIR]:
    d.mkdir(exist_ok=True)

fastf1.Cache.enable_cache(str(CACHE_DIR))
print('Setup complete')

---
## 2. Configuration

In [None]:
# ============================================================
#  CONFIGURATION
# ============================================================

SEASONS = [2023, 2024]              # Seasons to use
SESSION_TYPE = 'R'                   # R = Race
MAX_EVENTS_PER_SEASON = None         # None = all, set to 3-5 for quick testing
PARQUET_FILE = DATA_DIR / 'all_race_laps.parquet'

# Set to True on first run to download data,
# then False to skip download and load from parquet
DOWNLOAD_DATA = not PARQUET_FILE.exists()

print(f'Seasons:       {SEASONS}')
print(f'Download data: {DOWNLOAD_DATA}')
print(f'Parquet file:  {PARQUET_FILE}')

---
## 3. Data Collection (run once, then cached to Parquet)

Downloads all race lap data from FastF1 and saves to a Parquet file.  
On subsequent runs this cell does nothing — loads from disk instead.

In [None]:
def download_all_laps(seasons, session_type, max_events=None):
    """
    Download lap data from FastF1 for given seasons.
    Returns a single DataFrame with all laps.
    """
    all_laps = []

    for year in seasons:
        print(f'\n--- {year} Season ---')
        try:
            schedule = fastf1.get_event_schedule(year)
            events = schedule[schedule['EventFormat'] != 'testing']
            if max_events:
                events = events.head(max_events)
        except Exception as e:
            print(f'  Could not load schedule: {e}')
            continue

        for _, event in events.iterrows():
            name = event['EventName']
            print(f'  {name}...', end=' ')
            try:
                session = fastf1.get_session(year, name, session_type)
                session.load(laps=True, telemetry=False, weather=False, messages=False)
                laps = session.laps
                if laps.empty:
                    print('no data')
                    continue

                # Keep relevant columns
                cols_to_keep = [
                    'Driver', 'DriverNumber', 'Team', 'LapNumber', 'LapTime',
                    'Stint', 'TyreLife', 'Compound', 'FreshTyre',
                    'Sector1Time', 'Sector2Time', 'Sector3Time',
                    'SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST',
                    'IsPersonalBest', 'Position', 'TrackStatus',
                    'IsAccurate'
                ]
                cols_available = [c for c in cols_to_keep if c in laps.columns]
                laps = laps[cols_available].copy()

                # Convert timedeltas to seconds
                for col in ['LapTime', 'Sector1Time', 'Sector2Time', 'Sector3Time']:
                    if col in laps.columns:
                        laps[f'{col}Sec'] = laps[col].dt.total_seconds()
                        laps = laps.drop(columns=[col])

                # Add metadata
                laps['Year'] = year
                laps['Event'] = name
                laps['Circuit'] = session.event['EventName']
                laps['TotalLaps'] = laps['LapNumber'].max()

                all_laps.append(laps)
                print(f'{len(laps)} laps')

            except Exception as e:
                print(f'failed ({e})')

    return pd.concat(all_laps, ignore_index=True) if all_laps else pd.DataFrame()


# ---- Download or load from cache ----
if DOWNLOAD_DATA:
    print('Downloading from FastF1 (this takes a while on first run)...\n')
    df_raw = download_all_laps(SEASONS, SESSION_TYPE, MAX_EVENTS_PER_SEASON)
    df_raw.to_parquet(PARQUET_FILE, index=False)
    print(f'\nSaved {len(df_raw):,} laps to {PARQUET_FILE}')
else:
    print(f'Loading from cached parquet: {PARQUET_FILE}')
    df_raw = pd.read_parquet(PARQUET_FILE)
    print(f'Loaded {len(df_raw):,} laps')

print(f'\nDataset: {df_raw["Year"].nunique()} seasons, '
      f'{df_raw["Event"].nunique()} events, '
      f'{df_raw["Driver"].nunique()} drivers')

---
## 4. Data Cleaning

Remove unreliable laps: pit in/out, safety car, red flags, outliers.

In [None]:
df = df_raw.copy()
print(f'Raw:             {len(df):,} laps')

# 1. Drop laps without a valid time
df = df.dropna(subset=['LapTimeSec'])
print(f'After NaN drop:  {len(df):,}')

# 2. Keep only accurate laps (FastF1 flags inaccurate pit-in/out laps)
if 'IsAccurate' in df.columns:
    df = df[df['IsAccurate'] == True].copy()
    print(f'After IsAccurate: {len(df):,}')

# 3. Remove extreme outliers per event (5th–95th percentile)
def clip_outliers(group, col='LapTimeSec'):
    lo, hi = group[col].quantile(0.05), group[col].quantile(0.95)
    return group[(group[col] >= lo) & (group[col] <= hi)]

df = df.groupby(['Year', 'Event'], group_keys=False).apply(clip_outliers)
print(f'After outliers:  {len(df):,}')

# 4. Quick sanity check
print(f'\nLap time stats (seconds):')
print(df['LapTimeSec'].describe().round(2))

---
## 5. Feature Engineering

### Feature hierarchy (by prediction importance):

| Priority | Category | Features | Why |
|----------|----------|----------|-----|
| **1 (highest)** | Current race laps | prev_lap_1/2/3, rolling_avg_3/5, race_mean | The best predictor of the next lap is the most recent laps |
| **2** | Tyre / stint state | tyre_life, compound, deg_rate, stint_avg | Tyre deg is the main source of lap time variation |
| **3** | Race context | lap_number, fuel_effect, position | Fuel burn makes car lighter → faster by ~0.06s/lap |
| **4 (lowest)** | Historical | driver_circuit_avg, circuit_avg | Baseline expectation, lower weight for in-race predictions |

In [None]:
def build_features(df):
    """
    Feature engineering pipeline.
    Builds features in priority order: current race → tyre → context → historical.
    Every feature is computed using only past data (no lookahead).
    """
    df = df.sort_values(['Year', 'Event', 'Driver', 'LapNumber']).copy()

    # Unique key per driver per race
    df['race_id'] = df['Year'].astype(str) + '_' + df['Event']
    df['rd_key'] = df['race_id'] + '_' + df['Driver']

    grp = df.groupby('rd_key')['LapTimeSec']

    # ==========================================================
    #  PRIORITY 1: Current Race Lap Features
    # ==========================================================

    # Lag features — shifted so we only use completed laps
    df['prev_lap_1'] = grp.shift(1)
    df['prev_lap_2'] = grp.shift(2)
    df['prev_lap_3'] = grp.shift(3)

    # Rolling averages over completed laps
    shifted = grp.shift(1)
    df['roll_avg_3'] = shifted.groupby(df['rd_key']).transform(
        lambda x: x.rolling(3, min_periods=1).mean()
    )
    df['roll_avg_5'] = shifted.groupby(df['rd_key']).transform(
        lambda x: x.rolling(5, min_periods=1).mean()
    )
    df['roll_avg_10'] = shifted.groupby(df['rd_key']).transform(
        lambda x: x.rolling(10, min_periods=1).mean()
    )

    # Lap-over-lap deltas (trend)
    df['delta_1'] = grp.shift(1) - grp.shift(2)
    df['delta_2'] = grp.shift(2) - grp.shift(3)

    # Race cumulative stats up to previous lap
    df['race_mean'] = shifted.groupby(df['rd_key']).transform(
        lambda x: x.expanding().mean()
    )
    df['race_best'] = shifted.groupby(df['rd_key']).transform(
        lambda x: x.expanding().min()
    )
    df['race_std'] = shifted.groupby(df['rd_key']).transform(
        lambda x: x.expanding().std()
    )

    # ==========================================================
    #  PRIORITY 2: Tyre & Stint Features
    # ==========================================================

    # Tyre compound encoding (C1=softest → C5=hardest, wet types separate)
    compound_map = {
        'SOFT': 1, 'MEDIUM': 2, 'HARD': 3,
        'INTERMEDIATE': 4, 'WET': 5,
        'HYPERSOFT': 0, 'ULTRASOFT': 0.5, 'SUPERSOFT': 1,
        'SUPERHARD': 4, 'UNKNOWN': 2, 'TEST_UNKNOWN': 2
    }
    df['compound_enc'] = df['Compound'].map(compound_map).fillna(2)

    # Tyre life
    df['Stint'] = df['Stint'].fillna(1).astype(int)
    df['TyreLife'] = pd.to_numeric(df['TyreLife'], errors='coerce')
    df['TyreLife'] = df['TyreLife'].fillna(
        df.groupby(['rd_key', 'Stint'])['LapNumber'].transform(lambda x: x - x.min() + 1)
    )
    df['tyre_life_sq'] = df['TyreLife'] ** 2  # Non-linear degradation
    df['tyre_compound_int'] = df['TyreLife'] * df['compound_enc']  # Softs degrade faster

    # Fresh tyre flag
    df['fresh_tyre'] = df['FreshTyre'].fillna(True).astype(int)

    # Stint-level rolling average
    stint_shifted = df.groupby(['rd_key', 'Stint'])['LapTimeSec'].shift(1)
    df['stint_avg_3'] = stint_shifted.groupby([df['rd_key'], df['Stint']]).transform(
        lambda x: x.rolling(3, min_periods=1).mean()
    )
    df['stint_mean'] = stint_shifted.groupby([df['rd_key'], df['Stint']]).transform(
        lambda x: x.expanding().mean()
    )

    # Stint lap number
    df['stint_lap'] = df.groupby(['rd_key', 'Stint']).cumcount() + 1

    # Tyre degradation rate (slope within current stint)
    def stint_slope(series):
        vals = series.dropna().values
        if len(vals) < 2:
            return 0.0
        return np.polyfit(range(len(vals)), vals, 1)[0]

    df['deg_rate'] = stint_shifted.groupby([df['rd_key'], df['Stint']]).transform(
        lambda x: x.expanding().apply(stint_slope, raw=False)
    )

    # ==========================================================
    #  PRIORITY 3: Race Context
    # ==========================================================

    # Fuel effect — approx 0.06s/lap lighter over race distance
    df['fuel_corrected_lap'] = df['LapNumber'] / df['TotalLaps']
    df['fuel_effect'] = (1 - df['fuel_corrected_lap']) * 0.06 * df['TotalLaps']

    # Position
    df['Position'] = pd.to_numeric(df['Position'], errors='coerce')

    # Speed trap averages
    for col in ['SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST']:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    df['avg_speed'] = df[['SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST']].mean(axis=1)

    # ==========================================================
    #  PRIORITY 4: Historical Driver × Circuit Performance
    # ==========================================================

    # Build historical stats EXCLUDING the current race (no leakage)
    hist = df.groupby(['Circuit', 'Driver', 'race_id'])['LapTimeSec'].agg(
        hist_mean='mean', hist_best='min'
    ).reset_index()

    # For each race, the historical stats should come from OTHER races only
    # We use an expanding approach grouped by circuit+driver
    hist = hist.sort_values('race_id')
    hist['hist_circuit_mean'] = hist.groupby(['Circuit', 'Driver'])['hist_mean'].transform(
        lambda x: x.shift(1).expanding().mean()
    )
    hist['hist_circuit_best'] = hist.groupby(['Circuit', 'Driver'])['hist_best'].transform(
        lambda x: x.shift(1).expanding().min()
    )
    hist = hist[['Circuit', 'Driver', 'race_id', 'hist_circuit_mean', 'hist_circuit_best']]

    df = df.merge(hist, on=['Circuit', 'Driver', 'race_id'], how='left')

    # Circuit average (across all drivers, for relative comparison)
    circuit_avg = df.groupby('Circuit')['LapTimeSec'].transform('mean')
    df['circuit_avg'] = circuit_avg
    df['driver_advantage'] = df['circuit_avg'] - df['hist_circuit_mean']

    # ==========================================================
    #  ENCODING
    # ==========================================================

    le_driver = LabelEncoder()
    df['driver_enc'] = le_driver.fit_transform(df['Driver'].astype(str))

    le_team = LabelEncoder()
    df['team_enc'] = le_team.fit_transform(df['Team'].astype(str))

    le_circuit = LabelEncoder()
    df['circuit_enc'] = le_circuit.fit_transform(df['Circuit'].astype(str))

    return df, le_driver, le_team, le_circuit


print('Building features...')
df, le_driver, le_team, le_circuit = build_features(df)
print(f'Done. Shape: {df.shape}')

---
## 6. Prepare Training Data

Fill NaN values with sensible defaults (rather than dropping rows).  
Early-race laps have fewer past laps to reference, so NaNs are expected and handled.

In [None]:
FEATURES = [
    # P1: Current race
    'prev_lap_1', 'prev_lap_2', 'prev_lap_3',
    'roll_avg_3', 'roll_avg_5', 'roll_avg_10',
    'delta_1', 'delta_2',
    'race_mean', 'race_best', 'race_std',

    # P2: Tyre / stint
    'TyreLife', 'tyre_life_sq', 'compound_enc', 'tyre_compound_int',
    'fresh_tyre', 'Stint', 'stint_lap', 'stint_avg_3', 'stint_mean',
    'deg_rate',

    # P3: Race context
    'LapNumber', 'fuel_corrected_lap', 'fuel_effect',
    'Position', 'avg_speed',

    # P4: Historical
    'hist_circuit_mean', 'hist_circuit_best',
    'circuit_avg', 'driver_advantage',

    # Encoded identifiers
    'driver_enc', 'team_enc', 'circuit_enc',
]

TARGET = 'LapTimeSec'

# --- Show NaN landscape before filling ---
nan_before = df[FEATURES].isnull().sum()
print('NaN counts before filling:')
print(nan_before[nan_before > 0].to_string())

# ---- Intelligent NaN filling ----

# Lag features: cascade fill (3 → 2 → 1 → race_mean → circuit_avg)
df['prev_lap_3'] = df['prev_lap_3'].fillna(df['prev_lap_2'])
df['prev_lap_2'] = df['prev_lap_2'].fillna(df['prev_lap_1'])
df['prev_lap_1'] = df['prev_lap_1'].fillna(df['race_mean']).fillna(df['hist_circuit_mean']).fillna(df['circuit_avg'])
# After filling prev_lap_1, re-fill 2 and 3
df['prev_lap_2'] = df['prev_lap_2'].fillna(df['prev_lap_1'])
df['prev_lap_3'] = df['prev_lap_3'].fillna(df['prev_lap_2'])

# Rolling averages: cascade
df['roll_avg_10'] = df['roll_avg_10'].fillna(df['roll_avg_5'])
df['roll_avg_5'] = df['roll_avg_5'].fillna(df['roll_avg_3'])
df['roll_avg_3'] = df['roll_avg_3'].fillna(df['prev_lap_1'])
df['roll_avg_5'] = df['roll_avg_5'].fillna(df['roll_avg_3'])
df['roll_avg_10'] = df['roll_avg_10'].fillna(df['roll_avg_5'])

# Deltas: 0 = no trend data yet
df['delta_1'] = df['delta_1'].fillna(0)
df['delta_2'] = df['delta_2'].fillna(0)

# Race cumulative: fill from historical or circuit
df['race_mean'] = df['race_mean'].fillna(df['hist_circuit_mean']).fillna(df['circuit_avg'])
df['race_best'] = df['race_best'].fillna(df['hist_circuit_best']).fillna(df['circuit_avg'])
df['race_std'] = df['race_std'].fillna(0)

# Stint: fill from race-level
df['stint_avg_3'] = df['stint_avg_3'].fillna(df['roll_avg_3'])
df['stint_mean'] = df['stint_mean'].fillna(df['race_mean'])
df['deg_rate'] = df['deg_rate'].fillna(0)

# Speed / Position: per-event median
for col in ['avg_speed', 'Position']:
    df[col] = df.groupby(['Year', 'Event'])[col].transform(lambda x: x.fillna(x.median()))

# Historical: fill with circuit average
df['hist_circuit_mean'] = df['hist_circuit_mean'].fillna(df['circuit_avg'])
df['hist_circuit_best'] = df['hist_circuit_best'].fillna(df['circuit_avg'])
df['driver_advantage'] = df['driver_advantage'].fillna(0)

# Final safety net: fill any remaining NaN with column median
for col in FEATURES:
    if df[col].isnull().any():
        df[col] = df[col].fillna(df[col].median())

# Drop only if target is missing
model_df = df.dropna(subset=[TARGET])

X = model_df[FEATURES]
y = model_df[TARGET]

nan_after = X.isnull().sum().sum()
print(f'\nNaN remaining: {nan_after}')
print(f'Training samples: {len(X):,}')
print(f'Features: {len(FEATURES)}')

# Time-aware split (recent races = test set)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False
)
print(f'\nTrain: {len(X_train):,}  |  Test: {len(X_test):,}')

---
## 7. Train Model

In [None]:
model = XGBRegressor(
    n_estimators=800,
    max_depth=8,
    learning_rate=0.03,
    subsample=0.8,
    colsample_bytree=0.8,
    min_child_weight=5,
    reg_alpha=0.1,
    reg_lambda=1.0,
    random_state=42,
    n_jobs=-1,
    early_stopping_rounds=50,
)

model.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    verbose=100
)

print(f'\nBest iteration: {model.best_iteration}')

---
## 8. Evaluate

In [None]:
y_pred = model.predict(X_test)
y_pred_train = model.predict(X_train)

print('Model Performance')
print('=' * 50)
for name, yt, yp in [('Train', y_train, y_pred_train), ('Test', y_test, y_pred)]:
    mae = mean_absolute_error(yt, yp)
    rmse = np.sqrt(mean_squared_error(yt, yp))
    r2 = r2_score(yt, yp)
    print(f'{name:6s}  MAE: {mae:.3f}s  RMSE: {rmse:.3f}s  R²: {r2:.4f}')

print(f'\nThe model predicts lap times within ~{mean_absolute_error(y_test, y_pred):.2f} seconds on average.')

In [None]:
# Time-series cross-validation
tscv = TimeSeriesSplit(n_splits=5)
cv_model = XGBRegressor(
    n_estimators=300, max_depth=6, learning_rate=0.05,
    subsample=0.8, colsample_bytree=0.8, random_state=42, n_jobs=-1
)
cv_scores = cross_val_score(cv_model, X, y, cv=tscv, scoring='neg_mean_absolute_error')
print(f'CV MAE: {-cv_scores.mean():.3f} +/- {cv_scores.std():.3f} seconds')

---
## 9. Feature Importance

In [None]:
importance = pd.DataFrame({
    'Feature': FEATURES,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

# Categorize
P1 = ['prev_lap', 'roll_avg', 'delta_', 'race_mean', 'race_best', 'race_std']
P2 = ['Tyre', 'tyre', 'compound', 'fresh', 'Stint', 'stint', 'deg']
P3 = ['Lap', 'fuel', 'Position', 'speed']

def cat(f):
    if any(k in f for k in P1): return 'P1: Current Race'
    if any(k in f for k in P2): return 'P2: Tyre/Stint'
    if any(k in f for k in P3): return 'P3: Race Context'
    if 'hist' in f or 'circuit' in f or 'advantage' in f: return 'P4: Historical'
    return 'Other'

importance['Category'] = importance['Feature'].apply(cat)
cat_colors = {
    'P1: Current Race': '#e74c3c',
    'P2: Tyre/Stint': '#f39c12',
    'P3: Race Context': '#2ecc71',
    'P4: Historical': '#3498db',
    'Other': '#95a5a6'
}

fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# All features
colors = importance['Category'].map(cat_colors)
axes[0].barh(importance['Feature'], importance['Importance'], color=colors)
axes[0].set_xlabel('Importance')
axes[0].set_title('Individual Feature Importance')
axes[0].invert_yaxis()

# Category totals
cat_imp = importance.groupby('Category')['Importance'].sum().sort_values()
cat_c = [cat_colors.get(c, '#95a5a6') for c in cat_imp.index]
axes[1].barh(cat_imp.index, cat_imp.values, color=cat_c)
axes[1].set_xlabel('Total Importance')
axes[1].set_title('Importance by Priority Category')

for i, (idx, val) in enumerate(cat_imp.items()):
    axes[1].text(val + 0.005, i, f'{val:.1%}', va='center', fontsize=11)

plt.tight_layout()
plt.savefig(str(MODEL_DIR / 'feature_importance.png'), dpi=150, bbox_inches='tight')
plt.show()

print('\nTop 10 features:')
print(importance.head(10).to_string(index=False))

---
## 10. Diagnostic Plots

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Predicted vs Actual
axes[0,0].scatter(y_test, y_pred, alpha=0.1, s=5, c='#e74c3c')
lims = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
axes[0,0].plot(lims, lims, 'k--', lw=2)
axes[0,0].set_xlabel('Actual (s)'); axes[0,0].set_ylabel('Predicted (s)')
axes[0,0].set_title('Predicted vs Actual Lap Times')

# Residuals
residuals = y_test.values - y_pred
axes[0,1].hist(residuals, bins=100, color='#3498db', edgecolor='white')
axes[0,1].axvline(0, color='red', lw=2, ls='--')
axes[0,1].set_xlabel('Residual (s)'); axes[0,1].set_title(f'Residuals (mean={residuals.mean():.3f}s)')

# Error vs Tyre Life
test_analysis = X_test.copy()
test_analysis['abs_error'] = np.abs(residuals)
tyre_err = test_analysis.groupby('TyreLife')['abs_error'].mean()
tyre_err = tyre_err[tyre_err.index <= 40]
axes[1,0].plot(tyre_err.index, tyre_err.values, color='#f39c12', lw=2)
axes[1,0].set_xlabel('Tyre Life (laps)'); axes[1,0].set_ylabel('MAE (s)')
axes[1,0].set_title('Prediction Error vs Tyre Age')

# Error by race stage
test_analysis['stage'] = pd.cut(test_analysis['LapNumber'], bins=5)
stage_err = test_analysis.groupby('stage')['abs_error'].mean()
axes[1,1].bar(range(len(stage_err)), stage_err.values, color='#2ecc71', edgecolor='white')
axes[1,1].set_xticks(range(len(stage_err)))
axes[1,1].set_xticklabels([str(b) for b in stage_err.index], rotation=30, ha='right')
axes[1,1].set_xlabel('Lap Range'); axes[1,1].set_ylabel('MAE (s)')
axes[1,1].set_title('Prediction Error by Race Stage')

plt.tight_layout()
plt.savefig(str(MODEL_DIR / 'diagnostics.png'), dpi=150, bbox_inches='tight')
plt.show()

---
## 11. Tyre Degradation Curves

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

compound_info = {
    'SOFT': ('#FF3333', 1),
    'MEDIUM': ('#FFDD00', 2),
    'HARD': ('#CCCCCC', 3),
}

for name, (color, _) in compound_info.items():
    mask = df['Compound'] == name
    data = df[mask].groupby('TyreLife')['LapTimeSec'].median()
    data = data[data.index <= 35]
    if len(data) > 2:
        ax.plot(data.index, data.values, label=name, color=color, lw=2.5)

ax.set_xlabel('Tyre Life (laps)', fontsize=12)
ax.set_ylabel('Median Lap Time (s)', fontsize=12)
ax.set_title('Tyre Degradation by Compound (all circuits)', fontsize=14, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(str(MODEL_DIR / 'tyre_deg.png'), dpi=150, bbox_inches='tight')
plt.show()

---
## 12. Export Model for Deployment

Exports everything needed to run predictions from your FastAPI backend —  
no FastF1 or internet required at inference time.

In [None]:
# Build metadata for inference
# These maps let the backend convert user-friendly names to encoded values

driver_map = {name: int(idx) for idx, name in enumerate(le_driver.classes_)}
team_map = {name: int(idx) for idx, name in enumerate(le_team.classes_)}
circuit_map = {name: int(idx) for idx, name in enumerate(le_circuit.classes_)}

# Circuit average lap times (for baseline when no race data yet)
circuit_baselines = df.groupby('Circuit')['LapTimeSec'].agg(
    ['mean', 'min']
).round(3).to_dict(orient='index')

# Driver × Circuit historical performance
driver_circuit_hist = df.groupby(['Circuit', 'Driver'])['LapTimeSec'].agg(
    ['mean', 'min']
).round(3).reset_index()
driver_circuit_hist.columns = ['circuit', 'driver', 'mean', 'best']
driver_circuit_dict = {}
for _, row in driver_circuit_hist.iterrows():
    key = f"{row['driver']}@{row['circuit']}"
    driver_circuit_dict[key] = {'mean': row['mean'], 'best': row['best']}

# Save everything
artifacts = {
    'model': model,
    'features': FEATURES,
    'driver_map': driver_map,
    'team_map': team_map,
    'circuit_map': circuit_map,
    'circuit_baselines': circuit_baselines,
    'driver_circuit_hist': driver_circuit_dict,
    'compound_map': {'SOFT': 1, 'MEDIUM': 2, 'HARD': 3, 'INTERMEDIATE': 4, 'WET': 5},
    'training_info': {
        'seasons': SEASONS,
        'total_laps': len(X),
        'test_mae': float(mean_absolute_error(y_test, y_pred)),
        'test_r2': float(r2_score(y_test, y_pred)),
    }
}

model_path = MODEL_DIR / 'lap_time_model.pkl'
with open(model_path, 'wb') as f:
    pickle.dump(artifacts, f)

# Also save metadata as JSON (human-readable, for API docs)
meta_path = MODEL_DIR / 'model_metadata.json'
meta = {
    'features': FEATURES,
    'drivers': list(driver_map.keys()),
    'teams': list(team_map.keys()),
    'circuits': list(circuit_map.keys()),
    'compounds': ['SOFT', 'MEDIUM', 'HARD', 'INTERMEDIATE', 'WET'],
    'training_info': artifacts['training_info'],
}
with open(meta_path, 'w') as f:
    json.dump(meta, f, indent=2)

print(f'Model saved:    {model_path} ({model_path.stat().st_size / 1024:.0f} KB)')
print(f'Metadata saved: {meta_path}')
print(f'\nFiles in export directory:')
for p in MODEL_DIR.iterdir():
    print(f'  {p.name} ({p.stat().st_size / 1024:.0f} KB)')

---
## 13. Inference Engine

This is the class you'd copy to your FastAPI backend.  
It takes **current race state** (laps completed so far) and predicts the next lap.

In [None]:
class LapTimePredictor:
    """
    Stateful lap time predictor.

    Usage:
        predictor = LapTimePredictor('model_export/lap_time_model.pkl')
        predicted = predictor.predict(
            driver='VER', circuit='Monaco Grand Prix',
            team='Red Bull Racing',
            completed_laps=[78.1, 77.9, 78.3, 78.0, 78.5],  # times so far
            current_compound='MEDIUM', tyre_life=5,
            stint=1, fresh_tyre=True,
            position=1, total_laps=78
        )
    """

    def __init__(self, model_path):
        with open(model_path, 'rb') as f:
            arts = pickle.load(f)
        self.model = arts['model']
        self.features = arts['features']
        self.driver_map = arts['driver_map']
        self.team_map = arts['team_map']
        self.circuit_map = arts['circuit_map']
        self.circuit_baselines = arts['circuit_baselines']
        self.driver_circuit_hist = arts['driver_circuit_hist']
        self.compound_map = arts['compound_map']

    def predict(self, driver, circuit, team, completed_laps,
                current_compound, tyre_life, stint=1,
                fresh_tyre=True, position=10, total_laps=57,
                avg_speed=None):
        """
        Predict the next lap time.

        Args:
            driver: Driver abbreviation (e.g. 'VER', 'HAM')
            circuit: Circuit name (e.g. 'Monaco Grand Prix')
            team: Team name (e.g. 'Red Bull Racing')
            completed_laps: List of lap times (seconds) completed so far
            current_compound: 'SOFT', 'MEDIUM', 'HARD', 'INTERMEDIATE', 'WET'
            tyre_life: Laps on current set of tyres
            stint: Current stint number (1, 2, 3...)
            fresh_tyre: Whether tyres are new (True) or used (False)
            position: Current race position
            total_laps: Total race laps
            avg_speed: Average speed trap reading (optional)

        Returns:
            float: Predicted next lap time in seconds
        """
        laps = np.array(completed_laps, dtype=float)
        n = len(laps)
        next_lap_num = n + 1

        # --- Historical lookup ---
        hist_key = f'{driver}@{circuit}'
        hist = self.driver_circuit_hist.get(hist_key, {})
        baseline = self.circuit_baselines.get(circuit, {})
        circuit_mean = baseline.get('mean', np.mean(laps) if n > 0 else 90.0)
        circuit_best = baseline.get('min', np.min(laps) if n > 0 else 80.0)
        hist_mean = hist.get('mean', circuit_mean)
        hist_best = hist.get('best', circuit_best)

        # --- Build feature vector ---
        f = {}

        # P1: Current race
        f['prev_lap_1'] = laps[-1] if n >= 1 else hist_mean
        f['prev_lap_2'] = laps[-2] if n >= 2 else f['prev_lap_1']
        f['prev_lap_3'] = laps[-3] if n >= 3 else f['prev_lap_2']
        f['roll_avg_3'] = np.mean(laps[-3:]) if n >= 1 else hist_mean
        f['roll_avg_5'] = np.mean(laps[-5:]) if n >= 1 else hist_mean
        f['roll_avg_10'] = np.mean(laps[-10:]) if n >= 1 else hist_mean
        f['delta_1'] = float(laps[-1] - laps[-2]) if n >= 2 else 0.0
        f['delta_2'] = float(laps[-2] - laps[-3]) if n >= 3 else 0.0
        f['race_mean'] = float(np.mean(laps)) if n >= 1 else hist_mean
        f['race_best'] = float(np.min(laps)) if n >= 1 else hist_best
        f['race_std'] = float(np.std(laps)) if n >= 2 else 0.0

        # P2: Tyre / stint
        f['TyreLife'] = tyre_life
        f['tyre_life_sq'] = tyre_life ** 2
        f['compound_enc'] = self.compound_map.get(current_compound, 2)
        f['tyre_compound_int'] = tyre_life * f['compound_enc']
        f['fresh_tyre'] = int(fresh_tyre)
        f['Stint'] = stint
        f['stint_lap'] = tyre_life  # approximation
        # Use only laps from current stint for stint averages
        stint_laps = laps[-tyre_life:] if tyre_life <= n else laps
        f['stint_avg_3'] = float(np.mean(stint_laps[-3:])) if len(stint_laps) >= 1 else f['roll_avg_3']
        f['stint_mean'] = float(np.mean(stint_laps)) if len(stint_laps) >= 1 else f['race_mean']
        f['deg_rate'] = float(np.polyfit(range(len(stint_laps)), stint_laps, 1)[0]) if len(stint_laps) >= 2 else 0.0

        # P3: Race context
        f['LapNumber'] = next_lap_num
        f['fuel_corrected_lap'] = next_lap_num / total_laps
        f['fuel_effect'] = (1 - f['fuel_corrected_lap']) * 0.06 * total_laps
        f['Position'] = position
        f['avg_speed'] = avg_speed if avg_speed is not None else 0.0

        # P4: Historical
        f['hist_circuit_mean'] = hist_mean
        f['hist_circuit_best'] = hist_best
        f['circuit_avg'] = circuit_mean
        f['driver_advantage'] = circuit_mean - hist_mean

        # Encoded identifiers
        f['driver_enc'] = self.driver_map.get(driver, 0)
        f['team_enc'] = self.team_map.get(team, 0)
        f['circuit_enc'] = self.circuit_map.get(circuit, 0)

        # Predict
        X = pd.DataFrame([f])[self.features]
        return float(self.model.predict(X)[0])

    def predict_remaining(self, driver, circuit, team, completed_laps,
                          current_compound, tyre_life, stint=1,
                          fresh_tyre=True, position=10, total_laps=57,
                          pit_plan=None):
        """
        Predict all remaining laps in the race.

        Args:
            pit_plan: Optional list of dicts for planned pit stops.
                      e.g. [{'lap': 25, 'compound': 'HARD'}]
                      If None, assumes no more pit stops.

        Returns:
            list of predicted lap times
        """
        predictions = []
        sim_laps = list(completed_laps)
        sim_tyre_life = tyre_life
        sim_compound = current_compound
        sim_stint = stint
        sim_fresh = fresh_tyre

        pit_lap_set = {}
        if pit_plan:
            for p in pit_plan:
                pit_lap_set[p['lap']] = p['compound']

        for lap_num in range(len(completed_laps) + 1, total_laps + 1):
            # Check if there's a pit stop at this lap
            if lap_num in pit_lap_set:
                sim_compound = pit_lap_set[lap_num]
                sim_tyre_life = 1
                sim_stint += 1
                sim_fresh = True
            else:
                sim_tyre_life += 1
                sim_fresh = False

            pred = self.predict(
                driver=driver, circuit=circuit, team=team,
                completed_laps=sim_laps,
                current_compound=sim_compound,
                tyre_life=sim_tyre_life,
                stint=sim_stint,
                fresh_tyre=sim_fresh,
                position=position,
                total_laps=total_laps,
            )
            predictions.append(pred)
            sim_laps.append(pred)  # Feed prediction back as input

        return predictions


print('LapTimePredictor class defined')

---
## 14. Test Predictions

Validate the inference engine against actual race data.

In [None]:
# --- Load predictor from saved model ---
predictor = LapTimePredictor(str(MODEL_DIR / 'lap_time_model.pkl'))

# --- Pick a real driver-race from the test set to validate ---
test_data = model_df.iloc[len(X_train):]  # test portion
sample_rd = test_data.groupby('rd_key').filter(lambda x: len(x) >= 20)
sample_key = sample_rd['rd_key'].unique()[0]
sample = test_data[test_data['rd_key'] == sample_key].sort_values('LapNumber')

driver = sample['Driver'].iloc[0]
circuit = sample['Circuit'].iloc[0]
team = sample['Team'].iloc[0]
total_laps = int(sample['TotalLaps'].iloc[0])

print(f'Testing: {driver} ({team}) at {circuit}')
print(f'Total laps available: {len(sample)}\n')

# Predict laps 6–20, using laps 1–5 as initial context
print(f'{"Lap":>4} {"Predicted":>10} {"Actual":>10} {"Error":>8}')
print('-' * 36)

errors = []
for i in range(5, min(25, len(sample))):
    context = sample.iloc[:i]
    actual = sample.iloc[i]

    pred = predictor.predict(
        driver=driver, circuit=circuit, team=team,
        completed_laps=context['LapTimeSec'].tolist(),
        current_compound=actual['Compound'],
        tyre_life=int(actual['TyreLife']),
        stint=int(actual['Stint']),
        fresh_tyre=bool(actual.get('fresh_tyre', 0)),
        position=int(actual.get('Position', 10)),
        total_laps=total_laps,
    )

    err = abs(pred - actual['LapTimeSec'])
    errors.append(err)
    print(f'{int(actual["LapNumber"]):>4} {pred:>10.3f} {actual["LapTimeSec"]:>10.3f} {err:>8.3f}')

print(f'\nAverage error: {np.mean(errors):.3f}s')
print(f'Max error:     {np.max(errors):.3f}s')

In [None]:
# --- Visualize: predicted vs actual lap trace ---
context_size = 5
ctx_laps = sample.iloc[:context_size]
remaining = sample.iloc[context_size:]

preds = predictor.predict_remaining(
    driver=driver, circuit=circuit, team=team,
    completed_laps=ctx_laps['LapTimeSec'].tolist(),
    current_compound=remaining.iloc[0]['Compound'],
    tyre_life=int(remaining.iloc[0].get('TyreLife', context_size)),
    stint=int(remaining.iloc[0].get('Stint', 1)),
    position=int(remaining.iloc[0].get('Position', 10)),
    total_laps=len(sample),  # Use available laps as total
)

fig, ax = plt.subplots(figsize=(14, 5))
all_actual_laps = sample['LapNumber'].values
all_actual_times = sample['LapTimeSec'].values

pred_lap_nums = list(range(context_size + 1, context_size + 1 + len(preds)))

ax.plot(all_actual_laps, all_actual_times, 'b-o', ms=3, label='Actual', alpha=0.8)
ax.plot(pred_lap_nums[:len(remaining)], preds[:len(remaining)], 'r--s', ms=3, label='Predicted', alpha=0.8)
ax.axvline(context_size + 0.5, color='green', ls=':', lw=2, label=f'Prediction starts (lap {context_size+1})')

ax.set_xlabel('Lap Number')
ax.set_ylabel('Lap Time (s)')
ax.set_title(f'{driver} at {circuit} — Actual vs Predicted Lap Trace')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(str(MODEL_DIR / 'prediction_trace.png'), dpi=150, bbox_inches='tight')
plt.show()

---
## 15. Download for Deployment

In [None]:
try:
    from google.colab import files
    for p in MODEL_DIR.iterdir():
        files.download(str(p))
    print('Files downloaded')
except ImportError:
    print(f'Not in Colab. Model files are in: {MODEL_DIR.absolute()}')
    for p in MODEL_DIR.iterdir():
        print(f'  {p.name}')

---
## Architecture Summary

```
TRAINING (this notebook, run in Colab):

  FastF1 API ──(bulk download once)──> Parquet file ──> Feature Engineering ──> XGBoost ──> model.pkl
                                       (cached on disk)                                     (exported)


INFERENCE (your FastAPI backend, zero API calls):

  User input:                          LapTimePredictor class:
  ┌─────────────────────┐              ┌──────────────────────────────────┐
  │ driver: VER         │              │ 1. Look up driver×circuit history│
  │ circuit: Monaco     │──────────────│ 2. Compute features from input   │
  │ compound: MEDIUM    │              │ 3. model.predict(features)       │
  │ tyre_life: 12       │              │ 4. Return predicted lap time     │
  │ completed_laps: []  │              └──────────────────────────────────┘
  └─────────────────────┘
```

### Feature priority ensures current-race data dominates:

| Priority | Weight | Features |
|----------|--------|----------|
| P1 (highest) | ~50-60% | prev_lap_1/2/3, rolling averages, deltas, race mean/best |
| P2 | ~20-30% | tyre_life, compound, deg_rate, stint averages |
| P3 | ~10% | lap number, fuel effect, position |
| P4 (lowest) | ~5-10% | historical driver×circuit mean, circuit average |

The model naturally learns these weights from data — XGBoost assigns higher importance to features with more predictive power.