En este notebook vamos a armar y entrenar los primeros modelos en base al set creado en EDA.ipynb

In [51]:
from src import metrics
from src import plots

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold

In [52]:
df = pd.read_csv("data/processed/monaco_2025_colapinto_alllaps.csv")

# Target
y = df["LapTime_s"].to_numpy()

# Features legales
LEGAL_FEATURES_NUM = ["LapNumber", "Stint", "TyreLife", "Position"]
LEGAL_FEATURES_CAT = ["Session", "Compound"]

LEGAL_FEATURES_NUM = [c for c in LEGAL_FEATURES_NUM if c in df.columns]
LEGAL_FEATURES_CAT = [c for c in LEGAL_FEATURES_CAT if c in df.columns]

X = df[LEGAL_FEATURES_NUM + LEGAL_FEATURES_CAT].copy()



In [53]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), LEGAL_FEATURES_NUM),
        ("cat", OneHotEncoder(handle_unknown="ignore"), LEGAL_FEATURES_CAT),
    ]
)

rf = RandomForestRegressor(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

model = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("regressor", rf),
])


Hacemos CrossValidation con K-Fold para poder tener una mejor evaluacion.

In [54]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

mae_scores = []
rmse_scores = []
r2_scores = []

for fold, (train_idx, test_idx) in enumerate(kf.split(X), start=1):
    X_train_fold = X.iloc[train_idx]
    X_test_fold  = X.iloc[test_idx]
    y_train_fold = y[train_idx]
    y_test_fold  = y[test_idx]
    
    # entrenar modelo en este fold
    model.fit(X_train_fold, y_train_fold)
    y_pred_fold = model.predict(X_test_fold)
    
    mae = metrics.MAE(y_test_fold, y_pred_fold)
    rmse = metrics.RMSE(y_test_fold, y_pred_fold)
    r2 = metrics.R2(y_test_fold, y_pred_fold)
    
    mae_scores.append(mae)
    rmse_scores.append(rmse)
    r2_scores.append(r2)
    
    print(f"Fold {fold}: MAE={mae:.3f}, RMSE={rmse:.3f}, R2={r2:.3f}")

print("\n=== Resultados promedio CV (5 folds) ===")
print(f"MAE  medio: {np.mean(mae_scores):.3f} ± {np.std(mae_scores):.3f}")
print(f"RMSE medio: {np.mean(rmse_scores):.3f} ± {np.std(rmse_scores):.3f}")
print(f"R2   medio: {np.mean(r2_scores):.33f} ± {np.std(r2_scores):.3f}")


Fold 1: MAE=0.946, RMSE=1.366, R2=0.647
Fold 2: MAE=0.532, RMSE=0.849, R2=0.885
Fold 3: MAE=0.771, RMSE=1.050, R2=0.808
Fold 4: MAE=0.890, RMSE=1.318, R2=0.553
Fold 5: MAE=0.580, RMSE=0.726, R2=0.817

=== Resultados promedio CV (5 folds) ===
MAE  medio: 0.744 ± 0.164
RMSE medio: 1.062 ± 0.251
R2   medio: 0.741958283297933673949842159345280 ± 0.123


OPCIONES A SEGUIR:

Analizar los grandes errores:
- Mirar esas vueltas con residuo > ~2 s y ver si se concentran en:
- cierto stint,
- cierto compuesto,
- momentos con tráfico/lock-ups (si tenés alguna pista).

Más features:
- Alguna medida de “fase de carrera” más explícita (por ejemplo lap_norm = LapNumber / max_laps).
- Features agregadas de vueltas anteriores (media de las últimas N vueltas, mejor vuelta previa, etc.).

Más datos
- Sumar los datos de Gasly de Mónaco (Mismo auto pero distinto piloto)

