# ü§ñ Machine Learning F1 ‚Äì Saison 2025
Ce notebook explore la **mod√©lisation pr√©dictive** sur la saison F1 2025 :
- Pr√©diction de podiums (classification)
- Pr√©diction de la dur√©e des pitstops (r√©gression)
> üèÅ *La data c‚Äôest bien, la data qui pr√©dit la F1 c‚Äôest mieux !* üôÇ

## 1Ô∏è‚É£ Imports & Chargement des donn√©es
On charge tous les datasets n√©cessaires pour la mod√©lisation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, classification_report, mean_absolute_error, r2_score
import plotly.express as px
import plotly.graph_objects as go

In [2]:
DATA_DIR = '../data'

df_results = pd.read_parquet(f'{DATA_DIR}/results_2025.parquet')
df_quali = pd.read_parquet(f'{DATA_DIR}/qualifying_2025.parquet')
df_pits = pd.read_parquet(f'{DATA_DIR}/pitstops_2025.parquet')
df_drv_stand = pd.read_parquet(f'{DATA_DIR}/driver_standings_2025.parquet')
df_team_stand = pd.read_parquet(f'{DATA_DIR}/team_standings_2025.parquet')
df_weather = pd.read_parquet(f'{DATA_DIR}/weather_2025.parquet')
df_flights = pd.read_parquet(f'{DATA_DIR}/flightlegs_2025.parquet')

## 2Ô∏è‚É£ Pr√©diction de podium (classification supervis√©e) üèÜ
Objectif : pr√©dire si un pilote va finir sur le podium (top 3) en fonction de ses qualifs, de son √©quipe, de la m√©t√©o et de la strat√©gie. 
- **Target** : `is_podium` (1 si top 3, 0 sinon)  
- **Features** : position grille, qualif, √©quipe, m√©t√©o, chronos Q1/Q2/Q3   
- **Mod√®le** : Random Forest Classifier    
- **√âvaluation** : classification_report, AUC, ROC, feature importance

In [3]:
# Merge m√©t√©o d√©part de chaque GP (moyenne premiers tours ou premier timestamp)
df_weather_gp = df_weather.groupby('event').first().reset_index()  # On prend la m√©t√©o du d√©part
df_weather_gp = df_weather_gp[['event', 'AirTemp', 'Rainfall', 'TrackTemp']]

# Position en qualif (qualif = meilleure Position par pilote/Gp)
best_quali = df_quali.groupby(['FullName', 'event'])['Position'].min().reset_index()
best_quali = best_quali.rename(columns={'Position': 'QualiPosition'})

# Merge le tout
df_ml = (
    df_results
    .merge(df_quali[['FullName', 'event', 'Q1', 'Q2', 'Q3']], on=['FullName', 'event'], how='left')
    .merge(best_quali, on=['FullName', 'event'], how='left')
    .merge(df_weather_gp, on='event', how='left')
)
df_ml['is_podium'] = (df_ml['Position'] <= 3).astype(int)

features = [
    'GridPosition',
    'QualiPosition',
    'TeamName',
    'Q1', 'Q2', 'Q3',
    'AirTemp', 'Rainfall', 'TrackTemp'
]
df_features = df_ml[features + ['is_podium']].dropna()

In [4]:
# Encodage
df_features = pd.get_dummies(df_features, columns=['TeamName'])
for q in ['Q1', 'Q2', 'Q3']:
    if np.issubdtype(df_features[q].dtype, np.timedelta64):
        df_features[q] = df_features[q].dt.total_seconds()
    elif df_features[q].dtype == 'O':
        df_features[q] = pd.to_timedelta(df_features[q]).dt.total_seconds()
    df_features[q] = df_features[q].fillna(df_features[q].max())

# Split, mod√®le, √©valuation 
X = df_features.drop(columns=['is_podium'])
y = df_features['is_podium']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

clf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
roc_auc = roc_auc_score(y_test, y_proba)
print(f'AUC: {roc_auc:.3f}')

              precision    recall  f1-score   support

           0       0.90      0.86      0.88        21
           1       0.70      0.78      0.74         9

    accuracy                           0.83        30
   macro avg       0.80      0.82      0.81        30
weighted avg       0.84      0.83      0.84        30

AUC: 0.881


Le mod√®le identifie les podiums avec une performance significative dans un contexte tr√®s bruit√©, o√π la part d‚Äôal√©atoire reste structurelle en F1. Les mod√®les plus avanc√©s pourront affiner ce score mais n√©cessitent des donn√©es additionnelles souvent inaccessibles au public.

**Interpr√©tation :**
- Le mod√®le atteint une accuracy de 83% et un AUC de 0.88.
- Il identifie correctement la plupart des podiums malgr√© l'al√©a F1 (strat√©gie, incidents, m√©t√©o).
- Les features les plus importantes sont la position sur la grille, la qualif et l'√©quipe.    
> ‚ö†Ô∏è Les mod√®les plus avanc√©s (XGBoost, stacking, etc.) pourraient am√©liorer le score mais n√©cessitent plus de donn√©es et de tuning.

In [5]:
# ROC
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name=f'AUC = {roc_auc:.2f}', line=dict(color='royalblue')))
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', name='Random', line=dict(dash='dash', color='gray')))
fig.update_layout(title="ROC Curve ‚Äì Pr√©diction de podium", xaxis_title='False Positive Rate', yaxis_title='True Positive Rate', template='plotly_dark', width=700, height=500)
fig.show()


In [6]:
# Feature importance
importances = clf.feature_importances_
feat_names = X.columns
feat_imp_df = pd.DataFrame({'feature': feat_names, 'importance': importances}).sort_values('importance', ascending=False)
fig = px.bar(feat_imp_df.head(12), x='importance', y='feature', orientation='h', title="Top Feature Importances", labels={'importance': "Importance", 'feature': "Variable"})
fig.update_layout(template='plotly_dark', height=450)
fig.show()

## 3Ô∏è‚É£ Pr√©diction de la dur√©e d‚Äôun pitstop (r√©gression supervis√©e) üõû
Objectif : pr√©dire la dur√©e d‚Äôun arr√™t au stand en fonction de la strat√©gie, de l‚Äô√©quipe, du tour, de la m√©t√©o, etc.  
- **Target** : `pit_duration` (en secondes)    
- **Features** : √©quipe, pneus, m√©t√©o, tour, position, usure 
- **Mod√®le** : Random Forest Regressor    
- **√âvaluation** : MAE, R¬≤, scatter plot, feature importance

In [7]:
# Joint m√©t√©o sur event/round (prend m√©t√©o de la course)
df_pits_valid = df_pits.dropna(subset=['PitInTime', 'PitOutTime', 'TeamName', 'CompoundIn', 'CompoundOut', 'PositionIn', 'TyreLifeIn']).copy()
df_pits_valid = df_pits_valid.merge(df_weather_gp, on='event', how='left')

# Dur√©e d'arr√™t (s)
df_pits_valid['PitInTime'] = pd.to_timedelta(df_pits_valid['PitInTime'])
df_pits_valid['PitOutTime'] = pd.to_timedelta(df_pits_valid['PitOutTime'])
df_pits_valid['pit_duration'] = (df_pits_valid['PitOutTime'] - df_pits_valid['PitInTime']).dt.total_seconds()
df_pits_valid = df_pits_valid[df_pits_valid['pit_duration'] > 0]

features_pit = [
    'TeamName', 'CompoundIn', 'CompoundOut', 'LapIn', 'PositionIn', 'TyreLifeIn',
    'AirTemp', 'Rainfall', 'TrackTemp'
]
df_feat_pit = pd.get_dummies(df_pits_valid[features_pit + ['pit_duration']].dropna(), columns=['TeamName', 'CompoundIn', 'CompoundOut'])
X_pit = df_feat_pit.drop(columns=['pit_duration'])
y_pit = df_feat_pit['pit_duration']

X_pit_train, X_pit_test, y_pit_train, y_pit_test = train_test_split(X_pit, y_pit, test_size=0.2, random_state=42)


In [8]:
# Mod√®le, pr√©diction, √©valuation
regr = RandomForestRegressor(n_estimators=100, random_state=42)
regr.fit(X_pit_train, y_pit_train)
y_pit_pred = regr.predict(X_pit_test)

mae = mean_absolute_error(y_pit_test, y_pit_pred)
r2 = r2_score(y_pit_test, y_pit_pred)
print(f"MAE = {mae:.2f} s | R¬≤ = {r2:.2f}")

MAE = 1.71 s | R¬≤ = 0.16


**Interpr√©tation :**
- Le mod√®le pr√©dit la dur√©e des pitstops avec une erreur absolue moyenne de 1.7s.
- Le R¬≤ est faible (0.16), ce qui refl√®te la forte variabilit√© des arr√™ts (trafic, incidents, strat√©gie, etc.).
- Les features les plus importantes sont l‚Äô√©quipe, le type de pneus et la m√©t√©o.   
> ‚ö†Ô∏è Un mod√®le plus avanc√© ou des features additionnelles (trafic, incidents, t√©l√©m√©trie) pourraient am√©liorer la pr√©diction.


In [9]:
fig = px.scatter(
    x=y_pit_test, y=y_pit_pred,
    labels={'x': "Vraie dur√©e (s)", 'y': "Pr√©dit (s)"},
    title="Dur√©e des pitstops ‚Äì Pr√©diction vs r√©alit√©", opacity=0.7
)
fig.add_shape(type="line", x0=y_pit_test.min(), y0=y_pit_test.min(), x1=y_pit_test.max(), y1=y_pit_test.max(), line=dict(color="white", dash="dash"))
fig.update_layout(template='plotly_dark', width=600, height=500)
fig.show()

In [10]:
# Feature importance
importances_pit = regr.feature_importances_
feat_names_pit = X_pit.columns
feat_imp_df_pit = pd.DataFrame({'feature': feat_names_pit, 'importance': importances_pit}).sort_values('importance', ascending=False)
fig = px.bar(feat_imp_df_pit.head(12), x='importance', y='feature', orientation='h', title="Top Feature Importances (Dur√©e pitstop, enrichi)", labels={'importance': "Importance", 'feature': "Variable"})
fig.update_layout(template='plotly_dark', height=450)
fig.show()