<a href="https://colab.research.google.com/github/100479095/Predictor_F1_2025/blob/main/entrenamiento_modelo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**INTRODUCCIÓN**

---

En este notebook se muestra el entrenamiento de varios modelos de IA para predecir las posiciones del último Gran Premio de Fórmula 1 de 2025, Abu Dabi. Para ello se realizó previamente un  preprocesado de los datos, recopilados de kaggle, lo cuáles se pueden encontrar en el fichero "f1_trainning_data_2016_onwards.csv". En este notebook se eliminarán algunas columnas de este csv con el objetivo de tener un dataset genérico y realizará una validación cruzada entre distintos modelos para encontrar el mejor, el cuál será utilizado para predecir los resultados finales del GP.


#**Carga de datos**

---

Primero que todo importamos las distintas librearís que estaremos utilizando a lo largo del entrenamiento y la validación cruzada.

In [18]:
#Importamos las librerías necesarias
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

Cargamos los datos del **csv f1_training_data_2016_onwards.csv**, es importante recordar que este fichero es **generado por el script de carga de python** y para lograr que el código se ejecute de manera correcta hay que **subirlo a la plataforma de google colab**.

In [7]:
df = pd.read_csv('f1_training_data_2016_onwards.csv')
#mostramos las primeras 5 filas
print(df.head())

   RACEID  DRIVERID  CONSTRUCTORID  CIRCUITID  ROUND  YEAR  LAP DISTANCE KM  \
0     948         1            131          1      1  2016            5.278   
1     948         3            131          1      1  2016            5.278   
2     948         4              1          1      1  2016            5.278   
3     948         8              6          1      1  2016            5.278   
4     948        13              3          1      1  2016            5.278   

   LAPS RACE  AVG WIND SPEED  MAX WIND SPEED  ...  MATE LAST POSITION  \
0         57           17.35           25.97  ...                   1   
1         57           17.35           25.97  ...                   2   
2         57           17.35           25.97  ...                  12   
3         57           17.35           25.97  ...                   4   
4         57           17.35           25.97  ...                  13   

   CONSTRUCTOR POINTS BEFORE GP  CONSTRUCTOR WINS SEASON       Q1       Q2  \
0       

## Limpieza de ID

En este apartadp eliminamos las columnas de RACEID y DRIVERID ya que son únicas o muy específicas y queremos que nuestro modelo sea capaz de generalizar entre los datos.

In [11]:
#, 'CONSTRUCTORID', 'CIRCUITID', 'Q1', 'Q2', 'Q3'
cols_to_drop = ['RACEID', 'DRIVERID', 'MS RACE']
X = df.drop(columns=cols_to_drop)

# Transformación Logarítmica del Target
y = np.log1p(df['MS RACE'])

# Identificar columnas
categorical_features = ['CONSTRUCTORID', 'CIRCUITID']
numerical_features = [col for col in X.columns if col not in categorical_features]

# 3. Crear el Pipeline de Transformación
preprocessor = ColumnTransformer(
    transformers=[
        # Escalar numéricas para que el modelo no se sesgue por magnitudes grandes
        ('num', StandardScaler(), numerical_features),
        # Convertir IDs a vectores binarios (One-Hot)
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])


## Definición y Evaluación de Modelos de Regresión

---

Para lograr la regresión vamos a seleccionar el mejor entre 3 modelos (Random Forest, MLP o GradientBoost)


In [12]:
# Define models to evaluate
models = [
    ('RandomForestRegressor', RandomForestRegressor(n_estimators=50, random_state=42)),
    ('MLPRegressor', MLPRegressor(max_iter=500, random_state=42)),
    ('GradientBoostingRegressor', GradientBoostingRegressor(random_state=42))
]

# Split data into training and testing sets
# Perform stratified split by 'YEAR'
print("Performing stratified train-test split by 'YEAR'...")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=X['YEAR']
)
print("Distribution of 'YEAR' in X_train:")
print(X_train['YEAR'].value_counts(normalize=True).sort_index())

print("\nDistribution of 'YEAR' in X_test:")
print(X_test['YEAR'].value_counts(normalize=True).sort_index())

# Initialize KFold for cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print("KFold cross-validation object initialized.")

results = {}
print("\nEvaluating models with cross-validation (neg_mean_squared_error):")
# 4. Ejecutar Validación Cruzada
results = []
names = []

print("Comparando modelos (RMSE en escala Logarítmica - Menor es mejor):")
for name, model in models:
    # Creamos un pipeline individual para cada modelo para evitar 'data leakage'
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

    # cross_val_score devuelve 'neg_mean_squared_error', así que lo negamos y sacamos raíz
    cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
    rmse_scores = np.sqrt(-cv_scores)

    results.append(rmse_scores)
    names.append(name)
    print(f"{name}: {rmse_scores.mean():.4f} (+/- {rmse_scores.std():.4f})")

Performing stratified train-test split by 'YEAR'...
Distribution of 'YEAR' in X_train:
YEAR
2016    0.107558
2017    0.093023
2018    0.097674
2019    0.097674
2020    0.079070
2021    0.102326
2022    0.102326
2023    0.102326
2024    0.111337
2025    0.106686
Name: proportion, dtype: float64

Distribution of 'YEAR' in X_test:
YEAR
2016    0.106977
2017    0.093023
2018    0.097674
2019    0.097674
2020    0.079070
2021    0.102326
2022    0.102326
2023    0.102326
2024    0.111628
2025    0.106977
Name: proportion, dtype: float64
KFold cross-validation object initialized.

Evaluating models with cross-validation (neg_mean_squared_error):
Comparando modelos (RMSE en escala Logarítmica - Menor es mejor):
RandomForestRegressor: 0.3131 (+/- 0.1234)
MLPRegressor: 1.0972 (+/- 0.4882)
GradientBoostingRegressor: 0.3054 (+/- 0.1232)


## Optimización de Hiperparámetros



In [16]:
print("Starting GridSearchCV for GradientBoostingRegressor...")

# Select the best model for hyperparameter tuning
best_model_name = 'GradientBoostingRegressor'
best_model = None
for name, model in models:
    if name == best_model_name:
        best_model = model
        break
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', best_model)])
# Define the parameter grid for GridSearchCV
param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__learning_rate': [0.01, 0.05, 0.1],
    'model__max_depth': [3, 4, 5]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=kf,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

print("GridSearchCV completed.")

# Print the best parameters and best score
print(f"\nBest parameters for {best_model_name}: {grid_search.best_params_}")
print(f"Best cross-validation score (neg_mean_squared_error) for {best_model_name}: {grid_search.best_score_:.4f}")
print(f"Best RMSE score for {best_model_name}: {np.sqrt(np.abs(grid_search.best_score_)):.4f}")

Starting GridSearchCV for GradientBoostingRegressor...
Fitting 5 folds for each of 27 candidates, totalling 135 fits
GridSearchCV completed.

Best parameters for GradientBoostingRegressor: {'model__learning_rate': 0.1, 'model__max_depth': 4, 'model__n_estimators': 100}
Best cross-validation score (neg_mean_squared_error) for GradientBoostingRegressor: -0.0445
Best RMSE score for GradientBoostingRegressor: 0.2109


## Entrenamiento del Modelo Final y Predicción




In [21]:
print("Training final model with optimal hyperparameters and making predictions...")

# Get the best model from GridSearchCV
best_model_final = grid_search.best_estimator_

# 7. Predecir y devolver a la escala original (ms)
y_pred_log = best_model_final.predict(X_test)
y_pred = np.expm1(y_pred_log) # Inversa de log1p
y_test = np.expm1(y_test)

# 8. Evaluar
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = metrics.r2_score(y_test, y_pred)

print(f"RMSE (en ms): {rmse:,.0f}")
print(f"MAPE (Error Porcentual): {mape:.2%}")
print(f"R-squared (R2): {r2:.4f}")


Training final model with optimal hyperparameters and making predictions...
RMSE (en ms): 1,590,106
MAPE (Error Porcentual): 14.34%
R-squared (R2): 0.1980
