<a href="https://colab.research.google.com/github/100479095/Predictor_F1_2025/blob/main/entrenamiento_modelo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**INTRODUCCIÓN**

---

En este notebook se muestra el entrenamiento de varios modelos de IA para predecir las posiciones del último Gran Premio de Fórmula 1 de 2025, Abu Dabi. Para ello se realizó previamente un  preprocesado de los datos, recopilados de kaggle, lo cuáles se pueden encontrar en el fichero "f1_trainning_data_2016_onwards.csv". En este notebook se eliminarán algunas columnas de este csv con el objetivo de tener un dataset genérico y realizará una validación cruzada entre distintos modelos para encontrar el mejor, el cuál será utilizado para predecir los resultados finales del GP.


#**Carga de datos**

---

Primero que todo importamos las distintas librearís que estaremos utilizando a lo largo del entrenamiento y la validación cruzada.

In [1]:
#Importamos las librerías necesarias
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold

Cargamos los datos del **csv f1_training_data_2016_onwards.csv**, es importante recordar que este fichero es **generado por el script de carga de python** y para lograr que el código se ejecute de manera correcta hay que **subirlo a la plataforma de google colab**.

In [3]:
df = pd.read_csv('f1_training_data_2016_onwards.csv')
#mostramos las primeras 5 filas
print(df.head())

   RACEID  DRIVERID  CONSTRUCTORID  CIRCUITID  ROUND  YEAR  LAP DISTANCE KM  \
0     948         1            131          1      1  2016            5.278   
1     948         3            131          1      1  2016            5.278   
2     948         4              1          1      1  2016            5.278   
3     948         8              6          1      1  2016            5.278   
4     948        13              3          1      1  2016            5.278   

   LAPS RACE  AVG WIND SPEED  MAX WIND SPEED  ...  MATE LAST POSITION  \
0         57           17.35           25.97  ...                   1   
1         57           17.35           25.97  ...                   2   
2         57           17.35           25.97  ...                  12   
3         57           17.35           25.97  ...                   4   
4         57           17.35           25.97  ...                  13   

   CONSTRUCTOR POINTS BEFORE GP  CONSTRUCTOR WINS SEASON       Q1       Q2  \
0       

## Limpieza de ID

En este apartadp eliminamos las columnas de RACEID y DRIVERID ya que son únicas o muy específicas y queremos que nuestro modelo sea capaz de generalizar entre los datos.

In [6]:
df_processed = df.drop(columns=['RACEID', 'DRIVERID'])
#, 'CONSTRUCTORID', 'CIRCUITID'
y = df_processed['MS RACE']
X = df_processed.drop(columns=['MS RACE'])


## Definición y Evaluación de Modelos de Regresión

---

Para lograr la regresión vamos a seleccionar el mejor entre 3 modelos (Random Forest, MLP o GradientBoost)


In [10]:
# Define models to evaluate
models = [
    ('RandomForestRegressor', RandomForestRegressor(random_state=42)),
    ('MLPRegressor', MLPRegressor(random_state=42, max_iter=1000, early_stopping=True)),
    ('GradientBoostingRegressor', GradientBoostingRegressor(random_state=42))
]

# Split data into training and testing sets
# Perform stratified split by 'YEAR'
print("Performing stratified train-test split by 'YEAR'...")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=X['YEAR']
)
print("Distribution of 'YEAR' in X_train:")
print(X_train['YEAR'].value_counts(normalize=True).sort_index())

print("\nDistribution of 'YEAR' in X_test:")
print(X_test['YEAR'].value_counts(normalize=True).sort_index())

# Initialize KFold for cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print("KFold cross-validation object initialized.")

results = {}
print("\nEvaluating models with cross-validation (neg_mean_squared_error):")
for name, model in models:
    cv_scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)
    RMSE_scores = np.sqrt(np.abs(cv_scores))
    mean_RMSE = RMSE_scores.mean()
    results[name] = mean_RMSE

print("\nCross-validation results:")
for name, score in results.items():
    print(f"{name}: {score:.4f}")

Performing stratified train-test split by 'YEAR'...
Distribution of 'YEAR' in X_train_raw:
YEAR
2016    0.107558
2017    0.093023
2018    0.097674
2019    0.097674
2020    0.079070
2021    0.102326
2022    0.102326
2023    0.102326
2024    0.111337
2025    0.106686
Name: proportion, dtype: float64

Distribution of 'YEAR' in X_test_raw:
YEAR
2016    0.106977
2017    0.093023
2018    0.097674
2019    0.097674
2020    0.079070
2021    0.102326
2022    0.102326
2023    0.102326
2024    0.111628
2025    0.106977
Name: proportion, dtype: float64
Stratified data split complete.
X_train_raw shape: (3440, 30), y_train shape: (3440,)
X_test_raw shape: (860, 30), y_test shape: (860,)
KFold cross-validation object initialized.

Evaluating models with cross-validation (neg_mean_squared_error):
RandomForestRegressor: Mean MAE = 1592526.9461
MLPRegressor: Mean MAE = 1936427.3389
GradientBoostingRegressor: Mean MAE = 1588421.7270

Cross-validation results:
RandomForestRegressor: 1592526.9461
MLPRegres

## Optimización de Hiperparámetros



In [11]:
print("Starting GridSearchCV for GradientBoostingRegressor...")

# Select the best model for hyperparameter tuning
best_model_name = 'GradientBoostingRegressor'
best_model = None
for name, model in models:
    if name == best_model_name:
        best_model = model
        break

# Define the parameter grid for GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=best_model,
    param_grid=param_grid,
    cv=kf,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

print("GridSearchCV completed.")

# Print the best parameters and best score
print(f"\nBest parameters for {best_model_name}: {grid_search.best_params_}")
print(f"Best cross-validation score (neg_mean_squared_error) for {best_model_name}: {grid_search.best_score_:.4f}")
print(f"Best RMSE score for {best_model_name}: {np.sqrt(np.abs(grid_search.best_score_)):.4f}")

Starting GridSearchCV for GradientBoostingRegressor...
Fitting 5 folds for each of 27 candidates, totalling 135 fits
GridSearchCV completed.

Best parameters for GradientBoostingRegressor: {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 200}
Best cross-validation score (neg_mean_squared_error) for GradientBoostingRegressor: -2492948499314.9468
Best RMSE score for GradientBoostingRegressor: 1578907.3752


## Entrenamiento del Modelo Final y Predicción




In [12]:
print("Training final model with optimal hyperparameters and making predictions...")

# Get the best model from GridSearchCV
best_model_final = grid_search.best_estimator_

# Make predictions on the test set
y_pred = best_model_final.predict(X_test)

# Calculate evaluation metrics
mae = metrics.mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
r2 = metrics.r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f"\nFinal Model Performance on Test Set:")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R-squared (R2): {r2:.4f}")

Training final model with optimal hyperparameters and making predictions...

Final Model Performance on Test Set:
Mean Absolute Error (MAE): 1131551.5230
Root Mean Squared Error (RMSE): 1597198.6233
R-squared (R2): 0.1909
