## Índice

- [Importación de los datos](#importación-de-los-datos)
- [Preprocesamiento](#preprocesamiento)
- [Entrenamiento](#entrenamiento)
- [Análisis del modelo](#análisis-del-modelo)
- [Registro del modelo en MLflow](#registro-del-modelo-en-mlflow)

In [40]:
import time
import mlflow
import pandas as pd
from evaluation.evaluator import Evaluator

SEED = 22 # replicabilidad

# =====================================
MODEL_NAME = "XG BOOST MODEL" # Rellenar
# =====================================

## Importación de los datos

In [41]:
df_train = pd.read_parquet("../data/train.parquet")
df_test = pd.read_parquet("../data/test.parquet")

# ! NOTA -> están el ICAO, Callsign y Timestamp por si hay que depurar
X_train, y_train = df_train.drop(columns="takeoff_time", axis=1), df_train["takeoff_time"]
X_test, y_test = df_test.drop(columns="takeoff_time", axis=1), df_test["takeoff_time"]

In [42]:
X_train.shape, X_test.shape

((123733, 60), (27791, 60))

In [43]:
X_train.columns


Index(['timestamp', 'icao', 'callsign', 'holding_point', 'runway', 'operator',
       'turbulence_category', 'last_min_takeoffs', 'last_min_landings',
       'last_event_turb_cat', 'time_since_last_event_seconds',
       'time_before_holding_point', 'time_at_holding_point', 'hour', 'weekday',
       'is_holiday', 'Z1', 'KA6', 'KA8', 'K3', 'K2', 'K1', 'Y1', 'Y2', 'Y3',
       'Y7', 'Z6', 'Z4', 'Z2', 'Z3', 'LF', 'L1', 'LA', 'LB', 'LC', 'LD', 'LE',
       '36R_18L', '32R_14L', '36L_18R', '32L_14R', 'temperature_2m (°C)',
       'relative_humidity_2m (%)', 'dew_point_2m (°C)', 'precipitation (mm)',
       'snowfall (cm)', 'weather_code (wmo code)', 'surface_pressure (hPa)',
       'cloud_cover (%)', 'cloud_cover_low (%)', 'cloud_cover_mid (%)',
       'cloud_cover_high (%)', 'is_day ()', 'wind_speed_10m (km/h)',
       'wind_direction_10m (°)', 'wind_direction_100m (°)',
       'soil_moisture_0_to_7cm (m³/m³)', 'soil_temperature_100_to_255cm (°C)',
       'soil_moisture_100_to_255cm (m³/m³

In [44]:
X_train.head(4)

Unnamed: 0,timestamp,icao,callsign,holding_point,runway,operator,turbulence_category,last_min_takeoffs,last_min_landings,last_event_turb_cat,...,cloud_cover_mid (%),cloud_cover_high (%),is_day (),wind_speed_10m (km/h),wind_direction_10m (°),wind_direction_100m (°),soil_moisture_0_to_7cm (m³/m³),soil_temperature_100_to_255cm (°C),soil_moisture_100_to_255cm (m³/m³),et0_fao_evapotranspiration (mm)
0,2024-11-07 05:02:26.219,4CAC23,RYR99AM_,Z4,36L/18R,RYR,Light,0,1,Medium 1,...,0,0,0,6.0,3,41,0.316,20.4,0.164,0.0
1,2024-11-07 05:02:26.721,4CAC23,RYR99AM_,Z4,36L/18R,RYR,Light,0,1,Medium 1,...,0,0,0,6.0,3,41,0.316,20.4,0.164,0.0
2,2024-11-07 05:02:34.900,4CAC23,RYR99AM_,Z4,36L/18R,RYR,Light,0,1,Medium 1,...,0,0,0,6.0,3,41,0.316,20.4,0.164,0.0
3,2024-11-07 05:02:35.399,4CAC23,RYR99AM_,Z4,36L/18R,RYR,Light,0,1,Medium 1,...,0,0,0,6.0,3,41,0.316,20.4,0.164,0.0


In [45]:
X_train['timestamp']

0        2024-11-07 05:02:26.219
1        2024-11-07 05:02:26.721
2        2024-11-07 05:02:34.900
3        2024-11-07 05:02:35.399
4        2024-11-07 05:02:35.706
                   ...          
124159   2025-01-14 19:59:59.993
124160   2025-01-14 22:29:01.359
124161   2025-01-14 22:29:06.417
124162   2025-01-14 22:29:11.578
124163   2025-01-14 22:29:16.437
Name: timestamp, Length: 123733, dtype: datetime64[ns]

## Preprocesamiento

In [46]:
import time
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error

* Preprocesamiento automático de los datos:

cat_cols: selecciona las columnas categóricas (tipo object) de X_train.

num_cols: selecciona las columnas numéricas (int64 o float64) de X_train.

Luego se crea un ColumnTransformer llamado preprocessor que:

Aplica OneHotEncoder a las columnas categóricas (cat_cols), para convertir las categorías en columnas binarias (0 o 1).

Deja pasar las columnas numéricas sin cambios (passthrough)

En Random Forest no necesita un preprocesamiento agresivo porque:

No requiere escalar las variables numéricas (no le importa si una variable está entre 0-1 o 0-10000).

Tolera variables categóricas codificadas tipo one-hot (aunque no trabaja directamente con strings, de ahí el OneHotEncoder).

Es robusto a outliers y distribuciones raras en los datos.

In [47]:
# Eliminar identificadores de X_train y X_test
cols_to_drop = ['icao', 'callsign']
X_train = X_train.drop(columns=cols_to_drop)
X_test = X_test.drop(columns=cols_to_drop)

cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()
num_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Crear preprocesador
preprocessor = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
    ('num', 'passthrough', num_cols)
])

## Entrenamiento

In [48]:
#!pip install xgboost
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV


In [49]:
from xgboost import XGBRegressor
tscv = TimeSeriesSplit(n_splits=5)

xgb = XGBRegressor(random_state=SEED, verbosity=0)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', xgb)
])

param_grid = {
    'regressor__n_estimators': [100, 200],
    'regressor__max_depth': [3, 6, 10],
    'regressor__learning_rate': [0.01, 0.1, 0.3],
    'regressor__subsample': [0.7, 1.0],
    'regressor__colsample_bytree': [0.7, 1.0],
}

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=tscv,   
    scoring='neg_mean_absolute_error',
    verbose=3,
    n_jobs=-1
)


In [50]:
start_time = time.time()

# ========================================
# ENTRENAMIENTO AQUÍ
grid_search.fit(X_train, y_train)
# ========================================

end_time = time.time()
execution_time = end_time - start_time

Fitting 5 folds for each of 72 candidates, totalling 360 fits
[CV 1/5] END regressor__colsample_bytree=0.7, regressor__learning_rate=0.01, regressor__max_depth=3, regressor__n_estimators=100, regressor__subsample=1.0;, score=-67.041 total time=   0.4s
[CV 1/5] END regressor__colsample_bytree=0.7, regressor__learning_rate=0.01, regressor__max_depth=3, regressor__n_estimators=100, regressor__subsample=0.7;, score=-66.822 total time=   0.4s
[CV 2/5] END regressor__colsample_bytree=0.7, regressor__learning_rate=0.01, regressor__max_depth=3, regressor__n_estimators=100, regressor__subsample=1.0;, score=-76.519 total time=   0.7s
[CV 3/5] END regressor__colsample_bytree=0.7, regressor__learning_rate=0.01, regressor__max_depth=3, regressor__n_estimators=100, regressor__subsample=1.0;, score=-75.776 total time=   0.7s
[CV 2/5] END regressor__colsample_bytree=0.7, regressor__learning_rate=0.01, regressor__max_depth=3, regressor__n_estimators=100, regressor__subsample=0.7;, score=-77.522 total t

## Análisis del modelo

In [51]:
# ===============================================================
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Define el mismo TimeSeriesSplit que usaste en GridSearchCV
tscv = TimeSeriesSplit(n_splits=5)  

# Vamos a tomar el ÚLTIMO split
for train_idx, val_idx in tscv.split(X_train):
    pass  # esto deja train_idx y val_idx en el último split

# Entrenar en el último training set
X_train_tscv, X_val_tscv = X_train.iloc[train_idx], X_train.iloc[val_idx]
y_train_tscv, y_val_tscv = y_train.iloc[train_idx], y_train.iloc[val_idx]

# Predecimos con el mejor modelo encontrado
y_train_pred = grid_search.best_estimator_.predict(X_train_tscv)
y_val_pred = grid_search.best_estimator_.predict(X_val_tscv)

# Calculamos métricas
mae_train = mean_absolute_error(y_train_tscv, y_train_pred)
rmse_train = np.sqrt(mean_squared_error(y_train_tscv, y_train_pred))

mae_val = mean_absolute_error(y_val_tscv, y_val_pred)
rmse_val = np.sqrt(mean_squared_error(y_val_tscv, y_val_pred))

print(f"MAE Train: {mae_train:.4f}")
print(f"RMSE Train: {rmse_train:.4f}")
print(f"MAE Validation: {mae_val:.4f}")
print(f"RMSE Validation: {rmse_val:.4f}")

# ===============================================================

MAE Train: 70.7087
RMSE Train: 96.9675
MAE Validation: 83.6664
RMSE Validation: 125.0202


In [52]:
# ===============================================================
# Generar predicciones en test
y_pred = grid_search.best_estimator_.predict(X_test)
# ===============================================================

df_test['prediction'] = y_pred

In [53]:
# Nota: df_test tiene que tener la columna 'prediction'
ev = Evaluator(df_test, MODEL_NAME, mae_val, rmse_val)
report = ev.getReport()
ev.visualEvaluation()

### Influencia de las variables

In [54]:
print(grid_search.best_estimator_)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['holding_point', 'runway',
                                                   'operator',
                                                   'turbulence_category',
                                                   'last_event_turb_cat',
                                                   'weekday']),
                                                 ('num', 'passthrough',
                                                  ['last_min_takeoffs',
                                                   'last_min_landings',
                                                   'time_since_last_event_seconds',
                                                   'time_before_holding_point',
                                                   'time_at_holding_point',
           

## Registro del modelo en MLflow

In [55]:
import mlflow
import mlflow.sklearn
import time
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Define el nombre de tu experimento y la URI de seguimiento
mlflow.set_tracking_uri("file:./mlflow_yu")
mlflow.set_experiment("takeoff_time_prediction")

# Inicia un nuevo experimento en MLflow
with mlflow.start_run():
    
    # - Datos generales -
    mlflow.set_tag("model_type", "XGBoost")  
    mlflow.set_tag("framework", "scikit-learn + xgboost")  
    mlflow.set_tag("target_variable", "takeoff_time")  
    mlflow.set_tag("preprocessing", "OneHotEncoder + passthrough")
    mlflow.set_tag("dataset", "original")  
    mlflow.set_tag("seed", SEED)  
    
    # - Hiperparámetros óptimos -
    mlflow.log_param("model", "XGBoost") 
    mlflow.log_param("n_estimators", grid_search.best_params_['regressor__n_estimators'])
    mlflow.log_param("max_depth", grid_search.best_params_['regressor__max_depth'])
    mlflow.log_param("learning_rate", grid_search.best_params_['regressor__learning_rate'])

    # =====================================
    # AÑADIR MÁS HIPERPARÁMETROS AQUÍ si quieres
    # Por ejemplo subsample, colsample_bytree, gamma, etc. 
    # =====================================
    
    # - Métricas -
    execution_time = time.time() - start_time
    mlflow.log_metric("execution_time_s", execution_time)

    # Métricas de validación
    mlflow.log_metric("mae_val", mae_val)
    mlflow.log_metric("rmse_val", rmse_val)

    # Métricas de entrenamiento
    mlflow.log_metric("mae_train", mae_train)
    mlflow.log_metric("rmse_train", rmse_train)

    # Registrar métricas globales en test
    for metric_name, value in report["global"].items():
        mlflow.log_metric(f"{metric_name}_test", value)
    
    # Registrar métricas por runway
    for runway, metrics in report["by_runway"].items():
        for metric_name, value in metrics.items():
            mlflow.log_metric(f"{metric_name}_test_runway_{runway}", value)
    
    # Registrar métricas por holding point
    for hp, metrics in report["by_holding_point"].items():
        for metric_name, value in metrics.items():
            mlflow.log_metric(f"{metric_name}_test_hp_{hp}", value)

    # - Modelo -
    # Registrar el modelo
    mlflow.sklearn.log_model(grid_search.best_estimator_, "model")


In [57]:
!mlflow ui --backend-store-uri ./mlflow_yu

[2025-04-27 14:31:18 +0200] [69582] [INFO] Starting gunicorn 21.2.0
[2025-04-27 14:31:18 +0200] [69582] [INFO] Listening at: http://127.0.0.1:5000 (69582)
[2025-04-27 14:31:18 +0200] [69582] [INFO] Using worker: sync
[2025-04-27 14:31:18 +0200] [69583] [INFO] Booting worker with pid: 69583
[2025-04-27 14:31:18 +0200] [69584] [INFO] Booting worker with pid: 69584
[2025-04-27 14:31:18 +0200] [69585] [INFO] Booting worker with pid: 69585
[2025-04-27 14:31:18 +0200] [69586] [INFO] Booting worker with pid: 69586
^C

Aborted!
[2025-04-27 14:31:25 +0200] [69584] [INFO] Worker exiting (pid: 69584)
[2025-04-27 14:31:25 +0200] [69585] [INFO] Worker exiting (pid: 69585)
[2025-04-27 14:31:25 +0200] [69586] [INFO] Worker exiting (pid: 69586)
[2025-04-27 14:31:25 +0200] [69583] [INFO] Worker exiting (pid: 69583)
