# MODELADO

## HEURÍSTICA


<hr>

<code> **Proyecto de Datos II** </code>

## Índice

- [Importación de los datos](#importación-de-los-datos)
- [Preprocesamiento](#preprocesamiento)
- [Entrenamiento](#entrenamiento)
- [Análisis del modelo](#análisis-del-modelo)
- [Registro del modelo en MLflow](#registro-del-modelo-en-mlflow)


In [3]:
import time
import mlflow
import pandas as pd
from evaluation.evaluator import Evaluator

SEED = 22 # replicabilidad

# =====================================
MODEL_NAME = "Heurística I"
# =====================================

## Importación de los datos

In [5]:
df_train = pd.read_parquet("../data/train.parquet")
df_test = pd.read_parquet("../data/test.parquet")

# ! NOTA -> están el ICAO, Callsign y Timestamp por si hay que depurar
X_train, y_train = df_train.drop(columns="takeoff_time", axis=1), df_train["takeoff_time"]
X_test, y_test = df_test.drop(columns="takeoff_time", axis=1), df_test["takeoff_time"]

In [6]:
X_train.shape, X_test.shape

((123733, 60), (27791, 60))

## Preprocesamiento

Para la heurística definida no será necesario ningún tipo de preprocesamiento

## Entrenamiento

Definimos la heurística:

In [11]:
import numpy as np

def h(row):
    
    # Base time (en segundos)
    base_time = 100
    
    # - Ajustes por tráfico reciente -
    traffic = row['last_min_takeoffs'] + row['last_min_landings']
    if traffic > 5:
        base_time += 40
    elif traffic > 3:
        base_time += 20
    else:
        base_time += 5
    
    # - Ajustes por categoría de turbulencia -
    if row['last_event_turb_cat'].startswith('H'):  # Heavy
        base_time += 30
    elif row['last_event_turb_cat'].startswith('M'):  # Medium
        base_time += 15
    
    # - Ajuste por tiempo desde el último evento -
    if row['time_since_last_event_seconds'] < 60:
        base_time += (60 - row['time_since_last_event_seconds']) # más cercano, más espera
    
    # - Ajuste por hora pico -
    if 7 <= row['hour'] <= 10 or 17 <= row['hour'] <= 20:  # Mañana y tarde
        base_time += 100
        
    # - Ajuste si es festivo -
    if row['is_holiday']:
        base_time -= 30  # menos tráfico
    
    return max(base_time, 30)

In [12]:
start_time = time.time()

df_train['prediction'] = df_train.apply(h, axis=1)

end_time = time.time()
execution_time = end_time - start_time

## Análisis del modelo

In [14]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# ===============================================================
y_true = df_train['takeoff_time']
y_pred = df_train['prediction']

mae_train = mean_absolute_error(y_true, y_pred)
rmse_train = np.sqrt(mean_squared_error(y_true, y_pred))

mae_val = None
rmse_val = None
# ===============================================================

In [15]:
# ===============================================================
# Generar predicciones en test
df_test['prediction'] = df_test.apply(h, axis=1)
# ===============================================================

In [16]:
# Nota: df_test tiene que tener la columna 'prediction'
ev = Evaluator(df_test, MODEL_NAME)
report = ev.getReport()
ev.visualEvaluation()

## Registro del modelo en MLflow

In [18]:
mlflow.set_tracking_uri("file:./mlflow_experiments")
mlflow.set_experiment("takeoff_time_prediction")

with mlflow.start_run():

    # - Datos generales -

    # ========================================================================
    mlflow.set_tag("model_type", MODEL_NAME)
    mlflow.set_tag("framework", "pandas") # scikit-learn, tensorflow, etc.
    mlflow.set_tag("target_variable", "takeoff_time") # variable respuesta
    mlflow.set_tag("preprocessing", "none") # transformaciones separadas por un +
    mlflow.set_tag("dataset", "original") # indicar si se ha modificado el conjunto de datos
    mlflow.set_tag("seed", SEED) # semilla para replicabilidad
    # ========================================================================
    
    
    # - Métricas -

    mlflow.log_metric("execution_time_s", execution_time)

    mlflow.log_metric("mae_train", mae_train)
    mlflow.log_metric("rmse_train", rmse_train)

    # Registrar métricas globales en test
    for metric_name, value in report["global"].items():
        mlflow.log_metric(f"{metric_name}_test", value)
    
    # Registrar métricas por runway
    for runway, metrics in report["by_runway"].items():
        for metric_name, value in metrics.items():
            mlflow.log_metric(f"{metric_name}_test_runway_{runway}", value)
    
    # Registrar métricas por holding point
    for hp, metrics in report["by_holding_point"].items():
        for metric_name, value in metrics.items():
            mlflow.log_metric(f"{metric_name}_test_hp_{hp}", value)

    # - Modelo -

    import mlflow.pyfunc

    class HeuristicModel(mlflow.pyfunc.PythonModel):
        
        def predict(self, context, model_input):
            # model_input será un DataFrame
            return model_input.apply(h, axis=1)
    
    model = HeuristicModel()
    mlflow.pyfunc.log_model(
        artifact_path=MODEL_NAME,
        python_model=model
    )
    


[33mAdd type hints to the `predict` method to enable data validation and automatic signature inference during model logging. Check https://mlflow.org/docs/latest/model/python_model.html#type-hint-usage-in-pythonmodel for more details.[0m



In [None]:
!mlflow ui --backend-store-uri ./mlflow_experiments

[2025-04-26 14:52:59 +0200] [77373] [INFO] Starting gunicorn 23.0.0
[2025-04-26 14:52:59 +0200] [77373] [INFO] Listening at: http://127.0.0.1:5000 (77373)
[2025-04-26 14:52:59 +0200] [77373] [INFO] Using worker: sync
[2025-04-26 14:52:59 +0200] [77374] [INFO] Booting worker with pid: 77374
[2025-04-26 14:53:00 +0200] [77375] [INFO] Booting worker with pid: 77375
[2025-04-26 14:53:00 +0200] [77376] [INFO] Booting worker with pid: 77376
[2025-04-26 14:53:00 +0200] [77384] [INFO] Booting worker with pid: 77384
