# BIKE SHARING DEMAND - MODELING

Este notebook contiene el entrenamiento y evaluaci√≥n de modelos baseline para predicci√≥n de demanda de bicicletas compartidas.

## Objetivos:
1. Entrenar 3 modelos baseline (Linear Regression, Random Forest, XGBoost)
2. Registrar experimentos con MLflow
3. Evaluar con m√©tricas objetivo (MAE < 50, RMSE < 80, R¬≤ > 0.7)
4. Analizar feature importance
5. Seleccionar mejor modelo para optimizaci√≥n

---

**Prerequisitos:**
- Datasets normalizados en `data/processed/`
- Scaler guardado en `models/scaler.pkl`
- Notebook anterior ejecutado con feature engineering completo

# 1. SETUP Y CONFIGURACI√ìN

Importamos librer√≠as, configuramos MLflow y definimos variables globales.

## 1.1 Imports


In [None]:
# Sistema y paths
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
import numpy as np

# Visualizaci√≥n
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score,
    mean_absolute_percentage_error
)

# MLflow
import mlflow
import mlflow.sklearn
import mlflow.xgboost

# Utilities
import joblib
from datetime import datetime
import json

# Configuraci√≥n de plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úì Librer√≠as importadas correctamente")
print(f"Fecha de ejecuci√≥n: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


## 1.2 Configuraci√≥n de Paths y Constantes


In [None]:
# Directorios
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / 'data' / 'processed'
MODELS_DIR = PROJECT_ROOT / 'models'
MLFLOW_DIR = PROJECT_ROOT / 'mlruns'

# Crear directorios si no existen
MLFLOW_DIR.mkdir(exist_ok=True)

# M√©tricas objetivo (seg√∫n an√°lisis EDA)
TARGET_METRICS = {
    'MAE': 50,      # Mean Absolute Error < 50
    'RMSE': 80,     # Root Mean Squared Error < 80
    'R2': 0.7,      # R¬≤ > 0.7
    'MAPE': 25      # Mean Absolute Percentage Error < 25%
}

# Configuraci√≥n MLflow
EXPERIMENT_NAME = "bike-sharing-demand-baseline"
mlflow.set_tracking_uri(f"file:///{MLFLOW_DIR}")

print("="*70)
print("CONFIGURACI√ìN DEL PROYECTO")
print("="*70)
print(f"Project Root: {PROJECT_ROOT}")
print(f"Data Directory: {DATA_DIR}")
print(f"Models Directory: {MODELS_DIR}")
print(f"MLflow Tracking: {MLFLOW_DIR}")
print(f"\nM√©tricas Objetivo:")
for metric, target in TARGET_METRICS.items():
    print(f"  ‚Ä¢ {metric}: {'<' if metric != 'R2' else '>'} {target}")
print("="*70)


# 2. CARGA DE DATOS

Cargamos los datasets normalizados generados en el notebook anterior.


## 2.1 Cargar Datasets Normalizados


In [None]:
# Cargar datasets
train_df = pd.read_csv(DATA_DIR / 'bike_sharing_features_train_normalized.csv')
val_df = pd.read_csv(DATA_DIR / 'bike_sharing_features_validation_normalized.csv')
test_df = pd.read_csv(DATA_DIR / 'bike_sharing_features_test_normalized.csv')

print("="*70)
print("DATASETS CARGADOS")
print("="*70)
print(f"Train: {train_df.shape}")
print(f"  Fecha inicio: {train_df['timestamp'].min()}")
print(f"  Fecha fin:    {train_df['timestamp'].max()}")
print(f"\nValidation: {val_df.shape}")
print(f"  Fecha inicio: {val_df['timestamp'].min()}")
print(f"  Fecha fin:    {val_df['timestamp'].max()}")
print(f"\nTest: {test_df.shape}")
print(f"  Fecha inicio: {test_df['timestamp'].min()}")
print(f"  Fecha fin:    {test_df['timestamp'].max()}")
print("="*70)

# Verificar integridad
assert train_df.shape[1] == val_df.shape[1] == test_df.shape[1], "Datasets tienen diferente n√∫mero de columnas"
assert train_df.isnull().sum().sum() == 0, "Train tiene valores nulos"
assert val_df.isnull().sum().sum() == 0, "Validation tiene valores nulos"
assert test_df.isnull().sum().sum() == 0, "Test tiene valores nulos"

print("\n‚úì Verificaci√≥n de integridad completada")


## 2.2 Preparar Features y Target


In [None]:
# Definir columnas a excluir (metadata y targets)
exclude_cols = ['timestamp', 'dteday', 'cnt', 'casual', 'registered']

# Features (todas excepto las excluidas)
feature_cols = [col for col in train_df.columns if col not in exclude_cols]

# Separar X e y
X_train = train_df[feature_cols].values
y_train = train_df['cnt'].values

X_val = val_df[feature_cols].values
y_val = val_df['cnt'].values

X_test = test_df[feature_cols].values
y_test = test_df['cnt'].values

print("="*70)
print("FEATURES Y TARGET")
print("="*70)
print(f"Total features: {len(feature_cols)}")
print(f"\nFeatures incluidos (primeros 10):")
for i, feat in enumerate(feature_cols[:10], 1):
    print(f"  {i:2d}. {feat}")
print(f"  ... y {len(feature_cols) - 10} m√°s")

print(f"\nTarget: cnt (demanda total de bicicletas)")
print(f"\nShapes:")
print(f"  X_train: {X_train.shape}")
print(f"  y_train: {y_train.shape}")
print(f"  X_val:   {X_val.shape}")
print(f"  y_val:   {y_val.shape}")
print(f"  X_test:  {X_test.shape}")
print(f"  y_test:  {y_test.shape}")

print(f"\nEstad√≠sticas del target (cnt):")
print(f"  Train - Mean: {y_train.mean():.2f}, Std: {y_train.std():.2f}, Min: {y_train.min():.0f}, Max: {y_train.max():.0f}")
print(f"  Val   - Mean: {y_val.mean():.2f}, Std: {y_val.std():.2f}, Min: {y_val.min():.0f}, Max: {y_val.max():.0f}")
print(f"  Test  - Mean: {y_test.mean():.2f}, Std: {y_test.std():.2f}, Min: {y_test.min():.0f}, Max: {y_test.max():.0f}")
print("="*70)


# 3. CONFIGURACI√ìN DE MLFLOW

Configuramos el experimento de MLflow para tracking de modelos.


In [None]:
# Crear o obtener experimento
try:
    experiment_id = mlflow.create_experiment(
        EXPERIMENT_NAME,
        tags={
            "project": "mlops-team-61",
            "phase": "baseline-models",
            "dataset": "bike-sharing",
            "features": str(len(feature_cols))
        }
    )
    print(f"‚úì Experimento creado: {EXPERIMENT_NAME}")
except:
    experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
    experiment_id = experiment.experiment_id
    print(f"‚úì Experimento existente: {EXPERIMENT_NAME}")

mlflow.set_experiment(EXPERIMENT_NAME)

print(f"  Experiment ID: {experiment_id}")
print(f"  Tracking URI: {mlflow.get_tracking_uri()}")
print(f"\nüìä Para ver MLflow UI, ejecutar en terminal:")
print(f"   mlflow ui --backend-store-uri {mlflow.get_tracking_uri()}")
print(f"   Luego abrir: http://localhost:5000")


# 4. FUNCIONES DE EVALUACI√ìN

Definimos funciones reutilizables para evaluar modelos.


In [None]:
def evaluate_model(y_true, y_pred, dataset_name="Validation"):
    """
    Eval√∫a un modelo con m√∫ltiples m√©tricas.
    
    Args:
        y_true: Valores reales
        y_pred: Valores predichos
        dataset_name: Nombre del dataset (Train/Validation/Test)
    
    Returns:
        dict: Diccionario con m√©tricas calculadas
    """
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100
    
    # M√©tricas adicionales
    residuals = y_true - y_pred
    
    metrics = {
        'mae': mae,
        'mse': mse,
        'rmse': rmse,
        'r2': r2,
        'mape': mape,
        'residuals_mean': residuals.mean(),
        'residuals_std': residuals.std()
    }
    
    return metrics


def print_metrics(metrics, dataset_name="Validation", targets=TARGET_METRICS):
    """
    Imprime m√©tricas en formato legible con comparaci√≥n vs targets.
    """
    print(f"\n{'='*70}")
    print(f"M√âTRICAS - {dataset_name.upper()}")
    print(f"{'='*70}")
    
    # MAE
    mae_status = "‚úì" if metrics['mae'] < targets['MAE'] else "‚úó"
    print(f"MAE:  {metrics['mae']:8.2f}  {mae_status}  (target: < {targets['MAE']})")
    
    # RMSE
    rmse_status = "‚úì" if metrics['rmse'] < targets['RMSE'] else "‚úó"
    print(f"RMSE: {metrics['rmse']:8.2f}  {rmse_status}  (target: < {targets['RMSE']})")
    
    # R¬≤
    r2_status = "‚úì" if metrics['r2'] > targets['R2'] else "‚úó"
    print(f"R¬≤:   {metrics['r2']:8.4f}  {r2_status}  (target: > {targets['R2']})")
    
    # MAPE
    mape_status = "‚úì" if metrics['mape'] < targets['MAPE'] else "‚úó"
    print(f"MAPE: {metrics['mape']:8.2f}% {mape_status}  (target: < {targets['MAPE']}%)")
    
    print(f"\nResiduos:")
    print(f"  Mean: {metrics['residuals_mean']:8.2f}  (debe estar ~0)")
    print(f"  Std:  {metrics['residuals_std']:8.2f}")
    print(f"{'='*70}")


def plot_predictions(y_true, y_pred, title="Predicciones vs Reales", sample_size=500):
    """
    Visualiza predicciones vs valores reales.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Scatter plot (muestra)
    idx = np.random.choice(len(y_true), min(sample_size, len(y_true)), replace=False)
    axes[0].scatter(y_true[idx], y_pred[idx], alpha=0.5, s=20)
    axes[0].plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 
                 'r--', lw=2, label='Perfect Prediction')
    axes[0].set_xlabel('Valores Reales')
    axes[0].set_ylabel('Predicciones')
    axes[0].set_title(f'{title} - Scatter')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Distribuci√≥n de residuos
    residuals = y_true - y_pred
    axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
    axes[1].axvline(0, color='red', linestyle='--', lw=2, label='Zero Error')
    axes[1].axvline(residuals.mean(), color='green', linestyle='--', lw=2, 
                    label=f'Mean: {residuals.mean():.2f}')
    axes[1].set_xlabel('Residuos (Real - Predicci√≥n)')
    axes[1].set_ylabel('Frecuencia')
    axes[1].set_title('Distribuci√≥n de Residuos')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

print("‚úì Funciones de evaluaci√≥n definidas")


# 5. MODELO 1: LINEAR REGRESSION (RIDGE)

Modelo baseline simple con regularizaci√≥n Ridge.


## 5.1 Entrenamiento


In [None]:
print("="*70)
print("MODELO 1: RIDGE REGRESSION")
print("="*70)

# Hiperpar√°metros
ridge_params = {
    'alpha': 1.0,
    'random_state': 42
}

# MLflow Run
with mlflow.start_run(run_name="ridge_baseline") as run:
    
    # Log parameters
    mlflow.log_params(ridge_params)
    mlflow.log_param("model_type", "Ridge Regression")
    mlflow.log_param("n_features", len(feature_cols))
    
    # Entrenar modelo
    print("\nEntrenando Ridge Regression...")
    ridge_model = Ridge(**ridge_params)
    ridge_model.fit(X_train, y_train)
    print("‚úì Modelo entrenado")
    
    # Predicciones
    y_train_pred_ridge = ridge_model.predict(X_train)
    y_val_pred_ridge = ridge_model.predict(X_val)
    y_test_pred_ridge = ridge_model.predict(X_test)
    
    # Evaluar
    train_metrics_ridge = evaluate_model(y_train, y_train_pred_ridge, "Train")
    val_metrics_ridge = evaluate_model(y_val, y_val_pred_ridge, "Validation")
    test_metrics_ridge = evaluate_model(y_test, y_test_pred_ridge, "Test")
    
    # Log metrics
    for prefix, metrics in [('train', train_metrics_ridge), 
                             ('val', val_metrics_ridge),
                             ('test', test_metrics_ridge)]:
        for metric_name, value in metrics.items():
            mlflow.log_metric(f"{prefix}_{metric_name}", value)
    
    # Log model
    mlflow.sklearn.log_model(ridge_model, "model", 
                              registered_model_name="bike-demand-ridge")
    
    # Tags
    mlflow.set_tags({
        "model_family": "linear",
        "complexity": "low",
        "regularization": "L2"
    })
    
    print(f"\n‚úì Run ID: {run.info.run_id}")

# Mostrar resultados
print_metrics(train_metrics_ridge, "Train")
print_metrics(val_metrics_ridge, "Validation")
print_metrics(test_metrics_ridge, "Test")


## 5.2 Visualizaci√≥n


In [None]:
plot_predictions(y_val, y_val_pred_ridge, "Ridge Regression - Validation")


# 6. MODELO 2: RANDOM FOREST

Modelo ensemble basado en √°rboles de decisi√≥n.


## 6.1 Entrenamiento


In [None]:
print("="*70)
print("MODELO 2: RANDOM FOREST")
print("="*70)

# Hiperpar√°metros
rf_params = {
    'n_estimators': 100,
    'max_depth': 20,
    'min_samples_split': 5,
    'min_samples_leaf': 2,
    'max_features': 'sqrt',
    'random_state': 42,
    'n_jobs': -1
}

# MLflow Run
with mlflow.start_run(run_name="random_forest_baseline") as run:
    
    # Log parameters
    mlflow.log_params(rf_params)
    mlflow.log_param("model_type", "Random Forest")
    mlflow.log_param("n_features", len(feature_cols))
    
    # Entrenar modelo
    print("\nEntrenando Random Forest...")
    rf_model = RandomForestRegressor(**rf_params)
    rf_model.fit(X_train, y_train)
    print("‚úì Modelo entrenado")
    
    # Predicciones
    y_train_pred_rf = rf_model.predict(X_train)
    y_val_pred_rf = rf_model.predict(X_val)
    y_test_pred_rf = rf_model.predict(X_test)
    
    # Evaluar
    train_metrics_rf = evaluate_model(y_train, y_train_pred_rf, "Train")
    val_metrics_rf = evaluate_model(y_val, y_val_pred_rf, "Validation")
    test_metrics_rf = evaluate_model(y_test, y_test_pred_rf, "Test")
    
    # Log metrics
    for prefix, metrics in [('train', train_metrics_rf), 
                             ('val', val_metrics_rf),
                             ('test', test_metrics_rf)]:
        for metric_name, value in metrics.items():
            mlflow.log_metric(f"{prefix}_{metric_name}", value)
    
    # Feature importance
    feature_importance = pd.DataFrame({
        'feature': feature_cols,
        'importance': rf_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    # Log feature importance como artifact
    importance_path = MODELS_DIR / 'rf_feature_importance.csv'
    feature_importance.to_csv(importance_path, index=False)
    mlflow.log_artifact(str(importance_path))
    
    # Log model
    mlflow.sklearn.log_model(rf_model, "model", 
                              registered_model_name="bike-demand-rf")
    
    # Tags
    mlflow.set_tags({
        "model_family": "ensemble",
        "complexity": "medium",
        "base_learner": "decision_tree"
    })
    
    print(f"\n‚úì Run ID: {run.info.run_id}")

# Mostrar resultados
print_metrics(train_metrics_rf, "Train")
print_metrics(val_metrics_rf, "Validation")
print_metrics(test_metrics_rf, "Test")


## 6.2 Feature Importance


In [None]:
# Mostrar top 20 features
print("="*70)
print("TOP 20 FEATURES M√ÅS IMPORTANTES - RANDOM FOREST")
print("="*70)
print(feature_importance.head(20).to_string(index=False))

# Visualizar
fig, ax = plt.subplots(figsize=(10, 8))
top_features = feature_importance.head(20)
ax.barh(range(len(top_features)), top_features['importance'])
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['feature'])
ax.invert_yaxis()
ax.set_xlabel('Importancia')
ax.set_title('Top 20 Features - Random Forest')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()


## 6.3 Visualizaci√≥n


In [None]:
plot_predictions(y_val, y_val_pred_rf, "Random Forest - Validation")


# 7. MODELO 3: XGBOOST

Modelo de gradient boosting (modelo principal seg√∫n ML Canvas).


## 7.1 Entrenamiento


In [None]:
print("="*70)
print("MODELO 3: XGBOOST")
print("="*70)

# Hiperpar√°metros
xgb_params = {
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 3,
    'gamma': 0,
    'reg_alpha': 0,
    'reg_lambda': 1,
    'random_state': 42,
    'n_jobs': -1
}

# MLflow Run
with mlflow.start_run(run_name="xgboost_baseline") as run:
    
    # Log parameters
    mlflow.log_params(xgb_params)
    mlflow.log_param("model_type", "XGBoost")
    mlflow.log_param("n_features", len(feature_cols))
    
    # Entrenar modelo
    print("\nEntrenando XGBoost...")
    xgb_model = XGBRegressor(**xgb_params)
    xgb_model.fit(
        X_train, y_train,
        eval_set=[(X_train, y_train), (X_val, y_val)],
        verbose=False
    )
    print("‚úì Modelo entrenado")
    
    # Predicciones
    y_train_pred_xgb = xgb_model.predict(X_train)
    y_val_pred_xgb = xgb_model.predict(X_val)
    y_test_pred_xgb = xgb_model.predict(X_test)
    
    # Evaluar
    train_metrics_xgb = evaluate_model(y_train, y_train_pred_xgb, "Train")
    val_metrics_xgb = evaluate_model(y_val, y_val_pred_xgb, "Validation")
    test_metrics_xgb = evaluate_model(y_test, y_test_pred_xgb, "Test")
    
    # Log metrics
    for prefix, metrics in [('train', train_metrics_xgb), 
                             ('val', val_metrics_xgb),
                             ('test', test_metrics_xgb)]:
        for metric_name, value in metrics.items():
            mlflow.log_metric(f"{prefix}_{metric_name}", value)
    
    # Feature importance
    feature_importance_xgb = pd.DataFrame({
        'feature': feature_cols,
        'importance': xgb_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    # Log feature importance
    importance_path_xgb = MODELS_DIR / 'xgb_feature_importance.csv'
    feature_importance_xgb.to_csv(importance_path_xgb, index=False)
    mlflow.log_artifact(str(importance_path_xgb))
    
    # Log model
    mlflow.xgboost.log_model(xgb_model, "model", 
                              registered_model_name="bike-demand-xgboost")
    
    # Tags
    mlflow.set_tags({
        "model_family": "boosting",
        "complexity": "medium",
        "algorithm": "gradient_boosting"
    })
    
    print(f"\n‚úì Run ID: {run.info.run_id}")

# Mostrar resultados
print_metrics(train_metrics_xgb, "Train")
print_metrics(val_metrics_xgb, "Validation")
print_metrics(test_metrics_xgb, "Test")


## 7.2 Feature Importance


In [None]:
# Mostrar top 20 features
print("="*70)
print("TOP 20 FEATURES M√ÅS IMPORTANTES - XGBOOST")
print("="*70)
print(feature_importance_xgb.head(20).to_string(index=False))

# Visualizar
fig, ax = plt.subplots(figsize=(10, 8))
top_features_xgb = feature_importance_xgb.head(20)
ax.barh(range(len(top_features_xgb)), top_features_xgb['importance'])
ax.set_yticks(range(len(top_features_xgb)))
ax.set_yticklabels(top_features_xgb['feature'])
ax.invert_yaxis()
ax.set_xlabel('Importancia')
ax.set_title('Top 20 Features - XGBoost')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()


## 7.3 Visualizaci√≥n


In [None]:
plot_predictions(y_val, y_val_pred_xgb, "XGBoost - Validation")


# 8. COMPARACI√ìN DE MODELOS

Comparamos los 3 modelos baseline para seleccionar el mejor.


In [None]:
# Crear tabla comparativa
comparison_df = pd.DataFrame([
    {
        'Model': 'Ridge Regression',
        'Train_MAE': train_metrics_ridge['mae'],
        'Val_MAE': val_metrics_ridge['mae'],
        'Test_MAE': test_metrics_ridge['mae'],
        'Train_RMSE': train_metrics_ridge['rmse'],
        'Val_RMSE': val_metrics_ridge['rmse'],
        'Test_RMSE': test_metrics_ridge['rmse'],
        'Train_R2': train_metrics_ridge['r2'],
        'Val_R2': val_metrics_ridge['r2'],
        'Test_R2': test_metrics_ridge['r2'],
        'Val_MAPE': val_metrics_ridge['mape']
    },
    {
        'Model': 'Random Forest',
        'Train_MAE': train_metrics_rf['mae'],
        'Val_MAE': val_metrics_rf['mae'],
        'Test_MAE': test_metrics_rf['mae'],
        'Train_RMSE': train_metrics_rf['rmse'],
        'Val_RMSE': val_metrics_rf['rmse'],
        'Test_RMSE': test_metrics_rf['rmse'],
        'Train_R2': train_metrics_rf['r2'],
        'Val_R2': val_metrics_rf['r2'],
        'Test_R2': test_metrics_rf['r2'],
        'Val_MAPE': val_metrics_rf['mape']
    },
    {
        'Model': 'XGBoost',
        'Train_MAE': train_metrics_xgb['mae'],
        'Val_MAE': val_metrics_xgb['mae'],
        'Test_MAE': test_metrics_xgb['mae'],
        'Train_RMSE': train_metrics_xgb['rmse'],
        'Val_RMSE': val_metrics_xgb['rmse'],
        'Test_RMSE': test_metrics_xgb['rmse'],
        'Train_R2': train_metrics_xgb['r2'],
        'Val_R2': val_metrics_xgb['r2'],
        'Test_R2': test_metrics_xgb['r2'],
        'Val_MAPE': val_metrics_xgb['mape']
    }
])

print("="*100)
print("COMPARACI√ìN DE MODELOS BASELINE")
print("="*100)
print(comparison_df.to_string(index=False))

# Identificar mejor modelo por validation
best_idx = comparison_df['Val_RMSE'].idxmin()
best_model = comparison_df.iloc[best_idx]['Model']

print(f"\n{'='*100}")
print(f"üèÜ MEJOR MODELO: {best_model}")
print(f"{'='*100}")
print(f"  Validation MAE:  {comparison_df.iloc[best_idx]['Val_MAE']:.2f}  (target: < {TARGET_METRICS['MAE']})")
print(f"  Validation RMSE: {comparison_df.iloc[best_idx]['Val_RMSE']:.2f}  (target: < {TARGET_METRICS['RMSE']})")
print(f"  Validation R¬≤:   {comparison_df.iloc[best_idx]['Val_R2']:.4f}  (target: > {TARGET_METRICS['R2']})")
print(f"  Validation MAPE: {comparison_df.iloc[best_idx]['Val_MAPE']:.2f}%  (target: < {TARGET_METRICS['MAPE']}%)")

# Guardar comparaci√≥n
comparison_path = MODELS_DIR / 'model_comparison.csv'
comparison_df.to_csv(comparison_path, index=False)
print(f"\n‚úì Comparaci√≥n guardada en: {comparison_path}")


# 9. RESUMEN Y PR√ìXIMOS PASOS

---

## ‚úÖ Modelos Entrenados

1. **Ridge Regression** - Baseline lineal simple
2. **Random Forest** - Ensemble de √°rboles
3. **XGBoost** - Gradient boosting (modelo principal)

## üìä M√©tricas Objetivo

| M√©trica | Target | Descripci√≥n |
|---------|--------|-------------|
| MAE | < 50 | Error absoluto medio |
| RMSE | < 80 | Error cuadr√°tico medio |
| R¬≤ | > 0.7 | Coeficiente de determinaci√≥n |
| MAPE | < 25% | Error porcentual medio |

## üöÄ Pr√≥ximos Pasos

1. **An√°lisis de errores por segmentos** (hora, clima, season)
2. **Hyperparameter tuning** del mejor modelo
3. **Feature selection** basado en importance
4. **Ensemble de modelos** (stacking/blending)
5. **Evaluaci√≥n exhaustiva** con test set
6. **Deployment** del modelo final

## üìù Archivos Generados

- `models/rf_feature_importance.csv`
- `models/xgb_feature_importance.csv`
- `models/model_comparison.csv`
- MLflow runs en `mlruns/`

## üîó MLflow UI

Para visualizar experimentos:
```bash
mlflow ui
```

Abrir: http://localhost:5000

---

**Notebook completado:** `02_modeling.ipynb`  
**Estado:** ‚úÖ Modelos baseline entrenados  
**Siguiente:** Optimizaci√≥n y evaluaci√≥n exhaustiva
