# ÔøΩÔøΩ Hyperparameter Tuning: RSF vs XGBoost-AFT

**Objetivo**: Encontrar la mejor configuraci√≥n de hiperpar√°metros para modelos de supervivencia.

**Modelos**:
- Random Survival Forest (RSF)
- XGBoost-AFT (Accelerated Failure Time)

**Output**:
- `models/best_rsf.pkl`
- `models/best_xgb_aft.json`

---

In [1]:
# ==============================================================================
# 0. CONFIGURACI√ìN Y CARGA
# ==============================================================================
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import xgboost as xgb
import joblib
from pathlib import Path

from sksurv.ensemble import RandomSurvivalForest
from sksurv.util import Surv
from sklearn.model_selection import GridSearchCV, StratifiedKFold, ParameterGrid

# ----------------------------
# Reproducibilidad b√°sica
# ----------------------------
SEED = 42
np.random.seed(SEED)

# ----------------------------
# Paths y checks
# ----------------------------
# Usamos la versi√≥n corregida (v2) del dataset
DATA_PATH = Path("data/processed/train_survival_final.parquet")
MODELS_DIR = Path("models")
MODELS_DIR.mkdir(parents=True, exist_ok=True)

if not DATA_PATH.exists():
    raise FileNotFoundError(f"‚ùå No existe el archivo: {DATA_PATH.resolve()}")

# ----------------------------
# Cargar datos
# ----------------------------
df_train = pd.read_parquet(DATA_PATH)
print(f"‚úÖ Datos cargados: {len(df_train)} registros | columnas: {df_train.shape[1]}")

required_cols = {"event", "duration"}
missing = required_cols - set(df_train.columns)
if missing:
    raise ValueError(f"‚ùå Faltan columnas requeridas: {missing}")

# Validaci√≥n m√≠nima
if (df_train["duration"] <= 0).any():
    bad = int((df_train["duration"] <= 0).sum())
    raise ValueError(f"‚ùå Hay {bad} filas con duration <= 0. Corrige antes de entrenar.")

print(f"   Duration range: [{df_train['duration'].min():.2f}, {df_train['duration'].max():.2f}]")
print(f"   Event rate: {df_train['event'].mean():.1%}")

‚úÖ Datos cargados: 296 registros | columnas: 63
   Duration range: [0.18, 30.00]
   Event rate: 45.6%


In [2]:
# ==============================================================================
# 1. PREPROCESAMIENTO PARA RSF
# ==============================================================================
cols_to_drop = ["event", "duration", "carrera_norm"]  # meta-data o labels
X_train = df_train.drop(columns=[c for c in cols_to_drop if c in df_train.columns])

# Si hay columnas no num√©ricas, intenta one-hot autom√°ticamente
non_numeric_cols = X_train.select_dtypes(exclude=[np.number]).columns.tolist()
if non_numeric_cols:
    print(f"‚ö†Ô∏è Columnas no num√©ricas detectadas: {non_numeric_cols}")
    print("‚û°Ô∏è Aplicando one-hot encoding autom√°tico.")
    X_train = pd.get_dummies(X_train, columns=non_numeric_cols, drop_first=True)

# NaNs: imputaci√≥n simple con mediana
if X_train.isna().any().any():
    print("‚ö†Ô∏è Se detectaron NaNs. Aplicando imputaci√≥n (mediana).")
    X_train = X_train.fillna(X_train.median(numeric_only=True))

# Formato para Scikit-Survival (RSF)
y_sksurv = Surv.from_dataframe("event", "duration", df_train)

print(f"‚úÖ Features (X): {X_train.shape[1]} columnas")
print(f"‚úÖ Target (y): {len(y_sksurv)} registros estructurados")

‚úÖ Features (X): 61 columnas
‚úÖ Target (y): 296 registros estructurados


---
## 2. Tuning Random Survival Forest (RSF)

Usamos StratifiedKFold para mantener proporci√≥n de eventos en cada fold.

In [3]:
# ==============================================================================
# 2. TUNING RANDOM SURVIVAL FOREST (RSF)
# ==============================================================================
print("üå≤ INICIANDO TUNING DE RSF (puede tardar varios minutos)...")

rsf_param_grid = {
    "n_estimators": [200, 500],
    "min_samples_leaf": [5, 10, 20],
    "max_depth": [None, 10],
    "max_features": ["sqrt"]
}

# Estratificar por event (proporci√≥n censura estable por fold)
y_event = df_train["event"].astype(int).values
cv_rsf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

rsf_grid = GridSearchCV(
    estimator=RandomSurvivalForest(random_state=SEED, n_jobs=-1),
    param_grid=rsf_param_grid,
    cv=cv_rsf,
    n_jobs=-1,
    verbose=1,
    refit=True
)

rsf_grid.fit(X_train, y_sksurv)

print(f"\nüèÜ Mejor RSF C-index promedio: {rsf_grid.best_score_:.4f}")
print(f"‚öôÔ∏è Mejores par√°metros RSF: {rsf_grid.best_params_}")

# Guardar mejor modelo RSF
joblib.dump(rsf_grid.best_estimator_, MODELS_DIR / "best_rsf.pkl")
joblib.dump(rsf_grid.best_params_, MODELS_DIR / "best_rsf_params.pkl")
print(f"‚úÖ Modelo RSF guardado en {MODELS_DIR / 'best_rsf.pkl'}")

üå≤ INICIANDO TUNING DE RSF (puede tardar varios minutos)...
Fitting 5 folds for each of 12 candidates, totalling 60 fits



üèÜ Mejor RSF C-index promedio: 0.6000
‚öôÔ∏è Mejores par√°metros RSF: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 20, 'n_estimators': 500}


‚úÖ Modelo RSF guardado en models/best_rsf.pkl


---
## 3. Tuning XGBoost-AFT

XGBoost AFT requiere formato especial de censura:
- Evento=1 ‚Üí tiempo exacto `[T, T]`
- Evento=0 ‚Üí censura derecha ‚Üí `[T, +inf]`

In [4]:
# ==============================================================================
# 3. TUNING XGBOOST-AFT (L√≥gica avanzada de censura)
# ==============================================================================
print("üöÄ INICIANDO TUNING DE XGBOOST-AFT...")

def make_aft_bounds(df):
    """
    AFT requiere intervalo [lower, upper]
    Evento=1 => tiempo exacto [T, T]
    Evento=0 => censura derecha => [T, +inf]
    """
    y_lower = df["duration"].to_numpy(dtype=float)
    y_upper = df["duration"].to_numpy(dtype=float)
    y_upper[df["event"].to_numpy(dtype=int) == 0] = np.inf
    return y_lower, y_upper

y_lower, y_upper = make_aft_bounds(df_train)

# DMatrix con bounds
dtrain = xgb.DMatrix(X_train)
dtrain.set_float_info("label_lower_bound", y_lower)
dtrain.set_float_info("label_upper_bound", y_upper)

# Grid de hiperpar√°metros (reducido para tiempo razonable)
xgb_params_grid = {
    "learning_rate": [0.01, 0.05],
    "max_depth": [3, 5],
    "min_child_weight": [1, 3],
    "reg_lambda": [1.0, 5.0],
    "aft_loss_distribution": ["normal", "logistic"],
}

best_loss_val = float("inf")
best_params = None
best_num_rounds = None

grid = list(ParameterGrid(xgb_params_grid))
print(f"Evaluando {len(grid)} combinaciones para XGBoost...")

for i, params in enumerate(grid, start=1):
    current_params = dict(params)
    current_params.update({
        "objective": "survival:aft",
        "eval_metric": "aft-nloglik",
        "tree_method": "hist",
        "verbosity": 0,
        "seed": SEED,
        "nthread": -1
    })

    cv_results = xgb.cv(
        params=current_params,
        dtrain=dtrain,
        num_boost_round=1000,
        nfold=5,
        stratified=False,
        early_stopping_rounds=20,
        verbose_eval=False,
        seed=SEED
    )

    # ‚úÖ BUG FIX: la mejor ronda es el argmin, no el total de filas
    loss_series = cv_results["test-aft-nloglik-mean"]
    mean_loss = float(loss_series.min())
    best_round_here = int(loss_series.idxmin() + 1)

    if mean_loss < best_loss_val:
        best_loss_val = mean_loss
        best_params = current_params
        best_num_rounds = best_round_here
        print(f"   ü•á Nuevo mejor ({i}/{len(grid)}): Loss={mean_loss:.4f} | "
              f"Dist={params['aft_loss_distribution']} | Rounds={best_num_rounds}")

print(f"\nüèÜ Mejor XGBoost Loss (aft-nloglik): {best_loss_val:.4f}")
print(f"‚öôÔ∏è Mejores par√°metros XGB: {best_params}")
print(f"üîÅ Mejor num_boost_round: {best_num_rounds}")

üöÄ INICIANDO TUNING DE XGBOOST-AFT...
Evaluando 32 combinaciones para XGBoost...


   ü•á Nuevo mejor (1/32): Loss=17.0652 | Dist=normal | Rounds=39


   ü•á Nuevo mejor (2/32): Loss=16.9793 | Dist=normal | Rounds=41



üèÜ Mejor XGBoost Loss (aft-nloglik): 16.9793
‚öôÔ∏è Mejores par√°metros XGB: {'aft_loss_distribution': 'normal', 'learning_rate': 0.01, 'max_depth': 3, 'min_child_weight': 1, 'reg_lambda': 5.0, 'objective': 'survival:aft', 'eval_metric': 'aft-nloglik', 'tree_method': 'hist', 'verbosity': 0, 'seed': 42, 'nthread': -1}
üîÅ Mejor num_boost_round: 41


In [5]:
# ==============================================================================
# 4. ENTRENAR MODELO FINAL XGBOOST-AFT
# ==============================================================================

# Entrenar el modelo con los mejores hiperpar√°metros y rondas √≥ptimas
final_xgb = xgb.train(
    params=best_params,
    dtrain=dtrain,
    num_boost_round=best_num_rounds
)

# Guardar modelo
final_xgb.save_model(str(MODELS_DIR / "best_xgb_aft.json"))
joblib.dump(best_params, MODELS_DIR / "best_xgb_params.pkl")
print(f"‚úÖ Modelo XGBoost-AFT guardado en {MODELS_DIR / 'best_xgb_aft.json'}")

‚úÖ Modelo XGBoost-AFT guardado en models/best_xgb_aft.json


In [6]:
# ==============================================================================
# üìä RESUMEN FINAL
# ==============================================================================
print("=" * 70)
print("üìä RESUMEN DEL HYPERPARAMETER TUNING")
print("=" * 70)

print(f"\nüå≤ RANDOM SURVIVAL FOREST:")
print(f"   C-index promedio: {rsf_grid.best_score_:.4f}")
print(f"   Mejores par√°metros:")
for k, v in rsf_grid.best_params_.items():
    print(f"      {k}: {v}")

print(f"\nüöÄ XGBOOST-AFT:")
print(f"   Loss (aft-nloglik): {best_loss_val:.4f}")
print(f"   Distribuci√≥n ganadora: {best_params.get('aft_loss_distribution', 'N/A')}")
print(f"   Mejor num_boost_round: {best_num_rounds}")

print(f"\nüìÅ MODELOS GUARDADOS:")
print(f"   - models/best_rsf.pkl")
print(f"   - models/best_xgb_aft.json")

print("\n" + "=" * 70)
print("‚úÖ PROCESO COMPLETADO. Listo para evaluaci√≥n final en test set.")
print("=" * 70)

üìä RESUMEN DEL HYPERPARAMETER TUNING

üå≤ RANDOM SURVIVAL FOREST:
   C-index promedio: 0.6000
   Mejores par√°metros:
      max_depth: None
      max_features: sqrt
      min_samples_leaf: 20
      n_estimators: 500

üöÄ XGBOOST-AFT:
   Loss (aft-nloglik): 16.9793
   Distribuci√≥n ganadora: normal
   Mejor num_boost_round: 41

üìÅ MODELOS GUARDADOS:
   - models/best_rsf.pkl
   - models/best_xgb_aft.json

‚úÖ PROCESO COMPLETADO. Listo para evaluaci√≥n final en test set.
