# ÔøΩÔøΩ Feature Selection y Entrenamiento Final

**Versi√≥n:** v2  
**Entrada:** `./data/processed/train_final.parquet`, `test_final.parquet`  
**Objetivo:** Seleccionar Top-20 features y entrenar modelos de supervivencia

---

## Justificaci√≥n Metodol√≥gica (Basada en el Corpus)

### 1. Reducci√≥n de Dimensionalidad

> **"Ante la alta dimensionalidad (60+ features), aplicamos Permutation Importance con un Random Forest base para seleccionar las variables con poder predictivo real, descartando ruido (Hastie et al., 2009)."**

| Paper del Corpus | Justificaci√≥n |
|------------------|---------------|
| Barnwal et al. (2022) *Survival Regression with Accelerated Failure Time* | XGBoost-AFT para datos censurados |
| Andonovikj et al. (2024) *Survival Analysis as Semi-Supervised* | Feature selection en survival |
| Abd ElHafeez (2021) *Methods to Analyze Time-to-Event Data* | Cox regression y C-index |
| Getie Ayaneh (2020) *Survival Models for Waiting Time* | Aplicaci√≥n a empleabilidad |

### 2. Modelos a Entrenar

- **Random Survival Forest (RSF)**: Ensemble robusto para datos censurados
- **XGBoost-AFT**: Accelerated Failure Time con regularizaci√≥n

---

In [None]:
# ==============================================================================
# CONFIGURACI√ìN
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import joblib
from pathlib import Path

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA_DIR = Path("data/processed")
MODELS_DIR = Path("models")
MODELS_DIR.mkdir(exist_ok=True)

print("‚úÖ Configuraci√≥n cargada")

In [None]:
# Cargar datos
train = pd.read_parquet(DATA_DIR / "train_final.parquet")
test = pd.read_parquet(DATA_DIR / "test_final.parquet")

# Separar features y target
feature_cols = [c for c in train.columns if c not in ['event', 'duration']]
X_train = train[feature_cols]
y_train_event = train['event']
y_train_duration = train['duration']

X_test = test[feature_cols]
y_test_event = test['event']
y_test_duration = test['duration']

print(f"‚úÖ Datos cargados:")
print(f"   Train: {X_train.shape} | Test: {X_test.shape}")
print(f"   Features: {len(feature_cols)}")
print(f"   Event rate train: {y_train_event.mean():.1%}")

In [None]:
# Imports para ML
import xgboost as xgb
from sksurv.ensemble import RandomSurvivalForest
from sksurv.util import Surv
from sksurv.metrics import concordance_index_censored
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance

print("‚úÖ Librer√≠as ML cargadas")

---
## 1. Feature Selection con Permutation Importance

Usamos un RandomForest para identificar las 20 variables m√°s importantes.

In [None]:
# ==============================================================================
# FEATURE IMPORTANCE CON RANDOM FOREST
# ==============================================================================

print("üîç Calculando Feature Importance...")

# Entrenar RF r√°pido para obtener importancia
rf_baseline = RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
rf_baseline.fit(X_train, y_train_duration)

# Obtener importancia
importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_baseline.feature_importances_
}).sort_values('importance', ascending=False)

print("\nüìä Top 20 Features por Importancia:")
print(importance.head(20).to_string(index=False))

# Seleccionar Top 20
TOP_K = 20
top_features = importance.head(TOP_K)['feature'].tolist()
print(f"\n‚úÖ Seleccionadas {TOP_K} features")

In [None]:
# Reducir a Top-K features
X_train_sel = X_train[top_features]
X_test_sel = X_test[top_features]

print(f"‚úÖ Dimensiones reducidas:")
print(f"   Train: {X_train_sel.shape}")
print(f"   Test: {X_test_sel.shape}")

---
## 2. Entrenamiento: Random Survival Forest

In [None]:
# ==============================================================================
# RANDOM SURVIVAL FOREST
# ==============================================================================

# Formato Scikit-Survival
y_train_surv = Surv.from_arrays(y_train_event.astype(bool), y_train_duration)
y_test_surv = Surv.from_arrays(y_test_event.astype(bool), y_test_duration)

# Entrenar RSF con mejores hiperpar√°metros del tuning previo
print("üå≤ Entrenando Random Survival Forest...")

rsf = RandomSurvivalForest(
    n_estimators=500,
    min_samples_leaf=20,
    max_depth=None,
    max_features='sqrt',
    random_state=RANDOM_STATE,
    n_jobs=-1
)

rsf.fit(X_train_sel, y_train_surv)

# Evaluaci√≥n en TEST
rsf_pred = rsf.predict(X_test_sel)
c_index_rsf = concordance_index_censored(
    y_test_event.astype(bool), 
    y_test_duration, 
    rsf_pred
)[0]

print(f"\nüèÜ RSF C-index (TEST): {c_index_rsf:.4f}")

# Guardar modelo
joblib.dump(rsf, MODELS_DIR / "rsf_final.pkl")
print(f"‚úÖ Modelo guardado: {MODELS_DIR / 'rsf_final.pkl'}")

---
## 3. Entrenamiento: XGBoost-AFT

In [None]:
# ==============================================================================
# XGBOOST AFT
# ==============================================================================

print("üöÄ Entrenando XGBoost-AFT...")

# Preparar bounds para AFT
def make_aft_bounds(event, duration):
    y_lower = duration.values.astype(float)
    y_upper = duration.values.astype(float).copy()
    y_upper[event.values == 0] = np.inf
    return y_lower, y_upper

y_lower_train, y_upper_train = make_aft_bounds(y_train_event, y_train_duration)
y_lower_test, y_upper_test = make_aft_bounds(y_test_event, y_test_duration)

# DMatrix
dtrain = xgb.DMatrix(X_train_sel)
dtrain.set_float_info('label_lower_bound', y_lower_train)
dtrain.set_float_info('label_upper_bound', y_upper_train)

dtest = xgb.DMatrix(X_test_sel)
dtest.set_float_info('label_lower_bound', y_lower_test)
dtest.set_float_info('label_upper_bound', y_upper_test)

# Par√°metros √≥ptimos del tuning
xgb_params = {
    'objective': 'survival:aft',
    'eval_metric': 'aft-nloglik',
    'aft_loss_distribution': 'normal',
    'learning_rate': 0.01,
    'max_depth': 3,
    'min_child_weight': 1,
    'reg_lambda': 5.0,
    'tree_method': 'hist',
    'seed': RANDOM_STATE
}

# Entrenar
xgb_model = xgb.train(xgb_params, dtrain, num_boost_round=100)

# Predicci√≥n (tiempo esperado)
xgb_pred = xgb_model.predict(dtest)

# C-index para XGBoost (predicci√≥n de tiempo, mayor = evento m√°s tarde)
c_index_xgb = concordance_index_censored(
    y_test_event.astype(bool),
    y_test_duration,
    -xgb_pred  # Negativo porque mayor tiempo = menor riesgo
)[0]

print(f"\nüèÜ XGBoost-AFT C-index (TEST): {c_index_xgb:.4f}")

# Guardar
xgb_model.save_model(str(MODELS_DIR / "xgb_aft_final.json"))
print(f"‚úÖ Modelo guardado: {MODELS_DIR / 'xgb_aft_final.json'}")

---
## 4. Resumen de Resultados

In [None]:
# ==============================================================================
# RESUMEN FINAL
# ==============================================================================

print("=" * 70)
print("üìä RESULTADOS FINALES EN TEST SET")
print("=" * 70)

print(f"""
üéØ M√âTRICAS DE EVALUACI√ìN (C-index)

| Modelo          | C-index (Test) | Interpretaci√≥n |
|-----------------|----------------|----------------|
| RSF             | {c_index_rsf:.4f}         | {'‚úÖ Bueno' if c_index_rsf > 0.6 else '‚ö†Ô∏è Moderado' if c_index_rsf > 0.55 else '‚ùå Bajo'} |
| XGBoost-AFT     | {c_index_xgb:.4f}         | {'‚úÖ Bueno' if c_index_xgb > 0.6 else '‚ö†Ô∏è Moderado' if c_index_xgb > 0.55 else '‚ùå Bajo'} |

üìå Referencia C-index:
   0.50 = Aleatorio
   0.60 = Aceptable
   0.70 = Bueno
   0.80 = Excelente

üìÅ MODELOS GUARDADOS:
   - models/rsf_final.pkl
   - models/xgb_aft_final.json

üîç FEATURES SELECCIONADAS ({TOP_K}):
""")

for i, f in enumerate(top_features, 1):
    print(f"   {i:2d}. {f}")

In [None]:
# Guardar lista de features seleccionadas
import json

feature_selection_info = {
    'top_features': top_features,
    'c_index_rsf': float(c_index_rsf),
    'c_index_xgb': float(c_index_xgb),
    'n_train': len(X_train),
    'n_test': len(X_test)
}

with open(MODELS_DIR / 'feature_selection_results.json', 'w') as f:
    json.dump(feature_selection_info, f, indent=2)

print(f"‚úÖ Resultados guardados en {MODELS_DIR / 'feature_selection_results.json'}")