# üéØ Baseline Training (Escenario A: Real Only)

**Versi√≥n:** v3_experimental  
**Rol:** ML Engineer  
**Fecha:** 2026-01-08

---

## Objetivo

Establecer la **l√≠nea base de rendimiento** usando solo datos reales, contra la cual se comparar√°n los escenarios con datos sint√©ticos.

## Datos

| Dataset | Path | n |
|---------|------|---|
| Train | `../../v2/data/processed/train_final.parquet` | 296 |
| Test | `../../v2/data/processed/test_final.parquet` | 75 |

## Modelos

1. **Random Survival Forest (RSF)**
2. **XGBoost-AFT** (Accelerated Failure Time)

## M√©tricas

- **C-index** (concordance index) - Poder discriminativo
- **IBS** (Integrated Brier Score) - Calibraci√≥n

---

In [1]:
# ==============================================================================
# CONFIGURACI√ìN Y DEPENDENCIAS
# ==============================================================================

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import json
import joblib
from pathlib import Path
from datetime import datetime

# Survival Analysis
from sksurv.ensemble import RandomSurvivalForest
from sksurv.metrics import concordance_index_censored, integrated_brier_score
from sksurv.functions import StepFunction

# XGBoost
import xgboost as xgb

# Configuraci√≥n
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Paths
DATA_DIR = Path("../../v2/data/processed")
OUTPUT_DIR = Path(".")
OUTPUT_DIR.mkdir(exist_ok=True)

print("‚úÖ Dependencias cargadas")
print(f"   Random State: {RANDOM_STATE}")

‚úÖ Dependencias cargadas
   Random State: 42


---
## 1. Carga de Datos

In [2]:
# ==============================================================================
# CARGA DE DATOS
# ==============================================================================

# Cargar datasets
train_df = pd.read_parquet(DATA_DIR / "train_final.parquet")
test_df = pd.read_parquet(DATA_DIR / "test_final.parquet")

print(f"üìä Datos cargados:")
print(f"   Train: {train_df.shape}")
print(f"   Test:  {test_df.shape}")

# Identificar columnas
target_cols = ['duration', 'event']
zero_variance_cols = ['tech_python', 'tech_big_data']  # Del diagn√≥stico

# Excluir zero-variance (si existen)
cols_to_drop = [c for c in zero_variance_cols if c in train_df.columns]
feature_cols = [c for c in train_df.columns if c not in target_cols + cols_to_drop]

print(f"\nüìã Estructura:")
print(f"   Features: {len(feature_cols)}")
print(f"   Excluidas (zero-variance): {cols_to_drop}")

üìä Datos cargados:
   Train: (296, 63)
   Test:  (75, 63)

üìã Estructura:
   Features: 59
   Excluidas (zero-variance): ['tech_python', 'tech_big_data']


In [3]:
# ==============================================================================
# PREPARAR DATOS PARA MODELOS
# ==============================================================================

# Features
X_train = train_df[feature_cols].values
X_test = test_df[feature_cols].values

# Targets
y_train_duration = train_df['duration'].values
y_train_event = train_df['event'].values.astype(bool)

y_test_duration = test_df['duration'].values
y_test_event = test_df['event'].values.astype(bool)

# Formato estructurado para sksurv
y_train_surv = np.array([(e, t) for e, t in zip(y_train_event, y_train_duration)],
                        dtype=[('event', bool), ('duration', float)])
y_test_surv = np.array([(e, t) for e, t in zip(y_test_event, y_test_duration)],
                       dtype=[('event', bool), ('duration', float)])

print(f"‚úÖ Datos preparados:")
print(f"   X_train: {X_train.shape}")
print(f"   X_test:  {X_test.shape}")
print(f"   Event rate (train): {y_train_event.mean():.1%}")
print(f"   Event rate (test):  {y_test_event.mean():.1%}")

‚úÖ Datos preparados:
   X_train: (296, 59)
   X_test:  (75, 59)
   Event rate (train): 45.6%
   Event rate (test):  45.3%


---
## 2. Random Survival Forest (RSF)

In [4]:
# ==============================================================================
# RANDOM SURVIVAL FOREST
# ==============================================================================

print("üå≤ Entrenando Random Survival Forest...")

rsf = RandomSurvivalForest(
    n_estimators=100,
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    n_jobs=-1,
    random_state=RANDOM_STATE
)

rsf.fit(X_train, y_train_surv)

# Predicci√≥n (risk scores)
rsf_risk_train = rsf.predict(X_train)
rsf_risk_test = rsf.predict(X_test)

# C-index
c_index_rsf_train = concordance_index_censored(
    y_train_event, y_train_duration, rsf_risk_train
)[0]

c_index_rsf_test = concordance_index_censored(
    y_test_event, y_test_duration, rsf_risk_test
)[0]

print(f"\nüèÜ RSF Results:")
print(f"   C-index (Train): {c_index_rsf_train:.4f}")
print(f"   C-index (Test):  {c_index_rsf_test:.4f}")

# Guardar modelo
joblib.dump(rsf, OUTPUT_DIR / "rsf_baseline.pkl")
print(f"   ‚úÖ Modelo guardado: rsf_baseline.pkl")

üå≤ Entrenando Random Survival Forest...



üèÜ RSF Results:
   C-index (Train): 0.7210
   C-index (Test):  0.4675


   ‚úÖ Modelo guardado: rsf_baseline.pkl


---
## 3. XGBoost-AFT (Accelerated Failure Time)

In [5]:
# ==============================================================================
# XGBOOST-AFT
# ==============================================================================

print("üöÄ Entrenando XGBoost-AFT...")

# Preparar labels para AFT (log-transform de duration)
y_train_lower = np.log(y_train_duration)
y_train_upper = np.where(y_train_event, y_train_lower, np.inf)

y_test_lower = np.log(y_test_duration)
y_test_upper = np.where(y_test_event, y_test_lower, np.inf)

# DMatrix con bounds
dtrain = xgb.DMatrix(X_train)
dtrain.set_float_info('label_lower_bound', y_train_lower)
dtrain.set_float_info('label_upper_bound', y_train_upper)

dtest = xgb.DMatrix(X_test)
dtest.set_float_info('label_lower_bound', y_test_lower)
dtest.set_float_info('label_upper_bound', y_test_upper)

# Par√°metros AFT
params = {
    'objective': 'survival:aft',
    'eval_metric': 'aft-nloglik',
    'aft_loss_distribution': 'normal',
    'aft_loss_distribution_scale': 1.0,
    'max_depth': 3,
    'learning_rate': 0.1,
    'seed': RANDOM_STATE
}

# Entrenar
xgb_model = xgb.train(
    params,
    dtrain,
    num_boost_round=100,
    evals=[(dtrain, 'train'), (dtest, 'test')],
    verbose_eval=False
)

# Predicci√≥n (log-time predicho ‚Üí convertir a risk)
# En AFT, mayor tiempo predicho = menor riesgo
xgb_pred_train = xgb_model.predict(dtrain)
xgb_pred_test = xgb_model.predict(dtest)

# Risk = -log_time (invertir para que mayor riesgo = evento m√°s temprano)
xgb_risk_train = -xgb_pred_train
xgb_risk_test = -xgb_pred_test

# C-index
c_index_xgb_train = concordance_index_censored(
    y_train_event, y_train_duration, xgb_risk_train
)[0]

c_index_xgb_test = concordance_index_censored(
    y_test_event, y_test_duration, xgb_risk_test
)[0]

print(f"\nüèÜ XGBoost-AFT Results:")
print(f"   C-index (Train): {c_index_xgb_train:.4f}")
print(f"   C-index (Test):  {c_index_xgb_test:.4f}")

# Guardar modelo
xgb_model.save_model(OUTPUT_DIR / "xgb_aft_baseline.json")
print(f"   ‚úÖ Modelo guardado: xgb_aft_baseline.json")

üöÄ Entrenando XGBoost-AFT...



üèÜ XGBoost-AFT Results:
   C-index (Train): 0.7544
   C-index (Test):  0.4791
   ‚úÖ Modelo guardado: xgb_aft_baseline.json


---
## 4. C√°lculo de IBS (Integrated Brier Score)

In [6]:
# ==============================================================================
# INTEGRATED BRIER SCORE
# ==============================================================================

print("üìä Calculando Integrated Brier Score...")

try:
    # Obtener funciones de supervivencia del RSF
    surv_funcs = rsf.predict_survival_function(X_test)
    
    # Crear grid de tiempos para evaluaci√≥n
    # Usar solo tiempos dentro del rango observado
    min_time = max(y_train_duration.min(), y_test_duration.min())
    max_time = min(y_train_duration.max(), y_test_duration.max())
    
    times = np.linspace(min_time + 0.1, max_time - 0.1, 50)
    
    # Evaluar supervivencia en cada tiempo
    preds = np.array([[fn(t) for t in times] for fn in surv_funcs])
    
    # Calcular IBS
    ibs_rsf = integrated_brier_score(y_train_surv, y_test_surv, preds, times)
    
    print(f"   RSF IBS: {ibs_rsf:.4f}")
    
except Exception as e:
    print(f"   ‚ö†Ô∏è No se pudo calcular IBS para RSF: {e}")
    ibs_rsf = None

# Para XGBoost-AFT, IBS requiere funciones de supervivencia completas
# que no est√°n disponibles directamente, marcamos como N/A
ibs_xgb = None
print(f"   XGBoost-AFT IBS: N/A (AFT no provee funciones de supervivencia)")

üìä Calculando Integrated Brier Score...
   RSF IBS: 0.1204
   XGBoost-AFT IBS: N/A (AFT no provee funciones de supervivencia)


---
## 5. Resumen y Exportaci√≥n de Resultados

In [7]:
# ==============================================================================
# RESUMEN DE RESULTADOS
# ==============================================================================

print("\n" + "="*60)
print("üìä RESUMEN BASELINE (Escenario A: Real Only)")
print("="*60)

results = {
    "metadata": {
        "experiment": "Baseline (Real Only)",
        "scenario": "A",
        "date": datetime.now().isoformat(),
        "random_state": RANDOM_STATE
    },
    "data": {
        "n_train": int(X_train.shape[0]),
        "n_test": int(X_test.shape[0]),
        "n_features": int(X_train.shape[1]),
        "event_rate_train": float(y_train_event.mean()),
        "event_rate_test": float(y_test_event.mean())
    },
    "rsf": {
        "c_index_train": float(c_index_rsf_train),
        "c_index_test": float(c_index_rsf_test),
        "ibs": float(ibs_rsf) if ibs_rsf is not None else None,
        "model_path": "rsf_baseline.pkl",
        "hyperparameters": {
            "n_estimators": 100,
            "max_depth": 5,
            "min_samples_split": 10,
            "min_samples_leaf": 5
        }
    },
    "xgb_aft": {
        "c_index_train": float(c_index_xgb_train),
        "c_index_test": float(c_index_xgb_test),
        "ibs": None,
        "model_path": "xgb_aft_baseline.json",
        "hyperparameters": {
            "objective": "survival:aft",
            "aft_loss_distribution": "normal",
            "max_depth": 3,
            "learning_rate": 0.1,
            "num_boost_round": 100
        }
    },
    "best_model": {
        "name": "RSF" if c_index_rsf_test >= c_index_xgb_test else "XGBoost-AFT",
        "c_index": float(max(c_index_rsf_test, c_index_xgb_test))
    }
}

# Mostrar tabla resumen
print(f"\n{'Modelo':<15} {'C-index Train':<15} {'C-index Test':<15} {'IBS':<10}")
print("-" * 55)
print(f"{'RSF':<15} {c_index_rsf_train:<15.4f} {c_index_rsf_test:<15.4f} {ibs_rsf if ibs_rsf else 'N/A':<10}")
print(f"{'XGBoost-AFT':<15} {c_index_xgb_train:<15.4f} {c_index_xgb_test:<15.4f} {'N/A':<10}")
print("-" * 55)
print(f"\nüèÜ Mejor modelo: {results['best_model']['name']} (C-index: {results['best_model']['c_index']:.4f})")

# Guardar resultados
with open(OUTPUT_DIR / "baseline_metrics.json", 'w') as f:
    json.dump(results, f, indent=2)

print(f"\n‚úÖ Resultados guardados en: baseline_metrics.json")


üìä RESUMEN BASELINE (Escenario A: Real Only)

Modelo          C-index Train   C-index Test    IBS       
-------------------------------------------------------
RSF             0.7210          0.4675          0.12040254081869432
XGBoost-AFT     0.7544          0.4791          N/A       
-------------------------------------------------------

üèÜ Mejor modelo: XGBoost-AFT (C-index: 0.4791)

‚úÖ Resultados guardados en: baseline_metrics.json


---
## 6. Interpretaci√≥n

### C-index Interpretaci√≥n

| Rango | Interpretaci√≥n |
|-------|----------------|
| 0.50 | Modelo aleatorio |
| 0.50 - 0.60 | Modelo pobre |
| 0.60 - 0.70 | Modelo aceptable |
| 0.70 - 0.80 | Modelo bueno |
| > 0.80 | Modelo excelente |

### Diagn√≥stico

Seg√∫n el diagn√≥stico previo (`dataset_diagnosis.json`):
- La correlaci√≥n m√°xima feature-event es **0.17** (muy baja)
- Esto limita el poder predictivo de cualquier modelo
- La augmentaci√≥n con datos sint√©ticos busca mejorar esto

---

## Archivos Generados

| Archivo | Descripci√≥n |
|---------|-------------|
| `rsf_baseline.pkl` | Modelo Random Survival Forest entrenado |
| `xgb_aft_baseline.json` | Modelo XGBoost-AFT entrenado |
| `baseline_metrics.json` | M√©tricas de rendimiento |

---

## Siguiente Paso

Proceder al **Prompt 4: Generaci√≥n Sint√©tica** para crear datos que complementen el entrenamiento.