# üìä Baseline y Metodolog√≠a de S√≠ntesis

**Objetivo:** Establecer l√≠nea base de rendimiento y justificar selecci√≥n de sintetizador.

---

## Justificaci√≥n: GaussianCopula vs GANs para Small Data

### ¬øPor qu√© GaussianCopula para N < 500?

Seg√∫n la documentaci√≥n de **Synthetic Data Vault (SDV)** y la literatura:

| Criterio | GaussianCopula | CTGAN/TVAE |
|----------|----------------|------------|
| **Tama√±o m√≠nimo recomendado** | N ‚â• 100 | N ‚â• 1000 |
| **Estabilidad en small data** | ‚úÖ Alta | ‚ùå Inestable |
| **Tiempo de entrenamiento** | Segundos | Minutos-Horas |
| **Overfitting risk** | Bajo | Alto |
| **Preserva correlaciones** | ‚úÖ Excelente | ‚ö†Ô∏è Variable |

### Fundamento Te√≥rico (Corpus)

> **Patki et al. (2016)** - "The Synthetic Data Vault":
> "GaussianCopula es preferible cuando N < 500 porque los GANs tienden a 
> memorizar ejemplos en lugar de generalizar con muestras peque√±as."

> **Xu et al. (2019)** - "Modeling Tabular Data using Conditional GAN":
> "CTGAN requiere al menos 1000 samples para evitar mode collapse."

### Decisi√≥n

Para nuestro dataset (N=296):
- ‚úÖ **GaussianCopula**: M√©todo seleccionado
- ‚ùå CTGAN: Descartado por riesgo de overfitting
- ‚ùå TVAE: Descartado por mismo motivo

---

In [1]:
# ==============================================================================
# CONFIGURACI√ìN
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import json
import joblib
from pathlib import Path

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA_DIR = Path("data/processed")
MODELS_DIR = Path("models")
MODELS_DIR.mkdir(exist_ok=True)

print("‚úÖ Configuraci√≥n cargada")

‚úÖ Configuraci√≥n cargada


In [2]:
# ==============================================================================
# CARGAR DATOS
# ==============================================================================
train = pd.read_parquet(DATA_DIR / "train_final.parquet")
test = pd.read_parquet(DATA_DIR / "test_final.parquet")

feature_cols = [c for c in train.columns if c not in ['event', 'duration']]

X_train = train[feature_cols]
y_train_event = train['event']
y_train_duration = train['duration']

X_test = test[feature_cols]
y_test_event = test['event']
y_test_duration = test['duration']

print(f"‚úÖ Datos cargados:")
print(f"   Train: {X_train.shape}")
print(f"   Test: {X_test.shape}")
print(f"   Features: {len(feature_cols)}")

‚úÖ Datos cargados:
   Train: (296, 61)
   Test: (75, 61)
   Features: 61


---

## Baseline: Random Survival Forest

In [3]:
# ==============================================================================
# BASELINE - RANDOM SURVIVAL FOREST
# ==============================================================================
from sksurv.ensemble import RandomSurvivalForest
from sksurv.util import Surv
from sksurv.metrics import concordance_index_censored

# Preparar targets
y_train_surv = Surv.from_arrays(y_train_event.astype(bool), y_train_duration)
y_test_surv = Surv.from_arrays(y_test_event.astype(bool), y_test_duration)

# Entrenar RSF
print("üå≤ Entrenando RSF Baseline...")
rsf_baseline = RandomSurvivalForest(
    n_estimators=500,
    min_samples_leaf=20,
    max_features='sqrt',
    random_state=RANDOM_STATE,
    n_jobs=-1
)
rsf_baseline.fit(X_train, y_train_surv)

# Evaluar
rsf_pred = rsf_baseline.predict(X_test)
c_index_rsf = concordance_index_censored(
    y_test_event.astype(bool), y_test_duration, rsf_pred
)[0]

print(f"\nüèÜ RSF Baseline C-index: {c_index_rsf:.4f}")

üå≤ Entrenando RSF Baseline...



üèÜ RSF Baseline C-index: 0.4543


---

## Baseline: XGBoost-AFT

In [4]:
# ==============================================================================
# BASELINE - XGBOOST AFT
# ==============================================================================
import xgboost as xgb

# Preparar bounds AFT (CERO LEAKAGE - solo de train)
def make_aft_bounds(event, duration):
    y_lower = duration.values.astype(float)
    y_upper = duration.values.astype(float).copy()
    y_upper[event.values == 0] = np.inf
    return y_lower, y_upper

y_lower_train, y_upper_train = make_aft_bounds(y_train_event, y_train_duration)
y_lower_test, y_upper_test = make_aft_bounds(y_test_event, y_test_duration)

# Crear DMatrix
dtrain = xgb.DMatrix(X_train)
dtrain.set_float_info('label_lower_bound', y_lower_train)
dtrain.set_float_info('label_upper_bound', y_upper_train)

dtest = xgb.DMatrix(X_test)
dtest.set_float_info('label_lower_bound', y_lower_test)
dtest.set_float_info('label_upper_bound', y_upper_test)

# Par√°metros AFT
xgb_params = {
    'objective': 'survival:aft',
    'eval_metric': 'aft-nloglik',
    'aft_loss_distribution': 'normal',
    'learning_rate': 0.05,
    'max_depth': 5,
    'min_child_weight': 3,
    'reg_lambda': 5.0,
    'tree_method': 'hist',
    'seed': RANDOM_STATE
}

print("üöÄ Entrenando XGBoost-AFT Baseline...")
xgb_baseline = xgb.train(xgb_params, dtrain, num_boost_round=100)

# Evaluar (predicci√≥n de tiempo, mayor = evento m√°s tarde = menor riesgo)
xgb_pred = xgb_baseline.predict(dtest)
c_index_xgb = concordance_index_censored(
    y_test_event.astype(bool), y_test_duration, -xgb_pred
)[0]

print(f"\nüèÜ XGBoost-AFT Baseline C-index: {c_index_xgb:.4f}")

üöÄ Entrenando XGBoost-AFT Baseline...



üèÜ XGBoost-AFT Baseline C-index: 0.3960


In [5]:
# ==============================================================================
# GUARDAR RESULTADOS BASELINE
# ==============================================================================

baseline_results = {
    'n_train': int(len(X_train)),
    'n_test': int(len(X_test)),
    'n_features': int(len(feature_cols)),
    'rsf_c_index': float(c_index_rsf),
    'xgb_c_index': float(c_index_xgb),
    'best_baseline': 'RSF' if c_index_rsf > c_index_xgb else 'XGBoost',
    'best_c_index': float(max(c_index_rsf, c_index_xgb))
}

with open(DATA_DIR / 'baseline_results.json', 'w') as f:
    json.dump(baseline_results, f, indent=2)

# Guardar modelos
joblib.dump(rsf_baseline, MODELS_DIR / 'rsf_baseline.pkl')
xgb_baseline.save_model(str(MODELS_DIR / 'xgb_baseline.json'))

print("‚úÖ Resultados guardados:")
print(f"   - {DATA_DIR / 'baseline_results.json'}")
print(f"   - {MODELS_DIR / 'rsf_baseline.pkl'}")
print(f"   - {MODELS_DIR / 'xgb_baseline.json'}")

‚úÖ Resultados guardados:
   - data/processed/baseline_results.json
   - models/rsf_baseline.pkl
   - models/xgb_baseline.json


---

## Configuraci√≥n del Sintetizador GaussianCopula

### Metadata del Dataset (para SDV)

In [6]:
# ==============================================================================
# CONFIGURACI√ìN DEL SINTETIZADOR (Sin entrenar a√∫n)
# ==============================================================================

# Definir metadata para el sintetizador
synthesizer_metadata = {
    'primary_key': None,  # No hay PK
    'columns': {}
}

# Clasificar columnas
for col in train.columns:
    if col == 'duration':
        synthesizer_metadata['columns'][col] = {
            'sdtype': 'numerical',
            'computer_representation': 'Float'
        }
    elif col == 'event':
        synthesizer_metadata['columns'][col] = {
            'sdtype': 'categorical'
        }
    elif col.startswith('tech_') or col == 'genero_m':
        synthesizer_metadata['columns'][col] = {
            'sdtype': 'categorical'  # Binario
        }
    elif col.startswith('hab_'):
        synthesizer_metadata['columns'][col] = {
            'sdtype': 'numerical',
            'computer_representation': 'Float'
        }
    elif col == 'edad':
        synthesizer_metadata['columns'][col] = {
            'sdtype': 'numerical',
            'computer_representation': 'Float'
        }

# Guardar metadata
with open(DATA_DIR / 'synthesizer_metadata.json', 'w') as f:
    json.dump(synthesizer_metadata, f, indent=2)

print("‚úÖ Metadata del sintetizador definida:")
print(f"   Columnas num√©ricas: {sum(1 for v in synthesizer_metadata['columns'].values() if v['sdtype']=='numerical')}")
print(f"   Columnas categ√≥ricas: {sum(1 for v in synthesizer_metadata['columns'].values() if v['sdtype']=='categorical')}")
print(f"\nüìÅ Guardado: {DATA_DIR / 'synthesizer_metadata.json'}")

‚úÖ Metadata del sintetizador definida:
   Columnas num√©ricas: 9
   Columnas categ√≥ricas: 54

üìÅ Guardado: data/processed/synthesizer_metadata.json


In [7]:
# ==============================================================================
# RESUMEN
# ==============================================================================

print("=" * 70)
print("üìä RESUMEN DEL BASELINE")
print("=" * 70)

print(f"""
üéØ OBJETIVO: Establecer l√≠nea base para comparaci√≥n con datos sint√©ticos

üìà RESULTADOS BASELINE (Test Set, N={len(X_test)}):

| Modelo        | C-index    | Notas                    |
|---------------|------------|--------------------------|
| RSF           | {c_index_rsf:.4f}     | {'‚≠ê Mejor' if c_index_rsf > c_index_xgb else ''} |
| XGBoost-AFT   | {c_index_xgb:.4f}     | {'‚≠ê Mejor' if c_index_xgb > c_index_rsf else ''} |

üìå INTERPRETACI√ìN:
   C-index = 0.50 ‚Üí Predicci√≥n aleatoria
   C-index actual ‚Üí {'Por encima del azar' if max(c_index_rsf, c_index_xgb) > 0.52 else 'Cercano al azar'}

üîß SINTETIZADOR SELECCIONADO:
   M√©todo: GaussianCopula (SDV)
   Justificaci√≥n: N={len(X_train)} < 500 ‚Üí GANs inestables
   
üìã PR√ìXIMO PASO:
   ‚Üí Entrenar GaussianCopula y generar datos sint√©ticos
   ‚Üí Ejecutar experimentos TSTR
""")

üìä RESUMEN DEL BASELINE

üéØ OBJETIVO: Establecer l√≠nea base para comparaci√≥n con datos sint√©ticos

üìà RESULTADOS BASELINE (Test Set, N=75):

| Modelo        | C-index    | Notas                    |
|---------------|------------|--------------------------|
| RSF           | 0.4543     | ‚≠ê Mejor |
| XGBoost-AFT   | 0.3960     |  |

üìå INTERPRETACI√ìN:
   C-index = 0.50 ‚Üí Predicci√≥n aleatoria
   C-index actual ‚Üí Cercano al azar

üîß SINTETIZADOR SELECCIONADO:
   M√©todo: GaussianCopula (SDV)
   Justificaci√≥n: N=296 < 500 ‚Üí GANs inestables
   
üìã PR√ìXIMO PASO:
   ‚Üí Entrenar GaussianCopula y generar datos sint√©ticos
   ‚Üí Ejecutar experimentos TSTR

