# ÔøΩÔøΩ Pivote de Emergencia: Clasificaci√≥n Binaria

**Contexto:** El an√°lisis de supervivencia fall√≥ (C-index ~0.45).  
**Nuevo Objetivo:** Predecir si el graduado consigue empleo STEM (event=1) o no (event=0).

---

## Justificaci√≥n Metodol√≥gica

> Seg√∫n Hosmer & Lemeshow (2000), cuando el tiempo al evento no es predecible, 
> la clasificaci√≥n binaria del outcome sigue siendo cient√≠ficamente v√°lida para 
> identificar factores de riesgo/protecci√≥n.

**M√©trica Principal:** AUC (Area Under ROC Curve)
- AUC > 0.6: Modelo √∫til
- AUC > 0.7: Modelo bueno
- AUC > 0.8: Modelo excelente

---

In [None]:
# ==============================================================================
# CONFIGURACI√ìN
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA_DIR = Path("data/processed")
FIGURES_DIR = Path("figures")
FIGURES_DIR.mkdir(exist_ok=True)

# Estilo gr√°ficos
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 11, 'savefig.dpi': 300, 'savefig.facecolor': 'white'})

print("‚úÖ Configuraci√≥n cargada")

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import (accuracy_score, roc_auc_score, f1_score, 
                             confusion_matrix, classification_report, roc_curve)
from imblearn.over_sampling import SMOTENC

print("‚úÖ Librer√≠as ML cargadas")

In [None]:
# ==============================================================================
# 1. CARGAR DATOS
# ==============================================================================
train = pd.read_parquet(DATA_DIR / "train_final.parquet")
test = pd.read_parquet(DATA_DIR / "test_final.parquet")

# Features y Target
feature_cols = [c for c in train.columns if c not in ['event', 'duration']]

X_train = train[feature_cols]
y_train = train['event']

X_test = test[feature_cols]
y_test = test['event']

print(f"‚úÖ Datos cargados:")
print(f"   Train: {X_train.shape} | Test: {X_test.shape}")
print(f"   Event rate train: {y_train.mean():.1%}")
print(f"   Event rate test: {y_test.mean():.1%}")

---
## 2. Data Augmentation (SMOTE-NC)

In [None]:
# ==============================================================================
# SMOTE-NC PARA BALANCEAR CLASES
# ==============================================================================

# Identificar columnas categ√≥ricas (binarias)
cat_cols = [c for c in feature_cols if c.startswith('tech_') or c == 'genero_m']
cat_indices = [feature_cols.index(c) for c in cat_cols]

print(f"Columnas categ√≥ricas: {len(cat_cols)}")
print(f"Distribuci√≥n original: {dict(y_train.value_counts())}")

# Aplicar SMOTE-NC
smote = SMOTENC(categorical_features=cat_indices, random_state=RANDOM_STATE)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Corregir binarias a enteros
for c in cat_cols:
    if c in X_train_smote.columns:
        X_train_smote[c] = X_train_smote[c].round().astype(int)

print(f"\n‚úÖ SMOTE aplicado:")
print(f"   Original: {len(X_train)} -> Aumentado: {len(X_train_smote)}")
print(f"   Distribuci√≥n nueva: 0={sum(y_train_smote==0)}, 1={sum(y_train_smote==1)}")

---
## 3. Entrenamiento: XGBoost Classifier

In [None]:
# ==============================================================================
# XGBOOST CLASSIFIER
# ==============================================================================

xgb_clf = XGBClassifier(
    objective='binary:logistic',
    n_estimators=200,
    max_depth=5,
    learning_rate=0.05,
    min_child_weight=3,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    random_state=RANDOM_STATE,
    use_label_encoder=False,
    eval_metric='logloss'
)

print("üöÄ Entrenando XGBoost Classifier...")
xgb_clf.fit(X_train_smote, y_train_smote)
print("‚úÖ Entrenamiento completado")

---
## 4. Evaluaci√≥n en Test Set

In [None]:
# ==============================================================================
# PREDICCIONES Y M√âTRICAS
# ==============================================================================

y_pred = xgb_clf.predict(X_test)
y_pred_proba = xgb_clf.predict_proba(X_test)[:, 1]

# M√©tricas
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)
f1 = f1_score(y_test, y_pred)

print("=" * 60)
print("üìä RESULTADOS EN TEST SET")
print("=" * 60)
print(f"\nüéØ M√âTRICAS PRINCIPALES:")
print(f"   Accuracy: {accuracy:.4f}")
print(f"   AUC:      {auc:.4f}")
print(f"   F1-Score: {f1:.4f}")

print(f"\nüìã REPORTE DE CLASIFICACI√ìN:")
print(classification_report(y_test, y_pred, target_names=['No Empleo STEM', 'Empleo STEM']))

---
## 5. Visualizaciones para Tesis

In [None]:
# ==============================================================================
# GR√ÅFICO 1: CURVA ROC
# ==============================================================================

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

fig_roc, ax_roc = plt.subplots(figsize=(8, 8))
ax_roc.plot(fpr, tpr, 'b-', linewidth=2.5, label=f'XGBoost (AUC = {auc:.3f})')
ax_roc.plot([0, 1], [0, 1], 'k--', linewidth=1.5, label='Aleatorio (AUC = 0.5)')
ax_roc.fill_between(fpr, tpr, alpha=0.2)

ax_roc.set_xlabel('Tasa de Falsos Positivos (FPR)', fontweight='bold')
ax_roc.set_ylabel('Tasa de Verdaderos Positivos (TPR)', fontweight='bold')
ax_roc.set_title('Curva ROC - Predicci√≥n de Empleabilidad STEM', fontweight='bold', pad=15)
ax_roc.legend(loc='lower right', fontsize=11)
ax_roc.set_xlim([0, 1])
ax_roc.set_ylim([0, 1.02])
ax_roc.spines['top'].set_visible(False)
ax_roc.spines['right'].set_visible(False)

fig_roc.savefig(FIGURES_DIR / "fig_roc_curve.png")
plt.close(fig_roc)
print(f"‚úÖ Guardado: {FIGURES_DIR / 'fig_roc_curve.png'}")

In [None]:
# ==============================================================================
# GR√ÅFICO 2: MATRIZ DE CONFUSI√ìN
# ==============================================================================

cm = confusion_matrix(y_test, y_pred)

fig_cm, ax_cm = plt.subplots(figsize=(8, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Empleo STEM', 'Empleo STEM'],
            yticklabels=['No Empleo STEM', 'Empleo STEM'],
            annot_kws={'size': 16}, ax=ax_cm)

ax_cm.set_xlabel('Predicci√≥n', fontweight='bold', fontsize=12)
ax_cm.set_ylabel('Real', fontweight='bold', fontsize=12)
ax_cm.set_title('Matriz de Confusi√≥n (Test Set)', fontweight='bold', pad=15, fontsize=14)

fig_cm.savefig(FIGURES_DIR / "fig_confusion_matrix.png")
plt.close(fig_cm)
print(f"‚úÖ Guardado: {FIGURES_DIR / 'fig_confusion_matrix.png'}")

In [None]:
# ==============================================================================
# GR√ÅFICO 3: FEATURE IMPORTANCE
# ==============================================================================

importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': xgb_clf.feature_importances_
}).sort_values('Importance', ascending=True).tail(15)

fig_imp, ax_imp = plt.subplots(figsize=(10, 8))

colors = ['#4682B4' if 'hab' in f else '#2E8B57' if 'tech' in f else '#708090' 
          for f in importance['Feature']]
ax_imp.barh(importance['Feature'], importance['Importance'], color=colors, edgecolor='white')

ax_imp.set_xlabel('Importancia (Gain)', fontweight='bold')
ax_imp.set_title('Predictores de Empleabilidad STEM\n(XGBoost Feature Importance)', 
                 fontweight='bold', pad=15)
ax_imp.spines['top'].set_visible(False)
ax_imp.spines['right'].set_visible(False)

from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#4682B4', label='Soft Skills'),
    Patch(facecolor='#2E8B57', label='Tech Skills'),
    Patch(facecolor='#708090', label='Demogr√°ficas')
]
ax_imp.legend(handles=legend_elements, loc='lower right')

fig_imp.savefig(FIGURES_DIR / "fig_feature_importance_class.png")
plt.close(fig_imp)
print(f"‚úÖ Guardado: {FIGURES_DIR / 'fig_feature_importance_class.png'}")

In [None]:
# ==============================================================================
# RESULTADO FINAL
# ==============================================================================

print("\n" + "=" * 70)
if auc > 0.60:
    print("‚úÖ PIVOTE EXITOSO: El modelo predice la empleabilidad con √©xito!")
    print(f"   AUC = {auc:.4f} > 0.60")
else:
    print("‚ö†Ô∏è AUC < 0.60: Modelo con capacidad predictiva limitada")
    print(f"   AUC = {auc:.4f}")
print("=" * 70)

print(f"""
üìä RESUMEN EJECUTIVO:

Objetivo: Predecir si un graduado STEM consigue empleo relacionado
Modelo: XGBoost Classifier + SMOTE-NC

M√âTRICAS (Test Set, n={len(y_test)}):
‚Ä¢ Accuracy: {accuracy:.1%}
‚Ä¢ AUC:      {auc:.4f}
‚Ä¢ F1-Score: {f1:.4f}

MATRIZ DE CONFUSI√ìN:
‚Ä¢ Verdaderos Negativos: {cm[0,0]}
‚Ä¢ Falsos Positivos:     {cm[0,1]}
‚Ä¢ Falsos Negativos:     {cm[1,0]}
‚Ä¢ Verdaderos Positivos: {cm[1,1]}

ARCHIVOS GENERADOS:
‚Ä¢ figures/fig_roc_curve.png
‚Ä¢ figures/fig_confusion_matrix.png
‚Ä¢ figures/fig_feature_importance_class.png
""")