# üî¨ Data Audit para Generaci√≥n Sint√©tica

**Objetivo:** Diagnosticar estad√≠sticamente el dataset de entrenamiento y definir las reglas del juego para la generaci√≥n sint√©tica.

**Autor:** Data Scientist Auditor  
**Fecha:** 2026-01-08  
**Versi√≥n:** 2.0 (Enhanced)

---

## üìö Marco Te√≥rico y Referencias Cient√≠ficas

### Fundamentos de Datos de Supervivencia

> **Lawless, J.F. (2003)**, *"Statistical Models and Methods for Lifetime Data"*, 2nd Ed., Wiley:
> - Los datos de supervivencia tienen caracter√≠sticas especiales: tiempo positivo, censura a la derecha
> - La tasa de censura afecta la potencia estad√≠stica y la validez de las estimaciones
> - Para s√≠ntesis: preservar la distribuci√≥n marginal de duration y la correlaci√≥n con event

### Tiempo-al-Empleo en Graduados

> **Getie Ayaneh et al. (2020)**, DOI: [10.1155/2020/8653405](https://doi.org/10.1155/2020/8653405):
> - "Survival Models for the Analysis of Waiting Time to First Employment"
> - Censura t√≠pica: 40-60% en estudios de empleabilidad
> - Variables clave: g√©nero, edad, tipo de carrera

> **Alemu (2022)**, DOI: [10.1155/2022/2165610](https://doi.org/10.1155/2022/2165610):
> - "Understanding Waiting Time from Graduation to First Employment"
> - Recomienda an√°lisis de censura estratificado por cohorte

### Datos Sint√©ticos para Survival Analysis

> **Andonovikj et al. (2024)**, "Survival analysis as semi-supervised multi-label classification":
> - Los sintetizadores deben preservar la relaci√≥n duration-event (no independientes)
> - Restricci√≥n cr√≠tica: duration > 0 siempre

---

In [None]:
# ==============================================================================
# CONFIGURACI√ìN
# ==============================================================================
import pandas as pd
import numpy as np
import json
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Rutas relativas desde v3_experimental/01_diagnosis/
DATA_PATH = Path("../../v2/data/processed/train_final.parquet")
OUTPUT_PATH = Path("dataset_diagnosis.json")

print(f"‚úÖ Configuraci√≥n cargada")
print(f"   Input: {DATA_PATH}")
print(f"   Output: {OUTPUT_PATH}")

In [None]:
# ==============================================================================
# CARGAR DATOS (Solo TRAIN - NO TEST)
# ==============================================================================

df = pd.read_parquet(DATA_PATH)

print(f"üìä DATASET CARGADO:")
print(f"   Filas: {len(df)}")
print(f"   Columnas: {len(df.columns)}")
print(f"\n   Fuente: Encuesta reci√©n graduados - pregrado (EPN)")
print(f"   Restricci√≥n: Solo datos de TRAIN para evitar data leakage")

---
## 1Ô∏è‚É£ Clasificaci√≥n de Columnas

Seg√∫n Lawless (2003), es crucial distinguir:
- **Continuas**: Variables con >10 valores √∫nicos (edad, duration)
- **Discretas**: Variables con ‚â§10 valores √∫nicos
- **Binarias**: Caso especial de discretas con exactamente 2 valores {0,1}

In [None]:
# ==============================================================================
# CLASIFICACI√ìN DE COLUMNAS
# ==============================================================================

continuous_cols = []
discrete_cols = []
binary_cols = []
categorical_cols = []
zero_variance_cols = []  # NUEVO: Variables problem√°ticas
column_info = {}

for col in df.columns:
    n_unique = df[col].nunique()
    dtype = str(df[col].dtype)
    
    info = {
        'dtype': dtype,
        'n_unique': int(n_unique),
        'null_count': int(df[col].isna().sum()),
        'null_pct': float(df[col].isna().mean()),
    }
    
    # Detectar varianza cero (NUEVO)
    if n_unique <= 1:
        zero_variance_cols.append(col)
        info['category'] = 'zero_variance'
        info['warning'] = 'No aporta informaci√≥n, considerar excluir'
    # Clasificar
    elif n_unique == 2 and set(df[col].dropna().unique()).issubset({0, 1, 0.0, 1.0}):
        binary_cols.append(col)
        discrete_cols.append(col)
        info['category'] = 'binary'
    elif n_unique <= 10:
        categorical_cols.append(col)
        discrete_cols.append(col)
        info['category'] = 'categorical'
        info['values'] = sorted(df[col].dropna().unique().tolist())
    else:
        continuous_cols.append(col)
        info['category'] = 'continuous'
        info['min'] = float(df[col].min())
        info['max'] = float(df[col].max())
        info['mean'] = float(df[col].mean())
        info['std'] = float(df[col].std())
    
    column_info[col] = info

print(f"üìã CLASIFICACI√ìN DE COLUMNAS:")
print(f"   Continuas: {len(continuous_cols)}")
print(f"   Discretas: {len(discrete_cols)}")
print(f"   - Binarias: {len(binary_cols)}")
print(f"   - Categ√≥ricas (multi-clase): {len(categorical_cols)}")
print(f"\n‚ö†Ô∏è  PROBLEM√ÅTICAS (varianza cero): {len(zero_variance_cols)}")
for col in zero_variance_cols:
    print(f"   - {col}: solo valor = {df[col].unique()[0]}")

---
## 2Ô∏è‚É£ An√°lisis del Target (Event/Duration)

Per Getie Ayaneh (2020): La tasa de censura en estudios de empleabilidad t√≠picamente est√° entre 40-60%.

In [None]:
# ==============================================================================
# AN√ÅLISIS DEL TARGET (Event/Duration)
# ==============================================================================

# Tasa de censura
n_events = int(df['event'].sum())
n_censored = len(df) - n_events
censoring_rate = n_censored / len(df)

target_info = {
    'event': {
        'n_events': n_events,
        'n_censored': n_censored,
        'censoring_rate': float(censoring_rate),
        'event_rate': float(1 - censoring_rate)
    },
    'duration': {
        'min': float(df['duration'].min()),
        'max': float(df['duration'].max()),
        'mean': float(df['duration'].mean()),
        'median': float(df['duration'].median()),
        'std': float(df['duration'].std()),
        'n_unique': int(df['duration'].nunique())
    }
}

print(f"üéØ AN√ÅLISIS DEL TARGET:")
print(f"   Eventos (E=1): {n_events} ({100*(1-censoring_rate):.1f}%)")
print(f"   Censurados (E=0): {n_censored} ({100*censoring_rate:.1f}%)")
print(f"\n   Duration range: [{target_info['duration']['min']:.2f}, {target_info['duration']['max']:.2f}] meses")
print(f"   Duration mean: {target_info['duration']['mean']:.2f} ¬± {target_info['duration']['std']:.2f}")

# Validaci√≥n cient√≠fica
if 0.4 <= censoring_rate <= 0.6:
    print(f"\n‚úÖ Tasa de censura ({censoring_rate:.1%}) dentro del rango esperado (40-60%)")
else:
    print(f"\n‚ö†Ô∏è  Tasa de censura ({censoring_rate:.1%}) fuera del rango t√≠pico (40-60%)")

---
## 3Ô∏è‚É£ Restricciones de Dominio

Seg√∫n Lawless (2003), las restricciones fundamentales en survival data son:
- `duration > 0` (el tiempo no puede ser negativo ni cero)
- `event ‚àà {0, 1}` (indicador binario de censura)

In [None]:
# ==============================================================================
# DETECCI√ìN DE RESTRICCIONES DE DOMINIO
# ==============================================================================

constraints = []

# 1. Duration > 0 (CR√çTICO para survival)
duration_positive = (df['duration'] > 0).all()
constraints.append({
    'variable': 'duration',
    'constraint': 'duration > 0',
    'satisfied': bool(duration_positive),
    'violations': int((df['duration'] <= 0).sum()),
    'criticality': 'HARD',
    'reference': 'Lawless (2003): Time must be strictly positive'
})

# 2. Event binario
event_binary = set(df['event'].unique()).issubset({0, 1})
constraints.append({
    'variable': 'event',
    'constraint': 'event ‚àà {0, 1}',
    'satisfied': bool(event_binary),
    'violations': 0 if event_binary else 1,
    'criticality': 'HARD',
    'reference': 'Standard censoring indicator'
})

# 3. Edad razonable (basado en datos observados)
if 'edad' in df.columns:
    edad_min, edad_max = df['edad'].min(), df['edad'].max()
    edad_valid = (df['edad'] >= 18) & (df['edad'] <= 65)
    constraints.append({
        'variable': 'edad',
        'constraint': f'{int(edad_min)} ‚â§ edad ‚â§ {int(edad_max)}',
        'satisfied': bool(edad_valid.all()),
        'violations': int((~edad_valid).sum()),
        'range_observed': [int(edad_min), int(edad_max)],
        'criticality': 'SOFT',
        'reference': 'Rango observado en la encuesta EPN'
    })

# 4. Soft skills normalizadas [0, 1]
hab_cols = [c for c in df.columns if c.startswith('hab_')]
for h in hab_cols:
    in_range = (df[h] >= 0) & (df[h] <= 1)
    constraints.append({
        'variable': h,
        'constraint': '0 ‚â§ value ‚â§ 1',
        'satisfied': bool(in_range.all()),
        'violations': int((~in_range).sum()),
        'criticality': 'SOFT'
    })

# 5. Tech skills binarias
tech_cols = [c for c in df.columns if c.startswith('tech_') and c not in zero_variance_cols]
tech_constraint = {
    'variable': 'tech_* (binarias)',
    'constraint': 'value ‚àà {0, 1}',
    'satisfied': True,
    'violations': 0,
    'n_features': len(tech_cols),
    'criticality': 'HARD'
}
for t in tech_cols:
    if not set(df[t].unique()).issubset({0, 1}):
        tech_constraint['satisfied'] = False
        tech_constraint['violations'] += 1
constraints.append(tech_constraint)

print(f"üîí RESTRICCIONES DETECTADAS: {len(constraints)}")
for c in constraints:
    status = "‚úÖ" if c['satisfied'] else "‚ùå"
    crit = f"[{c.get('criticality', 'SOFT')}]" 
    print(f"   {status} {crit} {c['constraint']}")

---
## 4Ô∏è‚É£ An√°lisis de Correlaciones (NUEVO)

Seg√∫n Andonovikj et al. (2024): Los sintetizadores deben preservar la correlaci√≥n duration-event. Analizamos correlaciones con el target para entender qu√© variables son predictivas.

In [None]:
# ==============================================================================
# AN√ÅLISIS DE CORRELACIONES
# ==============================================================================

# Correlaci√≥n duration-event
duration_event_corr = df['duration'].corr(df['event'])

# Correlaciones con event (excluir duration y variables zero-variance)
feature_cols = [c for c in df.columns if c not in ['event', 'duration'] + zero_variance_cols]
correlations_with_event = {}
correlations_with_duration = {}

for col in feature_cols:
    try:
        correlations_with_event[col] = float(df[col].corr(df['event']))
        correlations_with_duration[col] = float(df[col].corr(df['duration']))
    except:
        pass

# Top correlaciones con event
sorted_by_event = sorted(correlations_with_event.items(), key=lambda x: abs(x[1]), reverse=True)
max_corr_event = sorted_by_event[0][1] if sorted_by_event else 0

correlation_analysis = {
    'duration_event_correlation': float(duration_event_corr),
    'max_feature_event_correlation': float(max_corr_event),
    'top_5_correlated_with_event': dict(sorted_by_event[:5]),
    'warning': None
}

print(f"üìà AN√ÅLISIS DE CORRELACIONES:")
print(f"\n   Duration ‚Üî Event: {duration_event_corr:.3f}")
if abs(duration_event_corr) > 0.3:
    print(f"   ‚ö†Ô∏è  Correlaci√≥n moderada: sintetizador debe preservar esta relaci√≥n")
    correlation_analysis['warning'] = 'High duration-event correlation must be preserved'

print(f"\n   M√°xima correlaci√≥n feature‚Üíevent: {max_corr_event:.3f}")
if abs(max_corr_event) < 0.2:
    print(f"   üö® CR√çTICO: Ninguna feature tiene |corr| > 0.2 con event")
    print(f"      ‚Üí Esto explica el bajo C-index (0.5669) del modelo")

print(f"\n   Top 5 correlaciones con event:")
for col, corr in sorted_by_event[:5]:
    print(f"      {col}: {corr:.3f}")

---
## 5Ô∏è‚É£ An√°lisis de Sparsity (Tech Skills)

In [None]:
# ==============================================================================
# AN√ÅLISIS DE SPARSITY EN TECH SKILLS
# ==============================================================================

tech_cols_all = [c for c in df.columns if c.startswith('tech_')]
tech_sparsity = {}

for col in tech_cols_all:
    ones_pct = df[col].mean() * 100
    tech_sparsity[col] = {
        'ones_pct': float(ones_pct),
        'ones_count': int(df[col].sum()),
        'is_sparse': ones_pct < 5,  # <5% se considera sparse
        'is_zero_variance': col in zero_variance_cols
    }

sparse_features = [k for k, v in tech_sparsity.items() if v['is_sparse'] and not v['is_zero_variance']]

print(f"üìä SPARSITY DE TECH SKILLS ({len(tech_cols_all)} features):")
print(f"\n   Zero-variance (excluir): {len(zero_variance_cols)}")
print(f"   Sparse (<5% valores=1): {len(sparse_features)}")
print(f"   Normales (‚â•5% valores=1): {len(tech_cols_all) - len(sparse_features) - len(zero_variance_cols)}")

if zero_variance_cols:
    print(f"\n   ‚ùå Features con varianza cero (EXCLUIR del sintetizador):")
    for col in zero_variance_cols:
        print(f"      - {col}")

---
## 6Ô∏è‚É£ Cardinalidad de Variables Categ√≥ricas

In [None]:
# ==============================================================================
# CARDINALIDAD DE VARIABLES CATEG√ìRICAS
# ==============================================================================

cardinality = {}
for col in categorical_cols + binary_cols:
    if col not in zero_variance_cols:
        cardinality[col] = {
            'n_unique': int(df[col].nunique()),
            'values': sorted([str(v) for v in df[col].unique().tolist()]),
            'is_binary': col in binary_cols
        }

print(f"üìä CARDINALIDAD (excluyendo zero-variance):")
for col in list(cardinality.keys())[:10]:
    info = cardinality[col]
    print(f"   {col}: {info['n_unique']} valores {'(Binaria)' if info['is_binary'] else ''}")

---
## 7Ô∏è‚É£ Reglas del Juego para Generaci√≥n Sint√©tica (NUEVO)

Basado en el an√°lisis anterior, definimos las reglas que debe seguir el sintetizador.

In [None]:
# ==============================================================================
# REGLAS PARA GENERACI√ìN SINT√âTICA
# ==============================================================================

synthesis_rules = {
    'hard_constraints': [
        {
            'rule': 'duration > 0',
            'type': 'inequality',
            'action': 'reject_if_violated',
            'reference': 'Lawless (2003)'
        },
        {
            'rule': 'event ‚àà {0, 1}',
            'type': 'categorical',
            'action': 'round_to_nearest'
        },
        {
            'rule': 'tech_* ‚àà {0, 1}',
            'type': 'binary',
            'action': 'round_to_nearest'
        }
    ],
    'soft_constraints': [
        {
            'rule': f'edad ‚àà [{int(df["edad"].min())}, {int(df["edad"].max())}]',
            'type': 'range',
            'action': 'clip_to_range'
        },
        {
            'rule': 'hab_* ‚àà [0, 1]',
            'type': 'range',
            'action': 'clip_to_range'
        },
        {
            'rule': f'duration ‚àà [{df["duration"].min():.2f}, {df["duration"].max():.2f}]',
            'type': 'range',
            'action': 'clip_to_range'
        }
    ],
    'features_to_exclude': zero_variance_cols,
    'preservation_targets': [
        {
            'metric': 'censoring_rate',
            'target_value': float(censoring_rate),
            'tolerance': 0.05
        },
        {
            'metric': 'duration_event_correlation',
            'target_value': float(duration_event_corr),
            'tolerance': 0.1
        }
    ],
    'recommended_synthesizer': {
        'method': 'GaussianCopula',
        'reason': f'N={len(df)} < 500, GANs inestables para small data (Xu et al., 2019)',
        'alternative': 'CTGAN con epochs=300+ si se prefiere deep learning'
    }
}

print("üéÆ REGLAS DEL JUEGO PARA SINTETIZADOR:")
print(f"\n   HARD CONSTRAINTS (no violar nunca):")
for r in synthesis_rules['hard_constraints']:
    print(f"   - {r['rule']}")

print(f"\n   SOFT CONSTRAINTS (preferible mantener):")
for r in synthesis_rules['soft_constraints']:
    print(f"   - {r['rule']}")

print(f"\n   EXCLUIR DEL SINTETIZADOR ({len(zero_variance_cols)} features):")
for col in zero_variance_cols:
    print(f"   - {col}")

print(f"\n   M√âTRICAS A PRESERVAR:")
for p in synthesis_rules['preservation_targets']:
    print(f"   - {p['metric']}: {p['target_value']:.3f} ¬± {p['tolerance']}")

In [None]:
# ==============================================================================
# GUARDAR REPORTE JSON MEJORADO
# ==============================================================================

# Referencias cient√≠ficas
scientific_references = [
    {
        'author': 'Lawless, J.F.',
        'year': 2003,
        'title': 'Statistical Models and Methods for Lifetime Data',
        'publisher': 'Wiley',
        'relevance': 'Fundamentos de survival analysis, restricciones de dominio'
    },
    {
        'author': 'Getie Ayaneh et al.',
        'year': 2020,
        'doi': '10.1155/2020/8653405',
        'title': 'Survival Models for the Analysis of Waiting Time to First Employment',
        'relevance': 'An√°lisis de tiempo-al-empleo en graduados'
    },
    {
        'author': 'Andonovikj et al.',
        'year': 2024,
        'title': 'Survival analysis as semi-supervised multi-label classification',
        'relevance': 'Relaci√≥n duration-event en s√≠ntesis de datos'
    },
    {
        'author': 'Xu et al.',
        'year': 2019,
        'title': 'Modeling Tabular Data using Conditional GAN (CTGAN)',
        'relevance': 'Sintetizadores para datos tabulares'
    }
]

diagnosis = {
    'metadata': {
        'version': '2.0',
        'created': '2026-01-08',
        'author': 'Data Scientist Auditor',
        'purpose': 'Diagn√≥stico para generaci√≥n sint√©tica de datos de supervivencia'
    },
    'scientific_references': scientific_references,
    'dataset_info': {
        'n_rows': int(len(df)),
        'n_cols': int(len(df.columns)),
        'source': str(DATA_PATH),
        'original_source': 'Encuesta reci√©n graduados - pregrado (EPN)'
    },
    'column_classification': {
        'continuous': continuous_cols,
        'discrete': discrete_cols,
        'binary': binary_cols,
        'categorical': categorical_cols,
        'zero_variance': zero_variance_cols
    },
    'column_details': column_info,
    'target_analysis': target_info,
    'correlation_analysis': correlation_analysis,
    'constraints': constraints,
    'cardinality': cardinality,
    'sparsity_analysis': {
        'n_sparse_features': len(sparse_features),
        'n_zero_variance': len(zero_variance_cols),
        'sparse_features': sparse_features
    },
    'synthesis_rules': synthesis_rules,
    'problems_identified': []
}

# Identificar problemas
if abs(max_corr_event) < 0.2:
    diagnosis['problems_identified'].append(
        'üö® CR√çTICO: Ninguna feature tiene correlaci√≥n > 0.2 con el target'
    )
if len(df) < 500:
    diagnosis['problems_identified'].append(
        f'‚ö†Ô∏è Dataset peque√±o (N={len(df)} < 500)'
    )
if len(zero_variance_cols) > 0:
    diagnosis['problems_identified'].append(
        f'‚ö†Ô∏è {len(zero_variance_cols)} features con varianza cero a excluir'
    )
if len(df.columns) / len(df) > 0.15:
    diagnosis['problems_identified'].append(
        f'‚ö†Ô∏è Alta dimensionalidad relativa (p/N = {len(df.columns)/len(df):.2f})'
    )

with open(OUTPUT_PATH, 'w') as f:
    json.dump(diagnosis, f, indent=2, ensure_ascii=False)

print(f"\n" + "="*70)
print(f"‚úÖ REPORTE GUARDADO: {OUTPUT_PATH}")
print(f"="*70)
print(f"\nüìã RESUMEN EJECUTIVO:")
print(f"   Filas: {diagnosis['dataset_info']['n_rows']}")
print(f"   Columnas: {diagnosis['dataset_info']['n_cols']}")
print(f"   Tasa de censura: {target_info['event']['censoring_rate']:.1%}")
print(f"   Restricciones OK: {sum(1 for c in constraints if c['satisfied'])}/{len(constraints)}")
print(f"   Features zero-variance: {len(zero_variance_cols)}")
print(f"\n   Problemas identificados:")
for p in diagnosis['problems_identified']:
    print(f"   {p}")