# ÔøΩÔøΩ Diagn√≥stico Profundo del Dataset

**Objetivo:** Auditor√≠a estad√≠stica completa del dataset de entrenamiento para identificar:
1. Tipos de datos (continuo vs discreto)
2. Cardinalidad de variables
3. Tasa de censura
4. Restricciones de dominio
5. Problemas potenciales

---

## Justificaci√≥n Metodol√≥gica

> Seg√∫n **Lawless (2003)**, *"Statistical Models and Methods for Lifetime Data"*:
> - La distribuci√≥n del tiempo al evento debe examinarse antes de modelar
> - Datos de intervalo (interval-censored) requieren tratamiento especial
> - La tasa de censura afecta la potencia estad√≠stica

---

In [1]:
# ==============================================================================
# CONFIGURACI√ìN
# ==============================================================================
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import json
from pathlib import Path

DATA_DIR = Path("data/processed")
TRAIN_PATH = DATA_DIR / "train_final.parquet"  # Usamos train_final ya que train_survival_selected no existe

print("‚úÖ Configuraci√≥n cargada")
print(f"   Dataset: {TRAIN_PATH}")

‚úÖ Configuraci√≥n cargada
   Dataset: data/processed/train_final.parquet


In [2]:
# ==============================================================================
# 1. CARGA Y OVERVIEW
# ==============================================================================
df = pd.read_parquet(TRAIN_PATH)

print("=" * 70)
print("üìä OVERVIEW DEL DATASET")
print("=" * 70)
print(f"\nDimensiones: {df.shape[0]} filas √ó {df.shape[1]} columnas")
print(f"\nColumnas:")
for i, c in enumerate(df.columns):
    print(f"   [{i:2d}] {c}")

üìä OVERVIEW DEL DATASET

Dimensiones: 296 filas √ó 63 columnas

Columnas:
   [ 0] edad
   [ 1] genero_m
   [ 2] hab_1
   [ 3] hab_2
   [ 4] hab_3
   [ 5] hab_4
   [ 6] hab_5
   [ 7] hab_6
   [ 8] hab_7
   [ 9] tech_programacion
   [10] tech_python
   [11] tech_java
   [12] tech_desarrollo_web
   [13] tech_desarrollo_movil
   [14] tech_base_datos
   [15] tech_machine_learning
   [16] tech_inteligencia_artificial
   [17] tech_analisis_datos
   [18] tech_big_data
   [19] tech_redes
   [20] tech_telecomunicaciones
   [21] tech_redes_inalambricas
   [22] tech_sistemas_celulares
   [23] tech_ciberseguridad
   [24] tech_cloud_computing
   [25] tech_sistemas_operativos
   [26] tech_iot
   [27] tech_automatizacion
   [28] tech_robotica
   [29] tech_electronica
   [30] tech_instrumentacion
   [31] tech_control
   [32] tech_electricidad
   [33] tech_sistemas_potencia
   [34] tech_energias_renovables
   [35] tech_calidad_energia
   [36] tech_termodinamica
   [37] tech_mecanica_fluidos
   [38] te

---
## 2. An√°lisis de Tipos de Datos

In [3]:
# ==============================================================================
# CLASIFICACI√ìN DE COLUMNAS
# ==============================================================================

# Identificar tipos
continuous_cols = []
categorical_cols = []
binary_cols = []

for col in df.columns:
    if col in ['event', 'duration']:
        continue  # Targets
    
    unique_vals = df[col].nunique()
    dtype = df[col].dtype
    
    if unique_vals == 2:
        binary_cols.append(col)
        categorical_cols.append(col)
    elif unique_vals <= 10 or col.startswith('tech_'):
        categorical_cols.append(col)
    else:
        continuous_cols.append(col)

print("=" * 70)
print("üìã CLASIFICACI√ìN DE VARIABLES")
print("=" * 70)

print(f"\nüî¢ CONTINUAS ({len(continuous_cols)}):")
for c in continuous_cols:
    print(f"   - {c}: {df[c].nunique()} valores √∫nicos, rango [{df[c].min():.2f}, {df[c].max():.2f}]")

print(f"\nüìä CATEG√ìRICAS ({len(categorical_cols)}):")
print(f"   - Binarias: {len(binary_cols)}")
cat_non_binary = [c for c in categorical_cols if c not in binary_cols]
print(f"   - Multi-clase: {len(cat_non_binary)}")

üìã CLASIFICACI√ìN DE VARIABLES

üî¢ CONTINUAS (1):
   - edad: 17 valores √∫nicos, rango [21.00, 40.00]

üìä CATEG√ìRICAS (60):
   - Binarias: 51
   - Multi-clase: 9


---
## 3. An√°lisis del Target (Event/Duration)

In [4]:
# ==============================================================================
# AN√ÅLISIS DE CENSURA Y DURACI√ìN
# ==============================================================================

print("=" * 70)
print("üéØ AN√ÅLISIS DEL TARGET")
print("=" * 70)

# Event (censura)
n_events = df['event'].sum()
n_censored = len(df) - n_events
censoring_rate = n_censored / len(df)

print(f"\nüìå EVENTO (event):")
print(f"   Eventos (E=1):    {n_events} ({100*n_events/len(df):.1f}%)")
print(f"   Censurados (E=0): {n_censored} ({100*censoring_rate:.1f}%)")
print(f"   Tasa de censura:  {censoring_rate:.4f}")

# Duration
print(f"\n‚è±Ô∏è DURACI√ìN (duration):")
print(f"   M√≠nimo:  {df['duration'].min():.2f} meses")
print(f"   M√°ximo:  {df['duration'].max():.2f} meses")
print(f"   Media:   {df['duration'].mean():.2f} meses")
print(f"   Mediana: {df['duration'].median():.2f} meses")
print(f"   Std:     {df['duration'].std():.2f}")
print(f"   Valores √∫nicos: {df['duration'].nunique()}")

# Distribuci√≥n por evento
print(f"\nüìä DISTRIBUCI√ìN POR EVENTO:")
print(f"   Duration (E=0): mean={df[df['event']==0]['duration'].mean():.2f}, "
      f"std={df[df['event']==0]['duration'].std():.2f}")
print(f"   Duration (E=1): mean={df[df['event']==1]['duration'].mean():.2f}, "
      f"std={df[df['event']==1]['duration'].std():.2f}")

üéØ AN√ÅLISIS DEL TARGET

üìå EVENTO (event):
   Eventos (E=1):    135 (45.6%)
   Censurados (E=0): 161 (54.4%)
   Tasa de censura:  0.5439

‚è±Ô∏è DURACI√ìN (duration):
   M√≠nimo:  0.58 meses
   M√°ximo:  30.00 meses
   Media:   15.43 meses
   Mediana: 17.64 meses
   Std:     10.99
   Valores √∫nicos: 265

üìä DISTRIBUCI√ìN POR EVENTO:
   Duration (E=0): mean=10.37, std=10.92
   Duration (E=1): mean=21.47, std=7.48


---
## 4. Detecci√≥n de Restricciones de Dominio

In [5]:
# ==============================================================================
# RESTRICCIONES L√ìGICAS
# ==============================================================================

constraints = []

print("=" * 70)
print("üîí RESTRICCIONES DE DOMINIO DETECTADAS")
print("=" * 70)

# 1. Duration > 0
if (df['duration'] > 0).all():
    constraints.append({'variable': 'duration', 'constraint': 'duration > 0', 'status': 'OK'})
    print("\n‚úÖ duration > 0: Cumplida")
else:
    constraints.append({'variable': 'duration', 'constraint': 'duration > 0', 'status': 'VIOLATED'})
    print("\n‚ùå duration > 0: Violada")

# 2. Event binario
if set(df['event'].unique()) <= {0, 1}:
    constraints.append({'variable': 'event', 'constraint': 'event ‚àà {0, 1}', 'status': 'OK'})
    print("‚úÖ event ‚àà {0, 1}: Cumplida")
else:
    constraints.append({'variable': 'event', 'constraint': 'event ‚àà {0, 1}', 'status': 'VIOLATED'})
    print("‚ùå event ‚àà {0, 1}: Violada")

# 3. Edad razonable
if 'edad' in df.columns:
    edad_min, edad_max = df['edad'].min(), df['edad'].max()
    if edad_min >= 18 and edad_max <= 65:
        constraints.append({'variable': 'edad', 'constraint': '18 ‚â§ edad ‚â§ 65', 'status': 'OK'})
        print(f"‚úÖ Edad razonable [{edad_min:.0f}, {edad_max:.0f}]: Cumplida")
    else:
        constraints.append({'variable': 'edad', 'constraint': '18 ‚â§ edad ‚â§ 65', 'status': 'WARNING'})
        print(f"‚ö†Ô∏è Edad [{edad_min:.0f}, {edad_max:.0f}]: Revisar valores extremos")

# 4. Soft skills [0, 1]
hab_cols = [c for c in df.columns if c.startswith('hab_')]
for h in hab_cols:
    if df[h].min() >= 0 and df[h].max() <= 1:
        constraints.append({'variable': h, 'constraint': '0 ‚â§ value ‚â§ 1', 'status': 'OK'})
    else:
        constraints.append({'variable': h, 'constraint': '0 ‚â§ value ‚â§ 1', 'status': 'WARNING'})
        
print(f"‚úÖ Soft skills normalizadas [0,1]: {len(hab_cols)} variables")

# 5. Tech skills binarias
tech_cols = [c for c in df.columns if c.startswith('tech_')]
tech_binary_ok = all(set(df[c].unique()) <= {0, 1} for c in tech_cols)
if tech_binary_ok:
    constraints.append({'variable': 'tech_*', 'constraint': 'value ‚àà {0, 1}', 'status': 'OK'})
    print(f"‚úÖ Tech skills binarias: {len(tech_cols)} variables")
else:
    constraints.append({'variable': 'tech_*', 'constraint': 'value ‚àà {0, 1}', 'status': 'WARNING'})
    print(f"‚ö†Ô∏è Tech skills: Algunos valores no binarios")

üîí RESTRICCIONES DE DOMINIO DETECTADAS

‚úÖ duration > 0: Cumplida
‚úÖ event ‚àà {0, 1}: Cumplida
‚úÖ Edad razonable [21, 40]: Cumplida
‚úÖ Soft skills normalizadas [0,1]: 7 variables
‚úÖ Tech skills binarias: 52 variables


In [6]:
# ==============================================================================
# DETECCI√ìN DE PII (Informaci√≥n Personal Identificable)
# ==============================================================================

print("\n" + "=" * 70)
print("üîê AN√ÅLISIS DE PII (Privacidad)")
print("=" * 70)

pii_candidates = []

# Buscar columnas que podr√≠an contener PII
pii_keywords = ['nombre', 'email', 'telefono', 'cedula', 'id', 'direccion', 
                'name', 'mail', 'phone', 'address', 'ssn']

for col in df.columns:
    col_lower = col.lower()
    for kw in pii_keywords:
        if kw in col_lower:
            pii_candidates.append(col)
            break

if pii_candidates:
    print(f"\n‚ö†Ô∏è Posibles columnas PII detectadas: {pii_candidates}")
else:
    print("\n‚úÖ No se detectaron columnas PII obvias")
    print("   (El dataset parece anonimizado)")


üîê AN√ÅLISIS DE PII (Privacidad)

‚ö†Ô∏è Posibles columnas PII detectadas: ['tech_ciberseguridad', 'tech_electricidad', 'tech_calidad_energia', 'tech_mecanica_fluidos', 'tech_gestion_calidad', 'tech_hidraulica_sanitaria']


---
## 5. An√°lisis de Correlaciones

In [7]:
# ==============================================================================
# CORRELACIONES CON EL TARGET
# ==============================================================================

print("=" * 70)
print("üìà CORRELACIONES CON EVENT")
print("=" * 70)

# Solo columnas num√©ricas
numeric_cols = df.select_dtypes(include=[np.number]).columns
corrs = df[numeric_cols].corr()['event'].drop(['event', 'duration']).sort_values(key=abs, ascending=False)

print("\nTop 10 correlaciones con event:")
for i, (col, corr) in enumerate(corrs.head(10).items(), 1):
    star = "‚≠ê" if abs(corr) > 0.1 else ""
    print(f"   {i:2d}. {col:30s}: {corr:+.4f} {star}")

print(f"\nüìå Variables con |r| > 0.1: {sum(abs(corrs) > 0.1)}")
print(f"üìå Variables con |r| > 0.2: {sum(abs(corrs) > 0.2)}")

# Correlaci√≥n duration-event
corr_de = df[['duration', 'event']].corr().iloc[0,1]
print(f"\nüîó Correlaci√≥n duration-event: {corr_de:.4f}")

üìà CORRELACIONES CON EVENT

Top 10 correlaciones con event:
    1. tech_sistemas_operativos      : +0.1697 ‚≠ê
    2. hab_1                         : +0.1511 ‚≠ê
    3. tech_geotecnia                : -0.1440 ‚≠ê
    4. tech_gestion_empresarial      : +0.1402 ‚≠ê
    5. tech_gestion_calidad          : +0.1298 ‚≠ê
    6. tech_programacion             : +0.1290 ‚≠ê
    7. tech_mecanica_fluidos         : -0.1229 ‚≠ê
    8. tech_electricidad             : +0.1213 ‚≠ê
    9. hab_7                         : +0.1159 ‚≠ê
   10. tech_algoritmos               : +0.1144 ‚≠ê

üìå Variables con |r| > 0.1: 17
üìå Variables con |r| > 0.2: 0

üîó Correlaci√≥n duration-event: 0.5040


---
## 6. Diagn√≥stico del Problema de Modelado

In [8]:
# ==============================================================================
# DIAGN√ìSTICO: ¬øPOR QU√â FALLA EL MODELO?
# ==============================================================================

print("=" * 70)
print("üî¨ DIAGN√ìSTICO: ¬øPOR QU√â FALLAN LOS MODELOS?")
print("=" * 70)

problems = []

# 1. Correlaciones muy bajas
max_corr = abs(corrs).max()
if max_corr < 0.2:
    problems.append("üö® CR√çTICO: Ninguna feature tiene correlaci√≥n > 0.2 con el target")
    print(f"\n1. PROBLEMA: Correlaciones muy bajas")
    print(f"   M√°xima correlaci√≥n con event: {max_corr:.4f}")
    print(f"   ‚Üí Las features no tienen poder predictivo")

# 2. Alta correlaci√≥n duration-event
if abs(corr_de) > 0.4:
    problems.append("‚ö†Ô∏è Alta correlaci√≥n duration-event")
    print(f"\n2. OBSERVACI√ìN: Correlaci√≥n duration-event alta ({corr_de:.4f})")

# 3. Dataset peque√±o
if len(df) < 500:
    problems.append("‚ö†Ô∏è Dataset peque√±o (N < 500)")
    print(f"\n3. LIMITACI√ìN: Dataset peque√±o (N={len(df)})")
    print(f"   ‚Üí Dif√≠cil detectar patrones d√©biles")

# 4. Alta dimensionalidad
n_features = len([c for c in df.columns if c not in ['event', 'duration']])
if n_features > len(df) / 5:
    problems.append("‚ö†Ô∏è Alta dimensionalidad relativa")
    print(f"\n4. PROBLEMA: Alta dimensionalidad")
    print(f"   Features: {n_features}, N/p ratio: {len(df)/n_features:.1f}")

# Resumen
print(f"\nüìã RESUMEN DE PROBLEMAS: {len(problems)}")
for p in problems:
    print(f"   ‚Ä¢ {p}")

üî¨ DIAGN√ìSTICO: ¬øPOR QU√â FALLAN LOS MODELOS?

1. PROBLEMA: Correlaciones muy bajas
   M√°xima correlaci√≥n con event: 0.1697
   ‚Üí Las features no tienen poder predictivo

2. OBSERVACI√ìN: Correlaci√≥n duration-event alta (0.5040)

3. LIMITACI√ìN: Dataset peque√±o (N=296)
   ‚Üí Dif√≠cil detectar patrones d√©biles

4. PROBLEMA: Alta dimensionalidad
   Features: 61, N/p ratio: 4.9

üìã RESUMEN DE PROBLEMAS: 4
   ‚Ä¢ üö® CR√çTICO: Ninguna feature tiene correlaci√≥n > 0.2 con el target
   ‚Ä¢ ‚ö†Ô∏è Alta correlaci√≥n duration-event
   ‚Ä¢ ‚ö†Ô∏è Dataset peque√±o (N < 500)
   ‚Ä¢ ‚ö†Ô∏è Alta dimensionalidad relativa


In [9]:
# ==============================================================================
# EXPORTAR REPORTE JSON
# ==============================================================================

report = {
    'n_rows': int(len(df)),
    'n_cols': int(len(df.columns)),
    'censoring_rate': float(censoring_rate),
    'event_rate': float(1 - censoring_rate),
    'continuous_cols': continuous_cols,
    'categorical_cols': categorical_cols,
    'binary_cols': binary_cols,
    'constraints_detected': constraints,
    'max_correlation_with_event': float(max_corr),
    'duration_range': [float(df['duration'].min()), float(df['duration'].max())],
    'duration_unique_values': int(df['duration'].nunique()),
    'pii_detected': pii_candidates,
    'problems_identified': problems
}

with open(DATA_DIR / 'diagnosis_report.json', 'w') as f:
    json.dump(report, f, indent=2)

print("\n" + "=" * 70)
print("üìÅ REPORTE GUARDADO")
print("=" * 70)
print(f"\nArchivo: {DATA_DIR / 'diagnosis_report.json'}")
print(f"\nüìä Resumen:")
print(f"   ‚Ä¢ Filas: {report['n_rows']}")
print(f"   ‚Ä¢ Columnas: {report['n_cols']}")
print(f"   ‚Ä¢ Tasa de censura: {report['censoring_rate']:.2%}")
print(f"   ‚Ä¢ Vars continuas: {len(report['continuous_cols'])}")
print(f"   ‚Ä¢ Vars categ√≥ricas: {len(report['categorical_cols'])}")
print(f"   ‚Ä¢ Max correlaci√≥n: {report['max_correlation_with_event']:.4f}")


üìÅ REPORTE GUARDADO

Archivo: data/processed/diagnosis_report.json

üìä Resumen:
   ‚Ä¢ Filas: 296
   ‚Ä¢ Columnas: 63
   ‚Ä¢ Tasa de censura: 54.39%
   ‚Ä¢ Vars continuas: 1
   ‚Ä¢ Vars categ√≥ricas: 60
   ‚Ä¢ Max correlaci√≥n: 0.1697
