# üß¨ Generaci√≥n de Datos Sint√©ticos con SDV

**Versi√≥n:** v3_experimental  
**Rol:** Data Engineer (Synthetic Data)  
**Fecha:** 2026-01-08

---

## Objetivo

Generar datos sint√©ticos tabulares usando **GaussianCopulaSynthesizer** de SDV para aumentar el dataset de entrenamiento.

## M√©todo

**GaussianCopula** es el m√©todo recomendado para datasets peque√±os (<1000 filas) porque:
- No requiere muchos datos para aprender distribuciones
- Captura correlaciones entre variables
- Es estable y reproducible

## Restricciones

> ‚ö†Ô∏è **CERO LEAKAGE**: Solo se usa el archivo de Train. El Test est√° sellado.

---

In [None]:
# ==============================================================================
# CONFIGURACI√ìN Y DEPENDENCIAS
# ==============================================================================

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import json
from pathlib import Path
from datetime import datetime

# SDV
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.evaluation.single_table import evaluate_quality

# Configuraci√≥n
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Paths
DATA_DIR = Path("../../v2/data/processed")
OUTPUT_DIR = Path(".")

# Cu√°ntas filas sint√©ticas generar
N_SYNTHETIC = 1000

print("‚úÖ Dependencias cargadas")
print(f"   SDV GaussianCopula")
print(f"   Random State: {RANDOM_STATE}")
print(f"   N Sint√©tico: {N_SYNTHETIC}")

---
## 1. Carga de Datos de Entrenamiento

In [None]:
# ==============================================================================
# CARGA DE DATOS (SOLO TRAIN - CERO LEAKAGE)
# ==============================================================================

train_df = pd.read_parquet(DATA_DIR / "train_final.parquet")

print(f"üìä Datos cargados (SOLO TRAIN):")
print(f"   Shape: {train_df.shape}")
print(f"   Columnas: {len(train_df.columns)}")

# Verificar columnas zero-variance a excluir
zero_var_cols = []
for col in train_df.columns:
    if train_df[col].nunique() <= 1:
        zero_var_cols.append(col)
        
print(f"\n‚ö†Ô∏è Columnas zero-variance detectadas: {zero_var_cols}")
print(f"   (Ser√°n excluidas del sintetizador)")

# Crear dataframe limpio para s√≠ntesis
train_clean = train_df.drop(columns=zero_var_cols, errors='ignore')
print(f"\nüìã Dataset para s√≠ntesis: {train_clean.shape}")

---
## 2. Definici√≥n de Metadatos SDV

In [None]:
# ==============================================================================
# DEFINICI√ìN DE METADATOS
# ==============================================================================

print("üìù Definiendo metadatos para SDV...")

# Crear metadata autom√°ticamente
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(train_clean)

# Mostrar tipos detectados
print(f"\nüìã Tipos detectados:")
for col, info in metadata.columns.items():
    sdtype = info.get('sdtype', 'unknown')
    print(f"   {col}: {sdtype}")
    
# Verificar metadata
metadata.validate()
print(f"\n‚úÖ Metadatos validados correctamente")

In [None]:
# ==============================================================================
# AJUSTE MANUAL DE METADATOS (si es necesario)
# ==============================================================================

# Asegurar que 'event' sea categ√≥rico/booleano
if 'event' in metadata.columns:
    metadata.update_column('event', sdtype='categorical')
    print("‚úèÔ∏è 'event' ajustado a categorical")

# Asegurar que 'duration' sea num√©rico
if 'duration' in metadata.columns:
    metadata.update_column('duration', sdtype='numerical')
    print("‚úèÔ∏è 'duration' ajustado a numerical")

# Asegurar que binarios tech_* sean categ√≥ricos
binary_cols = [c for c in train_clean.columns if c.startswith('tech_') or c == 'genero_m']
for col in binary_cols:
    if col in metadata.columns:
        metadata.update_column(col, sdtype='categorical')

print(f"‚úèÔ∏è {len(binary_cols)} columnas binarias ajustadas a categorical")

# Validar de nuevo
metadata.validate()
print("\n‚úÖ Metadatos actualizados y validados")

---
## 3. Entrenamiento del Sintetizador

In [None]:
# ==============================================================================
# CONFIGURAR Y ENTRENAR GAUSSIANCOPULASYNTHESIZER
# ==============================================================================

print("üîß Configurando GaussianCopulaSynthesizer...")

synthesizer = GaussianCopulaSynthesizer(
    metadata,
    enforce_min_max_values=True,
    enforce_rounding=True,
    numerical_distributions={
        'duration': 'truncnorm',  # Truncated normal para evitar negativos
        'edad': 'truncnorm'
    },
    default_distribution='norm'
)

print("\nüèãÔ∏è Entrenando sintetizador...")
synthesizer.fit(train_clean)

print("\n‚úÖ Sintetizador entrenado exitosamente")

---
## 4. Generaci√≥n de Datos Sint√©ticos

In [None]:
# ==============================================================================
# GENERAR DATOS SINT√âTICOS
# ==============================================================================

print(f"üß¨ Generando {N_SYNTHETIC} filas sint√©ticas...")

synthetic_df = synthesizer.sample(num_rows=N_SYNTHETIC)

print(f"\nüìä Datos sint√©ticos generados:")
print(f"   Shape: {synthetic_df.shape}")
print(f"\n   Primeras filas:")
print(synthetic_df.head())

---
## 5. Post-Procesamiento

In [None]:
# ==============================================================================
# POST-PROCESAMIENTO: ASEGURAR RESTRICCIONES DE DOMINIO
# ==============================================================================

print("üîß Aplicando post-procesamiento...")

# 1. Duration debe ser > 0
min_duration = train_clean['duration'].min()
if 'duration' in synthetic_df.columns:
    before = (synthetic_df['duration'] <= 0).sum()
    synthetic_df['duration'] = synthetic_df['duration'].clip(lower=min_duration)
    print(f"   ‚úÖ duration: {before} valores corregidos (clip a {min_duration:.2f})")

# 2. Event debe ser 0 o 1
if 'event' in synthetic_df.columns:
    synthetic_df['event'] = synthetic_df['event'].round().astype(int).clip(0, 1)
    print(f"   ‚úÖ event: convertido a binario {synthetic_df['event'].unique()}")

# 3. Edad debe estar en rango razonable
if 'edad' in synthetic_df.columns:
    min_edad = train_clean['edad'].min()
    max_edad = train_clean['edad'].max()
    synthetic_df['edad'] = synthetic_df['edad'].clip(min_edad, max_edad).round().astype(int)
    print(f"   ‚úÖ edad: clip a [{min_edad}, {max_edad}]")

# 4. Columnas binarias tech_* deben ser 0 o 1
binary_cols = [c for c in synthetic_df.columns if c.startswith('tech_') or c == 'genero_m']
for col in binary_cols:
    synthetic_df[col] = synthetic_df[col].round().astype(int).clip(0, 1)
print(f"   ‚úÖ {len(binary_cols)} columnas binarias convertidas a 0/1")

# 5. Habilidades hab_* deben estar en [0, 1]
hab_cols = [c for c in synthetic_df.columns if c.startswith('hab_')]
for col in hab_cols:
    # Redondear a valores v√°lidos: 0, 0.25, 0.5, 0.75, 1.0
    synthetic_df[col] = (synthetic_df[col] * 4).round() / 4
    synthetic_df[col] = synthetic_df[col].clip(0, 1)
print(f"   ‚úÖ {len(hab_cols)} columnas hab_* normalizadas a [0, 0.25, 0.5, 0.75, 1.0]")

print("\n‚úÖ Post-procesamiento completado")

---
## 6. Validaci√≥n de Calidad

In [None]:
# ==============================================================================
# VALIDACI√ìN DE CALIDAD SDV
# ==============================================================================

print("üìä Evaluando calidad de los datos sint√©ticos...")

try:
    quality_report = evaluate_quality(
        real_data=train_clean,
        synthetic_data=synthetic_df,
        metadata=metadata
    )
    
    overall_score = quality_report.get_score()
    print(f"\nüèÜ Score de Calidad General: {overall_score:.4f}")
    
except Exception as e:
    print(f"‚ö†Ô∏è No se pudo calcular el score de calidad: {e}")
    overall_score = None

In [None]:
# ==============================================================================
# COMPARACI√ìN ESTAD√çSTICA
# ==============================================================================

print("üìä Comparaci√≥n estad√≠stica Real vs Sint√©tico:")
print("\n" + "="*70)

# Comparar estad√≠sticas clave
comparison_cols = ['duration', 'event', 'edad']

for col in comparison_cols:
    if col in train_clean.columns and col in synthetic_df.columns:
        real_mean = train_clean[col].mean()
        synth_mean = synthetic_df[col].mean()
        real_std = train_clean[col].std()
        synth_std = synthetic_df[col].std()
        
        print(f"\n{col}:")
        print(f"   Real:      Œº={real_mean:.3f}, œÉ={real_std:.3f}")
        print(f"   Sint√©tico: Œº={synth_mean:.3f}, œÉ={synth_std:.3f}")
        print(f"   Œî mean:    {abs(real_mean - synth_mean):.3f}")

# Comparar tasa de eventos
if 'event' in train_clean.columns:
    real_event_rate = train_clean['event'].mean()
    synth_event_rate = synthetic_df['event'].mean()
    print(f"\nEvent Rate:")
    print(f"   Real:      {real_event_rate:.1%}")
    print(f"   Sint√©tico: {synth_event_rate:.1%}")
    print(f"   Œî:         {abs(real_event_rate - synth_event_rate):.1%}")

---
## 7. Guardar Resultados

In [None]:
# ==============================================================================
# GUARDAR DATOS SINT√âTICOS Y MODELO
# ==============================================================================

print("üíæ Guardando resultados...")

# 1. Guardar datos sint√©ticos
synthetic_df.to_parquet(OUTPUT_DIR / "synthetic_data_copula.parquet", index=False)
print(f"   ‚úÖ synthetic_data_copula.parquet ({len(synthetic_df)} filas)")

# 2. Guardar modelo sintetizador
synthesizer.save(OUTPUT_DIR / "synthesizer_model.pkl")
print(f"   ‚úÖ synthesizer_model.pkl")

# 3. Guardar metadatos
metadata.save_to_json(OUTPUT_DIR / "synthesizer_metadata.json")
print(f"   ‚úÖ synthesizer_metadata.json")

# 4. Guardar reporte de generaci√≥n
report = {
    "metadata": {
        "date": datetime.now().isoformat(),
        "method": "GaussianCopulaSynthesizer",
        "sdv_version": "1.32.0",
        "random_state": RANDOM_STATE
    },
    "input": {
        "n_real": len(train_clean),
        "n_features": len(train_clean.columns),
        "excluded_cols": zero_var_cols
    },
    "output": {
        "n_synthetic": len(synthetic_df),
        "quality_score": float(overall_score) if overall_score else None
    },
    "statistics_comparison": {
        "duration": {
            "real_mean": float(train_clean['duration'].mean()),
            "synth_mean": float(synthetic_df['duration'].mean()),
            "real_std": float(train_clean['duration'].std()),
            "synth_std": float(synthetic_df['duration'].std())
        },
        "event_rate": {
            "real": float(train_clean['event'].mean()),
            "synthetic": float(synthetic_df['event'].mean())
        }
    },
    "files_generated": [
        "synthetic_data_copula.parquet",
        "synthesizer_model.pkl",
        "synthesizer_metadata.json"
    ]
}

with open(OUTPUT_DIR / "generation_report.json", 'w') as f:
    json.dump(report, f, indent=2)
print(f"   ‚úÖ generation_report.json")

print("\n" + "="*50)
print("üéâ GENERACI√ìN SINT√âTICA COMPLETADA")
print("="*50)

---
## Resumen

### Archivos Generados

| Archivo | Descripci√≥n |
|---------|-------------|
| `synthetic_data_copula.parquet` | 1000 filas sint√©ticas |
| `synthesizer_model.pkl` | Modelo GaussianCopula entrenado |
| `synthesizer_metadata.json` | Metadatos SDV |
| `generation_report.json` | Reporte de generaci√≥n |

### Siguiente Paso

**Prompt 5: Entrenamiento con Datos Aumentados** - Entrenar modelos con combinaciones de datos reales y sint√©ticos.

---