# üîí Auditor√≠a de Privacidad: Detecci√≥n de Memorizaci√≥n

**Versi√≥n:** v3_experimental  
**Rol:** Data Privacy Officer  
**Fecha:** 2026-01-08

---

## Objetivo

Evaluar riesgos de privacidad en los datos sint√©ticos generados, espec√≠ficamente:
1. **Copias Exactas**: ¬øEl sintetizador memoriz√≥ registros reales?
2. **Distancia DCR**: ¬øQu√© tan cerca est√°n los sint√©ticos de los reales?

## M√©tricas

- **DCR (Distance to Closest Record)**: Distancia m√≠nima de cada registro sint√©tico al registro real m√°s cercano
- **Copias Exactas**: Conteo de filas id√©nticas

## Justificaci√≥n

> Evitar "Memorization" en modelos generativos con pocos datos es cr√≠tico para preservar la privacidad de los participantes originales.

---

In [None]:
# ==============================================================================
# CONFIGURACI√ìN Y DEPENDENCIAS
# ==============================================================================

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import json
from pathlib import Path
from datetime import datetime

# Visualizaci√≥n
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')

# Distancias
from scipy.spatial.distance import cdist
from sklearn.preprocessing import StandardScaler

# Paths
DATA_DIR = Path("../../v2/data/processed")
SYNTH_GLOBAL_PATH = Path("../04_synthetic_sdv/synthetic_data_copula.parquet")
SYNTH_ADVANCED_PATH = Path("../05_synthetic_advanced/synthetic_data_advanced.parquet")
OUTPUT_DIR = Path(".")

print("‚úÖ Dependencias cargadas")
print(f"   M√©tricas: DCR, Copias Exactas")

---
## 1. Carga de Datos

In [None]:
# ==============================================================================
# CARGA DE DATOS
# ==============================================================================

# Datos reales
real_df = pd.read_parquet(DATA_DIR / "train_final.parquet")

# Excluir columnas zero-variance
zero_var_cols = [col for col in real_df.columns if real_df[col].nunique() <= 1]
real_df = real_df.drop(columns=zero_var_cols, errors='ignore')

# Datos sint√©ticos (ambos m√©todos)
synth_global = pd.read_parquet(SYNTH_GLOBAL_PATH)
synth_advanced = pd.read_parquet(SYNTH_ADVANCED_PATH)

print(f"üìä Datos cargados:")
print(f"   Real:            {real_df.shape}")
print(f"   Sint√©tico Global: {synth_global.shape}")
print(f"   Sint√©tico Advanced: {synth_advanced.shape}")

# Asegurar columnas comunes
common_cols = list(set(real_df.columns) & set(synth_global.columns) & set(synth_advanced.columns))
print(f"   Columnas comunes: {len(common_cols)}")

---
## 2. Detecci√≥n de Copias Exactas

In [None]:
# ==============================================================================
# FUNCI√ìN PARA DETECTAR COPIAS EXACTAS
# ==============================================================================

def count_exact_copies(real_df, synth_df, columns):
    """
    Cuenta cu√°ntas filas sint√©ticas son copias exactas de filas reales.
    """
    # Redondear para evitar errores de precisi√≥n float
    real_rounded = real_df[columns].round(6)
    synth_rounded = synth_df[columns].round(6)
    
    # Convertir a tuples para comparaci√≥n
    real_tuples = set(map(tuple, real_rounded.values))
    synth_tuples = list(map(tuple, synth_rounded.values))
    
    # Contar coincidencias exactas
    exact_copies = sum(1 for t in synth_tuples if t in real_tuples)
    
    return exact_copies, len(synth_df)

print("üîç Buscando copias exactas...")

# Analizar Global
copies_global, total_global = count_exact_copies(real_df, synth_global, common_cols)
pct_global = copies_global / total_global * 100

# Analizar Advanced
copies_advanced, total_advanced = count_exact_copies(real_df, synth_advanced, common_cols)
pct_advanced = copies_advanced / total_advanced * 100

print(f"\nüìä Resultados Copias Exactas:")
print(f"   Global:   {copies_global}/{total_global} ({pct_global:.2f}%)")
print(f"   Advanced: {copies_advanced}/{total_advanced} ({pct_advanced:.2f}%)")

# Evaluaci√≥n de riesgo
if copies_global > 0 or copies_advanced > 0:
    print("\n‚ö†Ô∏è ALERTA: Se detectaron copias exactas")
else:
    print("\n‚úÖ No se detectaron copias exactas - Bajo riesgo de memorizaci√≥n")

---
## 3. C√°lculo de DCR (Distance to Closest Record)

In [None]:
# ==============================================================================
# C√ÅLCULO DE DCR
# ==============================================================================

def calculate_dcr(real_df, synth_df, columns):
    """
    Calcula la distancia al registro real m√°s cercano para cada sint√©tico.
    Usa distancia euclidiana normalizada.
    """
    # Preparar datos
    real_vals = real_df[columns].values.astype(float)
    synth_vals = synth_df[columns].values.astype(float)
    
    # Normalizar (importante para variables en diferentes escalas)
    scaler = StandardScaler()
    real_scaled = scaler.fit_transform(real_vals)
    synth_scaled = scaler.transform(synth_vals)
    
    # Calcular matriz de distancias
    distances = cdist(synth_scaled, real_scaled, metric='euclidean')
    
    # DCR = m√≠nima distancia a cualquier registro real
    dcr = distances.min(axis=1)
    
    return dcr

print("üìè Calculando DCR (Distance to Closest Record)...")

# Calcular para ambos m√©todos
dcr_global = calculate_dcr(real_df, synth_global, common_cols)
dcr_advanced = calculate_dcr(real_df, synth_advanced, common_cols)

print(f"\nüìä Estad√≠sticas DCR:")
print(f"\n   Global:")
print(f"      Min:    {dcr_global.min():.4f}")
print(f"      Max:    {dcr_global.max():.4f}")
print(f"      Mean:   {dcr_global.mean():.4f}")
print(f"      Median: {np.median(dcr_global):.4f}")
print(f"      Std:    {dcr_global.std():.4f}")

print(f"\n   Advanced:")
print(f"      Min:    {dcr_advanced.min():.4f}")
print(f"      Max:    {dcr_advanced.max():.4f}")
print(f"      Mean:   {dcr_advanced.mean():.4f}")
print(f"      Median: {np.median(dcr_advanced):.4f}")
print(f"      Std:    {dcr_advanced.std():.4f}")

---
## 4. Visualizaci√≥n: Histograma DCR

In [None]:
# ==============================================================================
# HISTOGRAMA DE DCR
# ==============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histograma Global
ax1 = axes[0]
ax1.hist(dcr_global, bins=50, color='steelblue', edgecolor='white', alpha=0.7)
ax1.axvline(dcr_global.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {dcr_global.mean():.2f}')
ax1.axvline(0, color='darkred', linestyle='-', linewidth=2, alpha=0.5, label='Zero (Copia Exacta)')
ax1.set_xlabel('DCR (Distancia al Registro Real M√°s Cercano)', fontsize=11)
ax1.set_ylabel('Frecuencia', fontsize=11)
ax1.set_title('Distribuci√≥n DCR - Sint√©tico Global', fontsize=13, fontweight='bold')
ax1.legend()
ax1.grid(alpha=0.3)

# Histograma Advanced
ax2 = axes[1]
ax2.hist(dcr_advanced, bins=50, color='darkorange', edgecolor='white', alpha=0.7)
ax2.axvline(dcr_advanced.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {dcr_advanced.mean():.2f}')
ax2.axvline(0, color='darkred', linestyle='-', linewidth=2, alpha=0.5, label='Zero (Copia Exacta)')
ax2.set_xlabel('DCR (Distancia al Registro Real M√°s Cercano)', fontsize=11)
ax2.set_ylabel('Frecuencia', fontsize=11)
ax2.set_title('Distribuci√≥n DCR - Sint√©tico Advanced', fontsize=13, fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'dcr_histogram.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n‚úÖ Histograma guardado: dcr_histogram.png")

In [None]:
# ==============================================================================
# COMPARACI√ìN OVERLAY
# ==============================================================================

fig, ax = plt.subplots(figsize=(10, 6))

ax.hist(dcr_global, bins=50, color='steelblue', alpha=0.5, label='Global', edgecolor='white')
ax.hist(dcr_advanced, bins=50, color='darkorange', alpha=0.5, label='Advanced', edgecolor='white')

ax.axvline(dcr_global.mean(), color='steelblue', linestyle='--', linewidth=2)
ax.axvline(dcr_advanced.mean(), color='darkorange', linestyle='--', linewidth=2)

ax.set_xlabel('DCR (Distancia al Registro Real M√°s Cercano)', fontsize=12)
ax.set_ylabel('Frecuencia', fontsize=12)
ax.set_title('Comparaci√≥n DCR: Global vs Advanced', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'dcr_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n‚úÖ Comparaci√≥n guardada: dcr_comparison.png")

---
## 5. An√°lisis de Riesgo

In [None]:
# ==============================================================================
# AN√ÅLISIS DE RIESGO DE PRIVACIDAD
# ==============================================================================

print("üîí AN√ÅLISIS DE RIESGO DE PRIVACIDAD")
print("="*60)

# Umbral de riesgo: DCR muy bajo indica posible memorizaci√≥n
THRESHOLD_HIGH_RISK = 0.5  # Registros muy cercanos
THRESHOLD_MEDIUM_RISK = 1.0

# Contar registros de riesgo
high_risk_global = (dcr_global < THRESHOLD_HIGH_RISK).sum()
high_risk_advanced = (dcr_advanced < THRESHOLD_HIGH_RISK).sum()

medium_risk_global = ((dcr_global >= THRESHOLD_HIGH_RISK) & (dcr_global < THRESHOLD_MEDIUM_RISK)).sum()
medium_risk_advanced = ((dcr_advanced >= THRESHOLD_HIGH_RISK) & (dcr_advanced < THRESHOLD_MEDIUM_RISK)).sum()

print(f"\nüìä Registros por Nivel de Riesgo:")
print(f"\n   GLOBAL (n={len(dcr_global)}):")
print(f"      üî¥ Alto (DCR < {THRESHOLD_HIGH_RISK}):   {high_risk_global} ({high_risk_global/len(dcr_global)*100:.1f}%)")
print(f"      üü° Medio (DCR < {THRESHOLD_MEDIUM_RISK}): {medium_risk_global} ({medium_risk_global/len(dcr_global)*100:.1f}%)")
print(f"      üü¢ Bajo (DCR >= {THRESHOLD_MEDIUM_RISK}): {len(dcr_global) - high_risk_global - medium_risk_global}")

print(f"\n   ADVANCED (n={len(dcr_advanced)}):")
print(f"      üî¥ Alto (DCR < {THRESHOLD_HIGH_RISK}):   {high_risk_advanced} ({high_risk_advanced/len(dcr_advanced)*100:.1f}%)")
print(f"      üü° Medio (DCR < {THRESHOLD_MEDIUM_RISK}): {medium_risk_advanced} ({medium_risk_advanced/len(dcr_advanced)*100:.1f}%)")
print(f"      üü¢ Bajo (DCR >= {THRESHOLD_MEDIUM_RISK}): {len(dcr_advanced) - high_risk_advanced - medium_risk_advanced}")

In [None]:
# ==============================================================================
# DETERMINACI√ìN FINAL DE RIESGO
# ==============================================================================

print("\n" + "="*60)
print("üìã CONCLUSI√ìN DE PRIVACIDAD")
print("="*60)

# Criteria for privacy risk
def assess_risk(exact_copies, high_risk_count, total):
    if exact_copies > 0:
        return "CR√çTICO", "üî¥"
    elif high_risk_count / total > 0.05:  # >5% de registros muy cercanos
        return "ALTO", "üü†"
    elif high_risk_count / total > 0.01:  # >1%
        return "MEDIO", "üü°"
    else:
        return "BAJO", "üü¢"

risk_global, emoji_global = assess_risk(copies_global, high_risk_global, len(dcr_global))
risk_advanced, emoji_advanced = assess_risk(copies_advanced, high_risk_advanced, len(dcr_advanced))

print(f"\n   Sint√©tico Global:   {emoji_global} Riesgo {risk_global}")
print(f"   Sint√©tico Advanced: {emoji_advanced} Riesgo {risk_advanced}")

if copies_global == 0 and copies_advanced == 0:
    print("\n‚úÖ APROBADO: Los datos sint√©ticos no presentan memorizaci√≥n evidente.")
    print("   Los datos pueden usarse para entrenar modelos sin comprometer privacidad.")
else:
    print("\n‚ö†Ô∏è REQUIERE REVISI√ìN: Se detectaron posibles problemas de memorizaci√≥n.")

---
## 6. Guardar Reporte

In [None]:
# ==============================================================================
# GUARDAR REPORTE DE PRIVACIDAD
# ==============================================================================

privacy_report = {
    "metadata": {
        "date": datetime.now().isoformat(),
        "audit_type": "Privacy and Memorization Check",
        "metrics": ["Exact Copies", "DCR (Distance to Closest Record)"]
    },
    "exact_copies": {
        "global": {
            "count": int(copies_global),
            "total": int(total_global),
            "percentage": float(pct_global)
        },
        "advanced": {
            "count": int(copies_advanced),
            "total": int(total_advanced),
            "percentage": float(pct_advanced)
        }
    },
    "dcr_statistics": {
        "global": {
            "min": float(dcr_global.min()),
            "max": float(dcr_global.max()),
            "mean": float(dcr_global.mean()),
            "median": float(np.median(dcr_global)),
            "std": float(dcr_global.std())
        },
        "advanced": {
            "min": float(dcr_advanced.min()),
            "max": float(dcr_advanced.max()),
            "mean": float(dcr_advanced.mean()),
            "median": float(np.median(dcr_advanced)),
            "std": float(dcr_advanced.std())
        }
    },
    "risk_assessment": {
        "global": {
            "high_risk_count": int(high_risk_global),
            "medium_risk_count": int(medium_risk_global),
            "risk_level": risk_global
        },
        "advanced": {
            "high_risk_count": int(high_risk_advanced),
            "medium_risk_count": int(medium_risk_advanced),
            "risk_level": risk_advanced
        }
    },
    "conclusion": {
        "memorization_detected": copies_global > 0 or copies_advanced > 0,
        "approved_for_use": copies_global == 0 and copies_advanced == 0
    },
    "files_generated": [
        "dcr_histogram.png",
        "dcr_comparison.png",
        "privacy_report.json"
    ]
}

with open(OUTPUT_DIR / "privacy_report.json", 'w') as f:
    json.dump(privacy_report, f, indent=2)

print("\nüíæ Reporte guardado: privacy_report.json")
print("\n" + "="*50)
print("üéâ AUDITOR√çA DE PRIVACIDAD COMPLETADA")
print("="*50)

---
## Resumen

### M√©tricas Evaluadas

| M√©trica | Global | Advanced |
|---------|--------|----------|
| Copias Exactas | ? | ? |
| DCR M√≠nimo | ? | ? |
| DCR Medio | ? | ? |
| Riesgo | ? | ? |

### Interpretaci√≥n DCR

- **DCR = 0**: Copia exacta (CR√çTICO)
- **DCR < 0.5**: Muy cercano (ALTO riesgo)
- **DCR < 1.0**: Cercano (MEDIO riesgo)
- **DCR >= 1.0**: Diferenciado (BAJO riesgo)

### Siguiente Paso

Si aprobado, proceder al entrenamiento TSTR con los datos sint√©ticos validados.

---