# 09: Ensamblaje y Creación de Features

**Propósito:** Este es el *notebook* final del pipeline de datos. Su objetivo es tomar los tres *datasets* maestros limpios de `data/02_processed/` y unirlos en un único *dataset* analítico listo para el modelamiento (`analytical_dataset.parquet`).

**Proceso (La "Gran Fusión"):**
1.  **Cargar:** Cargar `diputados...`, `votaciones...`, `boletines...` y el archivo externo `colegios_chile.csv`.
2.  **Enriquecer Diputados:**
    * Hacer `merge` con `colegios_chile.csv` para obtener la `dependencia` (Público/Privado).
    * Calcular `antiguedad_partido_anios` usando las fechas de inicio de período y militancia.
3.  **Ensamblar:** Unir las tres tablas enriquecidas. La tabla `votaciones_master` es nuestra "tabla de hechos" (la base) que une todo.
4.  **Guardar:** Guardar el *dataset* final en `data/03_final/`.

**Dependencias:**
* `data/02_processed/diputados_periodo_master_clean.parquet`
* `data/02_processed/votaciones_master_clean.parquet`
* `data/02_processed/boletines_master_clean.parquet`
* `data/01_raw/colegios_chile.csv` (O la ruta a tu archivo de colegios)

**Salidas (Artifacts):**
* `data/03_final/analytical_dataset.parquet`

In [1]:
import pandas as pd
import numpy as np
import logging
from pathlib import Path
import sys
import json # Para parsear los ámbitos
from sklearn.preprocessing import MultiLabelBinarizer 

# --- Configurar Logging ---
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# --- Importar lógica personalizada de /src ---
sys.path.append('../') 
try:
    # Necesitamos la función de normalizar texto para el merge de colegios
    from src.common_utils import normalize_string
    from src.feature_engineering_utils import encode_candidates, find_dependencia_fast
except ImportError as e:
    logging.error(f"ERROR: No se pudo importar desde /src. {e}")
    raise

2025-11-18 15:23:51,093 - INFO - PyTorch version 2.5.1+cu121 available.
2025-11-18 15:23:51,881 - INFO - Use pytorch device_name: cuda:0
2025-11-18 15:23:51,881 - INFO - Load pretrained SentenceTransformer: paraphrase-multilingual-MiniLM-L12-v2


In [2]:
# --- 1. Configuración de Rutas y Constantes ---
ROOT = Path.cwd().parent
DATA_DIR_PROCESSED = ROOT / "data" / "02_processed"
DATA_DIR_FINAL = ROOT / "data" / "03_final"

# (Directorio de donde cargas tu archivo de colegios)
DATA_DIR_RAW = ROOT / "data" / "01_raw" 

# Asegurarse que el directorio de salida exista
DATA_DIR_FINAL.mkdir(parents=True, exist_ok=True)

# --- Archivos de Entrada ---
DIPUTADOS_FILE = DATA_DIR_PROCESSED / "diputados_master_clean.parquet"
VOTACIONES_FILE = DATA_DIR_PROCESSED / "votaciones_master_clean.parquet"
BOLETINES_FILE = DATA_DIR_PROCESSED / "boletines_master_clean.parquet"
COLEGIOS_FILE = DATA_DIR_RAW / "colegios_chile.csv" 

# --- Archivo de Salida ---
OUTPUT_FILE = DATA_DIR_FINAL / "analytical_dataset.parquet"

logging.info(f"Directorio Procesado: {DATA_DIR_PROCESSED}")
logging.info(f"Directorio Final: {DATA_DIR_FINAL}")
logging.info(f"Archivo de Salida: {OUTPUT_FILE}")

2025-11-18 15:23:58,043 - INFO - Directorio Procesado: C:\Users\angel\OneDrive\Documents\U\2025-2\Proyecto de Grado\Legislative-Voting-Behavior-Prediction-\data\02_processed
2025-11-18 15:23:58,044 - INFO - Directorio Final: C:\Users\angel\OneDrive\Documents\U\2025-2\Proyecto de Grado\Legislative-Voting-Behavior-Prediction-\data\03_final
2025-11-18 15:23:58,045 - INFO - Archivo de Salida: C:\Users\angel\OneDrive\Documents\U\2025-2\Proyecto de Grado\Legislative-Voting-Behavior-Prediction-\data\03_final\analytical_dataset.parquet


## 1. Cargar Datasets Maestros

Cargamos las tres tablas maestras de la capa `02_processed` y cualquier BBDD externa (como la de colegios).

In [3]:
logging.info("Cargando datasets maestros...")

try:
    df_diputados = pd.read_parquet(DIPUTADOS_FILE)
    df_votaciones = pd.read_parquet(VOTACIONES_FILE)
    df_boletines = pd.read_parquet(BOLETINES_FILE)
    
    logging.info(f"Diputados cargados: {df_diputados.shape}")
    logging.info(f"Votaciones cargadas: {df_votaciones.shape}")
    logging.info(f"Boletines cargados: {df_boletines.shape}")

except FileNotFoundError as e:
    logging.error(f"ERROR: No se encontró un archivo maestro en {DATA_DIR_PROCESSED}. {e}")
    logging.error("Asegúrese de haber ejecutado los notebooks 06, 07 y 08.")
    raise

# --- Cargar BBDD Externa de Colegios ---
try:
    df_colegios_db = pd.read_csv(
        COLEGIOS_FILE,
        sep=";",              # el Mineduc casi siempre usa punto y coma
        encoding="latin-1",   # evita problemas con tildes y ñ
        on_bad_lines="skip",  # salta filas con errores
        engine="python"       # más tolerante que el parser por defecto
    )
    logging.info(f"BBDD Externa de Colegios cargada: {df_colegios_db.shape}")
except FileNotFoundError as e:
    logging.warning(f"WARNING: No se encontró el archivo de colegios en {COLEGIOS_FILE}.")
    logging.warning("La feature 'dependencia_colegio' será 'Desconocida'.")
    df_colegios_db = None

2025-11-18 15:23:58,055 - INFO - Cargando datasets maestros...
2025-11-18 15:23:58,928 - INFO - Diputados cargados: (929, 29)
2025-11-18 15:23:58,928 - INFO - Votaciones cargadas: (1926750, 18)
2025-11-18 15:23:58,929 - INFO - Boletines cargados: (3286, 18)
2025-11-18 15:23:59,426 - INFO - BBDD Externa de Colegios cargada: (16694, 50)


## 2. Feature Engineering: Diputados (Antigüedad y Colegios)

Enriquecemos el dataset maestro de diputados con las *features* externas e internas que definimos.

In [4]:
logging.info("Iniciando Feature Engineering en Diputados...")
df_diputados_enriquecido = df_diputados.copy()

# --- 3a. Feature: Dependencia del Colegio ---
if df_colegios_db is not None:
    logging.info("Calculando 'dependencia_colegio'...")
    # Normalizar la llave en la BBDD de colegios
    df_colegios_db['colegio_merge_key'] = df_colegios_db['NOM_RBD'].apply(normalize_string)
    
    # Seleccionar solo las columnas necesarias y eliminar duplicados
    df_colegios_lookup = df_colegios_db[['colegio_merge_key', 'COD_DEPE']].drop_duplicates()
    df_unique = (
        df_diputados_enriquecido[["colegio_merge_key"]]
        .drop_duplicates()
        .dropna()
        .reset_index(drop=True)
    )
    
    # --- 3. (EL PASO CLAVE) Codificar candidatos UNA SOLA VEZ ---
    # (Esto puede tardar unos minutos, pero solo corre una vez)
    emb_colegios, df_colegios_lookup_indexed = encode_candidates(df_colegios_lookup)
    
    
    # --- 4. Aplicar la función RÁPIDA ---
    print("Iniciando el matching semántico (rápido)...")
    # Ajusta el 'threshold' (umbral) según tus necesidades
    df_unique[["match_fuzzy", "score", "dependencia_oficial"]] = df_unique["colegio_merge_key"].apply(
        lambda x: pd.Series(find_dependencia_fast(x, df_colegios_lookup_indexed, emb_colegios, threshold=0.65))
    )
    print("Matching semántico completado.")
    
    # --- 5. Hacemos el merge (sin cambios) ---
    df_diputados_enriquecido = pd.merge(
        df_diputados_enriquecido,
        df_unique,
        on='colegio_merge_key',
        how='left'
    )
    df_diputados_enriquecido['dependencia_colegio'] = df_diputados_enriquecido['dependencia_oficial'].fillna('Desconocida')

    dependencia_map = {
        1.0: 'Corporación Municipal',
        2.0: 'Municipal DAEM',
        3.0: 'Particular Subvencionado',
        4.0: 'Particular Pagado',
        5.0: 'Adm. Delegada (DL 3166)',
        6.0: 'Servicio Local de Educación',
    }
    

    df_diputados_enriquecido['dependencia_etiqueta'] = df_diputados_enriquecido['dependencia_oficial'].map(dependencia_map)

    df_diputados_enriquecido['dependencia_etiqueta'] = df_diputados_enriquecido['dependencia_etiqueta'].fillna('Desconocido / Extranjero')
    df_diputados_enriquecido = df_diputados_enriquecido.drop(columns=['dependencia_oficial'])

else:
    df_diputados_enriquecido['dependencia_colegio'] = 'Desconocida'


# --- 3b. Feature: Antigüedad en el Partido ---
logging.info("Calculando 'antiguedad_partido_anios'...")
if 'militancia_fecha_inicio' in df_diputados_enriquecido.columns:
    # Asegurar que sean datetime
    f_inicio_periodo = pd.to_datetime(df_diputados_enriquecido['periodo_fecha_inicio'], errors='coerce')
    f_inicio_militancia = pd.to_datetime(df_diputados_enriquecido['militancia_fecha_inicio'], errors='coerce')
    
    # Calcular diferencia en días y luego en años
    time_diff_days = (f_inicio_periodo - f_inicio_militancia).dt.days
    df_diputados_enriquecido['antiguedad_partido_anios'] = time_diff_days / 365.25
    
    # Manejar valores negativos (si militancia fue *después* de iniciar período)
    df_diputados_enriquecido.loc[df_diputados_enriquecido['antiguedad_partido_anios'] < 0, 'antiguedad_partido_anios'] = 0
else:
    logging.warning("No se encontró 'militancia_fecha_inicio', feature 'antiguedad' será NaN.")
    df_diputados_enriquecido['antiguedad_partido_anios'] = np.nan

logging.info("DataFrame de Diputados enriquecido.")
display(df_diputados_enriquecido[['diputado_id', 'dependencia_colegio', 'antiguedad_partido_anios']].sample(5))

2025-11-18 15:23:59,437 - INFO - Iniciando Feature Engineering en Diputados...
2025-11-18 15:23:59,439 - INFO - Calculando 'dependencia_colegio'...


Codificando la base de datos de MINEDUC (esto se hace 1 vez)...


Batches:   0%|          | 0/492 [00:00<?, ?it/s]

Codificación de candidatos completada.
Iniciando el matching semántico (rápido)...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-11-18 15:24:17,196 - INFO - Calculando 'antiguedad_partido_anios'...
2025-11-18 15:24:17,218 - INFO - DataFrame de Diputados enriquecido.


Matching semántico completado.


Unnamed: 0,diputado_id,dependencia_colegio,antiguedad_partido_anios
223,892,6.0,0.0
226,897,4.0,0.0
595,994,3.0,3.997262
516,877,3.0,3.997262
118,933,4.0,7.997262


## 3. Feature Engineering: Indice de Rice y Disciplina partidaria

In [5]:
try:
  df_analisis = pd.merge(
      df_votaciones,
      df_diputados_enriquecido,
      how="left",
      on=['diputado_id', 'periodo']
  )
  logging.info("Dataset analítico creado")
except KeyError:
  logging.error("Dataset no creado")

2025-11-18 15:24:18,723 - INFO - Dataset analítico creado


In [6]:
grouped = df_analisis.groupby(['votacion_id', 'partido_id', 'periodo'])
vote_pcts = grouped['voto_valor'].value_counts(normalize=True).unstack(fill_value=0)

## Índice de rice

El índice de Rice es un número entre 0 y 1 que indica el grado de acuerdo dentro de un cuerpo de votación.

In [7]:
logging.info("Calculando Voto de Bancada y Cohesión (Índice de Rice)...")

# 1. Agrupar por votación y partido
# (Usamos las llaves 'votacion_id' y 'partido_id')
grouped = df_analisis.groupby(['votacion_id', 'partido_id', 'periodo'])

# 2. Calcular % de votos
# (normalize=True da el porcentaje)
vote_pcts = grouped['voto_valor'].value_counts(normalize=True).unstack(fill_value=0)

# Asegurarse de que las columnas 1.0 (Aprueba) y 0.0 (Rechaza) existan
if 1.0 not in vote_pcts.columns:
    vote_pcts[1.0] = 0.0
if 0.0 not in vote_pcts.columns:
    vote_pcts[0.0] = 0.0

# 3. Calcular Índice de Rice (para la bancada)
vote_pcts['indice_rice'] = abs(vote_pcts[1.0] - vote_pcts[0.0])

# 4. Determinar Voto Mayoritario (Voto de Bancada)
vote_pcts['voto_bancada'] = np.where(vote_pcts[1.0] >= vote_pcts[0.0], 1.0, 0.0)

# (Manejar el caso de empate 50/50, lo marcamos como NaN o 0.5)
vote_pcts.loc[vote_pcts[1.0] == vote_pcts[0.0], 'voto_bancada'] = 0.5 # O np.nan

# 5. Seleccionar solo las features que creamos
party_line_features = vote_pcts[['indice_rice', 'voto_bancada']]

# 6. Unir estas features de vuelta al DataFrame principal
df = df_analisis.merge(
    party_line_features,
    left_on=['votacion_id', 'partido_id', 'periodo'],
    right_index=True,
    how='left'
)

logging.info("'indice_rice' y 'voto_bancada' calculados.")
display(df[['diputado_id', 'partido_id', 'votacion_id', 'voto_valor', 'voto_bancada', 'indice_rice', 'periodo']].sample(5))

2025-11-18 15:24:19,883 - INFO - Calculando Voto de Bancada y Cohesión (Índice de Rice)...
2025-11-18 15:24:22,376 - INFO - 'indice_rice' y 'voto_bancada' calculados.


Unnamed: 0,diputado_id,partido_id,votacion_id,voto_valor,voto_bancada,indice_rice,periodo
81209,177,PPD,10594,1,1.0,1.0,2002-2006
37830,813,RN,15010,1,1.0,1.0,2002-2006
1145618,961,IND,22135,1,1.0,1.0,2018-2022
643198,879,DC,18584,1,1.0,1.0,2014-2018
1693007,1178,IND,42443,0,0.0,0.25,2022-2026


## Disciplina Partidaria

In [8]:
logging.info("Calculando columna auxiliar 'voto_con_bancada'...")

# (voto_valor == voto_bancada)
df['voto_con_bancada'] = (df['voto_valor'] == df['voto_bancada']).astype(int)

# Manejar casos especiales
df.loc[df['voto_valor'] == 2, 'voto_con_bancada'] = 0 # Abstención no es "con bancada"
df.loc[df['voto_valor'] == 3, 'voto_con_bancada'] = 0 
df.loc[df['voto_valor'] == 4, 'voto_con_bancada'] = 0 
df.loc[df['voto_bancada'] == 0.5, 'voto_con_bancada'] = 0 # Bancada dividida no tiene línea
logging.info("'voto_con_bancada' calculada.")

# --- PASO 2b: Ventana Móvil (Rolling Window) ---
logging.info("Calculando 'disciplina_partidaria_hist' (rolling)...")
N_VOTOS_VENTANA = 50 # (Hiperparámetro ajustable)

# (Agrupamos por 'diputado_id' para que la ventana no "cruce" entre diputados)
# (shift(1) es CRÍTICO: usa solo historia pasada, previene data leakage)
df['disciplina_partidaria_hist'] = df.groupby('diputado_id')['voto_con_bancada'].shift(1).rolling(
    window=N_VOTOS_VENTANA,
    min_periods=10 
).mean()

logging.info("'disciplina_partidaria_hist' calculada.")
display(df[['diputado_id', 'partido_id', 'votacion_id', 'voto_valor', 'voto_bancada', 'disciplina_partidaria_hist', 'periodo']].sample(5))

2025-11-18 15:24:22,529 - INFO - Calculando columna auxiliar 'voto_con_bancada'...
2025-11-18 15:24:22,567 - INFO - 'voto_con_bancada' calculada.
2025-11-18 15:24:22,567 - INFO - Calculando 'disciplina_partidaria_hist' (rolling)...
2025-11-18 15:24:22,677 - INFO - 'disciplina_partidaria_hist' calculada.


Unnamed: 0,diputado_id,partido_id,votacion_id,voto_valor,voto_bancada,disciplina_partidaria_hist,periodo
1866947,1141,LIBERAL,83609,1,1.0,0.68,2022-2026
861497,982,DC,28456,1,0.0,0.92,2014-2018
1042464,977,,39640,2,,0.58,2018-2022
569748,816,DC,18464,1,1.0,0.96,2010-2014
380339,819,PPD,14188,1,1.0,0.96,2010-2014


## 4. Unimos los boletines a nuestro dataset

In [9]:
try:
  df_analisis = pd.merge(
      df,
      df_boletines,
      how="left",
      on=['boletin_id']
      )
  logging.info("Dataset analítico creado")
  display(df_analisis[['votacion_id', 'nombre_completo', 'boletin_id', 'topic_titulo_id', 'topic_materia_id']].sample(5))
except KeyError:
  logging.error("Dataset no creado")

2025-11-18 15:24:23,849 - INFO - Dataset analítico creado


Unnamed: 0,votacion_id,nombre_completo,boletin_id,topic_titulo_id,topic_materia_id
1067744,25460,Raúl Leiva Carvajal,9914-11,8.0,0.0
1178261,40477,Harry Jürgensen Rundshagen,13596-29,17.0,1.0
44657,15310,Exequiel Silva Ortiz,2892-06,1.0,13.0
95758,13686,Pablo Prieto Lorca,2853-04,5.0,4.0
1615492,34097,Raúl Soto Mardones,14090-07,0.0,1.0


## 5. Guardar el dataset

In [10]:
try:
    df_analisis["dependencia_colegio"] = pd.to_numeric(df_analisis["dependencia_colegio"], errors="coerce")
    df_analisis.to_parquet(OUTPUT_FILE, index=False)
    logging.info(f"Guardado exitosamente: {OUTPUT_FILE}")
    logging.info(f"Dimensiones del DataFrame maestro: {df_analisis.shape}")
    
    print("\n--- Columnas Finales del DataFrame Limpio ---")
    print(df_analisis.columns.tolist())
    
except Exception as e:
    logging.error(f"ERROR al guardar en Parquet: {e}")

display(df_analisis.head())

2025-11-18 15:24:33,315 - ERROR - ERROR al guardar en Parquet: malloc of size 67108864 failed


Unnamed: 0,votacion_id,fecha_votacion,total_si,total_no,total_abstenciones,total_dispensado,quorum,diputado_id,voto_valor,boletin_id,...,autores_json,materias_str,materias_json,boletin_id_consultado,topic_materia,topic_titulo,topic_materia_id,topic_materia_nombre,topic_titulo_id,topic_titulo_nombre
0,14898,2002-12-19 12:06:00,65,0,0,0,Quórum Simple,800,1,2625-07,...,"[""R\u00edos Santander, Mario""]",PREDIOS URBANOS,"[""PREDIOS URBANOS""]",2625.0,9.0,8.0,-1.0,-1_sector_público_remuneraciones_salud,10.0,10_viviendas_urbanismo_construcciones_raíces
1,14898,2002-12-19 12:06:00,65,0,0,0,Quórum Simple,802,1,2625-07,...,"[""R\u00edos Santander, Mario""]",PREDIOS URBANOS,"[""PREDIOS URBANOS""]",2625.0,9.0,8.0,-1.0,-1_sector_público_remuneraciones_salud,10.0,10_viviendas_urbanismo_construcciones_raíces
2,14898,2002-12-19 12:06:00,65,0,0,0,Quórum Simple,807,1,2625-07,...,"[""R\u00edos Santander, Mario""]",PREDIOS URBANOS,"[""PREDIOS URBANOS""]",2625.0,9.0,8.0,-1.0,-1_sector_público_remuneraciones_salud,10.0,10_viviendas_urbanismo_construcciones_raíces
3,14898,2002-12-19 12:06:00,65,0,0,0,Quórum Simple,806,1,2625-07,...,"[""R\u00edos Santander, Mario""]",PREDIOS URBANOS,"[""PREDIOS URBANOS""]",2625.0,9.0,8.0,-1.0,-1_sector_público_remuneraciones_salud,10.0,10_viviendas_urbanismo_construcciones_raíces
4,14898,2002-12-19 12:06:00,65,0,0,0,Quórum Simple,811,1,2625-07,...,"[""R\u00edos Santander, Mario""]",PREDIOS URBANOS,"[""PREDIOS URBANOS""]",2625.0,9.0,8.0,-1.0,-1_sector_público_remuneraciones_salud,10.0,10_viviendas_urbanismo_construcciones_raíces
