# ‚öôÔ∏è Preprocesamiento de Datos - PAKDD 2010 Credit Risk

Este notebook aplica el pipeline de preprocesamiento definido en 
`src/features/build_features.py`, valida su funcionamiento, 
realiza la divisi√≥n Train/Test y guarda el pipeline entrenado 
para su uso en el notebook `03_Training.ipynb`.

---

## üéØ Objetivos:
- Cargar el dataset intermedio con encabezados correctos  
- Separar variables predictoras (X) y variable objetivo (y)  
- Dividir los datos en train/test y guardarlos en `data/processed/`  
- Entrenar el pipeline de preprocesamiento (`BaseCleaner + ColumnTransformer`)  
- Guardar el pipeline entrenado en `models/preprocessing_pipeline.joblib`


## üìö Importar Librer√≠as

In [None]:
import pandas as pd
import joblib
import sys
import os
# Asegura que Python pueda encontrar el m√≥dulo 'src'
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from src.features.build_features import build_preprocessing_pipeline
from src.utils.split import split_and_save

pd.set_option('display.max_columns', 100)
pd.set_option('display.precision', 3)

print("‚úÖ Librer√≠as importadas correctamente.")

‚úÖ Librer√≠as importadas correctamente.


## üìÅ Cargar Datos Intermedios

In [2]:
DATA_PATH = "../data/interim/train_clean_headers.parquet"
TARGET_COL = "TARGET_LABEL_BAD=1"

print(f"üì• Cargando dataset desde: {DATA_PATH}")

if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"‚ùå No se encontr√≥ el archivo: {DATA_PATH}")

df = pd.read_parquet(DATA_PATH)

print(f"‚úÖ Dataset cargado correctamente.")
print(f"üìê Dimensiones: {df.shape[0]:,} filas x {df.shape[1]} columnas")
display(df.head(3))

üì• Cargando dataset desde: ../data/interim/train_clean_headers.parquet
‚úÖ Dataset cargado correctamente.
üìê Dimensiones: 50,000 filas x 54 columnas


Unnamed: 0,ID_CLIENT,CLERK_TYPE,PAYMENT_DAY,APPLICATION_SUBMISSION_TYPE,QUANT_ADDITIONAL_CARDS,POSTAL_ADDRESS_TYPE,SEX,MARITAL_STATUS,QUANT_DEPENDANTS,EDUCATION_LEVEL,STATE_OF_BIRTH,CITY_OF_BIRTH,NACIONALITY,RESIDENCIAL_STATE,RESIDENCIAL_CITY,RESIDENCIAL_BOROUGH,FLAG_RESIDENCIAL_PHONE,RESIDENCIAL_PHONE_AREA_CODE,RESIDENCE_TYPE,MONTHS_IN_RESIDENCE,FLAG_MOBILE_PHONE,FLAG_EMAIL,PERSONAL_MONTHLY_INCOME,OTHER_INCOMES,FLAG_VISA,FLAG_MASTERCARD,FLAG_DINERS,FLAG_AMERICAN_EXPRESS,FLAG_OTHER_CARDS,QUANT_BANKING_ACCOUNTS,QUANT_SPECIAL_BANKING_ACCOUNTS,PERSONAL_ASSETS_VALUE,QUANT_CARS,COMPANY,PROFESSIONAL_STATE,PROFESSIONAL_CITY,PROFESSIONAL_BOROUGH,FLAG_PROFESSIONAL_PHONE,PROFESSIONAL_PHONE_AREA_CODE,MONTHS_IN_THE_JOB,PROFESSION_CODE,OCCUPATION_TYPE,MATE_PROFESSION_CODE,MATE_EDUCATION_LEVEL,FLAG_HOME_ADDRESS_DOCUMENT,FLAG_RG,FLAG_CPF,FLAG_INCOME_PROOF,PRODUCT,FLAG_ACSP_RECORD,AGE,RESIDENCIAL_ZIP_3,PROFESSIONAL_ZIP_3,TARGET_LABEL_BAD=1
0,1,C,5,Web,0,1,F,6,1,0,RN,Assu,1,RN,Santana do Matos,Centro,Y,105,1.0,15.0,N,1,900.0,0.0,1,1,0,0,0,0,0,0.0,0,N,,,,N,,0,9.0,4.0,,,0,0,0,0,1,N,32,595,595,1
1,2,C,15,Carga,0,1,F,2,0,0,RJ,rio de janeiro,1,RJ,RIO DE JANEIRO,CAMPO GRANDE,Y,20,1.0,1.0,N,1,750.0,0.0,0,0,0,0,0,0,0,0.0,0,Y,,,,N,,0,11.0,4.0,11.0,,0,0,0,0,1,N,34,230,230,1
2,3,C,5,Web,0,1,F,2,0,0,RN,GARANHUNS,1,RN,Parnamirim,Boa Esperanca,Y,105,1.0,,N,1,500.0,0.0,0,0,0,0,0,0,0,0.0,0,N,,,,N,,0,11.0,,,,0,0,0,0,1,N,27,591,591,0


## ‚úÇÔ∏è Divisi√≥n en Train/Test y Guardado

In [3]:
print("üîß Separando variables predictoras (X) y target (y)...")

X_train, X_test, y_train, y_test = split_and_save(df, TARGET_COL)

print("‚úÖ Divisi√≥n completada y datasets guardados en data/processed/")

üîß Separando variables predictoras (X) y target (y)...
‚úÖ Guardado en ../data/processed/ (train: (40000, 53), test: (10000, 53))
‚úÖ Divisi√≥n completada y datasets guardados en data/processed/


## üß© Crear y Entrenar el Pipeline de Preprocesamiento

In [4]:
from time import time
t0 = time()

print("‚öôÔ∏è Construyendo pipeline...")
pipeline = build_preprocessing_pipeline()

print("üöÄ Entrenando pipeline con los datos de entrenamiento...")
X_train_processed = pipeline.fit_transform(X_train, y_train)

print(f"‚úÖ Pipeline entrenado correctamente en {time() - t0:.2f} segundos.")
print(f"üìè Forma original: {X_train.shape} ‚Üí Transformada: {X_train_processed.shape}")

‚öôÔ∏è Construyendo pipeline...
üöÄ Entrenando pipeline con los datos de entrenamiento...
‚úÖ Pipeline entrenado correctamente en 0.44 segundos.
üìè Forma original: (40000, 53) ‚Üí Transformada: (40000, 151)


## üîç Verificaci√≥n de las Features Generadas

In [5]:
try:
    feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()
    print(f"üî§ Se generaron {len(feature_names)} columnas transformadas.")
    print("üî∏ Ejemplo de columnas transformadas:")
    print(feature_names[:20])
except Exception as e:
    print(f"‚ö†Ô∏è No se pudieron obtener los nombres de columnas: {e}")

üî§ Se generaron 151 columnas transformadas.
üî∏ Ejemplo de columnas transformadas:
['numeric_pipe__PERSONAL_ASSETS_VALUE'
 'numeric_pipe__QUANT_BANKING_ACCOUNTS'
 'numeric_pipe__QUANT_SPECIAL_BANKING_ACCOUNTS' 'numeric_pipe__AGE'
 'numeric_pipe__MONTHS_IN_RESIDENCE' 'numeric_pipe__MONTHS_IN_THE_JOB'
 'categorical_pipe__SEX_F' 'categorical_pipe__SEX_M'
 'categorical_pipe__SEX_N' 'categorical_pipe__PRODUCT_1'
 'categorical_pipe__PRODUCT_2' 'categorical_pipe__PRODUCT_7'
 'categorical_pipe__NACIONALITY_0' 'categorical_pipe__NACIONALITY_1'
 'categorical_pipe__NACIONALITY_2' 'categorical_pipe__MARITAL_STATUS_0'
 'categorical_pipe__MARITAL_STATUS_1' 'categorical_pipe__MARITAL_STATUS_2'
 'categorical_pipe__MARITAL_STATUS_3' 'categorical_pipe__MARITAL_STATUS_4']


## üíæ Guardar el Pipeline Entrenado

In [6]:

MODEL_DIR = "../models"
os.makedirs(MODEL_DIR, exist_ok=True)
PIPELINE_PATH = os.path.join(MODEL_DIR, "preprocessing_pipeline.joblib")

joblib.dump(pipeline, PIPELINE_PATH)

print(f"üíæ Pipeline guardado exitosamente en: {PIPELINE_PATH}")

üíæ Pipeline guardado exitosamente en: ../models\preprocessing_pipeline.joblib


## üßæ Resumen del Proceso

In [7]:
print("‚úÖ PREPROCESSING COMPLETADO EXITOSAMENTE ‚úÖ")
print(f"""
üìä Dataset inicial: {df.shape}
üîπ Train: {X_train.shape}
üîπ Test: {X_test.shape}
üß† Pipeline entrenado y guardado en: {PIPELINE_PATH}
üìÅ Splits guardados en: ../data/processed/
""")

‚úÖ PREPROCESSING COMPLETADO EXITOSAMENTE ‚úÖ

üìä Dataset inicial: (50000, 54)
üîπ Train: (40000, 53)
üîπ Test: (10000, 53)
üß† Pipeline entrenado y guardado en: ../models\preprocessing_pipeline.joblib
üìÅ Splits guardados en: ../data/processed/

