# 02 — Preprocessing: Pipeline común para LR, RF, XGB, SVM

**Objetivo:** construir y persistir un *pipeline* de preprocesamiento **único** para todos los modelos e hipótesis.
- Ingeniería de variables (`engineer_basic_features`)
- Imputación (num=cént. *median*, cat=*most frequent*)
- Escalado (StandardScaler) para modelos lineales/SVM
- One-Hot Encoding en categóricas (handle_unknown='ignore')
- Transformación `Fare_log`
- Guardar: `data/processed/preprocessing_pipeline.pkl` y `data/processed/feature_definitions.json`

In [1]:
import sys, os
from pathlib import Path

candidates = [Path.cwd(), Path.cwd().parent, Path.cwd().parent.parent]
PROJECT_ROOT = None
for c in candidates:
    if (c / "src").exists():
        PROJECT_ROOT = c
        break
if PROJECT_ROOT is None:
    raise RuntimeError("No se encontró la carpeta 'src' en niveles superiores.")
sys.path.insert(0, str(PROJECT_ROOT))
os.environ["PYTHONPATH"] = str(PROJECT_ROOT) + os.pathsep + os.environ.get("PYTHONPATH", "")
print("Project root:", PROJECT_ROOT)

Project root: c:\Users\luigu\OneDrive\Escritorio\Titanic_MLProject


In [2]:
import joblib
import pandas as pd

from src.utils import p, read_csv, ensure_dirs, save_json
from src.preprocessing import build_preprocessing_pipeline, get_feature_definitions

DATA_RAW = p("data", "raw", "Titanic-Dataset.csv")
PIPE_OUT = p("data", "processed", "preprocessing_pipeline.pkl")
FEAT_JSON = p("data", "processed", "feature_definitions.json")

ensure_dirs()

In [3]:
# Cargar muestra
df_raw = read_csv(DATA_RAW)

# Construir pipeline
preproc = build_preprocessing_pipeline(df_raw)

# Ajuste (para que guarde imputadores, categorías OHE, medias/DEs, etc.)
_ = preproc.fit(df_raw)

# Persistir
joblib.dump(preproc, PIPE_OUT)
print("Guardado:", PIPE_OUT)

Guardado: C:\Users\luigu\OneDrive\Escritorio\Titanic_MLProject\data\processed\preprocessing_pipeline.pkl


In [4]:
feat_defs = get_feature_definitions()

feat_defs["Title"] = {
    "description": "Título extraído del nombre",
    "type": "categorical",
    "creation_method": "regex extraction (agrupado en {Mr, Mrs, Miss, Master, Officer, Royalty})",
    "missing_handling": "none",
    "values": ["Mr", "Mrs", "Miss", "Master", "Officer", "Royalty"]
}

save_json(feat_defs, FEAT_JSON)
print("Guardado:", FEAT_JSON)

Guardado: C:\Users\luigu\OneDrive\Escritorio\Titanic_MLProject\data\processed\feature_definitions.json


In [5]:
# Transform de ejemplo para verificar salida (matriz lista para modelos)
X_mat = preproc.transform(df_raw)
print("Shape transform:", X_mat.shape)

Shape transform: (891, 31)
