<a href="https://colab.research.google.com/github/Noirwolf04/pipeline_hackaton/blob/main/MVP_Paso2_Pipeline_FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semana 01 – Punto 2: Ingeniería de Features y Pipeline (Churn)

**Objetivo:** construir un `ColumnTransformer` + `Pipeline` que reciba datos crudos (con tipos mixtos) y devuelva una **matriz numérica lista** para alimentar un modelo de churn.

- Dataset: *Bank Customer Churn* (Kaggle)  
- Target (a futuro, para entrenamiento): `Exited` (0=se queda, 1=abandona)  
- Este notebook cubre **solo preprocesamiento** (Paso 2).


## 1) Instalación / Imports

>  `pandas` y `scikit-learn`.  
> `kagglehub`, instálalo con: `pip -q install kagglehub`


In [None]:
# !pip -q install kagglehub

import os
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer


## 2) Cargar dataset  (KaggleHub) y limpieza mínima

1.   Elemento de la lista
2.   Elemento de la lista



- Se elimina información no útil para el modelo: `RowNumber`, `CustomerId`, `Surname`
- Se estandarizan nombres de columnas (opcional) para que el contrato sea consistente.


In [None]:
import kagglehub

# Descargar dataset
path = kagglehub.dataset_download("radheshyamkollipara/bank-customer-churn")
print("Path:", path)

# Archivo principal
csv_path = os.path.join(path, "Customer-Churn-Records.csv")
df = pd.read_csv(csv_path)

print("Shape original:", df.shape)
df.head()


Using Colab cache for faster access to the 'bank-customer-churn' dataset.
Path: /kaggle/input/bank-customer-churn
Shape original: (10000, 18)


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


In [None]:
# Limpieza mínima (quitar identificadores)
cols_drop = ["RowNumber", "CustomerId", "Surname"]
df = df.drop(columns=[c for c in cols_drop if c in df.columns], errors="ignore")

# Normalizar nombres
df = df.rename(columns={
    "Satisfaction Score": "SatisfactionScore",
    "Card Type": "CardType",
    "Point Earned": "PointEarned",
    "Complain": "complain",
})

print("Shape limpia:", df.shape)
df.head()


Shape limpia: (10000, 15)


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,complain,SatisfactionScore,CardType,PointEarned
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


## 3) Definir variables (según lo confirmado en sesión)

### Numéricas (Standard Scaling)
- `Age`, `CreditScore`, `Balance`, `EstimatedSalary`, `Tenure`, `NumOfProducts`

### Categóricas nominales (One-Hot)
- `Geography`, `Gender`  
*(Si `CardType` se usa, también iría aquí; solo si DS confirma que entra al modelo.)*

### Binarias / ordinales (uso directo, sin escalado)
- Binarias: `IsActiveMember`, `HasCrCard`, `complain` (0/1 o True/False)
- Ordinal: `SatisfactionScore` (1–5)  
> Nota: al ser ordinal con orden natural, se puede dejar como numérica (sin One-Hot).  
> En este pipeline la tratamos como **ordinal-numérica** (imputación + passthrough), sin escalado.


In [None]:
# Listas de features
NUM_FEATURES = ["Age", "CreditScore", "Balance", "EstimatedSalary", "Tenure", "NumOfProducts"]
CAT_FEATURES = ["Geography", "Gender"]
BIN_FEATURES = ["IsActiveMember", "HasCrCard", "complain"]
ORD_FEATURES = ["SatisfactionScore"]

# Validación: qué columnas existen realmente en el DF
present = set(df.columns)
missing = [c for c in (NUM_FEATURES + CAT_FEATURES + BIN_FEATURES + ORD_FEATURES + ["Exited"]) if c not in present]
print("Columnas faltantes (si aplica):", missing)


Columnas faltantes (si aplica): []


## 4) Pipeline de preprocesamiento (ColumnTransformer)


Convertir binarios a `int` **antes** de imputar, y luego dejarlos como 0/1.


In [None]:
# Transformer para convertir bool -> int
def to_int_01(X):
    X = pd.DataFrame(X).copy()
    for col in X.columns:
        # True/False -> 1/0
        if X[col].dtype == bool:
            X[col] = X[col].astype(int)
        else:
            # por si llega como "0"/"1" o mezclado
            X[col] = pd.to_numeric(X[col], errors="ignore")
            # si quedó como object con True/False en texto, intenta mapear
            if X[col].dtype == object:
                X[col] = X[col].map({"True": 1, "False": 0, True: 1, False: 0}).fillna(X[col])
    return X.values

bool_to_int = FunctionTransformer(to_int_01, feature_names_out="one-to-one")


In [None]:
# Pipelines por tipo de variable

num_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

bin_pipe = Pipeline(steps=[
    ("to_int", bool_to_int),
    ("imputer", SimpleImputer(strategy="most_frequent")),
    # binario se deja como 0/1 sin escalado
])

ord_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    # ordinal se deja como número (1..5), sin one-hot
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_pipe, NUM_FEATURES),
        ("cat", cat_pipe, CAT_FEATURES),
        ("bin", bin_pipe, BIN_FEATURES),
        ("ord", ord_pipe, ORD_FEATURES),
    ],
    remainder="drop",
    verbose_feature_names_out=False
)

preprocessor


## 5) Función entregable: `raw_to_matrix(df)`  

Recibe un `DataFrame` crudo y devuelve `numpy.ndarray` con todo numérico.
- `fit=True` ajusta el preprocesador (para entrenamiento/validación).
- `fit=False` solo transforma (para producción/inferencia).


In [None]:
_FITTED = {"ok": False}

def raw_to_matrix(df_raw: pd.DataFrame, fit: bool = False) -> np.ndarray:
    global _FITTED

    # Asegurar que las columnas existan
    X = df_raw.copy()

    # Normalizar nombres
    X = X.rename(columns={
        "Satisfaction Score": "SatisfactionScore",
        "Card Type": "CardType",
        "Point Earned": "PointEarned",
        "Complain": "complain",
    })

    expected_cols = NUM_FEATURES + CAT_FEATURES + BIN_FEATURES + ORD_FEATURES
    missing = [c for c in expected_cols if c not in X.columns]
    if missing:
        raise ValueError(f"Faltan columnas requeridas: {missing}")

    X = X[expected_cols]

    if fit:
        M = preprocessor.fit_transform(X)
        _FITTED["ok"] = True
    else:
        if not _FITTED["ok"]:
            raise RuntimeError("El preprocesador no está fiteado. Ejecuta raw_to_matrix(..., fit=True) primero.")
        M = preprocessor.transform(X)
    return M


## 6) Prueba rápida con datos reales del dataset

El pipeline corre **con el dataset real**.


In [None]:
# Separar X / y
target_col = "Exited"
if target_col not in df.columns:
    raise ValueError("No existe la columna target 'Exited' en el dataset.")

X_real = df[NUM_FEATURES + CAT_FEATURES + BIN_FEATURES + ORD_FEATURES].copy()
y_real = df[target_col].astype(int)

# Fit + transform (entrenamiento)
M = raw_to_matrix(X_real, fit=True)

print("X_real shape:", X_real.shape)
print("Matriz preprocesada shape:", M.shape)
print("Ejemplo (primeras 2 filas):")
M[:2]


X_real shape: (10000, 12)
Matriz preprocesada shape: (10000, 15)
Ejemplo (primeras 2 filas):


  X[col] = pd.to_numeric(X[col], errors="ignore")


array([[ 0.29351742, -0.32622142, -1.22584767,  0.02188649, -1.04175968,
        -0.91158349,  1.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ,  1.        ,  1.        ,  2.        ],
       [ 0.19816383, -0.44003595,  0.11735002,  0.21653375, -1.38753759,
        -0.91158349,  0.        ,  0.        ,  1.        ,  1.        ,
         0.        ,  1.        ,  0.        ,  1.        ,  3.        ]])

## 7) Obtener nombres de features finales (útil para debugging)

Auditar qué se generó tras One-Hot.


In [None]:
# Nombres finales de columnas después del preprocesamiento
feature_names = preprocessor.get_feature_names_out()
print("Total features:", len(feature_names))
feature_names[:30]


Total features: 15


array(['Age', 'CreditScore', 'Balance', 'EstimatedSalary', 'Tenure',
       'NumOfProducts', 'Geography_France', 'Geography_Germany',
       'Geography_Spain', 'Gender_Female', 'Gender_Male',
       'IsActiveMember', 'HasCrCard', 'complain', 'SatisfactionScore'],
      dtype=object)

## 8) Contrato de entrada/salida (referencia)

**Entrada mínima (raw):** debe contener todas las columnas requeridas.

- Tipos esperados (recomendado):
  - Numéricas: int/float
  - Categóricas: string
  - Binarias: bool o 0/1
  - Ordinal: int (1..5)

**Salida del preprocesador:** matriz numérica `M` (numpy array) para el modelo.

> El contrato de inferencia del modelo (forecast/probability) se define en el Paso 3.


In [None]:
example_input = {
  "Geography": "Spain",
  "Gender": "Male",
  "Age": 42,
  "CreditScore": 650,
  "Balance": 14.5,
  "EstimatedSalary": 14.0,
  "Tenure": 6,
  "NumOfProducts": 5,
  "SatisfactionScore": 2,
  "IsActiveMember": True,
  "HasCrCard": True,
  "complain": False
}
example_input


{'Geography': 'Spain',
 'Gender': 'Male',
 'Age': 42,
 'CreditScore': 650,
 'Balance': 14.5,
 'EstimatedSalary': 14.0,
 'Tenure': 6,
 'NumOfProducts': 5,
 'SatisfactionScore': 2,
 'IsActiveMember': True,
 'HasCrCard': True,
 'complain': False}

## 9) Export (opcional): guardar preprocessor para producción

Guardar con `joblib` para que backend use el mismo preprocesamiento.


In [None]:
import joblib

out_path = "preprocessor_step2.joblib"
joblib.dump(preprocessor, out_path)
print("Guardado:", out_path)


Guardado: preprocessor_step2.joblib
