# üîß Notebook Interactivo: sklearn Pipelines y Feature Engineering

> **M√≥dulos 07-08 de la Gu√≠a MLOps**

Este notebook te permite experimentar con los conceptos de:
- sklearn Pipelines
- ColumnTransformer
- Custom Transformers
- Detecci√≥n de Data Leakage

---

## 1. Setup Inicial

In [None]:
# Instalaci√≥n de dependencias (ejecutar solo si es necesario)
# !pip install pandas scikit-learn numpy

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import accuracy_score, classification_report

print("‚úÖ Librer√≠as cargadas correctamente")

## 2. Crear Dataset de Ejemplo

Simulamos un dataset de **Bank Churn** similar al del portafolio.

In [None]:
# Crear dataset sint√©tico
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame({
    'CreditScore': np.random.randint(300, 850, n_samples),
    'Age': np.random.randint(18, 92, n_samples),
    'Tenure': np.random.randint(0, 10, n_samples),
    'Balance': np.random.uniform(0, 250000, n_samples),
    'NumOfProducts': np.random.randint(1, 4, n_samples),
    'HasCrCard': np.random.randint(0, 2, n_samples),
    'IsActiveMember': np.random.randint(0, 2, n_samples),
    'EstimatedSalary': np.random.uniform(10000, 200000, n_samples),
    'Geography': np.random.choice(['France', 'Germany', 'Spain'], n_samples),
    'Gender': np.random.choice(['Male', 'Female'], n_samples),
    'Exited': np.random.randint(0, 2, n_samples)  # Target
})

# Introducir algunos NaN para hacerlo m√°s realista
data.loc[np.random.choice(n_samples, 50), 'Balance'] = np.nan
data.loc[np.random.choice(n_samples, 30), 'CreditScore'] = np.nan

print(f"Dataset shape: {data.shape}")
print(f"\nMissing values:\n{data.isnull().sum()}")
data.head()

## 3. Pipeline B√°sico (M√≥dulo 07)

### 3.1 El problema SIN Pipeline

‚ùå **C√≥digo fr√°gil y con riesgo de data leakage:**

In [None]:
# ‚ùå MAL ENFOQUE - No hacer esto en producci√≥n

# Separar features y target
X = data.drop('Exited', axis=1)
y = data['Exited']

# ‚ö†Ô∏è LEAKAGE: fit en TODO el dataset antes del split
scaler = StandardScaler()
numeric_cols = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary']

# Imputar valores nulos (en todo X)
X[numeric_cols] = X[numeric_cols].fillna(X[numeric_cols].mean())

# Escalar (en todo X) - ¬°DATA LEAKAGE!
X[numeric_cols] = scaler.fit_transform(X[numeric_cols])

# Ahora hacer el split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print("‚ö†Ô∏è Este c√≥digo tiene DATA LEAKAGE")
print("El scaler vio informaci√≥n del test set antes del split")

### 3.2 La soluci√≥n CON Pipeline

‚úÖ **C√≥digo robusto sin data leakage:**

In [None]:
# ‚úÖ BUEN ENFOQUE - Pipeline unificado

# Definir columnas
numeric_features = ['CreditScore', 'Age', 'Tenure', 'Balance', 
                    'NumOfProducts', 'EstimatedSalary']
categorical_features = ['Geography', 'Gender']
binary_features = ['HasCrCard', 'IsActiveMember']

# Preprocessor con ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numeric_features),
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
            ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        ]), categorical_features),
        ('bin', 'passthrough', binary_features)
    ],
    remainder='drop'  # Elimina columnas no especificadas
)

# Pipeline completo
pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

print("‚úÖ Pipeline creado:")
print(pipe)

In [None]:
# Separar datos ANTES de cualquier transformaci√≥n
X = data.drop('Exited', axis=1)
y = data['Exited']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Entrenar pipeline (fit solo en train)
pipe.fit(X_train, y_train)

# Evaluar
y_pred = pipe.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\n‚úÖ Accuracy: {accuracy:.3f}")
print(f"\nüìä Classification Report:\n{classification_report(y_test, y_pred)}")

### 3.3 Cross-Validation con Pipeline

El pipeline garantiza que cada fold se procesa correctamente:

In [None]:
# Cross-validation con pipeline
cv_scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

print(f"CV Scores: {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})")

## 4. Custom Transformer (M√≥dulo 08)

Crear transformers personalizados para feature engineering:

In [None]:
class AgeGroupTransformer(BaseEstimator, TransformerMixin):
    """Agrupa edades en categor√≠as."""
    
    def __init__(self, bins=[0, 25, 35, 50, 65, 100], 
                 labels=['Young', 'Adult', 'Middle', 'Senior', 'Elderly']):
        self.bins = bins
        self.labels = labels
    
    def fit(self, X, y=None):
        # No necesita aprender nada de los datos
        return self
    
    def transform(self, X):
        X = X.copy()
        X['AgeGroup'] = pd.cut(X['Age'], bins=self.bins, labels=self.labels)
        return X
    
    def get_feature_names_out(self, input_features=None):
        return list(input_features) + ['AgeGroup'] if input_features else ['AgeGroup']


# Probar el transformer
age_transformer = AgeGroupTransformer()
transformed = age_transformer.fit_transform(X_train[['Age']])
print("\n‚úÖ AgeGroup distribution:")
print(transformed['AgeGroup'].value_counts())

In [None]:
class FeatureEngineer(BaseEstimator, TransformerMixin):
    """Custom transformer para crear features derivadas.
    
    Similar al FeatureEngineer del portafolio CarVision.
    """
    
    def __init__(self, create_balance_salary_ratio=True):
        self.create_balance_salary_ratio = create_balance_salary_ratio
        
    def fit(self, X, y=None):
        # Guardar columnas originales
        self.feature_names_in_ = X.columns.tolist()
        return self
    
    def transform(self, X):
        X = X.copy()
        
        # Feature 1: Ratio Balance/Salary
        if self.create_balance_salary_ratio:
            X['BalanceSalaryRatio'] = X['Balance'] / (X['EstimatedSalary'] + 1)
        
        # Feature 2: Tenure por producto
        X['TenurePerProduct'] = X['Tenure'] / (X['NumOfProducts'] + 0.1)
        
        # Feature 3: Cliente maduro (tenure > 5 y activo)
        X['MatureClient'] = ((X['Tenure'] > 5) & (X['IsActiveMember'] == 1)).astype(int)
        
        return X
    
    def get_feature_names_out(self, input_features=None):
        new_features = ['BalanceSalaryRatio', 'TenurePerProduct', 'MatureClient']
        if input_features is not None:
            return list(input_features) + new_features
        return self.feature_names_in_ + new_features


# Probar el FeatureEngineer
fe = FeatureEngineer()
X_engineered = fe.fit_transform(X_train)
print("\n‚úÖ Nuevas features creadas:")
print(X_engineered[['Balance', 'EstimatedSalary', 'BalanceSalaryRatio', 
                    'TenurePerProduct', 'MatureClient']].head())

## 5. Pipeline Completo con Custom Transformer

In [None]:
# Pipeline con FeatureEngineer incluido
numeric_features_extended = numeric_features + ['BalanceSalaryRatio', 'TenurePerProduct']
binary_features_extended = binary_features + ['MatureClient']

# Nuevo preprocessor
preprocessor_v2 = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numeric_features_extended),
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
            ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        ]), categorical_features),
        ('bin', 'passthrough', binary_features_extended)
    ],
    remainder='drop'
)

# Pipeline completo con feature engineering
pipe_v2 = Pipeline([
    ('features', FeatureEngineer()),
    ('preprocessor', preprocessor_v2),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

print("‚úÖ Pipeline v2 creado con FeatureEngineer")
print(pipe_v2)

In [None]:
# Entrenar y evaluar pipeline v2
pipe_v2.fit(X_train, y_train)
y_pred_v2 = pipe_v2.predict(X_test)
accuracy_v2 = accuracy_score(y_test, y_pred_v2)

print(f"\nüìä Comparaci√≥n:")
print(f"Pipeline v1 (sin FeatureEngineer): {accuracy:.3f}")
print(f"Pipeline v2 (con FeatureEngineer): {accuracy_v2:.3f}")
print(f"Mejora: {(accuracy_v2 - accuracy)*100:.1f}%")

## 6. Detecci√≥n de Data Leakage (M√≥dulo 08)

### üî¥ Ejemplo de Data Leakage

In [None]:
# ‚ùå EJEMPLO DE DATA LEAKAGE
# Crear feature que usa informaci√≥n del target

data_with_leakage = data.copy()

# ‚ö†Ô∏è Esta feature usa el target indirectamente
data_with_leakage['AvgExitByGeo'] = data_with_leakage.groupby('Geography')['Exited'].transform('mean')

print("‚ö†Ô∏è LEAKAGE: AvgExitByGeo calculado con el target")
print(data_with_leakage[['Geography', 'Exited', 'AvgExitByGeo']].head(10))

print("\nüî¥ Esta feature tiene informaci√≥n del futuro (el target).")
print("El modelo aprender√° a 'hacer trampa' usando esta informaci√≥n.")

### ‚úÖ C√≥mo evitar el leakage

In [None]:
# ‚úÖ CORRECTO: Calcular estad√≠sticas solo en el training set

class TargetEncoder(BaseEstimator, TransformerMixin):
    """Target encoding SIN data leakage."""
    
    def __init__(self, columns):
        self.columns = columns
        self.encoding_map_ = {}
        
    def fit(self, X, y):
        """Calcula encoding solo con datos de training."""
        df = X.copy()
        df['__target__'] = y
        
        for col in self.columns:
            self.encoding_map_[col] = df.groupby(col)['__target__'].mean().to_dict()
            
        return self
    
    def transform(self, X):
        """Aplica encoding aprendido (sin ver y de test)."""
        X = X.copy()
        for col in self.columns:
            global_mean = np.mean(list(self.encoding_map_[col].values()))
            X[f'{col}_encoded'] = X[col].map(self.encoding_map_[col]).fillna(global_mean)
        return X

# Uso correcto
encoder = TargetEncoder(columns=['Geography'])
encoder.fit(X_train, y_train)  # Solo usa y_train

X_test_encoded = encoder.transform(X_test)  # No ve y_test
print("‚úÖ Target encoding correcto (sin leakage)")
print(X_test_encoded[['Geography', 'Geography_encoded']].head())

## 7. Guardar y Cargar Pipeline

In [None]:
import joblib

# Guardar pipeline entrenado
joblib.dump(pipe_v2, 'pipeline_demo.joblib')
print("‚úÖ Pipeline guardado en 'pipeline_demo.joblib'")

# Cargar pipeline
loaded_pipe = joblib.load('pipeline_demo.joblib')

# Verificar que funciona igual
y_pred_loaded = loaded_pipe.predict(X_test)
assert (y_pred_v2 == y_pred_loaded).all()
print("‚úÖ Pipeline cargado y verificado")

## 8. Ejercicios para Practicar

### Ejercicio 1: A√±ade un nuevo transformer
Crea un transformer que agrupe `CreditScore` en categor√≠as (Poor, Fair, Good, Excellent).

### Ejercicio 2: Detecta el leakage
¬øQu√© pasa si a√±ades `X['ExitedPrediction'] = y` antes del split?

### Ejercicio 3: Compara modelos
Reemplaza `RandomForestClassifier` por `GradientBoostingClassifier` y compara.

---

## üìö Recursos

- [M√≥dulo 07: sklearn Pipelines](../07_SKLEARN_PIPELINES.md)
- [M√≥dulo 08: Feature Engineering](../08_INGENIERIA_FEATURES.md)
- [EJERCICIOS.md](../EJERCICIOS.md) - Ejercicios 7.1, 7.2, 8.1, 8.2
- [RECURSOS_POR_MODULO.md](../RECURSOS_POR_MODULO.md) - Videos recomendados

In [None]:
# Limpieza
import os
if os.path.exists('pipeline_demo.joblib'):
    os.remove('pipeline_demo.joblib')
    print("üßπ Archivo temporal eliminado")