# 02 - Preparación de Datos y Feature Engineering

Este notebook demuestra técnicas de transformación y creación de features usando Copilot.

**Objetivos:**
- Limpieza y transformación de datos
- Feature engineering
- Escalado y normalización
- Validación con Pandera
- División train/test

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
import pandera as pa
from pandera import Column, DataFrameSchema, Check

print("✅ Librerías cargadas")

## 1. Carga de Datos del Notebook Anterior

In [None]:
# Cargar datos explorados
df = pd.read_parquet('./data/exploracion_completa.parquet')
print(f"Dataset cargado: {df.shape}")
df.head()

## 2. Feature Engineering

💡 **Copilot tip:** Usa `Ctrl+I` para generar features personalizadas basadas en domain knowledge.

In [None]:
# Crear nuevas features
df_engineered = df.copy()

# Ejemplo: Interacciones entre features top
df_engineered['feature_0_x_1'] = df['feature_0'] * df['feature_1']
df_engineered['feature_0_squared'] = df['feature_0'] ** 2

# Ratios
df_engineered['feature_ratio_0_1'] = df['feature_0'] / (df['feature_1'] + 1e-8)

# Agregaciones
feature_cols = [col for col in df.columns if col.startswith('feature_')]
df_engineered['feature_sum'] = df[feature_cols].sum(axis=1)
df_engineered['feature_mean'] = df[feature_cols].mean(axis=1)
df_engineered['feature_std'] = df[feature_cols].std(axis=1)

print(f"Features creadas. Shape: {df_engineered.shape}")
print(f"Nuevas features: {df_engineered.shape[1] - df.shape[1]}")

## 3. Validación de Schema con Pandera

In [None]:
# Definir schema esperado
schema = DataFrameSchema(
    {
        "feature_0": Column(float, checks=Check.in_range(-5, 5)),
        "feature_1": Column(float),
        "target": Column(int, checks=Check.isin([0, 1])),
        "feature_sum": Column(float, nullable=False),
        "feature_mean": Column(float, nullable=False),
    },
    strict=False  # Permitir columnas adicionales
)

# Validar
try:
    validated_df = schema.validate(df_engineered)
    print("✅ Validación de schema exitosa")
except pa.errors.SchemaError as e:
    print(f"❌ Error de validación: {e}")

## 4. Escalado de Features

In [None]:
# Separar features y target
X = df_engineered.drop('target', axis=1)
y = df_engineered['target']

# División train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# Aplicar StandardScaler
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)
X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

print("✅ Escalado completado")
print(f"Media train (post-escalado): {X_train_scaled.mean().mean():.6f}")
print(f"Std train (post-escalado): {X_train_scaled.std().mean():.6f}")

## 5. Comparación Before/After

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Antes del escalado
X_train['feature_0'].hist(bins=50, ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Antes del Escalado', fontsize=14, fontweight='bold')
axes[0].set_xlabel('feature_0')

# Después del escalado
X_train_scaled['feature_0'].hist(bins=50, ax=axes[1], color='salmon', edgecolor='black')
axes[1].set_title('Después del Escalado', fontsize=14, fontweight='bold')
axes[1].set_xlabel('feature_0 (scaled)')

plt.tight_layout()
plt.show()

## 6. Guardar Datos Preparados

In [None]:
import joblib

# Guardar datasets procesados
X_train_scaled.to_parquet('./data/X_train.parquet')
X_test_scaled.to_parquet('./data/X_test.parquet')
y_train.to_frame().to_parquet('./data/y_train.parquet')
y_test.to_frame().to_parquet('./data/y_test.parquet')

# Guardar scaler para inference
joblib.dump(scaler, './models/scaler.joblib')

print("✅ Datos preparados guardados")
print("✅ Scaler guardado para inference")

## 7. Resumen

**Transformaciones aplicadas:**
- ✅ Feature engineering (interacciones, ratios, agregaciones)
- ✅ Validación de schema con Pandera
- ✅ División train/test estratificada
- ✅ Escalado con StandardScaler

**Próximo paso:** Entrenamiento de modelos (notebook 03)