# Proyecto de Churn Prediction - Feature Eng

### Resumen transformaciones

1. Limpieza básica (Antes del split)

 - Eliminar duplicados: df.drop_duplicates()

 - Corregir tipos de datos: convertir columnas mal tipadas (pd.to_numeric, pd.to_datetime)

 - Eliminar columnas irrelevantes (ej. customerID si no aporta valor)

2. EDA y análisis inicial

- Este paso es exploratorio, no modifica datos directamente, pero guía las decisiones posteriores

3. Separar features y target

4. Train/Test split

5. Tratamiento de valores nulos (Solo en train - fit/transform)

- Imputar numéricos: media o mediana

- Imputar categóricos: moda o "missing"

En valid/test: solo transform

Se puede hacer aquí en el notebook o dentro del pipeline si lo vas a reutilizar.

6. Outliers (opcional) (Solo en train)

- Detectar con Z-score o IQR

- Tratar si afectan mucho: eliminar o aplicar RobustScaler

En valid/test: transformar igual (¡no eliminar filas en valid!)

7. Corregir asimetría / skewness (Solo en train)

- Aplicar np.log1p() o PowerTransformer si es necesario

8. Encoding de variables categóricas (fit en train, transform en valid)

- pd.get_dummies() o OneHotEncoder()

- LabelEncoder() solo para variables ordinales o target

9. Escalado de variables numéricas (fit en train, transform en valid)

- StandardScaler() para modelos lineales, KNN, redes

- RobustScaler() si hay outliers

- No escalar si usas árboles (RandomForest, XGBoost...)

10. Balanceo de clases ( Solo en train)

- SMOTE, RandomOverSampler, etc.

- O usar class_weight='balanced' en modelos compatibles

11. Feature selection (fit en train, transform en valid)

- Eliminar columnas irrelevantes o muy correlacionadas

- Usar SelectKBest, RFE, o model.feature_importances_

12. Feature engineering (crear nuevas variables)

- Ratios, interacciones, extracciones...

Se puede hacer antes del split o dentro del pipeline si necesitas reproducibilidad

13. Validación y modelado

Una vez tienes:

- Datos imputados

- Codificados

- Escalados

- Balanceados (en train)

Ya puedes entrenar tu modelo y validarlo.

 14. Crear pipeline de producción

Agrupar todo lo anterior para predecir nuevos datos reales:


## 5.Feature Engineering

In [3]:
# cargamos datos

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib

df = pd.read_csv("../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Copiamos para mantener una versión cruda
# df_raw = df.copy()

# Vista rápida
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Procesamiento y Feature Engineering

En la parte de procesamiento y feature engineering hay que tener en cuenta que:

- El objetivo es pasar de datos en crudo a datos listos para entrenar el modelo.

- Muchas transformaciones y decisiones vienen guiadas por el EDA, por ejemplo:

  - Si hay variables categóricas (strings) → convertirlas a numéricas mediante codificación.

  - Si existen sesgos o distribuciones muy diferentes → considerar transformaciones (por ejemplo, logarítmicas).

  - Si hay valores faltantes (missing values) → imputarlos o eliminarlos.

  - Si hay outliers → tratarlos o transformarlos.

  - Si hay correlaciones fuertes entre variables → evaluar la eliminación de alguna.

### Dónde aplicar cada transformación

Las transformaciones pueden realizarse en distintos momentos del flujo según su naturaleza:

**En todo el dataset (antes de dividir)**

Cuando son errores globales o de calidad de datos que afectan por igual a todas las partes del dataset.

Ejemplos: eliminar duplicados, corregir tipos de datos, limpiar categorías inconsistentes.

**Solo en el conjunto de entrenamiento**

Cuando son procesos que aprenden información de los datos (como medias, modas o valores de imputación).

Si se aplicaran antes del split, podrían corromper el aprendizaje del modelo (data leakage).

Ejemplos: imputación, detección y eliminación de outliers, selección de variables basada en el target, balanceo de clases.

**Dentro del Feature Pipeline**

Cuando son transformaciones que deben repetirse exactamente igual al aplicar el modelo sobre datos nuevos.

Ejemplos: escalado, codificación, imputación, transformaciones numéricas, creación de nuevas características.

**NOTA**: El pipeline se entrena solo con los datos de entrenamiento y luego se usa para transformar tanto los datos de validación como los nuevos.

## Limpieza básicas antes del split

In [4]:
# Eliminar duplicados
df = df.drop_duplicates()

In [5]:
# Corregir tipos (ej. TotalCharges a numérico)
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

In [6]:
# elimnnar columna porque no aporta valor
df = df.drop(columns=["customerID"])

In [7]:
df.info() # revisamos que tiene valores nulos, object y demás

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   MultipleLines     7043 non-null   object 
 7   InternetService   7043 non-null   object 
 8   OnlineSecurity    7043 non-null   object 
 9   OnlineBackup      7043 non-null   object 
 10  DeviceProtection  7043 non-null   object 
 11  TechSupport       7043 non-null   object 
 12  StreamingTV       7043 non-null   object 
 13  StreamingMovies   7043 non-null   object 
 14  Contract          7043 non-null   object 
 15  PaperlessBilling  7043 non-null   object 
 16  PaymentMethod     7043 non-null   object 


### Features y target split

In [8]:
# Separar target y eliminar customerID
X = df.drop("Churn", axis=1)
y = df["Churn"].map({"No": 0, "Yes": 1})  # Opcional: convertir a numérico

### Dividir train-test

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y # stratify=y sirve para mantener la proporcion 75/25 de churn en train y test
)

### Detectar columnas numéricas y categóricas

In [10]:
num_cols = ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]
cat_cols = [col for col in X_train.columns if col not in num_cols]

## Imputar missing values

In [11]:
# Comprobamos si hay NaNs creados al convertir TotalCharges
X_train["TotalCharges"].isnull().sum()  # debería haber algunos


np.int64(8)

In [12]:
# # Opción 1: Manual -> Imputamos con la mediana del entrenamiento
# mediana_total = X_train["TotalCharges"].median() # evita valores extremos

# X_train["TotalCharges"] = X_train["TotalCharges"].fillna(mediana_total)
# X_test["TotalCharges"] = X_test["TotalCharges"].fillna(mediana_total)

In [13]:
# Opción 2: Usando Sklearn -> Permite usar pipelines sklearn
from sklearn.impute import SimpleImputer

# Imputador para numéricos -> media, mediana, KNN
num_imputer = SimpleImputer(strategy="median")
X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols])
X_test[num_cols] = num_imputer.transform(X_test[num_cols])

# # Imputador para categóricas -> moda, KNN
# cat_imputer = SimpleImputer(strategy="most_frequent")
# X_train[cat_cols] = cat_imputer.fit_transform(X_train[cat_cols])
# X_test[cat_cols] = cat_imputer.transform(X_test[cat_cols])


## Escalar variables numéricas

In [14]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, PowerTransformer

# Opción 1: StandardScaler
#Estandariza las variables a media 0 y desviación estándar.

scaler = StandardScaler()

X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])


# # Opción 2: MinMaxScaler
# # Escala los valores al rango [0, 1].
# from sklearn.preprocessing import MinMaxScaler

# scaler = MinMaxScaler()
# X_train_scaled = scaler.fit_transform(X_train[num_cols])
# X_test_scaled = scaler.transform(X_test[num_cols])


# # Opción 3: RobustScaler
# # Usa mediana y el rango intercuartílico (IQR).
# from sklearn.preprocessing import RobustScaler

# scaler = RobustScaler()
# X_train_scaled = scaler.fit_transform(X_train[num_cols])
# X_test_scaled = scaler.transform(X_test[num_cols])

In [15]:
X_train

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
3738,Male,-0.441773,No,No,0.102371,No,No phone service,DSL,No,No,Yes,No,Yes,Yes,Month-to-month,No,Electronic check,-0.521976,-0.263290
3151,Male,-0.441773,Yes,Yes,-0.711743,Yes,No,Fiber optic,Yes,No,No,No,No,No,Month-to-month,No,Mailed check,0.337478,-0.504815
4860,Male,-0.441773,Yes,Yes,-0.793155,No,No phone service,DSL,Yes,Yes,No,Yes,No,No,Two year,No,Mailed check,-0.809013,-0.751214
3867,Female,-0.441773,Yes,No,-0.263980,Yes,No,DSL,No,Yes,Yes,No,Yes,Yes,Two year,Yes,Credit card (automatic),0.284384,-0.173700
3810,Male,-0.441773,Yes,Yes,-1.281624,Yes,No,DSL,No,No,No,No,No,No,Month-to-month,No,Electronic check,-0.676279,-0.990851
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6303,Female,-0.441773,Yes,No,1.567778,Yes,Yes,Fiber optic,No,Yes,Yes,Yes,Yes,Yes,Two year,No,Electronic check,1.470695,2.373711
6227,Male,-0.441773,No,No,-1.240918,Yes,No,DSL,No,No,No,No,No,No,Month-to-month,No,Bank transfer (automatic),-0.626504,-0.975133
4673,Female,2.263606,No,No,-0.304686,Yes,Yes,Fiber optic,Yes,Yes,No,No,Yes,Yes,Month-to-month,Yes,Mailed check,1.256662,0.157569
2710,Female,-0.441773,Yes,No,-0.345392,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,No,Credit card (automatic),-1.477661,-0.798435


## Codificar variables categóricas (One-Hot Encoding)

In [16]:
# # Opción 1: Get dummies (igual que one-hot encoding)
# # Concatenamos X_train y X_valid para alinear columnas luego
# X_train_encoded = pd.get_dummies(X_train[cat_cols], drop_first=True)
# X_valid_encoded = pd.get_dummies(X_test[cat_cols], drop_first=True)
# k
# # Alineamos columnas (por si hay categorías en train que no están en valid, o viceversa)
# X_train_encoded, X_valid_encoded = X_train_encoded.align(X_valid_encoded, join='left', axis=1, fill_value=0)

# X_train_encoded

In [17]:
# Opción 2: Label-encoder/ Ordinal Encoder
from sklearn.preprocessing import OrdinalEncoder # usar Label Encoder para target y Ordinal Encoder para features

encoder = OrdinalEncoder() # permite pipeline
X_train[cat_cols] = encoder.fit_transform(X_train[cat_cols])
X_test[cat_cols] = encoder.transform(X_test[cat_cols])

X_train


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
3738,1.0,-0.441773,0.0,0.0,0.102371,0.0,1.0,0.0,0.0,0.0,2.0,0.0,2.0,2.0,0.0,0.0,2.0,-0.521976,-0.263290
3151,1.0,-0.441773,1.0,1.0,-0.711743,1.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.337478,-0.504815
4860,1.0,-0.441773,1.0,1.0,-0.793155,0.0,1.0,0.0,2.0,2.0,0.0,2.0,0.0,0.0,2.0,0.0,3.0,-0.809013,-0.751214
3867,0.0,-0.441773,1.0,0.0,-0.263980,1.0,0.0,0.0,0.0,2.0,2.0,0.0,2.0,2.0,2.0,1.0,1.0,0.284384,-0.173700
3810,1.0,-0.441773,1.0,1.0,-1.281624,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,-0.676279,-0.990851
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6303,0.0,-0.441773,1.0,0.0,1.567778,1.0,2.0,1.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,2.0,1.470695,2.373711
6227,1.0,-0.441773,0.0,0.0,-1.240918,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.626504,-0.975133
4673,0.0,2.263606,0.0,0.0,-0.304686,1.0,2.0,1.0,2.0,2.0,0.0,0.0,2.0,2.0,0.0,1.0,3.0,1.256662,0.157569
2710,0.0,-0.441773,1.0,0.0,-0.345392,1.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,-1.477661,-0.798435


In [18]:
# Opción 3: One-hot encoder (0s y 1s)
from sklearn.preprocessing import OneHotEncoder # permite pipeline

ohe = OneHotEncoder()
X_train_ohe = ohe.fit_transform(X_train[cat_cols])
X_test_ohe = ohe.transform(X_test[cat_cols])

## Outliers

El método del IQR (Interquartile Range) es una técnica muy usada para detectar outliers (valores atípicos) de forma robusta y sencilla.

El IQR es el rango entre el cuartil 1 (Q1, percentil 25) y el cuartil 3 (Q3, percentil 75) de una variable numérica.

IQR=Q3−Q1

Se considera outlier todo valor que está:

Por debajo de: Q1 - 1.5 * IQR

Por encima de: Q3 + 1.5 * IQR

Este método es robusto a la presencia de outliers, porque se basa en percentiles, no en la media.

In [19]:
# def cap_outliers_iqr(df, column):
#     Q1 = df[column].quantile(0.25)
#     Q3 = df[column].quantile(0.75)
#     IQR = Q3 - Q1
#     lower = Q1 - 1.5 * IQR
#     upper = Q3 + 1.5 * IQR
#     df[column] = np.clip(df[column], lower, upper)
#     return df

# for col in num_cols:
#     X_train = cap_outliers_iqr(X_train, col)
#     X_test = cap_outliers_iqr(X_test, col)  # Solo transformar

## Balanceo de clases (solo train)

Se aplica cuando la columna target en clasificación está desbalanceada.

- Undersampling -> eliminar filas mayoritaria
- Oversampling -> añadir filas minoritaria
- Class-weight -> modelo ajusta pesos
- Umbral clasificación -> para decirle dónde es una clase u otra

In [23]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train_ohe, y_train)


In [24]:
import pandas as pd

# Obtener los nombres de las columnas del OneHotEncoder
ohe_columns = ohe.get_feature_names_out(cat_cols)

# Convertir la matriz dispersa a un array denso
X_train_bal_dense = X_train_bal.toarray()

# Convertir a DataFrame
X_train_bal_df = pd.DataFrame(X_train_bal_dense, columns=ohe_columns)

# Mostrar las primeras filas
X_train_bal_df.head()


Unnamed: 0,gender_0.0,gender_1.0,Partner_0.0,Partner_1.0,Dependents_0.0,Dependents_1.0,PhoneService_0.0,PhoneService_1.0,MultipleLines_0.0,MultipleLines_1.0,...,StreamingMovies_2.0,Contract_0.0,Contract_1.0,Contract_2.0,PaperlessBilling_0.0,PaperlessBilling_1.0,PaymentMethod_0.0,PaymentMethod_1.0,PaymentMethod_2.0,PaymentMethod_3.0
0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [25]:
X_train_bal_df.shape

(8278, 41)

In [26]:
y_train_bal.head()

0    0
1    0
2    0
3    0
4    0
Name: Churn, dtype: int64

## Discretización

Convierte variables numéricas continuas en categorías (bins).

Se usa si crees que ciertas variables tienen un comportamiento distinto en tramos.

Ejemplo: tenure → [0-12], [13-24], etc.

Es útil si quieres usar modelos no lineales interpretables o si ayuda al rendimiento.

In [27]:
# # No aplica en este caso, es solo ejemplo
# df['tenure_bin'] = pd.cut(df['tenure'], bins=[0, 12, 24, 48, 60, 72], labels=False)

## Feature Selection

Qué es: eliminar variables que no aportan valor o están muy correlacionadas.

Cuándo hacerlo: si tienes muchas variables (no es el caso ahora).

Opciones:

- Correlación (df.corr() para eliminar duplicadas)

- Métodos automáticos (SelectKBest, RFE, etc.)

- Importancia de features con modelos como RandomForest

## Guardamos datos y features procesados

In [28]:
import pandas as pd

# Convertir los sparse matrix a DataFrame denso
X_train_df = pd.DataFrame(X_train_bal.toarray())
X_test_df = pd.DataFrame(X_test_ohe.toarray())

# Convertir y_train_bal e y_test a Series si no lo están
y_train_series = pd.Series(y_train_bal).reset_index(drop=True)
y_test_series = pd.Series(y_test).reset_index(drop=True)

# Guardar en CSV en la carpeta data/processed (sin índice)
X_train_df.to_csv("../data/processed/X_train.csv", index=False)
y_train_series.to_csv("../data/processed/y_train.csv", index=False)

X_test_df.to_csv("../data/processed/X_test.csv", index=False)
y_test_series.to_csv("../data/processed/y_test.csv", index=False)

print("Datos guardados en data/processed/:")

Datos guardados en data/processed/:


# Pipelines

## Feature Pipeline

In [None]:
# feature_pipeline.py

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

def build_feature_pipeline(numerical_cols, categorical_cols):
    # Numéricas: imputación + escalado
    num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])

    # Categóricas: imputación (si quieres) + one-hot
    cat_pipeline = Pipeline([
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse=False))
    ])

    # ColumnTransformer para aplicar cada pipeline a su tipo de variable
    full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_cols),
        ("cat", cat_pipeline, cat_cols)
    ])

    return full_pipeline


def preprocess_new_data(df_raw):
    """
    Aplica las mismas transformaciones al nuevo DataFrame crudo.
    """
    # Eliminar columnas no útiles
    df = df_raw.drop(columns=["customerID"], errors="ignore")

    # Convertir TotalCharges
    df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

    # Imputar si hay nulos por conversión
    df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].median())

    # Separar columnas numéricas y categóricas
    numerical_cols = ["tenure", "MonthlyCharges", "TotalCharges", "SeniorCitizen"]
    categorical_cols = [col for col in df.columns if col not in numerical_cols]

    # Crear y aplicar pipeline
    pipeline = build_feature_pipeline(numerical_cols, categorical_cols)
    df_processed = pipeline.fit_transform(df)

    return df_processed
