# **Feature Engineering**

The feature engineering will be performed using scikit-learn pipelines and transformers.

📚 **Importamos las librerías**

In [1]:
# base libraries for data science
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder

💾 **Cargamos los datos**

In [2]:
# configuración para que solo se muestren 2 decimales
pd.set_option("display.float_format", "{:.2f}".format)

BASE_DIR = Path("/home/lof/Projects/Telco-Customer-Churn")
DATA_DIR = BASE_DIR / "data" / "interim"
churn_df = pd.read_parquet(DATA_DIR / "churn_type_fixed.parquet", engine="pyarrow")

In [3]:
cols_boolean = [
    "SeniorCitizen",
    "Partner",
    "Dependents",
    "PhoneService",
    "PaperlessBilling",
    "Churn",
]
churn_df[cols_boolean] = churn_df[cols_boolean].astype("category")
cols_categoric = [
    "StreamingMovies",
    "InternetService",
    "StreamingTV",
    "OnlineSecurity",
    "MultipleLines",
    "DeviceProtection",
    "TechSupport",
    "gender ",
    "OnlineBackup",
    "PaymentMethod",
    "Contract",
]
churn_df[cols_categoric] = churn_df[cols_categoric].astype("category")

👷 **Preparación de datos**

In [4]:
selected_features = churn_df.columns.tolist()
churn_features = churn_df[selected_features].copy()
churn_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24742 entries, 0 to 24741
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   MonthlyCharges    24527 non-null  float64 
 1   StreamingMovies   24415 non-null  category
 2   Partner           24639 non-null  category
 3   PhoneService      24525 non-null  category
 4   InternetService   24445 non-null  category
 5   StreamingTV       24375 non-null  category
 6   OnlineSecurity    24393 non-null  category
 7   MultipleLines     24493 non-null  category
 8   Dependents        24585 non-null  category
 9   DeviceProtection  24383 non-null  category
 10  SeniorCitizen     24636 non-null  category
 11  TotalCharges      24560 non-null  float64 
 12  TechSupport       24379 non-null  category
 13  gender            24684 non-null  category
 14  PaperlessBilling  24504 non-null  category
 15  tenure            24562 non-null  float64 
 16  Churn             2457

**Nulos**

Verificamos que la variable target no contenga nulos, ya que esta variable no puede ser imputada por lo siguiente esos nulos deben ser descartados.

In [5]:
churn_features.isna().sum()

MonthlyCharges      215
StreamingMovies     327
Partner             103
PhoneService        217
InternetService     297
StreamingTV         367
OnlineSecurity      349
MultipleLines       249
Dependents          157
DeviceProtection    359
SeniorCitizen       106
TotalCharges        182
TechSupport         363
gender               58
PaperlessBilling    238
tenure              180
Churn               171
OnlineBackup        368
PaymentMethod       240
Contract            238
dtype: int64

Hay que eliminar todas las filas donde "Churn" es NAN. Para estratificarla no puede tener valores nulos.

In [6]:
churn_features = churn_features.dropna(subset=["Churn"])

**Datos duplicados**

Nos damos cuenta del número de registros duplicados para eliminarlos así evitando el data leakage

In [7]:
duplicate_rows = churn_features.duplicated().sum()
print("Numero de registros duplicados: ", duplicate_rows)

Numero de registros duplicados:  17705


In [8]:
churn_features.drop_duplicates(inplace=True, keep="first")
churn_features.duplicated().sum()  # Comprobación de existencia de duplicados

np.int64(0)

**Limpieza de datos**

En el paso anterior (EDA) identificamos las variables que tienen datos erroneos así que nos haremos cargo de ellos antes de comenzar el proceso de feature engineering

In [9]:
# Reemplazar los valores incorrectos por np.nan
churn_features["StreamingTV"] = churn_features["StreamingTV"].replace("5412335", np.nan)
churn_features["StreamingMovies"] = churn_features["StreamingMovies"].replace("1523434", np.nan)
churn_features["MultipleLines"] = churn_features["MultipleLines"].replace("1244132", np.nan)
churn_features["DeviceProtection"] = churn_features["DeviceProtection"].replace("1243524", np.nan)

cols = ["StreamingTV", "StreamingMovies", "MultipleLines", "DeviceProtection"]
for col in cols:
    print(f"Valores únicos en {col}: {churn_features[col].unique()}")

Valores únicos en StreamingTV: ['No', 'Yes', 'No internet service', NaN]
Categories (3, object): ['No', 'No internet service', 'Yes']
Valores únicos en StreamingMovies: ['Yes', 'No', 'No internet service', NaN]
Categories (3, object): ['No', 'No internet service', 'Yes']
Valores únicos en MultipleLines: ['Yes', 'No', 'No phone service', NaN]
Categories (3, object): ['No', 'No phone service', 'Yes']
Valores únicos en DeviceProtection: ['Yes', 'No', 'No internet service', NaN]
Categories (3, object): ['No', 'No internet service', 'Yes']


  churn_features["StreamingTV"] = churn_features["StreamingTV"].replace("5412335", np.nan)
  churn_features["StreamingMovies"] = churn_features["StreamingMovies"].replace(
  churn_features["MultipleLines"] = churn_features["MultipleLines"].replace(
  churn_features["DeviceProtection"] = churn_features["DeviceProtection"].replace(


In [10]:
# Definir constantes para los valores límites
MIN_MONTHLY_CHARGE = 0
MAX_MONTHLY_CHARGE = 500
MIN_TOTAL_CHARGE = 0
MAX_TOTAL_CHARGE = 20000

# Reemplazar los valores fuera de rango por np.nan
churn_features.loc[
    ~(
        (churn_features["MonthlyCharges"] > MIN_MONTHLY_CHARGE)
        & (churn_features["MonthlyCharges"] < MAX_MONTHLY_CHARGE)
        & (churn_features["TotalCharges"] > MIN_TOTAL_CHARGE)
        & (churn_features["TotalCharges"] < MAX_TOTAL_CHARGE)
    ),
    ["MonthlyCharges", "TotalCharges"],
] = np.nan

In [11]:
# Definir constantes para los valores límites
MIN_MONTHLY_CHARGE = 0
MAX_MONTHLY_CHARGE = 500
MIN_TOTAL_CHARGE = 0
MAX_TOTAL_CHARGE = 20000


def replace_out_of_range_values(X: pd.DataFrame) -> pd.DataFrame:
    """
    Reemplaza valores fuera de rango en las columnas 'MonthlyCharges' y 'TotalCharges' con np.nan.
    """
    assert isinstance(X, pd.DataFrame)
    mask = ~(
        (X["MonthlyCharges"] > MIN_MONTHLY_CHARGE)
        & (X["MonthlyCharges"] < MAX_MONTHLY_CHARGE)
        & (X["TotalCharges"] > MIN_TOTAL_CHARGE)
        & (X["TotalCharges"] < MAX_TOTAL_CHARGE)
    )
    X.loc[mask, ["MonthlyCharges", "TotalCharges"]] = np.nan
    return X


def replace_invalid_values(X: pd.DataFrame) -> pd.DataFrame:
    """
    Reemplaza valores incorrectos en columnas específicas con np.nan
    y convierte ciertas columnas a tipo 'category'.
    """
    assert isinstance(X, pd.DataFrame)

    # Reemplazar valores inválidos
    invalid_values = {
        "StreamingTV": "5412335",
        "StreamingMovies": "1523434",
        "MultipleLines": "1244132",
        "DeviceProtection": "1243524",
    }
    X.replace(invalid_values, np.nan, inplace=True)

    # Definir columnas a convertir a tipo 'category'
    cols_boolean = [
        "SeniorCitizen",
        "Partner",
        "Dependents",
        "PhoneService",
        "PaperlessBilling",
        "Churn",
    ]
    cols_categoric = [
        "StreamingMovies",
        "InternetService",
        "StreamingTV",
        "OnlineSecurity",
        "MultipleLines",
        "DeviceProtection",
        "TechSupport",
        "gender",
        "OnlineBackup",
        "PaymentMethod",
        "Contract",
    ]

    # Convertir a tipo 'category' solo si existen en el DataFrame
    existing_cols_boolean = [col for col in cols_boolean if col in X.columns]
    existing_cols_categoric = [col for col in cols_categoric if col in X.columns]

    X[existing_cols_boolean] = X[existing_cols_boolean].astype("category")
    X[existing_cols_categoric] = X[existing_cols_categoric].astype("category")

    return X

👨‍🏭 **Feature Engineering**

In [12]:
print(list(churn_df.select_dtypes(include=["number"]).columns))
print(list(churn_df.select_dtypes(include=["category"]).columns))

['MonthlyCharges', 'TotalCharges', 'tenure']
['StreamingMovies', 'Partner', 'PhoneService', 'InternetService', 'StreamingTV', 'OnlineSecurity', 'MultipleLines', 'Dependents', 'DeviceProtection', 'SeniorCitizen', 'TechSupport', 'gender ', 'PaperlessBilling', 'Churn', 'OnlineBackup', 'PaymentMethod', 'Contract']


In [13]:
cols_numeric = ["MonthlyCharges", "TotalCharges", "tenure"]
cols_categoric = [
    "StreamingMovies",
    "InternetService",
    "StreamingTV",
    "OnlineSecurity",
    "MultipleLines",
    "DeviceProtection",
    "TechSupport",
    "gender ",
    "OnlineBackup",
    "PaymentMethod",
    "Contract",
]
# Encode target variable
churn_features["Churn"] = churn_features["Churn"].astype("int")
print(churn_features["Churn"].value_counts())

Churn
0    5058
1    1808
Name: count, dtype: int64


In [14]:
from sklearn.discriminant_analysis import StandardScaler

numeric_pipe = Pipeline(
    steps=[
        (
            "outlier_removal",
            FunctionTransformer(replace_out_of_range_values, validate=False),
        ),
        ("imputer", KNNImputer(n_neighbors=5)),
        ("scaler", StandardScaler()),  # Opcional si tu modelo lo necesita
    ]
)

categorical_pipe = Pipeline(
    steps=[
        (
            "clean_categories",
            FunctionTransformer(replace_invalid_values, validate=False),
        ),
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(drop="first", handle_unknown="ignore")),
    ]
)


preprocessor = ColumnTransformer(
    transformers=[
        ("numericas", numeric_pipe, cols_numeric),
        ("categoricas nominales", categorical_pipe, cols_categoric),
    ]
)

In [15]:
preprocessor

### **Pipeline description**
---

**Numeric Pipeline:**

Columns: ['MonthlyCharges', 'TotalCharges', 'tenure']

Steps:
FunctionTransformer(replace_out_of_range_values): Remplaza con NAN los valores que sean outliers.
KNNImputer(n_neighbors=5): Imputa los valores nulos con el modelo KNN.

---

**Categorical Pipeline:**

Columns: ['StreamingMovies', 'InternetService', 'StreamingTV', 'OnlineSecurity', 'MultipleLines', 'DeviceProtection', 'TechSupport', 'gender ', 'OnlineBackup', 'PaymentMethod', 'Contract']

Steps:
SimpleImputer(strategy="most_frequent"): Imputa los valores nulos por medio de la moda o el valor más frecuente.
OneHotEncoder(drop='first'): Codifica las variables categóricas como una matriz numérica de una sola columna, eliminando una de las columnas para evitar la multicolinealidad.

 ---

**Column Transformer:**

Combina los procesos numéricos y categóricos en un único paso de preprocesamiento.

**Train / Test split**

In [16]:
X_features = churn_features.drop("Churn", axis="columns")
Y_target = churn_features["Churn"]

# 80% train, 20% test
x_train, x_test, y_train, y_test = train_test_split(
    X_features, Y_target, test_size=0.2, stratify=Y_target
)

In [17]:
x_train.shape, y_train.shape

((5492, 19), (5492,))

In [18]:
x_test.shape, y_test.shape

((1374, 19), (1374,))

**Pipeline de preprocesamiento**

In [19]:
transformed_data = preprocessor.fit(x_train)

In [20]:
# Get feature names from the preprocessor
numeric_features = (
    preprocessor.transformers_[0][1].named_steps["scaler"].get_feature_names_out(cols_numeric)
)
categorical_features = (
    preprocessor.transformers_[1][1].named_steps["onehot"].get_feature_names_out(cols_categoric)
)
feature_names = list(numeric_features) + list(categorical_features)

# transform x_train with preprocessor and pandas output set
x_train_transformed = preprocessor.transform(x_train)
x_train_transformed = pd.DataFrame(x_train_transformed, columns=feature_names)
x_train_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5492 entries, 0 to 5491
Data columns (total 25 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   MonthlyCharges                         5492 non-null   float64
 1   TotalCharges                           5492 non-null   float64
 2   tenure                                 5492 non-null   float64
 3   StreamingMovies_No internet service    5492 non-null   float64
 4   StreamingMovies_Yes                    5492 non-null   float64
 5   InternetService_Fiber optic            5492 non-null   float64
 6   InternetService_No                     5492 non-null   float64
 7   StreamingTV_No internet service        5492 non-null   float64
 8   StreamingTV_Yes                        5492 non-null   float64
 9   OnlineSecurity_No internet service     5492 non-null   float64
 10  OnlineSecurity_Yes                     5492 non-null   float64
 11  Mult

In [21]:
x_train_transformed.head()

Unnamed: 0,MonthlyCharges,TotalCharges,tenure,StreamingMovies_No internet service,StreamingMovies_Yes,InternetService_Fiber optic,InternetService_No,StreamingTV_No internet service,StreamingTV_Yes,OnlineSecurity_No internet service,...,TechSupport_No internet service,TechSupport_Yes,gender _Male,OnlineBackup_No internet service,OnlineBackup_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Contract_One year,Contract_Two year
0,0.26,1.06,1.33,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
1,0.85,0.83,0.59,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,-1.33,-0.25,1.33,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,1.12,0.82,0.31,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
4,0.5,1.44,1.49,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0


In [22]:
x_train.head()

Unnamed: 0,MonthlyCharges,StreamingMovies,Partner,PhoneService,InternetService,StreamingTV,OnlineSecurity,MultipleLines,Dependents,DeviceProtection,SeniorCitizen,TotalCharges,TechSupport,gender,PaperlessBilling,tenure,OnlineBackup,PaymentMethod,Contract
914,72.45,Yes,1.0,1.0,DSL,No,Yes,Yes,1.0,Yes,0.0,4653.85,No,Male,0.0,65.0,Yes,Credit card (automatic),One year
1263,90.05,Yes,1.0,1.0,Fiber optic,No,Yes,No,1.0,Yes,0.0,4137.2,No,Male,0.0,47.0,No,Electronic check,Month-to-month
3202,24.75,No internet service,1.0,1.0,No,No internet service,No internet service,Yes,1.0,No internet service,0.0,1715.1,No internet service,Male,0.0,65.0,No internet service,Credit card (automatic),Two year
3890,98.15,No,1.0,1.0,Fiber optic,Yes,Yes,No,0.0,Yes,1.0,4116.8,Yes,Male,1.0,40.0,Yes,Credit card (automatic),One year
7730,79.45,No,0.0,1.0,DSL,Yes,Yes,Yes,1.0,Yes,0.0,5502.55,Yes,Male,1.0,69.0,Yes,Credit card (automatic),Two year


### **Recomendaciones e ideas finales**

1. **Quitar el scaler de las variables numericas**

No todos los algoritmos de machine learning son sensibles a escalas distintas, si es uno basado en árboles (como Random Forest o XGBoost), el escalado no es tan relevante, pero si es regresión o redes neuronales, puede ser clave.

2. **Indagar más profundo en el hiperparametro KNN de la imputación**