# Feature Engineering — Customer Churn

This notebook prepares the churn dataset for modeling by cleaning data types, encoding categorical variables, and creating train/test splits.


## Imports

Load required libraries for preprocessing and model preparation.


In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


## Load dataset

Load the dataset used during exploratory analysis.


In [5]:
df = pd.read_csv("../data/telco_customer_churn.csv")
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Target variable

Convert the churn label into a binary numerical format.


In [7]:
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

## Data type corrections

Ensure numerical columns are correctly typed for modeling.

In [8]:
# TotalCharges sometimes contains blanks
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Drop rows with missing TotalCharges
df = df.dropna(subset=["TotalCharges"])


## Remove identifier columns

Customer identifiers do not carry predictive signal.

In [9]:
df = df.drop(columns=["customerID"])

## Feature / target split

Separate input features from the prediction target.

In [10]:
X = df.drop(columns=["Churn"])
y = df["Churn"]

## Feature type identification

Split features into categorical and numerical groups.

In [11]:
categorical_features = X.select_dtypes(include="object").columns.tolist()
numerical_features = X.select_dtypes(exclude="object").columns.tolist()

categorical_features, numerical_features

(['gender',
  'Partner',
  'Dependents',
  'PhoneService',
  'MultipleLines',
  'InternetService',
  'OnlineSecurity',
  'OnlineBackup',
  'DeviceProtection',
  'TechSupport',
  'StreamingTV',
  'StreamingMovies',
  'Contract',
  'PaperlessBilling',
  'PaymentMethod'],
 ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges'])

## Preprocessing pipeline

Encode categorical variables while passing numerical features unchanged.

In [13]:
categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, categorical_features),
        ("num", "passthrough", numerical_features),
    ]
)

## Train-test split

Split the dataset while preserving churn class balance.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## Apply preprocessing

Transform training and test data using the preprocessing pipeline.

In [16]:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

X_train_processed.shape, X_test_processed.shape

((5625, 45), (1407, 45))

## Save processed data (optional)

Persist processed datasets for reuse in modeling.

In [19]:
import joblib

joblib.dump(preprocessor, "../data/preprocessor.joblib")

['../data/preprocessor.joblib']