# Feature Engineering – Customer Churn Prediction

## Objective of Feature Engineering

The goal of feature engineering is to transform raw customer data into meaningful
and model-ready features that improve predictive performance.

This process is guided by insights obtained from the exploratory data analysis (EDA),
ensuring that each feature included has a clear business or behavioral justification.

At this stage, the focus is on:
- Selecting relevant features
- Encoding categorical variables
- Transforming numerical features when necessary
- Preparing a clean dataset for machine learning models

The resulting dataset will serve as the direct input for the modeling phase.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


In [2]:
# Load cleaned dataset
df = pd.read_csv("../data/processed/cleaned_data.csv")

# Create a working copy
data = df.copy()

data.head()

Unnamed: 0,customer_id,gender,age,under_30,senior_citizen,married,dependents,number_of_dependents,country,state,...,total_extra_data_charges,total_long_distance_charges,total_revenue,satisfaction_score,customer_status,churn_label,churn_score,cltv,churn_category,churn_reason
0,8779-QRDMV,Male,78,No,Yes,No,No,0,United States,California,...,20,0.0,59.65,3,Churned,Yes,91,5433,Competitor,Competitor offered more data
1,7495-OOKFY,Female,74,No,Yes,Yes,Yes,1,United States,California,...,0,390.8,1024.1,3,Churned,Yes,69,5302,Competitor,Competitor made better offer
2,1658-BYGOY,Male,71,No,Yes,No,Yes,3,United States,California,...,0,203.94,1910.88,2,Churned,Yes,81,3179,Competitor,Competitor made better offer
3,4598-XLKNJ,Female,78,No,Yes,Yes,Yes,1,United States,California,...,0,494.0,2995.07,2,Churned,Yes,88,5337,Dissatisfaction,Limited range of services
4,4846-WHAFZ,Female,80,No,Yes,Yes,Yes,1,United States,California,...,0,234.21,3102.36,2,Churned,Yes,67,2793,Price,Extra data charges


## Defining the Target Variable


In [3]:
# Define target variable
y = data['churn_label']

In [4]:
# Encode target variable
y = y.map({'No': 0, 'Yes': 1})

In [5]:
# Check class distribution
y.value_counts(normalize=True) * 100

churn_label
0    73.463013
1    26.536987
Name: proportion, dtype: float64

## Target Variable Definition

The target variable for this project is `churn_label`, which indicates whether a customer has churned.

To prepare the data for machine learning models, the target variable was encoded as follows:
- 0 → Customer did not churn
- 1 → Customer churned

The class distribution was analyzed to assess potential class imbalance, which may impact
model training and evaluation in later stages.


In [6]:
# Selected feature columns based on EDA
selected_features = [
    # Demographics
    'gender', 'under_30', 'senior_citizen', 'married', 'dependents',

    # Contract & Account
    'tenure_in_months', 'contract', 'payment_method', 'paperless_billing',

    # Pricing
    'monthly_charge', 'total_charges',

    # Services
    'internet_service', 'online_security', 'online_backup',
    'device_protection_plan', 'premium_tech_support',
    'streaming_tv', 'streaming_movies', 'streaming_music',

    # Offers & Referrals
    'offer', 'referred_a_friend'
]


In [7]:
X = data[selected_features].copy()

X.head()

Unnamed: 0,gender,under_30,senior_citizen,married,dependents,tenure_in_months,contract,payment_method,paperless_billing,monthly_charge,...,internet_service,online_security,online_backup,device_protection_plan,premium_tech_support,streaming_tv,streaming_movies,streaming_music,offer,referred_a_friend
0,Male,No,Yes,No,No,1,Month-to-Month,Bank Withdrawal,Yes,39.65,...,Yes,No,No,Yes,No,No,Yes,No,No Offer,No
1,Female,No,Yes,Yes,Yes,8,Month-to-Month,Credit Card,Yes,80.65,...,Yes,No,Yes,No,No,No,No,No,Offer E,Yes
2,Male,No,Yes,No,Yes,18,Month-to-Month,Bank Withdrawal,Yes,95.45,...,Yes,No,No,No,No,Yes,Yes,Yes,Offer D,No
3,Female,No,Yes,Yes,Yes,25,Month-to-Month,Bank Withdrawal,Yes,98.5,...,Yes,No,Yes,Yes,No,Yes,Yes,No,Offer C,Yes
4,Female,No,Yes,Yes,Yes,37,Month-to-Month,Bank Withdrawal,Yes,76.5,...,Yes,No,No,No,No,No,No,No,Offer C,Yes


## Feature Selection Based on EDA

Feature selection was guided by insights obtained from the exploratory data analysis (EDA).
Only variables that demonstrated a meaningful relationship with customer churn
were retained for modeling.

The selected features include customer demographics, contract characteristics,
pricing information, service usage, and promotional engagement.

Variables related to customer identifiers, geographic information,
or post-churn outcomes were excluded to prevent data leakage
and reduce model complexity.

This manual feature selection improves interpretability and reduces unnecessary model complexity.

In [8]:
# Identify categorical columns
categorical_cols = X.select_dtypes(include='object').columns.tolist()

categorical_cols

['gender',
 'under_30',
 'senior_citizen',
 'married',
 'dependents',
 'contract',
 'payment_method',
 'paperless_billing',
 'internet_service',
 'online_security',
 'online_backup',
 'device_protection_plan',
 'premium_tech_support',
 'streaming_tv',
 'streaming_movies',
 'streaming_music',
 'offer',
 'referred_a_friend']

In [9]:
# Apply one-hot encoding
X_encoded = pd.get_dummies(
    X,
    columns=categorical_cols,
    drop_first=True
)

In [10]:
X_encoded.shape
X_encoded.head()

Unnamed: 0,tenure_in_months,monthly_charge,total_charges,gender_Male,under_30_Yes,senior_citizen_Yes,married_Yes,dependents_Yes,contract_One Year,contract_Two Year,...,premium_tech_support_Yes,streaming_tv_Yes,streaming_movies_Yes,streaming_music_Yes,offer_Offer A,offer_Offer B,offer_Offer C,offer_Offer D,offer_Offer E,referred_a_friend_Yes
0,1,39.65,39.65,True,False,True,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
1,8,80.65,633.3,False,False,True,True,True,False,False,...,False,False,False,False,False,False,False,False,True,True
2,18,95.45,1752.55,True,False,True,False,True,False,False,...,False,True,True,True,False,False,False,True,False,False
3,25,98.5,2514.5,False,False,True,True,True,False,False,...,False,True,True,False,False,False,True,False,False,True
4,37,76.5,2868.15,False,False,True,True,True,False,False,...,False,False,False,False,False,False,True,False,False,True


## Handling Categorical Variables (Encoding)

Machine learning models require numerical inputs.
Therefore, categorical variables were transformed into numeric format using one-hot encoding.

Binary categorical features were encoded into a single numerical column,
while multi-category features were expanded into multiple binary columns.

The `drop_first=True` option was applied to prevent multicollinearity
and reduce redundancy among encoded features.


In [11]:
# Identify numerical columns
num_cols = [
    'tenure_in_months',
    'monthly_charge',
    'total_charges'
]

In [12]:
scaler = StandardScaler()
X_encoded[num_cols] = scaler.fit_transform(X_encoded[num_cols])

In [13]:
X_encoded[num_cols].describe()

Unnamed: 0,tenure_in_months,monthly_charge,total_charges
count,7043.0,7043.0,7043.0
mean,-8.070910000000001e-17,-6.456728e-17,8.07091e-18
std,1.000071,1.000071,1.000071
min,-1.278988,-1.54586,-0.9980237
25%,-0.9529936,-0.9725399,-0.829736
50%,-0.1380083,0.1857327,-0.3909126
75%,0.9214727,0.8338335,0.6646863
max,1.61421,1.794352,2.826236


After encoding, the dataset contains only numerical features suitable for machine learning algorithms.

## Feature Scaling (Numerical Features)

Before training machine learning models, numerical features were standardized using **StandardScaler**.

Standardization transforms numerical variables to have:
- A mean close to 0  
- A standard deviation close to 1  

This step is especially important for models that rely on distance or gradient-based optimization
(such as Logistic Regression, KNN, and Support Vector Machines), as it prevents features with larger
numeric ranges from dominating the learning process.

The following numerical features were scaled:
- `tenure_in_months`
- `monthly_charge`
- `total_charges`

After scaling, all numerical features are on a comparable scale, improving model stability and performance.


In [14]:
# Split features and target
X = X_encoded
y = y

# Train / Test split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Check shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5634, 27), (1409, 27), (5634,), (1409,))

## Train / Test Split Summary

The dataset was successfully split into training and testing sets:

- Training set: 5,634 records
- Testing set: 1,409 records
- Number of features: 27

Both sets maintain identical feature structures and preserve the original
class distribution of the target variable through stratified sampling.

At this stage, the data is fully prepared for machine learning modeling.

No data leakage occurs, as the split is performed after all feature transformations.

In [15]:
# Save processed datasets
X_train.to_csv("../data/processed/X_train.csv", index=False)
X_test.to_csv("../data/processed/X_test.csv", index=False)
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)

## Saving Processed Data

To ensure reproducibility and clean separation between feature engineering
and modeling, the processed training and testing datasets were saved to disk.

These datasets will be loaded directly in the modeling notebook,
avoiding redundant preprocessing steps.
