## Feature Engineering & Preprocessing

In this notebook, we prepare the Telco Customer Churn dataset for machine learning by:
- Separating features and target
- Handling missing values
- Encoding categorical variables
- Scaling numerical features where required
- Creating a reusable preprocessing pipeline


In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [3]:
df = pd.read_csv("../../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [5]:
df.isna().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [6]:
df[df['TotalCharges'].isna()]

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,4472-LVYGI,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
753,3115-CZMZD,Male,0,No,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
936,5709-LVOEQ,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,...,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
1082,4367-NUYAO,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
1340,1371-DWPAZ,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,,No
3331,7644-OMVMY,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.85,,No
3826,3213-VVOLG,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,,No
4380,2520-SGTTA,Female,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,,No
5218,2923-ARZLG,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,,No
6670,4075-WKNIU,Female,0,Yes,Yes,0,Yes,Yes,DSL,No,...,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,,No


In [7]:
df = df.dropna()

Most of the ML models cant work with missing values. It would likely crash or behave incorrectly.To handle missing values we could fill it with 0, or mean(Replace missing with average), or median(Replace missing with middle value), or we could just drop them.Since only a small number of rows have missing total charges we'll drop them.

In [8]:
df["Churn"].value_counts(normalize=True)
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})
df["Partner"] = df["Partner"].map({"Yes": 1, "No": 0})
df["PhoneService"] = df["PhoneService"].map({"Yes": 1, "No": 0})
df["Dependents"] = df["Dependents"].map({"Yes": 1, "No": 0})
df["PaperlessBilling"] = df["PaperlessBilling"].map({"Yes": 1, "No": 0})
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   gender            7032 non-null   object 
 2   SeniorCitizen     7032 non-null   int64  
 3   Partner           7032 non-null   int64  
 4   Dependents        7032 non-null   int64  
 5   tenure            7032 non-null   int64  
 6   PhoneService      7032 non-null   int64  
 7   MultipleLines     7032 non-null   object 
 8   InternetService   7032 non-null   object 
 9   OnlineSecurity    7032 non-null   object 
 10  OnlineBackup      7032 non-null   object 
 11  DeviceProtection  7032 non-null   object 
 12  TechSupport       7032 non-null   object 
 13  StreamingTV       7032 non-null   object 
 14  StreamingMovies   7032 non-null   object 
 15  Contract          7032 non-null   object 
 16  PaperlessBilling  7032 non-null   int64  
 17  

Features with Yes/No value can be binary encoded. 

In [9]:
internet_feature_cols = [
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies"
]

mapping = {
    "No internet service": 0,
    "No": 1,
    "Yes": 2
}

for col in internet_feature_cols:
    df[col] = df[col].replace(mapping).astype(int)


  df[col] = df[col].replace(mapping).astype(int)
  df[col] = df[col].replace(mapping).astype(int)
  df[col] = df[col].replace(mapping).astype(int)
  df[col] = df[col].replace(mapping).astype(int)
  df[col] = df[col].replace(mapping).astype(int)
  df[col] = df[col].replace(mapping).astype(int)


In [10]:
df[internet_feature_cols].info()
df[internet_feature_cols].head()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   OnlineSecurity    7032 non-null   int64
 1   OnlineBackup      7032 non-null   int64
 2   DeviceProtection  7032 non-null   int64
 3   TechSupport       7032 non-null   int64
 4   StreamingTV       7032 non-null   int64
 5   StreamingMovies   7032 non-null   int64
dtypes: int64(6)
memory usage: 384.6 KB


Unnamed: 0,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,1,2,1,1,1,1
1,2,1,2,1,1,1
2,2,2,1,1,1,1
3,2,1,2,2,1,1
4,1,1,1,1,1,1


Internet-related service features were encoded into three numerical categories to distinguish between lack of internet service, availability without adoption, and active usage. This preserves service availability information while avoiding unnecessary dimensionality expansion.
{"No internet service" == 0,
    "No" == 1,
    "Yes" == 2}

In [11]:
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

In [13]:
X = df.drop(columns=['Churn', 'customerID'])
y = df['Churn']

converted target to binary and droped the customer ID column as it has not predictive meaning

In [14]:
categorical_cols = X.select_dtypes(include='object').columns
numerical_cols = X.select_dtypes(exclude='object').columns

categorical_cols, numerical_cols

(Index(['gender', 'MultipleLines', 'InternetService', 'Contract',
        'PaymentMethod'],
       dtype='object'),
 Index(['SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService',
        'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
        'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'MonthlyCharges',
        'TotalCharges'],
       dtype='object'))

In [15]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)


The dataset contains a mix of numerical and categorical features, which must be processed differently before being used by machine learning models.

~Numerical Features
Numerical columns such as `tenure`, `MonthlyCharges`, and `TotalCharges` are scaled using StandardScaler.  
Scaling ensures that all numerical features are on a comparable scale, preventing features with larger magnitudes from dominating the model, especially for models like logistic regression.

~Categorical Features
Categorical columns such as `Contract`, `InternetService`, and `PaymentMethod` are converted into numerical format using One-Hot Encoding.  
One-hot encoding creates a binary column for each category and avoids introducing false ordinal relationships between categories.

#### ColumnTransformer
A `ColumnTransformer` is used to apply the appropriate transformation to each feature type:
- Numerical features → scaling
- Categorical features → one-hot encoding

This approach ensures:
- Consistent preprocessing during training and testing
- No data leakage
- A clean and reusable preprocessing pipeline that can be integrated directly with machine learning models

Overall, this preprocessing step prepares heterogeneous customer data in a form that machine learning algorithms can interpret correctly and fairly.
