# Feature Engineering & Preprocessing – Telecom Churn

This notebook focuses on transforming raw customer data into a
model-ready format.

Key objectives:
- Handle missing values
- Encode categorical variables
- Address multicollinearity
- Prepare features for machine learning models


## Install Libraries

In [2]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import Libraries

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Import Dataset

In [4]:
# Load raw churn dataset
#data_path = "../data/raw/customer_churn.csv"
data = "..\\data\\raw\\WA_Fn-UseC_-Telco-Customer-Churn.csv"
df = pd.read_csv(data)


## Fix TotalCharges and Convert it into Numeric

In [5]:
# Fix TotalCharges again
df['TotalCharges'] = df['TotalCharges'].replace(" ", np.nan)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'])

In [6]:
df['TotalCharges'].fillna(0, inplace=True)

# we Are Fillna = 0 because Customer hasn’t been billed yet MNAR

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(0, inplace=True)


## Drop the Customer ID

In [7]:
#CustomerID has no predictive value
df.drop(columns=['customerID'], inplace=True)

## Encoding The Target Variable

In [8]:
#Encode the Target Variable 
#Encoding is a proceudre in which non numeric column such as Catgeorical Col are converted in to Binary Format 

df['Churn'] = df['Churn'].map({'Yes' : 1 , 'No' : 0})

In [9]:
#Seperate Numerical COl and Categorical Col
numerical_col = df.select_dtypes(exclude='object').columns.tolist()
categorical_col = df.select_dtypes(include='object').columns.tolist()

print(numerical_col,categorical_col)

['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn'] ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']


## Feature Selection Decision

A strong correlation exists between `tenure` and `TotalCharges`.
To avoid redundancy and improve interpretability in linear models,
`TotalCharges` is dropped while retaining `tenure`.

This decision is based on business logic and EDA insights.

In [10]:
#Addressing Corelated & Multicollinearity Decision
#Multicollinearity occurs when two or more independent features (input variables) in a model are highly 
#correlated with each other, meaning they contain similar information.
#Highly correlated variables are not useless, but including them together in a linear regression model causes redundancy 
#and multicollinearity, which affects coefficient stability and interpretability.


#drop totalCharges Col
df.drop(columns='TotalCharges', inplace=True)

#We Avoid Multicollinearity or Highly Corelated Values as They Work as Reduntant and May Affect the coefficient stability( slope )

## Encoding
One-hot encoding of categorical variables is performed because machine learning models cannot process object or categorical data directly. Therefore, categorical features are converted into numerical form using encoding techniques.

In [11]:
df_encoded = pd.get_dummies(df, categorical_col, drop_first=True)

#pd.get_dummies Converts the categorical_col into numerical_col
#df is the Dataframe 
#drop_first = True Drop First ColAvoid dummy variable trap
#Especially important for Logistic Regression
#Drops one category from each encoded feature to avoid redundancy (multicollinearity)

## Split Features & Target (Code)

In [12]:
X = df_encoded.drop(columns='Churn')
y = df_encoded['Churn']

## Train test Split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

#test_size=0.2	20% of data goes to test set
#random_state=42	Makes split reproducible
#stratify=y	Keeps class proportions same in train and test

## Feature Scaling (Numerical Only)

it means the features like age:20 and the Salary: 10000 there is a HIgh difference so the Model Gets Confused and Priotise the Salary which is wrong, Thats why we use Scaling Which Make the Age and Salary fall Under certain Scale 

In [14]:
#Create a scaler → scaler = StandardScaler()
#Make copies of train/test data → safe practice
#Scale training data → fit_transform (calculate mean/std + scale)
#Scale test data using training stats → transform

#Seperate Numerical COl and Categorical Col
numerical_col = X_train.select_dtypes(exclude='object').columns.tolist()
scaler = StandardScaler()

#x Scaled on both train and Test
#X_train_scaled is a separate DataFrame Same values, same columns Changes to X_train_scaled do not affect X_train
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy() 


X_train_scaled[numerical_col] = scaler.fit_transform(X_train[numerical_col])
X_test_scaled[numerical_col] = scaler.transform(X_test[numerical_col])

## Save Processed Data (Code)

In [26]:

X_train_scaled.to_csv("..\\data\\processed\\X_train_scaled.csv",index=False)
X_test_scaled.to_csv("..\\data\\processed\\X_test_scaled.csv", index=False)
y_train.to_csv("..\\data\\processed\\y_train.csv", index=False)
y_test.to_csv("..\\data\\processed\\y_test.csv", index=False)

## Feature Engineering Summary

- Removed non-informative identifier columns
- Handled missing values using business logic
- Addressed multicollinearity by removing redundant features
- Encoded categorical variables using one-hot encoding
- Scaled numerical features for model compatibility
- Prepared clean train-test datasets for modeling
