# Data Preprocessing

Preparing Data for Machine Learning Models

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("D:/Project/data/telco_feature_engineered.csv")

print(df.columns)

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'Churn', 'tenure_group',
       'NumServicesUsed', 'MonthlyChargeGroup', 'IsFiber_and_TechSupport'],
      dtype='object')


In [3]:
df.shape

(7032, 24)

## 1. Label Encoding for Binary Columns

Converting Binary values like __Yes/No__ to numeric form for ML models to understand.

In [4]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 
               'Churn', 'SeniorCitizen']

for col in binary_cols:
    df[col] = le.fit_transform(df[col])

df[binary_cols].head()

Unnamed: 0,Partner,Dependents,PhoneService,PaperlessBilling,Churn,SeniorCitizen
0,1,0,0,1,0,0
1,0,0,1,0,0,0
2,0,0,1,1,1,0
3,0,0,0,0,0,0
4,0,0,1,1,1,0


## 2. One-Hot Encoding for Multi-class Categorical Columns

Some features like __InternetServices__ have more than 2 categories. 
ML models can’t understand strings like __'DSL', 'Fiber optic',__ etc. — so we create new columns for each category.

In [5]:
df = pd.get_dummies(df, drop_first = True, dtype=int)
df.head()

Unnamed: 0,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,NumServicesUsed,...,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_Mid-Term,tenure_group_New,MonthlyChargeGroup_Low,MonthlyChargeGroup_Medium
0,0,1,0,1,0,1,29.85,29.85,0,1,...,0,0,0,0,1,0,0,1,1,0
1,0,0,0,34,1,0,56.95,1889.5,0,3,...,0,1,0,0,0,1,1,0,0,1
2,0,0,0,2,1,1,53.85,108.15,1,3,...,0,0,0,0,0,1,0,1,0,1
3,0,0,0,45,0,0,42.3,1840.75,0,3,...,0,1,0,0,0,0,1,0,0,1
4,0,0,0,2,1,1,70.7,151.65,1,1,...,0,0,0,0,1,0,0,1,0,0


## 3. Scaling Numerical Features

Some features like __MonthlyCharges__ or __TotalCharges__ might have very different ranges.
ML models like Logistic Regression or SVM perform better when all features are on a similar scale.

In [6]:
from sklearn.preprocessing import StandardScaler

num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges', 'NumServicesUsed']
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

df[num_cols].describe()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,NumServicesUsed
count,7032.0,7032.0,7032.0,7032.0
mean,-1.126643e-16,6.062651000000001e-17,-1.119064e-16,6.214218e-17
std,1.000071,1.000071,1.000071,1.000071
min,-1.280248,-1.547283,-0.9990692,-1.631168
25%,-0.9542963,-0.9709769,-0.8302488,-1.146183
50%,-0.1394171,0.184544,-0.3908151,-0.1762139
75%,0.9199259,0.8331482,0.6668271,0.7937555
max,1.612573,1.793381,2.824261,2.24871


## 4. Train-Test Split

- Train on one part of the data
- Test on another part to evaluate performance on unseen cases

In [7]:
from sklearn.model_selection import train_test_split

# Define target and features
X = df.drop('Churn', axis=1)
y = df['Churn']

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## 5. Save the Data

In [8]:
import joblib

joblib.dump((X_train, X_test, y_train, y_test), 'D:/Project/model/preprocessed_data.pkl')

# Save preprocessor
joblib.dump(scaler, 'D:/Project/model/scaler.pkl') 

df.to_csv("D:/Project/data/preprocessed_data_csv.csv",index=False)

In [9]:
print(X_train.shape, X_test.shape)
print(y_train.value_counts(), y_test.value_counts())


(5625, 29) (1407, 29)
Churn
0    4130
1    1495
Name: count, dtype: int64 Churn
0    1033
1     374
Name: count, dtype: int64
