# Data Preprocessing for Deep Learning

This notebook focuses on preparing telecom customer data for training a deep learning churn prediction model.


In [27]:
import numpy as np
import pandas as pd

In [28]:
data_path=r"D:\Aman Deep\Deep-Learning\telecom-churn-deep-learning\data\processed\telecom_churn_initial_clean.csv"
df=pd.read_csv(data_path)

In [29]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


The cleaned datasert from the EDA phase is loaded for preprocessing

In [30]:
df=df.drop(columns=["customerID"])

Customer IDs are removed as they do not contribute to churn prediction.

In [31]:
y=df['Churn'].map({'Yes':1, 'No' : 0})

Churn is converted into a binary numerical variable for model training.

In [32]:
X=df.drop(columns=['Churn'])

In [33]:
categorical_features= X.select_dtypes(include="object").columns
numerical_features = X.select_dtypes(exclude="object").columns

In [34]:
print("Categorical Features: \n", categorical_features)
print("\n\nNumerical Features:\n ", numerical_features)

Categorical Features: 
 Index(['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod'],
      dtype='object')


Numerical Features:
  Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges'], dtype='object')


Separating categorical and numerical features allows appropriate preprocessing strategies for each type.

In [35]:
X[numerical_features]=X[numerical_features].fillna(X[numerical_features].median())


Median imputation is used to minimize the effect of outliers commonly present in billing data

In [36]:
X_encoded=pd.get_dummies(X, columns=categorical_features,drop_first=True )

In [37]:
X_encoded.shape

(7043, 30)

One-hot encoding is applied to categorical variables to convert them into numerical form suitable for neural networks.

In [38]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_scaled=scaler.fit_transform(X_encoded)

In [40]:
X_scaled_df = pd.DataFrame(
    X_scaled,
    columns=X_encoded.columns
)

Feature scaling is essential for stable and efficient neural network training.

In [39]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

Stratified splitting ensures the churn ratio remains consistent across training and test sets.

In [41]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((5634, 30), (1409, 30), (5634,), (1409,))

The dataset has been successfully split into training and test sets, preserving class distribution.

import os
os.getcwd()


In [47]:
os.makedirs("../data/processed", exist_ok=True)


In [48]:
np.save("../data/processed/X_train.npy", X_train)
np.save("../data/processed/X_test.npy", X_test)
np.save("../data/processed/y_train.npy", y_train)
np.save("../data/processed/y_test.npy", y_test)


Preprocessed datasets are saved to maintain separation between data preparation and model training stages.