# Customer Churn — Data Preprocessing

## Objective
Prepare a clean, model-ready dataset with no data leakage.

This notebook is part of an end-to-end customer churn classification project.
All preprocessing, modeling, and evaluation steps are designed to be:
- Leakage-safe
- Reproducible
- Interview-defensible


# Customer Churn Prediction — Data Preprocessing

This notebook focuses on understanding and preparing a customer churn dataset
for supervised machine learning. The objective is to assess data quality,
handle missing values, encode categorical variables, and produce a clean,
model-ready dataset for downstream analysis.


## Loading Libraries & Dataset

In [1]:
import pandas as pd   
import  numpy as np         


df = pd.read_csv("../data/churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Dataset Overview

In [2]:
df.shape
pd.DataFrame(df.columns.tolist())
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


The dataset contains customer-level demographic, service usage, and billing
information. The target variable is `Churn`, indicating whether a customer
has discontinued the service.

Understanding column types and data quality is critical before applying
any machine learning models.


## Target Variable Analysis

In [None]:
df['Churn'].unique()
churn_dist = df['Churn'].value_counts(normalize=True)

# Converting churn into a binary category
# df['Churn'] = df['Churn'].map({'No': 0, 'Yes': 1})

- The dataset is moderately imbalanced; accuracy alone may be misleading.

### Column Categorization

We categorize columns into:
- Numerical features
- Categorical features
- Target variable

This helps decide preprocessing steps such as scaling and encoding.


In [4]:
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.drop('Churn').tolist()
categorical_features = df.select_dtypes(include=['object', 'category']).columns.drop('customerID').tolist()
target_variable = df['Churn']


## Missing Values Analysis

In [5]:
df.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

- There are no missing values so we can proceed with encoding our Categorical Variables

## Encode Categorical Variables

In [6]:
# Separate binary vs multi-category features
binary_categorical = [col for col in categorical_features if df[col].nunique() == 2]

multi_categorical = [col for col in categorical_features if df[col].nunique() > 2]

In [7]:
# Encode binary categorical features (Label Encoding)

for col in binary_categorical:
    df[col] = df[col].map({df[col].unique()[0]:0, df[col].unique()[1]:1})

In [8]:
# One-Hot Encode multi-category features

df = pd.get_dummies(df, columns= multi_categorical, drop_first=True)

In [9]:
# Dropping customerID column
df = df.drop(columns='customerID', errors='ignore')

# Sanity check
df.select_dtypes(include='object').columns

Index([], dtype='object')

## Feature/Target Split

In [24]:
y = df['Churn']
X = df.drop(columns=['Churn'], errors='ignore')

In [26]:
# Sanity check
print(X.shape)
print(y.shape)

(7043, 6559)
(7043,)


In [29]:
# Save preprocessed dataframe to CSV
df.to_csv("../data/churn_preprocessed.csv", index=False)

### Key Observations

- The dataset contains both numerical and categorical customer attributes.
- The target variable shows moderate class imbalance.
- Several categorical variables required encoding before modeling.
- After preprocessing, the dataset is fully numeric and ML-ready.

## Data Quality Summary

- The dataset contains both numerical and categorical features.
- The `TotalCharges` column required type conversion and missing value handling.
- No duplicate records were found.
- After preprocessing, the dataset contains no missing values and all features
  are suitable for machine learning models.
