# Day 2 - Preprocessing and Data Balancing

**Objective:** Encode categorical features, split the dataset, handle class imbalance using SMOTE, and save the final datasets.


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE

import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
# Load the cleaned dataset from Day 1
df = pd.read_csv("../data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Reconvert TotalCharges and drop rows with nulls (as done in Day 1)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.dropna(subset=['TotalCharges'], inplace=True)

# Reset index (optional)
df.reset_index(drop=True, inplace=True)

# Drop customerID (not useful)
df.drop('customerID', axis=1, inplace=True)


## Encoding Categorical Columns

We will use:
- **Label Encoding** for binary categorical features.
- **One-Hot Encoding** for multi-category features.


In [3]:
# Label encode 'Yes/No' columns
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']

le = LabelEncoder()
for col in binary_cols:
    df[col] = le.fit_transform(df[col])


In [4]:
# Identify categorical columns to one-hot encode
multi_cat_cols = ['InternetService', 'Contract', 'PaymentMethod', 'MultipleLines', 'OnlineSecurity',
                  'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
                  'gender']

df_encoded = pd.get_dummies(df, columns=multi_cat_cols, drop_first=True)


## Train-Test Split

Split the data into training and test sets (80-20), stratified on the target (churn).


In [5]:
# Split into features and target
X = df_encoded.drop('Churn', axis=1)
y = df_encoded['Churn']

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Before SMOTE:")
print("Training target class distribution:\n", y_train.value_counts(normalize=True))


Before SMOTE:
Training target class distribution:
 Churn
0    0.734222
1    0.265778
Name: proportion, dtype: float64


## Balancing Classes with SMOTE

We’ll apply SMOTE on the training data to balance the churned vs non-churned classes.


In [6]:
# Apply SMOTE
sm = SMOTE(random_state=42)
X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)

print("After SMOTE:")
print("Balanced training target class distribution:\n", y_train_sm.value_counts(normalize=True))


After SMOTE:
Balanced training target class distribution:
 Churn
0    0.5
1    0.5
Name: proportion, dtype: float64


In [7]:
# Save processed datasets
X_train_sm.to_csv("../data/X_train_sm.csv", index=False)
y_train_sm.to_csv("../data/y_train_sm.csv", index=False)
X_test.to_csv("../data/X_test.csv", index=False)
y_test.to_csv("../data/y_test.csv", index=False)


## ✅ Summary

- Encoded all categorical variables
- Performed stratified train-test split
- Applied SMOTE to balance the training set
- Saved datasets for model training (Day 3)
