<a href="https://colab.research.google.com/github/SamFisher8/Telco-Customer-Churn-Analysis/blob/main/Data_Preparation/data/processed/US_2_5_Class_Imbalance_SMOTE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
## US-2.5: Class Imbalance Handling (Risk Mitigation)

### Problem
#The dataset is highly imbalanced:
#- No Churn: ~5,174
#- Churn: ~1,869

#This imbalance can cause models to overpredict the majority class.

### Solution
#Apply SMOTE (Synthetic Minority Over-sampling Technique) **only to the training set** to achieve an approximately 50/50 class distribution.

### Constraint
#- SMOTE must NOT be applied to the test set.
#- Test set must remain realistic and imbalanced.


In [None]:
!git clone https://github.com/SamFisher8/Telco-Customer-Churn-Analysis.git
%cd Telco-Customer-Churn-Analysis

Cloning into 'Telco-Customer-Churn-Analysis'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects:   2% (1/48)[Kremote: Counting objects:   4% (2/48)[Kremote: Counting objects:   6% (3/48)[Kremote: Counting objects:   8% (4/48)[Kremote: Counting objects:  10% (5/48)[Kremote: Counting objects:  12% (6/48)[Kremote: Counting objects:  14% (7/48)[Kremote: Counting objects:  16% (8/48)[Kremote: Counting objects:  18% (9/48)[Kremote: Counting objects:  20% (10/48)[Kremote: Counting objects:  22% (11/48)[Kremote: Counting objects:  25% (12/48)[Kremote: Counting objects:  27% (13/48)[Kremote: Counting objects:  29% (14/48)[Kremote: Counting objects:  31% (15/48)[Kremote: Counting objects:  33% (16/48)[Kremote: Counting objects:  35% (17/48)[Kremote: Counting objects:  37% (18/48)[Kremote: Counting objects:  39% (19/48)[Kremote: Counting objects:  41% (20/48)[Kremote: Counting objects:  43% (21/48)[Kremote: Counting objects:  45% (22/48

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE


In [None]:
df = pd.read_csv("Data_Preparation/data/processed/Dataset_Encoded_v1.csv")

df.shape


(6741, 10)

In [None]:
df.dtypes


Unnamed: 0,0
gender,int64
SeniorCitizen,int64
Dependents,int64
tenure,int64
PhoneService,int64
MultipleLines,int64
InternetService,int64
Contract,int64
MonthlyCharges,int64
Churn,int64


In [None]:
df['Churn'].value_counts()


Unnamed: 0_level_0,count
Churn,Unnamed: 1_level_1
0,4950
1,1791


**What this tells us**

-Majority class (No Churn = 0): 4,950

-Minority class (Churn = 1): 1,791

-Ratio ≈ 73.4% / 26.6%

**This confirms:**

-A material class imbalance

-A high risk of model bias toward “No Churn”

**SMOTE is justified and required**

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Churn', axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Training set:", X_train.shape, y_train.shape)
print("Testing set:", X_test.shape, y_test.shape)


Training set: (5392, 9) (5392,)
Testing set: (1349, 9) (1349,)


In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("Before SMOTE:")
print(y_train.value_counts())

print("\nAfter SMOTE:")
print(y_train_smote.value_counts())


Before SMOTE:
Churn
0    3959
1    1433
Name: count, dtype: int64

After SMOTE:
Churn
0    3959
1    3959
Name: count, dtype: int64
