# Customer Churn Prediction for Beta Bank

## Project Context
Beta Bank is experiencing customer churn and needs a predictive model to identify at-risk customers before they leave. Customer retention is significantly more cost-effective than acquisition, making this prediction crucial for business profitability.

## Dataset Overview
- **Target Variable**: `Exited` (1=customer churned, 0=customer stayed)
- **Features**: Customer demographics, banking relationship data, and account information
- **Challenge**: Handle class imbalance effectively using multiple techniques

## Project Objectives
1. **Develop classification model** with F1-score â‰¥ 0.59 on test set
2. **Compare multiple approaches** for handling class imbalance
3. **Evaluate AUC-ROC performance** alongside F1-score metrics
4. **Provide actionable insights** for customer retention strategies

## Key Dataset Features
- **Demographics**: Geography, Gender, Age
- **Banking Data**: Credit Score, Balance, Tenure, Products Used
- **Behavioral**: Credit Card ownership, Account activity, Estimated Salary

## Technical Approach
This analysis will explore data preprocessing, class imbalance handling, model selection, and performance evaluation to build an effective churn prediction system.

##### 1. Importing Libraries

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

##### 2. Data Loading and Preprocessing

In [2]:
df = pd.read_csv('/datasets/Churn.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [4]:
# # Class balance (target variable)
print(df['Exited'].value_counts()) 

0    7963
1    2037
Name: Exited, dtype: int64


The dataset shows significant class imbalance with 79.6% customers staying (0) and 20.4% leaving (1). This imbalance will cause models to bias toward predicting "stays" and makes standard accuracy misleading (class balancing techniques will be essential for effective churn prediction).

In [5]:
# Encode categorical features using OHE
df_ohe = pd.get_dummies(df, drop_first=True)
df_ohe = df_ohe.fillna(0)  # Fill NaN values with 0
print(f"Features after encoding: {df_ohe.shape[1]}")

Features after encoding: 2945


The OHE technique allows to transform categorical features into numerical features.

##### 3. Data Splitting

In [6]:
# Split data into features and target
target = df_ohe['Exited']
features = df_ohe.drop('Exited', axis=1)

# Split into train+valid and test (20% for test)
features_temp, features_test, target_temp, target_test = train_test_split(
    features, target, test_size=0.20, random_state=12345)

# Now split the rest into train and valid (75% train, 25% valid of the remaining 80%)
features_train, features_valid, target_train, target_valid = train_test_split(
    features_temp, target_temp, test_size=0.25, random_state=12345)

print(f"Training set size: {features_train.shape[0]}")
print(f"Validation set size: {features_valid.shape[0]}")
print(f"Test set size: {features_test.shape[0]}")

Training set size: 6000
Validation set size: 2000
Test set size: 2000


Data divided into 60% training, 20% validation, and 20% test sets. Class distribution preserved across all splits.

##### 4. Baseline Model

In [7]:
print("--- BASE MODEL (NO BALANCING) ---")
for depth in [10, 50, 100]:
    model = RandomForestClassifier(max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    f1 = f1_score(target_valid, predicted_valid)
    print(f"Random Forest (max_depth={depth}) F1-score: {f1:.3f}")

--- BASE MODEL (NO BALANCING) ---
Random Forest (max_depth=10) F1-score: 0.000
Random Forest (max_depth=50) F1-score: 0.340
Random Forest (max_depth=100) F1-score: 0.503


Unbalanced models perform poorly (all below 0.59 requirement). Class balancing needed to improve minority class detection.

##### 5. Model Optimization with Class Balancing

In [8]:
# Random Forest Classifier with Downsampling + Class Weights
def downsample(features, target, fraction):
    features_zeros = features[target == 0]  # Majority class (stays)
    features_ones = features[target == 1]   # Minority class (leaves)
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    # Sample fraction of majority class + all minority class
    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state=12345)]+ [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state=12345)]+ [target_ones])
    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12345)

    return features_downsampled, target_downsampled

# Apply downsampling to training set
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.4)
print(f"Original training size: {len(features_train)}")
print(f"Downsampled training size: {len(features_downsampled)}")

# Train model with optimal parameters
print("\n--- RANDOM FOREST WITH BALANCING ---")
for depth in [10, 50, 100]:
    rf_model = RandomForestClassifier(max_depth=depth, random_state=12345, class_weight='balanced')
    rf_model.fit(features_downsampled, target_downsampled)
    
    predicted_valid = rf_model.predict(features_valid) # Evaluate
    f1 = f1_score(target_valid, predicted_valid)
    print(f"Random Forest (max_depth={depth}) F1-score: {f1:.3f}")
    prob_valid = rf_model.predict_proba(features_valid)[:, 1]
    auc_valid = roc_auc_score(target_valid, prob_valid)
    print(f"AUC-ROC on validation set: {auc_valid:.3f}")

Original training size: 6000
Downsampled training size: 3131

--- RANDOM FOREST WITH BALANCING ---
Random Forest (max_depth=10) F1-score: 0.530
AUC-ROC on validation set: 0.798
Random Forest (max_depth=50) F1-score: 0.560
AUC-ROC on validation set: 0.824
Random Forest (max_depth=100) F1-score: 0.562
AUC-ROC on validation set: 0.824


Downsampling + class weights significantly improved performance. 
* Best results were max_depth=100 achieved F1=0.562 and AUC-ROC=0.824

##### 6. Final Model Evaluation

In [9]:
print("--- FINAL MODEL EVALUATION ---")

# Combine train and validation sets for final training
features_final = pd.concat([features_train, features_valid])
target_final = pd.concat([target_train, target_valid])

# Use optimal downsampling fraction (0.4)
features_downsampled, target_downsampled = downsample(features_final, target_final, 0.4)

# Train final model with optimal parameters
final_model = RandomForestClassifier(max_depth=100, random_state=12345, class_weight='balanced')
final_model.fit(features_downsampled, target_downsampled)

# Evaluate on test set
predicted_test = final_model.predict(features_test)
f1 = f1_score(target_test, predicted_test)
probs_test = final_model.predict_proba(features_test)[:, 1]
auc = roc_auc_score(target_test, probs_test)

print(f"F1-score on test set: {f1:.3f}")
print(f"AUC-ROC on test set: {auc:.3f}")

--- FINAL MODEL EVALUATION ---
F1-score on test set: 0.619
AUC-ROC on test set: 0.856


Final model evaluation demonstrates excellent performance: 
* F1-score = 0.619 (exceeds 0.59 requirement) and AUC-ROC = 0.856
  (the optimal combination of 40% downsampling, balanced class weights, and max_depth=100 successfully addresses class imbalance).

##### 7. General Conclusion

The final model achieved an F1-score of 0.619 and AUC-ROC of 0.856, demonstrating strong predictive capability for identifying customers at risk of leaving the bank. By addressing the inherent challenges of imbalanced classification, the solution enables the bank to proactively retain valuable customers and optimize marketing investments.