# Data Science Report: Predicting Bank Customer Churn

---



## Introduction

This project seeks to develop and evaluate a machine learning pipeline to predict bank customer churn (likelihood they will leave the bank) using demographic, behavioral, and financial data.

The workflow integrates:

**Python + Jupyter Notebook**: Data preparation, model training, and evaluation.

**XGBoost pipeline**: Optimized for classification on tabular data.

**Power BI**: Interactive dashboards for business stakeholders.

**Final CSV Export**: Predictions with churn probabilities for each customer across a test segment and the full data set.


#### **Objective**:

Identify high-risk churn customers.

Understand segment-level drivers of churn.

Provide actionable insights for retention strategies.

# Dataset Overview

**Source**:
This data was obtained using the following Kaggle dataset:
https://www.kaggle.com/datasets/radheshyamkollipara/bank-customer-churn/data

**Rows**: 10,000 customers.

**Final columns exported (post-model)**:


*   CustomerId: unique identifier
*   Exited: actual churn (0 = stayed, 1 = churned)
*   pred_churn: model-predicted churn (binary)
*   pred_prob_churn: churn probability (0–1)
*   AgeRange, SalaryRange, TenureBand: engineered categorical bins for business-friendly analysis
*   Additional features: Geography, Gender, IsActiveMember, CreditScore, Balance, Card Type, etc.

The churn rate in the dataset is ~20.4%, making this a moderately imbalanced classification problem.

# Bank Customer Churn Prediction – Final Report

## 1. Project Overview
Customer churn is one of the most significant challenges for financial institutions. Retaining existing customers is far less costly than acquiring new ones, so accurately predicting churn enables proactive retention strategies.  

This project:
- Explored and preprocessed the Bank Churn Dataset (Kaggle).  
- Built an XGBoost classifier to predict churn.  
- Evaluated performance using a holdout test set.  
- Generated enriched prediction outputs with demographic and financial segmentation.  
- Designed a Power BI dashboard for interactive business analysis.

This notebook and dashboard demonstrate a full ML-to-business workflow: from raw data, to model, to business insights.

## 2. Data Preparation
- **Dropped irrelevant IDs**: `RowNumber`, `CustomerId`, `Surname`.  
- **Removed leakage-prone features**: e.g., `Complain` (too predictive of churn because it likely occurs *after* dissatisfaction).  
- **Categorical Encoding**: OneHotEncoder for `Geography`, `Gender`, `Card Type`.  
- **Feature Scaling**: StandardScaler applied to numeric features.  
- **Feature Engineering**:
  - **AgeRange**: Customers grouped into quartiles (e.g., `21–32`, `33–41`, etc.).  
  - **SalaryRange**: Customers grouped into quartiles with currency formatting.  
  - **Sort Order Codes** added for clean Power BI visual sorting.  

## 3. Model Training
An **XGBoost classifier** was trained with log-loss evaluation, stratified 80/20 train–test split, and reproducible random seeds.  

**Why XGBoost?**
- Handles categorical and numeric mixes well.  
- Regularization reduces overfitting.  
- Interpretable feature importance and probability outputs.

## 4. Test Set Evaluation (Unbiased Performance)

Using the 20% holdout test set (`test_set_predictions.csv`):  

| Metric      | Class 0 (Stayed) | Class 1 (Churned) | Overall |
|-------------|------------------|-------------------|---------|
| Precision   | 0.88             | 0.64              | —       |
| Recall      | 0.94             | 0.82              | —       |
| F1-score    | 0.91             | 0.72              | —       |
| Accuracy    | —                | —                 | **0.86** |
| ROC AUC     | —                | —                 | **0.87** |


**Confusion Matrix (Test Set)**:

| Result      | Pred 0 | Pred 1 |
|-------------|--------|--------|
|Actual 0(No) | 7023   | 939    |
|Actual 1(Yes)| 371    | 1666   |



**Interpretation**:
- The model achieves high recall (82%) for churners, catching most customers likely to leave.  
- Precision (64%) suggests some false positives, but this is acceptable if retention actions are less costly than customer loss.  
- ROC AUC near .85 confirms the model is ranking churn probabilities very effectively.  

## 5. Full Dataset Predictions (For Business Analysis)

Pipeline retrained on the **full dataset** and exported enriched predictions (`churn_predictions.csv`), including:  
- Customer ID (for lookup).  
- Predicted churn probability (`pred_prob_churn`).  
- Predicted churn label (`pred_churn`).  
- Age and Salary ranges with sortable ordering.  

These predictions fuel the **Power BI dashboard**, where business stakeholders can explore:  
- Churn distribution by Age & Salary segments  
- Geographic churn patterns
- Effect of customer activity

## 6. Key Business Insights
- **Older customers** show higher churn rates across all income groups.
  - May need targeted engagement strategies.  
- **Inactive members** have elevated churn risk.
  - Retention could focus on re-engagement campaigns.  
- **Geography differences** show higher levels of churn in Germany across all groups.
  - suggest branch-level or regional initiatives in Germany could reduce attrition.  
- **Women** show higher churn rates across the board.
  - Suggests targeted engagement and retention strategies may be needed.  


## 7. Next Steps
- **Hyperparameter Tuning**: Use GridSearchCV for optimal XGBoost parameters.  
- **Calibration**:
  - Apply Platt scaling (transform outputs into probability distributions) or isotonic regression (finding the best fitting curve for a data set as long as that line does not fluctuate) to better align predicted probabilities with actual churn likelihood.
  - Use PowerBi's ability to create thresholds and allow users to tune the model based on them.
    - These techniques would allow a user, in say a marketing team, to adjust the model to provide predictions on the exact value they are searching for.
      - i.e. if they wanted a 10% churn rate, they could adjust the model to find a parameter that might give them those results.
- **Deployment**: Expose the trained pipeline as an API or batch job for integration with CRM systems.  