<a href="https://www.kaggle.com/code/aggarwalbhavya/credit-card-customer-churn-and-clustering?scriptVersionId=259344424" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

- ### Background
    This data set contains **credit card customer data** from a fictional bank. It stimulates real-world behavior with features related to demographics, credit usage and account activity.
    
    Here, we are trying to explore the data and create data visualisations to draw out meaningful insights and to predict whether a customer will churn or not based on their profile and activity.

- ### Key Features
  | Feature Name               | Data Type | Category    | Description                                           |
| -------------------------- | --------- | ----------- | ----------------------------------------------------- |
| `CLIENTNUM`                | int64     | Identifier  | Unique customer ID (not used for modeling)            |
| `Attrition_Flag`           | object    | Target      | Churn status: 🟢 Existing or 🔴 Attrited              |
| `Customer_Age`             | int64     | Numerical   | Age of the customer                                   |
| `Gender`                   | object    | Categorical | Customer gender                                       |
| `Dependent_count`          | int64     | Numerical   | Number of dependents                                  |
| `Education_Level`          | object    | Categorical | Education level (High School, Graduate, etc.)         |
| `Marital_Status`           | object    | Categorical | Marital status (Married, Single, etc.)                |
| `Income_Category`          | object    | Categorical | Income bracket (Less than \$40K, \$40K - \$60K, etc.) |
| `Card_Category`            | object    | Categorical | Credit card type (Blue, Silver, Gold, Platinum)       |
| `Months_on_book`           | int64     | Numerical   | Tenure with the bank (in months)                      |
| `Total_Relationship_Count` | int64     | Numerical   | Total number of bank products held                    |
| `Months_Inactive_12_mon`   | int64     | Numerical   | Inactive months in the past 12 months                 |
| `Contacts_Count_12_mon`    | int64     | Numerical   | Customer service contacts in the past 12 months       |
| `Credit_Limit`             | float64   | Numerical   | Credit card limit                                     |
| `Total_Revolving_Bal`      | int64     | Numerical   | Revolving balance on the card                         |
| `Avg_Open_To_Buy`          | float64   | Numerical   | Average available credit                              |
| `Total_Trans_Amt`          | int64     | Numerical   | Total transaction amount in last 12 months            |
| `Total_Trans_Ct`           | int64     | Numerical   | Total transaction count in last 12 months             |
| `Total_Ct_Chng_Q4_Q1`      | float64   | Numerical   | Change in transaction count Q4 vs Q1                  |
| `Total_Amt_Chng_Q4_Q1`     | float64   | Numerical   | Change in transaction amount Q4 vs Q1                 |
| `Avg_Utilization_Ratio`    | float64   | Numerical   | Average card utilization rate                         |


- ### Input

  * `BankChurners.csv`: Main dataset including both input features and target variable.

- ### Project Objective
    
    The goal of this notebook is to **analyze customer behavior and predict churn**, supporting business decisions like:
    
    * Targeted retention strategies
    * Personalized offers for at-risk customers
    * Reducing customer attrition

- ### Key Steps

    * **Exploratory Data Analysis (EDA):** <br>
      Understand patterns in customer behavior and churn.
    
    * **Feature Engineering:**
      Encode categorical variables, scale numerical features, and create meaningful derived variables (e.g. utilization ratios, transaction trends).
    
    * **Modeling:**
      Apply various classifiers like:
    
      * Logistic Regression
      * Random Forest
      * XGBoost
      * LightGBM
      * MLPClassifier
  

- ### Evaluation Framework

  * Use **Stratified Cross-Validation**
  * Assess using:

    * Accuracy
    * Precision
    * Recall
    * F1-score
    * ROC-AUC

# Import Libraries

In [None]:
# Core data manipulation libraries
import pandas as pd
import numpy as np

# Visulaization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import shap

import plotly.express as ex
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')

# Statistical functions
from scipy.stats import skew

# Display utilities for Jupyter notebook
from IPython.display import display

# Machine Learning pre-processing and modeling
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import OneHotEncoder, RobustScaler, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import optuna 
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Metrics
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, confusion_matrix, precision_recall_curve, auc, f1_score as f1

# Statistical
from scipy.stats import chi2_contingency
from scipy.stats import shapiro, probplot
from scipy.stats import mannwhitneyu
from scipy.stats import levene
from scipy.stats import ttest_ind
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import kruskal
from scipy.stats import anderson
from scipy.stats import normaltest
!pip install scikit-posthocs
import scikit_posthocs as sp

# Suppress warnings for clearer output
import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", 500)
pd.set_option("max_colwidth", None)

# Load Data

In [None]:
# Loading the Data set
df_customer_churn = pd.read_csv("/kaggle/input/credit-card-customers/BankChurners.csv")

# Verify shapes
print("Data Shape: ", df_customer_churn.shape)

# Data Preview and Information

In [None]:
# Displaying a few rows of the data set
print('Data Preview: ')
display(df_customer_churn.head())

# Insights from Initial Data Exploration

- **Data set size and structure:**
  The data set contains **10127** samples with **23** columns, including the target variable `Attrition_Flag`.

- **Feature Overview:**
  - **Numerical Features:** `Customer_Age`, `Dependent_count`, `Months_on_book`, `Total_Relationship_Count`, `Months_Inactive_12_mon`, `Contacts_Count_12_mon`, `Credit_Limit`, `Total_Revolving_Bal`, `Avg_Open_To_Buy`, `Total_Trans_Amt`, `Total_Trans_Ct`, `Total_Ct_Chng_Q4_Q1`, `Total_Amt_Chng_Q4_Q1` and `Avg_Utilization_Ratio`.
     - **Categorical features:** `Attrition_Flag`, `Gender`, `Education_Level`, `Marital_Status`, `Income_Category`, `Card_Category`.
   - The target variable is **object** value.

- **Data Completeness:**
   - The dataset have **no missing values**, we need to handle this issue.
   - Data types are appropriate: numerical features is float64 and int64, and categorical features are objects (strings).
   - The columns `Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1`, `Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2` are not meaningful for analysis. So these columns are not really a part of the information we should care about. We can drop them.

In [None]:
# Dropping data columns 
df_customer_churn.drop(columns="Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1", axis=1, inplace=True)
df_customer_churn.drop(columns="Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2", axis=1, inplace=True)

In [None]:
df_customer_churn.columns = (
    df_customer_churn.columns
    .str.strip()
)

In [None]:
cat_features = ["Attrition_Flag", "Gender", "Education_Level", "Marital_Status", "Income_Category", "Card_Category", "Dependent_count",
                "Months_Inactive_12_mon", "Contacts_Count_12_mon", "Total_Relationship_Count"]

num_features = ["Customer_Age", "Credit_Limit", "Avg_Open_To_Buy", "Total_Trans_Amt", "Months_on_book", 
                "Total_Trans_Ct", "Total_Ct_Chng_Q4_Q1", "Total_Amt_Chng_Q4_Q1", "Avg_Utilization_Ratio", "Total_Revolving_Bal"]

def convert_cat (df, cat_features=cat_features):
    for feature in cat_features:
        if feature in df.columns:
            df[feature] = df[feature].astype('category')
        else:
            pass

convert_cat(df_customer_churn)

In [None]:
df_customer_churn[num_features] = df_customer_churn[num_features].astype({
    "Customer_Age": "int8",
    "Credit_Limit": "float32",
    "Avg_Open_To_Buy": "float32",
    "Total_Trans_Amt": "int32",
    "Months_on_book": "int8",
    "Total_Trans_Ct": "int16",
    "Total_Ct_Chng_Q4_Q1": "float32",
    "Total_Amt_Chng_Q4_Q1": "float32",
    "Avg_Utilization_Ratio": "float32",
    "Total_Revolving_Bal": "int16"
})

In [None]:
# Information on the data frame
print('Data Information: ')
display(df_customer_churn.info())

In [None]:
# Data Description
print('Data Description: ')
cm = sns.light_palette('green', as_cmap=True)
display(df_customer_churn.drop(columns='CLIENTNUM', axis=1).describe().T.style.background_gradient(cmap=cm))

## Descriptive Insights – Numerical Features

**1. Demographics & Tenure**

* **Customer Age**

  * Mean: **46.3 years** | Range: 26–73
  * 25–75%: **41–52** → Primarily middle-aged adults
  * Std: 8.0 → Fairly even distribution

* **Months on Book** (Tenure)

  * Mean: **35.9 months** (\~3 years)
  * 25–75%: **31–40** → Most customers have stayed for 2.5–3.5 years
  * Std: 8.0 → Moderate spread

**2. Credit Behavior**

* **Credit Limit**

  * Mean: **\$8,632** | Max: \$34,516
  * 25–75%: **\$4,549–\$11,068**
  * Std: **\$9,089** → Highly dispersed, potential right-skewness

* **Avg Open to Buy**

  * Mean: **\$7,469** (closely follows `Credit_Limit`)
  * Std: **\$9,091** → Strong dependency between the two

* **Avg Utilization Ratio**

  * Mean: **27.5%**
  * 25–75%: **2.3%–50.3%** | Max: \~**99.9%**
  * → Some customers nearly max out their limits → Potential risk or loyal high spenders

* **Total Revolving Balance**

  * Mean: **\$1,162**
  * 25–75%: **\$359–\$1,784**
  * Std: **\$815** → Wide variability, warrants skewness check

**3. Transaction Behavior**

* **Total Transaction Amount**

  * Mean: **\$4,404** | Max: \$18,484
  * 25–75%: **\$2,156–\$4,741**
  * Std: **\$3,397** → Outliers likely among high spenders

* **Total Transaction Count**

  * Mean: **\~65 transactions/year**
  * 25–75%: **45–81 transactions**
  * Std: 23.5 → Ranges from light to highly active users

**4. Behavioral Changes**

* **Total Amt Change Q4/Q1**

  * Mean: **0.76** | Max: **3.40**
  * 25–75%: **0.63–0.86**
  * → Some customers drastically increased spending in Q4 → May signal churn or upsell opportunity

* **Total Ct Change Q4/Q1**

  * Mean: **0.71**
  * Std: 0.24 → Frequency shifts may indicate behavioral trends

**Summary**

* **Credit and transaction features show high variance** → Consider scaling or transformation
* **Age and tenure are more normally distributed** → Easier to model
* Features like `Total_Amt_Chng_Q4_Q1`, `Avg_Utilization_Ratio`, and `Total_Revolving_Bal` exhibit **strong financial behavior patterns** → Valuable churn predictors.

In [None]:
# Data Description
print('Data Describe: ')
display(df_customer_churn.describe(include=['category', 'object']).T)

## Descriptive Insights - Categorical Features

`Attrition_Flag` *(Target Variable)*

* Two classes:
  `Existing Customer` – 85.0%
  `Attrited Customer` – 15.0%
* **Observation:** Highly imbalanced → Resampling or class weighting needed

`Gender`

* Two values: `F`, `M`
* Majority: Female (≈ 52.9%)
* **Insight:** Balanced distribution → Can be used in segmentation or churn analysis

`Dependent_count`

* 6 unique values (0–5)
* Most frequent: 3 dependents (2,732 customers)
* **Insight:** Represents family responsibility → Consider ordinal treatment

`Education_Level`

* 7 levels, top category: `Graduate` (3,128)
* **Insight:** Relevant for income segmentation and credit behavior analysis

`Marital_Status`

* 4 values, most common: `Married` (4,687)
* **Insight:** May impact spending behavior → Check churn across groups

`Income_Category`

* 6 income brackets
* Most frequent: `Less than $40K` (3,561)
* **Insight:** Skewed toward lower-income segment → May affect product usage

`Card_Category`

* 4 categories, heavily skewed toward `Blue` (≈ 93%)
* **Insight:** Highly imbalanced → Consider grouping or frequency encoding

`Total_Relationship_Count`

* 6 values, most common: 3 (2,305 customers)
* **Insight:** Reflects product engagement → Key indicator of customer loyalty

`Months_Inactive_12_mon`

* 7 values, peak at 3 months (3,846 customers)
* **Insight:** Strong behavioral feature → Linked to potential churn risk

`Contacts_Count_12_mon`

* 7 levels, most common: 3 contacts/year (3,380)
* **Insight:** Indicates engagement with bank → Analyze correlation with churn

**Summary of Key Observations**

| Key Finding                                        | Suggested Next Step                             |
| -------------------------------------------------- | ----------------------------------------------- |
| Target is imbalanced (85% vs. 15%)                 | Apply resampling or class weights               |
| Card type is skewed toward `Blue`                  | Group rare categories or use frequency encoding |
| Behavioral features show clear usage patterns      | Leverage for churn prediction                   |
| Some features are ordinal (`Income`, `Dependents`) | Consider ordinal encoding or binning            |

# Data Quality Checks

## Missing Value

In [None]:
def displayNULL (df, dataset_name = None, style = 1):
    if style == 1 and dataset_name is None:
        for column in df.columns:
            if df[column].isna().sum() > 0:
                