# Credit Card Customer Churn Analysis: Hypothesis Testing

## Key deliverable: assess which customers are likely to churn using customer
## data and develop strategies to improve retention. My role was to perform 
## hypothesis testing to assist with the visualisation and customer         
## retention strategy.


## We looked into several hypothesis and retained three as we felt they were 
## the most relevant in terms of providing useful insights. 

### Hypothesis 1: Card_Category affects churn probability
### Test: Chi-Square
### Explanation: Does card type impact churn?

### Hypothesis 2: Credit_Limit differs significantly between churners and
### non-churners
### Test: t-test
### Explanation: Does available credit play a role?

### Hypothesis 3: Avg_Utilization_Ratio is significantly lower for churners
### Test: t-test
### Explanation: Are churners using less of their credit limit?

In [12]:
import pandas as pd
import numpy as np

In [13]:
import matplotlib.pyplot as plt

In [14]:
import seaborn as sns

### For the analysis, I used a clean csv file with removal of unknown values
### on key columns

In [28]:
import pandas as pd
df=pd.read_csv('../jupyter_notebooks/input/cleaned_no_unknown.csv')
df

Unnamed: 0,Customer_ID,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Tenure_Months,...,Credit_Limit,Total_Revolving_Bal,Available_Credit,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,NB_Stay_Probability,NB_Churn_Probability
0,768805383,Existing Customer,45.0,M,3,High School,Married,$60K - $80K,Blue,39.0,...,12691.0,777,11914.0,1.201,1144.00,42,1.172,0.061,0.000093,0.99991
1,818770008,Existing Customer,49.0,F,5,Graduate,Single,Less Than $40K,Blue,44.0,...,8256.0,864,7392.0,1.201,1291.00,33,1.172,0.105,0.000057,0.99994
2,713982108,Existing Customer,51.0,M,3,Graduate,Married,$80K - $120K,Blue,36.0,...,3418.0,0,3418.0,1.201,1887.00,20,1.172,0.000,0.000021,0.99998
3,709106358,Existing Customer,40.0,M,3,Uneducated,Married,$60K - $80K,Blue,21.0,...,4716.0,0,4716.0,1.201,816.00,28,1.172,0.000,0.000022,0.99998
4,713061558,Existing Customer,44.0,M,2,Graduate,Married,$40K - $60K,Blue,36.0,...,4010.0,1247,2763.0,1.201,1088.00,24,0.846,0.311,0.000055,0.99994
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7076,710841183,Existing Customer,54.0,M,1,High School,Single,$60K - $80K,Blue,34.0,...,13940.0,2109,11831.0,0.660,8619.25,114,0.754,0.151,0.000038,0.99996
7077,713899383,Existing Customer,56.0,F,1,Graduate,Single,Less Than $40K,Blue,50.0,...,3688.0,606,3082.0,0.570,8619.25,120,0.791,0.164,0.000148,0.99985
7078,772366833,Existing Customer,50.0,M,2,Graduate,Single,$40K - $60K,Blue,40.0,...,4003.0,1851,2152.0,0.703,8619.25,117,0.857,0.462,0.000191,0.99981
7079,716506083,Attrited Customer,44.0,F,1,High School,Married,Less Than $40K,Blue,36.0,...,5409.0,0,5409.0,0.819,8619.25,60,0.818,0.000,0.000695,0.99930


In [31]:
import scipy.stats as stats

In [33]:
# Prepare data
df_clean = df[["Attrition_Flag", "Card_Category"]].copy()
df_clean["Churn"] = df_clean["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)

# H1: Card_Category affects churn probability — Chi-squared test
contingency_table = pd.crosstab(df_clean["Card_Category"], df_clean["Churn"])
chi2, p_h1, dof, expected = stats.chi2_contingency(contingency_table)

# Print results
print("H1: Card_Category affects churn probability")
print("Chi² Statistic:", chi2)
print("p-value:", p_h1)
print("Significant at 0.05 level:", p_h1 < 0.05)

H1: Card_Category affects churn probability
Chi² Statistic: 1.1720705064793
p-value: 0.7597104531338725
Significant at 0.05 level: False


# 

In [42]:
# Select relevant columns and create churn flag
df_clean = df[["Attrition_Flag", "Card_Category", "Credit_Limit", "Avg_Utilization_Ratio"]].copy()
df_clean["Churn"] = df_clean["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)

# H2: Credit_Limit differs between churners and non-churners (Independent T-test)
credit_limit_churn = df_clean[df_clean["Churn"] == 1]["Credit_Limit"]
credit_limit_no_churn = df_clean[df_clean["Churn"] == 0]["Credit_Limit"]
t_h2, p_h2 = stats.ttest_ind(credit_limit_churn, credit_limit_no_churn, equal_var=False)
print("H2: Credit_Limit differs between churners and non-churners")
print("t-Stat:", t_h2)
print("p-value:", p_h2)
print("Significant:", p_h2 < 0.05)
print("\n")

H2: Credit_Limit differs between churners and non-churners
t-Stat: -1.8130829511943392
p-value: 0.07001068934718169
Significant: False




# 

In [44]:
# Select relevant columns and create churn flag
df_clean = df[["Attrition_Flag", "Card_Category", "Credit_Limit", "Avg_Utilization_Ratio"]].copy()
df_clean["Churn"] = df_clean["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)

# H3: Avg_Utilization_Ratio is significantly lower for churners (One-tailed T-test)
util_ratio_churn = df_clean[df_clean["Churn"] == 1]["Avg_Utilization_Ratio"]
util_ratio_no_churn = df_clean[df_clean["Churn"] == 0]["Avg_Utilization_Ratio"]
t_h3, p_h3_two_tailed = stats.ttest_ind(util_ratio_churn, util_ratio_no_churn, equal_var=False)
p_h3 = p_h3_two_tailed / 2 if t_h3 < 0 else 1 - (p_h3_two_tailed / 2)
print("H3: Avg_Utilization_Ratio is significantly lower for churners")
print("t-Stat:", t_h3)
print("One-tailed p-value:", p_h3)
print("Significant:", p_h3 < 0.05)

H3: Avg_Utilization_Ratio is significantly lower for churners
t-Stat: -15.99231734545571
One-tailed p-value: 9.136424529223674e-54
Significant: True
