# **Bank Customer Churn Prediction**

# **1. Introduction**

## **About dataset:**
The bank customer churn dataset is a commonly used dataset for predicting customer churn in the banking industry. It contains information on bank customers who either left the bank or continue to be a customer.

dataset: [Binary Classification with a Bank Churn Dataset | Kaggle](https://www.kaggle.com/competitions/playground-series-s4e1/data)

## **Columns Description:**
Customer ID: A unique identifier for each customer

- Surname: The customer's surname or last name

- Geography: The country where the customer resides (France, Spain or Germany)

- Credit Score: A numerical value representing the customer's credit score

- Gender: The customer's gender (Male or Female)

- Age: The customer's age.

- Tenure: The number of years the customer has been with the bank

- Balance: The customer's account balance

- NumOfProducts: The number of bank products the customer uses (e.g., savings account, credit card)

- HasCrCard: Whether the customer has a credit card (1 = yes, 0 = no)

- IsActiveMember: Whether the customer is an active member (1 = yes, 0 = no)

- EstimatedSalary: The estimated salary of the customer

- Exited: Whether the customer has churned (1 = yes, 0 = no)

# **2. Import Libary**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import plotly.express as px
import numpy as np
import plotly.graph_objects as go
from scipy.stats import norm
import seaborn as sns
import matplotlib.pyplot as plt

# **3. Import Dataset**

In [None]:
# load dataset
path="/content/drive/MyDrive/Projects/data visua/dataset/bankchurn.csv"
df = pd.read_csv(path)
df.head(5)

In [None]:
df.shape

# **4. Quick Remove Null and Duplicate Value**

In [None]:
# Remove null values
print(f'Num of Null Values: {df.isnull().sum()}')
df.dropna(inplace=True)

# Remove duplicate values
print(f"Number of Duplicate values: {df.duplicated().sum()}")
df.drop_duplicates(inplace=True)

df.reset_index(drop=True, inplace=True)

# **5. Exploratory Data Analysis (EDA)**

## 5.1. Remove Surname

In [None]:
# Count unique Surname
unique_sur = df["Surname"].nunique()
print(f"Total Number of Surname: {df.shape[0]} ")
print(f"Number of unique Surname: {unique_sur}")

In [None]:
# count each value of Surname
count_values = df["Surname"].value_counts()
count_values

In [None]:
# min max average num of surname
min_value = count_values.min()
max_value = count_values.max()
avge = count_values.mean()
print(f"Maximum: {max_value}")
print(f"Minimum: {min_value}")
print(f"Average: {avge}")

In [None]:
# Word cloud
from wordcloud import WordCloud

# Concatenate all surnames into a single string
surnames_text = ' '.join(df['Surname'])

# Generate the word cloud
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = None,
                min_font_size = 10).generate(surnames_text)

# Plot the word cloud
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

## 5.2. Customer Churn Distribution

In [None]:
# map color
map_color_exited={
        0: "deepskyblue",   # Exited = 0
        1: "red"            # Exited = 1
        }

In [None]:
# 5.1. Customer Churn Distribution

#  Exited
churn_counts = df["Exited"].value_counts().reset_index()
churn_counts.columns = ["Exited", "Count"]

# mapping color
map_color_exited = {"0": "deepskyblue", "1": "red"}

# convert Exited to string to mapping color
churn_counts["Exited"] = churn_counts["Exited"].astype(str)

# Pie Charts
fig = px.pie(
    churn_counts,
    names="Exited",
    values="Count",
    color="Exited",
    color_discrete_map=map_color_exited,
    title="Customer Churn Distribution"
)

fig.update_traces(textinfo='percent+label+value')
fig.show()

- 130,113 customers (78.8%) stayed with the bank.
- 34,921 customers (21.2%) churned (left the bank).

The churn rate is moderately low (~21%), which means the bank is doing reasonably well in customer retention.

However, nearly 1 in 5 customers leaving still poses a risk, especially considering that acquiring new customers is usually more expensive than retaining current ones.

## 5.3. Churn Distribution by Gender

In [None]:
## 5.2. Churn Distribution by Gender

def plot_churn_by_gender(df, mapcolor):
    # Total and churn
    n_female = df[df["Gender"] == "Female"].shape[0]
    n_male = df[df["Gender"] == "Male"].shape[0]
    total = n_female + n_male

    female_churn = df[(df["Gender"] == "Female") & (df["Exited"] == 1)].shape[0]
    male_churn = df[(df["Gender"] == "Male") & (df["Exited"] == 1)].shape[0]

    # get percent
    pct_female_total = round((n_female / total) * 100, 2)
    pct_male_total = round((n_male / total) * 100, 2)

    pct_female_churn = round((female_churn / n_female) * 100, 2)
    pct_male_churn = round((male_churn / n_male) * 100, 2)


    map_color = mapcolor

    # plot
    fig = px.histogram(
        df,
        x="Gender",
        color="Exited",
        barmode="group",
        title=f"Churn by Gender | Female: {pct_female_total}% of total, Churn of female: {pct_female_churn}% | Male: {pct_male_total}% of total, Churn of male: {pct_male_churn}%",
        color_discrete_map=map_color,
        category_orders={"Gender": ["Female", "Male"]},
        text_auto=True
    )

    fig.update_layout(
        xaxis_title="Gender",
        yaxis_title="Number of Customers",
        bargap=0.2
    )

    fig.update_traces(textfont_size=14, textposition='outside')
    fig.show()

In [None]:
plot_churn_by_gender(df, map_color_exited)

There are more male customers than female in the dataset:

- Male: ~93,150 (78,334 stayed, 14,816 churned)

- Female: ~71,884 (51,779 stayed, 20,105 churned)

Churn Rate by Gender:

- Female: ~28%

- Male: ~15.9%

Female customers churn significantly more than male customers.
This may suggest a gap in service satisfaction or product relevance for female users, which the bank should investigate further.

### Female vs Male

[One-Tailed Two-Proportion Z-Test ](https://www.statology.org/two-proportion-z-test/)

- z = (p1-p2) / √p(1-p)(1/n1+1/n2)

 - p1, p2:  sample proportions of churn for female and male groups
 - n1, n2:  total number of customers in each group
 - p: pooled proportion

- p = (p1n1 + p2n2)/(n1+n2)

The hypothesis:

- Null Hypothesis (H0): p₁ ≤ p₂
( Female churn rate ≤ Male churn rate)

- Alternative Hypothesis (H1):  p₁ > p₂
( Female churn rate > Male churn rate)

| gender   | Churned (x) | 	Total (n) | 	Churn Rate (p) |
|--------|---------------|-------------|-----------------|
| Female     | 20.105        | 71.884      | 27,97%          |
| Male    | 14.816        | 93.150      | 15,91%          |

In [None]:
# Churned (x),	Total (n) , Churn Rate (p)
x1, n1 = 20105, 71884
x2, n2 = 14816, 93150

p1 = x1 / n1
p2 = x2 / n2
p = (x1 + x2) / (n1 + n2)

z = round((p1 - p2) / np.sqrt(p * (1 - p) * (1/n1 + 1/n2)), 4)
p_value = round(1 - norm.cdf(z), 10)

# Distribution
x_vals = np.linspace(-10, 60, 1000)
y_vals = norm.pdf(x_vals)

fig = go.Figure()

# Normal curve
fig.add_trace(go.Scatter(x=x_vals, y=y_vals, mode='lines', name='Standard Normal Distribution', line=dict(color='blue')))

# Rejection region
fig.add_trace(go.Scatter(
    x=x_vals,
    y=np.where(x_vals >= 1.645, y_vals, None),
    fill='tozeroy',
    mode='none',
    name='Rejection Region (α = 0.05)',
    fillcolor='rgba(255,0,0,0.3)'
))

# Z-score line
fig.add_trace(go.Scatter(
    x=[z, z],
    y=[0, 0.4],  # constant height (just for visualization)
    mode='lines',
    line=dict(color='darkred', dash='dash'),
    name=f'Z = {z}'
))

# Annotation for p-value
fig.add_annotation(
    x=z,
    y=0.03,
    text=f"p-value ≈ {p_value}",
    showarrow=True,
    arrowhead=1,
    ax=-60,
    ay=-30,
    font=dict(color="darkred")
)

# Layout
fig.update_layout(
    title="Interactive One-Tailed Z-Test: Female Churn Rate > Male Churn Rate",
    xaxis_title="Z-score",
    yaxis_title="Probability Density",
    xaxis_range=[ -10, 60],
    showlegend=True,
    width=900,
    height=500
)

fig.show()

- The blue curve represents the standard normal distribution (Z-distribution) with mean = 0 and standard deviation = 1.

- The red shaded area starting from Z = 1.645 is the rejection region for a one-tailed test at significance level α = 0.05.

- The dashed red vertical line at Z = 59.491 indicates the actual result of your Z-Test.

- The annotation "p-value ≈ 0.0" highlights that the probability of getting such a high Z-score under the null hypothesis is virtually 0.

## 5.4. Churn Distribution by Geography

In [None]:
def plot_churn_by_geography(df, mapcolor):
    # all unique regions
    regions = df["Geography"].unique()
    total = df.shape[0]

    # summary for each region
    summaries = []
    for region in regions:
        n_region = df[df["Geography"] == region].shape[0]
        churn_region = df[(df["Geography"] == region) & (df["Exited"] == 1)].shape[0]
        pct_region_total = round((n_region / total) * 100, 2)
        pct_region_churn = round((churn_region / n_region) * 100, 2)
        summaries.append(f"{region}: {pct_region_total}% of total, Churn of {region}: {pct_region_churn}%")

    # title
    title_text = "Churn by Geography | " + " | ".join(summaries)

    # Plot
    fig = px.histogram(
        df,
        x="Geography",
        color="Exited",
        barmode="group",
        title=title_text,
        color_discrete_map=mapcolor,
        category_orders={"Geography": sorted(regions)},
        text_auto=True
    )

    fig.update_layout(
        xaxis_title="Geography",
        yaxis_title="Number of Customers",
        bargap=0.2
    )

    fig.update_traces(textfont_size=14, textposition='outside')
    fig.show()

In [None]:
plot_churn_by_geography(df, map_color_exited)

| Geography   | Churned (x) | 	Total (n) | 	Churn Rate (p) |
|--------|---------------|-------------|-----------------|
| France     | 15,572        | 94,215      | 16.53%         |
| Germany    | 13,114        | 34,606      | 37.90%         |
| Spain    | 6,235       | 36,213     | 17.22%         |

### Germany vs France
- H0: p_GER ≤ p_FR
- H1: p_GER > p_FR

In [None]:
# Germany vs France
# churned (X), Total (n)
x1, n1 = 13114, 34606  # Germany
x2, n2 = 15572, 94215  # France

# Proportions
p1 = x1 / n1  # Germany churn rate
p2 = x2 / n2  # France churn rate

# Pooled proportion
p = (x1 + x2) / (n1 + n2)

# Z-score and p-value
z = round((p1 - p2) / np.sqrt(p * (1 - p) * (1/n1 + 1/n2)), 4)
p_value = round(1 - norm.cdf(z), 10)

# Z-distribution plot
x_vals = np.linspace(-10, 90, 1000)
y_vals = norm.pdf(x_vals)

fig = go.Figure()

# Standard normal curve
fig.add_trace(go.Scatter(x=x_vals, y=y_vals, mode='lines',
                         name='Standard Normal Distribution', line=dict(color='blue')))

# Rejection region (α = 0.05, Z = 1.645)
fig.add_trace(go.Scatter(
    x=x_vals,
    y=np.where(x_vals >= 1.645, y_vals, None),
    fill='tozeroy',
    mode='none',
    name='Rejection Region (α = 0.05)',
    fillcolor='rgba(255,0,0,0.3)'
))

# Z-score line
fig.add_trace(go.Scatter(
    x=[z, z],
    y=[0, 0.4],
    mode='lines',
    line=dict(color='darkred', dash='dash'),
    name=f'Z = {z}'
))

# Annotate p-value
fig.add_annotation(
    x=z,
    y=0.03,
    text=f"p-value ≈ {p_value}",
    showarrow=True,
    arrowhead=1,
    ax=-60,
    ay=-30,
    font=dict(color="darkred")
)

# Layout
fig.update_layout(
    title="Z-Test: Germany Churn Rate > France Churn Rate",
    xaxis_title="Z-score",
    yaxis_title="Probability Density",
    xaxis_range=[-10, 90],
    showlegend=True,
    width=900,
    height=500
)

fig.show()

- The Z-score between Germany and France is Z = 81.7043

- The red region is the rejection zone for significance level α = 0.05, starting at Z = 1.645

- Because Z ≫ 1.645 and p-value ≈ 0.0, we reject the null hypothesis H₀

-> Germany has the highest churn rate

### Germany vs Spain
- H0: p_GER ≤ p_SP

- H1: p_GER > p_SP

In [None]:
# Germany vs Spain
# churned (X), Total (n)
x1, n1 = 13114, 34606   # Germany
x2, n2 = 6235, 36213    # Spain

# Proportions
p1 = x1 / n1  # Germany churn rate
p2 = x2 / n2  # Spain churn rate

# Pooled proportion
p = (x1 + x2) / (n1 + n2)

# Z-score and p-value
z = round((p1 - p2) / np.sqrt(p * (1 - p) * (1/n1 + 1/n2)), 4)
p_value = round(1 - norm.cdf(z), 10)

# Z-distribution for plot
x_vals = np.linspace(-10, 90, 1000)
y_vals = norm.pdf(x_vals)

# Create figure
fig = go.Figure()

# Normal curve
fig.add_trace(go.Scatter(x=x_vals, y=y_vals, mode='lines', name='Standard Normal Distribution', line=dict(color='blue')))

# Rejection region (α = 0.05 → Z = 1.645)
fig.add_trace(go.Scatter(
    x=x_vals,
    y=np.where(x_vals >= 1.645, y_vals, None),
    fill='tozeroy',
    mode='none',
    name='Rejection Region (α = 0.05)',
    fillcolor='rgba(255,0,0,0.3)'
))

# Z-score line
fig.add_trace(go.Scatter(
    x=[z, z],
    y=[0, 0.4],
    mode='lines',
    line=dict(color='darkred', dash='dash'),
    name=f'Z = {z}'
))

# Annotate p-value
fig.add_annotation(
    x=z,
    y=0.03,
    text=f"p-value ≈ {p_value}",
    showarrow=True,
    arrowhead=1,
    ax=-60,
    ay=-30,
    font=dict(color="darkred")
)

# Layout
fig.update_layout(
    title="Z-Test: Germany Churn Rate > Spain Churn Rate",
    xaxis_title="Z-score",
    yaxis_title="Probability Density",
    xaxis_range=[-10, 90],
    showlegend=True,
    width=900,
    height=500
)

fig.show()

- The chart shows the standard normal distribution (blue), with the rejection region shaded in red starting at Z = 1.645

- The dashed red line at Z = 61.2722 represents the observed Z-score for Germany vs Spain

- The p-value ≈ 0.0 indicates that such an extreme Z-score is virtually impossible under the null hypothesis

- Since Z > 1.645 and p ≈ 0, we reject the null hypothesis

-> Germany has the highest churn rate

### Conclusion

Germany has a higher churn rate than both France and Spain, with very high statistical significance (Z > 60, p ≈ 0 in both pairs of tests).

## 5.5. Geography and Gender  

In [None]:
# # Group by Geography and Gender
gender_geo_counts = df.groupby(["Geography", "Gender"]).size().reset_index(name="Count")

# # Calculate total and percent ( ALL Dataset)
# total = gender_geo_counts["Count"].sum()
# gender_geo_counts["Percent"] = round((gender_geo_counts["Count"] / total) * 100, 2)

# Calculate total per country and percent
total_per_country = gender_geo_counts.groupby("Geography")["Count"].transform("sum")
gender_geo_counts["Percent"] = round((gender_geo_counts["Count"] / total_per_country) * 100, 2)

# # Create label: "Count (Percent%)"
gender_geo_counts["Label"] = gender_geo_counts["Count"].astype(str) + " (" + gender_geo_counts["Percent"].astype(str) + "%)"

In [None]:
# Bar Chart
fig = px.bar(
    gender_geo_counts,
    x="Geography",
    y="Count",
    color="Gender",
    barmode="group",
    text=gender_geo_counts["Label"],
    title="Number and Percentage of Male and Female Customers per Country"
)

# Update layout
fig.update_layout(
    yaxis_title="Number of Customers",
    xaxis_title="Country",
    bargap=0.2
)

# Show text above bars
fig.update_traces(textposition="outside", textfont_size=13)

# Display chart
fig.show()

In [None]:
# Group: country, Gender and Churn
grouped_churn = df.groupby(["Geography", "Gender", "Exited"]).size().reset_index(name="Count")

# Bar Chart
fig = px.bar(
    grouped_churn,
    x="Geography",
    y="Count",
    color="Exited",
    barmode="group",
    facet_col="Gender",
    category_orders={"Exited": [0, 1]},
    labels={"Exited": "Churned (1 = Yes, 0 = No)"},
    title="Churn vs Stay by Gender and Country"
)

fig.update_layout(
    yaxis_title="Number of Customers",
    xaxis_title="Country",
    bargap=0.2
)

fig.show()