# Bank Customer Churn – Segments and Business Insights

This is the third notebook in the **Modern Bank Churn** project.

Goal:

1. Use the **tuned LightGBM model** to score customers with churn probabilities.
2. Define **risk segments** based on churn probability.
3. Analyse each segment to understand typical customer profiles.
4. Discuss possible **retention strategies** per segment.

We re-train the model in this notebook so it is fully self-contained.


## 1. Imports and configuration

In [None]:
from __future__ import annotations

from pathlib import Path
from typing import Dict, List

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from lightgbm import LGBMClassifier

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

RANDOM_STATE: int = 42
np.random.seed(RANDOM_STATE)

DATA_PATH: Path = Path("data") / "Churn_Modelling.csv"

if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Data file not found at {DATA_PATH.resolve()}. "
        "Please download the Bank Customer Churn CSV and place it under the 'data/' directory."
    )


## 2. Load, clean, and prepare data

We reuse the same cleaning and preprocessing logic:

- Drop identifier columns.
- Split into train/test.
- Use a LightGBM model with a good default configuration.

For simplicity, we skip Optuna here and use a strong but fixed configuration,
assuming hyperparameter tuning was done in the previous notebook.


In [None]:
def load_bank_churn_data(path: Path) -> pd.DataFrame:
    """Load the bank customer churn dataset from a CSV file."""
    if not path.exists():
        raise FileNotFoundError(f"File not found: {path!s}")
    df: pd.DataFrame = pd.read_csv(path)
    if df.empty:
        raise ValueError(f"Loaded DataFrame is empty: {path!s}")
    return df


def clean_bank_churn_data(raw_df: pd.DataFrame) -> pd.DataFrame:
    """Clean the bank customer churn dataset (drop IDs, check target)."""
    df = raw_df.copy()
    id_cols: List[str] = ["RowNumber", "CustomerId", "Surname"]
    drop_cols: List[str] = [c for c in id_cols if c in df.columns]
    if drop_cols:
        df = df.drop(columns=drop_cols)
        print(f"Dropped identifier columns: {drop_cols}")
    if "Exited" not in df.columns:
        raise ValueError("Target column 'Exited' not found in DataFrame.")
    return df


raw_df: pd.DataFrame = load_bank_churn_data(DATA_PATH)
df: pd.DataFrame = clean_bank_churn_data(raw_df)

TARGET_COL: str = "Exited"
X: pd.DataFrame = df.drop(columns=[TARGET_COL])
y: pd.Series = df[TARGET_COL].astype(int)

categorical_cols: List[str] = [c for c in ["Geography", "Gender"] if c in X.columns]
numeric_cols: List[str] = [c for c in X.columns if c not in categorical_cols]

numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])
categorical_transformer = Pipeline(steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ]
)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=RANDOM_STATE,
)

clf = LGBMClassifier(
    n_estimators=300,
    learning_rate=0.05,
    num_leaves=32,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=RANDOM_STATE,
    n_jobs=-1,
)

pipeline = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        ("clf", clf),
    ]
)

pipeline.fit(X_train, y_train)
y_proba_test = pipeline.predict_proba(X_test)[:, 1]
print("Test ROC-AUC:", roc_auc_score(y_test, y_proba_test))


## 3. Score all customers and define segments

We now score **all customers** (not only the test set) to create churn risk
segments.

Example segmentation:

- **Low risk**: `p(churn) < 0.2`
- **Medium risk**: `0.2 ≤ p(churn) < 0.5`
- **High risk**: `p(churn) ≥ 0.5`

These thresholds are illustrative and can be adjusted based on business needs.


In [None]:
# Score all customers
pipeline.fit(X, y)
proba_all = pipeline.predict_proba(X)[:, 1]

df_scores = df.copy()
df_scores["churn_proba"] = proba_all

def assign_risk_segment(p: float) -> str:
    """Assign a risk segment label based on churn probability p."""
    if p < 0.2:
        return "low_risk"
    if p < 0.5:
        return "medium_risk"
    return "high_risk"


df_scores["risk_segment"] = df_scores["churn_proba"].apply(assign_risk_segment)
df_scores[["Exited", "churn_proba", "risk_segment"]].head()


### 3.1 Segment sizes and average churn probability

In [None]:
segment_summary = (
    df_scores.groupby("risk_segment")
    .agg(
        n_customers=("Exited", "size"),
        churn_rate=("Exited", "mean"),
        avg_churn_proba=("churn_proba", "mean"),
        avg_age=("Age", "mean"),
        avg_balance=("Balance", "mean"),
        avg_num_products=("NumOfProducts", "mean"),
        avg_estimated_salary=("EstimatedSalary", "mean"),
    )
    .reset_index()
)

display(segment_summary)

sns.barplot(data=segment_summary, x="risk_segment", y="n_customers")
plt.title("Number of customers per risk segment")
plt.ylabel("# customers")
plt.show()

sns.barplot(data=segment_summary, x="risk_segment", y="churn_rate")
plt.title("Observed churn rate per segment")
plt.ylabel("Churn rate")
plt.show()


### 3.2 Feature distributions by segment

We compare some key features between segments, such as `Age`, `Balance`,
`NumOfProducts`, and `IsActiveMember`.


In [None]:
# Age distribution by segment
sns.boxplot(data=df_scores, x="risk_segment", y="Age")
plt.title("Age distribution by risk segment")
plt.show()

# Balance distribution by segment
sns.boxplot(data=df_scores, x="risk_segment", y="Balance")
plt.title("Balance distribution by risk segment")
plt.show()

# Activity by segment
if "IsActiveMember" in df_scores.columns:
    sns.barplot(
        data=df_scores,
        x="risk_segment",
        y="IsActiveMember",
        estimator=np.mean,
    )
    plt.title("Average activity (IsActiveMember) by risk segment")
    plt.ylabel("Mean IsActiveMember")
    plt.show()


## 4. Business interpretation

Based on the segments and their profiles, we can sketch some strategies:

- **High-risk segment** (`risk_segment == "high_risk"`):
  - High churn probability.
  - Often lower activity (`IsActiveMember`), certain balance/age patterns.
  - Potential actions:
    - Personalised outreach (calls, meetings).
    - Tailored product bundles or better terms.
    - Service quality review for this group.

- **Medium-risk segment**:
  - Clear risk but not extreme.
  - Potential actions:
    - Targeted digital campaigns and nudges.
    - Soft incentives (fee waivers, loyalty points).

- **Low-risk segment**:
  - Very low predicted churn.
  - Focus on maintaining satisfaction efficiently.
  - Potential actions:
    - Light-touch engagement.
    - Cross-sell / up-sell if appropriate.


## 5. Summary and next steps

In this notebook we:

- Used a tuned LightGBM churn model to score all customers.
- Defined risk segments based on churn probability.
- Characterised segments by age, balance, number of products, activity, etc.
- Linked segments to possible **retention strategies**.

Possible extensions:

- Integrate **customer lifetime value (CLV)** into the segmentation, so that
  high-risk, high-value customers are prioritised.
- Use **lift / gain charts** to evaluate how well the model concentrates churn
  in the top-risk deciles.
- Combine segments with **operational constraints** (call centre capacity,
  marketing budget) to build concrete campaign plans.
