# Bank Customer Churn – LightGBM, Optuna & SHAP

This notebook is part of the **Modern Bank Churn** project.

Goal of this notebook:

1. Reuse the bank churn dataset and preprocessing logic.
2. Train a **LightGBM** model for churn prediction.
3. Use **Optuna** to tune hyperparameters with cross-validation.
4. Explain the tuned model with **SHAP** values.


## 1. Imports and configuration

We add to the usual stack:

- `lightgbm` (`LGBMClassifier`) for gradient boosting.
- `optuna` for hyperparameter optimisation.
- `shap` for explainability.

In [None]:
from __future__ import annotations

from pathlib import Path
from typing import Dict, List

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator

from lightgbm import LGBMClassifier
import optuna
import shap

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

RANDOM_STATE: int = 42
np.random.seed(RANDOM_STATE)

DATA_PATH: Path = Path("data") / "Churn_Modelling.csv"

if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Data file not found at {DATA_PATH.resolve()}. "
        "Please download the Bank Customer Churn CSV and place it under the 'data/' directory."
    )


## 2. Load and clean the data

We mirror the cleaning steps from the first notebook so this one is self-contained:

- Drop identifier columns (`RowNumber`, `CustomerId`, `Surname`).
- Ensure `Exited` is present.


In [None]:
def load_bank_churn_data(path: Path) -> pd.DataFrame:
    """Load the bank customer churn dataset from a CSV file."""
    if not path.exists():
        raise FileNotFoundError(f"File not found: {path!s}")
    df: pd.DataFrame = pd.read_csv(path)
    if df.empty:
        raise ValueError(f"Loaded DataFrame is empty: {path!s}")
    return df


def clean_bank_churn_data(raw_df: pd.DataFrame) -> pd.DataFrame:
    """Clean the bank customer churn dataset (drop IDs, check target)."""
    df = raw_df.copy()

    id_cols: List[str] = ["RowNumber", "CustomerId", "Surname"]
    drop_cols: List[str] = [c for c in id_cols if c in df.columns]
    if drop_cols:
        df = df.drop(columns=drop_cols)
        print(f"Dropped identifier columns: {drop_cols}")

    if "Exited" not in df.columns:
        raise ValueError("Target column 'Exited' not found in DataFrame.")

    return df


raw_df: pd.DataFrame = load_bank_churn_data(DATA_PATH)
df: pd.DataFrame = clean_bank_churn_data(raw_df)
display(df.head())


## 3. Train–test split and preprocessing

We:

- Separate `X` and `y`.
- Perform a stratified train–test split.
- Use `ColumnTransformer` to one-hot encode `Geography` and `Gender`, while
  leaving numeric features untouched.

LightGBM handles raw numeric scales reasonably well, so scaling is optional,
but we keep a placeholder for flexibility.


In [None]:
TARGET_COL: str = "Exited"

if TARGET_COL not in df.columns:
    raise KeyError(f"Target column {TARGET_COL!r} not found in DataFrame.")

X: pd.DataFrame = df.drop(columns=[TARGET_COL])
y: pd.Series = df[TARGET_COL].astype(int)

categorical_cols: List[str] = [c for c in ["Geography", "Gender"] if c in X.columns]
numeric_cols: List[str] = [c for c in X.columns if c not in categorical_cols]

print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=RANDOM_STATE,
)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
    steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ]
)


## 4. Baseline LightGBM model

We start with a reasonable default `LGBMClassifier` inside a pipeline and
evaluate it using a simple train/test split.

This gives us a reference point before hyperparameter tuning with Optuna.


In [None]:
def evaluate_model_simple(
    model: BaseEstimator,
    X_train: pd.DataFrame,
    X_test: pd.DataFrame,
    y_train: pd.Series,
    y_test: pd.Series,
) -> Dict[str, float]:
    """Fit a model and compute basic metrics on train and test data."""
    model.fit(X_train, y_train)

    y_pred_test = model.predict(X_test)
    y_proba_test = model.predict_proba(X_test)[:, 1]

    acc = accuracy_score(y_test, y_pred_test)
    roc_auc = roc_auc_score(y_test, y_proba_test)

    print(f"Test accuracy: {acc:.3f}")
    print(f"Test ROC-AUC: {roc_auc:.3f}")
    print("\nClassification report (test):")
    print(classification_report(y_test, y_pred_test, target_names=["Stayed", "Exited"]))

    cm = confusion_matrix(y_test, y_pred_test)
    sns.heatmap(
        cm,
        annot=True,
        fmt="d",
        cmap="Blues",
        xticklabels=["Pred stayed", "Pred exited"],
        yticklabels=["True stayed", "True exited"],
    )
    plt.title("Confusion matrix - LightGBM (baseline)")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()

    RocCurveDisplay.from_predictions(y_test, y_proba_test)
    plt.title("ROC curve - LightGBM (baseline)")
    plt.show()

    return {"accuracy": acc, "roc_auc": roc_auc}


lgbm_baseline = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        (
            "clf",
            LGBMClassifier(
                n_estimators=200,
                learning_rate=0.05,
                subsample=0.8,
                colsample_bytree=0.8,
                random_state=RANDOM_STATE,
            ),
        ),
    ]
)

baseline_metrics = evaluate_model_simple(lgbm_baseline, X_train, X_test, y_train, y_test)
baseline_metrics


## 5. Hyperparameter tuning with Optuna

We now define an **Optuna** optimisation loop that:

1. Samples LightGBM hyperparameters.
2. Builds a pipeline with those hyperparameters.
3. Evaluates mean ROC-AUC via cross-validation on the training set.
4. Returns the negative loss (1 - ROC-AUC) or directly ROC-AUC as the objective.

We use:

- `StratifiedKFold` for consistent splits.
- A modest number of trials (e.g. 30–50) to keep runtime manageable.


In [None]:
def create_lgbm_pipeline(trial: optuna.Trial) -> Pipeline:
    """Create a LightGBM pipeline with hyperparameters suggested by Optuna."""
    # Hyperparameters suggested by Optuna
    num_leaves = trial.suggest_int("num_leaves", 16, 64)
    max_depth = trial.suggest_int("max_depth", 3, 10)
    learning_rate = trial.suggest_float("learning_rate", 0.01, 0.2, log=True)
    n_estimators = trial.suggest_int("n_estimators", 100, 500)
    min_child_samples = trial.suggest_int("min_child_samples", 10, 100)
    subsample = trial.suggest_float("subsample", 0.6, 1.0)
    colsample_bytree = trial.suggest_float("colsample_bytree", 0.6, 1.0)
    reg_lambda = trial.suggest_float("reg_lambda", 0.0, 10.0)
    reg_alpha = trial.suggest_float("reg_alpha", 0.0, 10.0)

    clf = LGBMClassifier(
        num_leaves=num_leaves,
        max_depth=max_depth,
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        min_child_samples=min_child_samples,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        reg_lambda=reg_lambda,
        reg_alpha=reg_alpha,
        random_state=RANDOM_STATE,
        n_jobs=-1,
    )

    pipeline = Pipeline(
        steps=[
            ("preprocess", preprocessor),
            ("clf", clf),
        ]
    )
    return pipeline


def objective(trial: optuna.Trial) -> float:
    """Optuna objective function: maximise ROC-AUC via cross-validation.

    We return the mean ROC-AUC across folds.
    """
    pipeline = create_lgbm_pipeline(trial)

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

    scores = cross_val_score(
        pipeline,
        X_train,
        y_train,
        cv=cv,
        scoring="roc_auc",
        n_jobs=-1,
    )
    mean_score = float(scores.mean())
    return mean_score


study = optuna.create_study(direction="maximize", study_name="lgbm_bank_churn")
study.optimize(objective, n_trials=30, show_progress_bar=False)

print("Best trial:")
print("  Value (ROC-AUC):", study.best_value)
print("  Params:")
for k, v in study.best_params.items():
    print(f"    {k}: {v}")


### 5.1 Fit the best LightGBM model

We now create a pipeline with the best parameters found by Optuna, fit it
on the full training data, and evaluate it on the test set.


In [None]:
best_params = study.best_params
best_clf = LGBMClassifier(
    num_leaves=best_params["num_leaves"],
    max_depth=best_params["max_depth"],
    learning_rate=best_params["learning_rate"],
    n_estimators=best_params["n_estimators"],
    min_child_samples=best_params["min_child_samples"],
    subsample=best_params["subsample"],
    colsample_bytree=best_params["colsample_bytree"],
    reg_lambda=best_params["reg_lambda"],
    reg_alpha=best_params["reg_alpha"],
    random_state=RANDOM_STATE,
    n_jobs=-1,
)

best_lgbm_pipeline = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        ("clf", best_clf),
    ]
)

tuned_metrics = evaluate_model_simple(best_lgbm_pipeline, X_train, X_test, y_train, y_test)
tuned_metrics


## 6. Model explainability with SHAP

We use **SHAP** to understand how features influence the LightGBM predictions.

Steps:

1. Fit the best pipeline on the full training set (already done).
2. Extract the trained `LGBMClassifier` and the transformed training data.
3. Use `shap.TreeExplainer` to compute SHAP values.
4. Plot:
   - SHAP summary plot (global importance).
   - SHAP dependence plots for key features.


In [None]:
# Fit on full training data to ensure the explainer uses the final model
best_lgbm_pipeline.fit(X_train, y_train)

# Extract trained model and transformed training features
preprocessor_fitted: ColumnTransformer = best_lgbm_pipeline.named_steps["preprocess"]  # type: ignore[assignment]
clf_fitted: LGBMClassifier = best_lgbm_pipeline.named_steps["clf"]  # type: ignore[assignment]

X_train_transformed = preprocessor_fitted.transform(X_train)

# Build SHAP explainer
explainer = shap.TreeExplainer(clf_fitted)
shap_values = explainer.shap_values(X_train_transformed)

# SHAP expects a dense matrix for plotting in many cases
X_train_transformed_dense = X_train_transformed.toarray() if hasattr(X_train_transformed, "toarray") else X_train_transformed

# Get feature names from preprocessor
cat_encoder: OneHotEncoder = preprocessor_fitted.named_transformers_["cat"].named_steps["encoder"]  # type: ignore[index]
cat_feature_names = cat_encoder.get_feature_names_out(categorical_cols)

feature_names: List[str] = numeric_cols + list(cat_feature_names)


In [None]:
# SHAP summary plot (global importance)
shap.summary_plot(shap_values[1], X_train_transformed_dense, feature_names=feature_names)


In [None]:
# Optional: dependence plot for selected key features
# You can change feature names depending on importance and domain interest.
for feat in ["Age", "Balance", "NumOfProducts", "IsActiveMember"]:
    if feat in feature_names:
        shap.dependence_plot(feat, shap_values[1], X_train_transformed_dense, feature_names=feature_names)


### Section summary

In this notebook we:

- Built a **LightGBM** churn model.
- Tuned its hyperparameters with **Optuna** using ROC-AUC as objective.
- Achieved improved performance over the baseline model.
- Used **SHAP** to interpret which features drive churn predictions.

Next, in the segmentation notebook, we will use the tuned model to create
**actionable customer segments** and link them to potential retention strategies.
