# Bank Customer Churn – Uplift Modelling for Retention Campaigns (Simulated RCT)

This notebook is a **new churn project** focused on **uplift modelling**.

Instead of only predicting **who will churn**, we want to estimate **who will be
positively influenced by a retention campaign**.

Because the `Churn_Modelling.csv` dataset does **not** contain an actual
experiment, we:

- **Simulate** a retention campaign as a randomised experiment (treatment vs control).
- Construct **potential outcomes** with heterogeneous treatment effects.
- Fit an **uplift model** to learn where the campaign works best.

High-level steps:

1. Load and clean the bank churn dataset.
2. Simulate a randomised retention campaign (treatment assignment).
3. Simulate heterogeneous treatment effects and an observed post-campaign churn.
4. Estimate average treatment effects (ATE) and segment-level effects.
5. Train an uplift model using the **two-model approach** (treated vs control models).
6. Evaluate uplift by **uplift-by-quantile** analysis.
7. Discuss how to use uplift scores to target future retention campaigns.


## 1. Imports and configuration

We use:

- `pandas`, `numpy` for data handling.
- `matplotlib`, `seaborn` for plots.
- `scikit-learn` for modelling.

We assume the dataset is available at:

```text
data/Churn_Modelling.csv
```


In [None]:
from __future__ import annotations

from pathlib import Path
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

RANDOM_STATE: int = 42
np.random.seed(RANDOM_STATE)

DATA_PATH: Path = Path("data") / "Churn_Modelling.csv"

if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Data file not found at {DATA_PATH.resolve()}. "
        "Please download the Bank Customer Churn CSV and place it under the 'data/' directory."
    )


## 2. Load and clean the data

We load the bank churn dataset and perform basic cleaning:

- Drop identifier columns.
- Ensure `Exited` exists and is encoded as 0/1.


In [None]:
def load_bank_churn_data(path: Path) -> pd.DataFrame:
    """Load the bank customer churn dataset from a CSV file.

    Args:
        path: Path to the CSV file.

    Returns:
        DataFrame containing the bank churn data.

    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If the loaded DataFrame is empty.
    """
    if not path.exists():
        raise FileNotFoundError(f"File not found: {path!s}")

    df: pd.DataFrame = pd.read_csv(path)
    if df.empty:
        raise ValueError(f"Loaded DataFrame is empty: {path!s}")

    return df


def clean_bank_churn_data(raw_df: pd.DataFrame) -> pd.DataFrame:
    """Clean the bank customer churn dataset.

    - Drop identifier columns.
    - Ensure `Exited` exists and is integer 0/1.

    Args:
        raw_df: Raw bank churn DataFrame.

    Returns:
        Cleaned DataFrame.
    """
    df = raw_df.copy()

    id_cols: List[str] = ["RowNumber", "CustomerId", "Surname"]
    drop_cols: List[str] = [c for c in id_cols if c in df.columns]
    if drop_cols:
        df = df.drop(columns=drop_cols)
        print(f"Dropped identifier columns: {drop_cols}")

    if "Exited" not in df.columns:
        raise ValueError("Target column 'Exited' not found in DataFrame.")

    df["Exited"] = df["Exited"].astype(int)

    # Show any missing values
    missing = df.isna().sum()
    print("Missing values per column (non-zero only):")
    display(missing[missing > 0])

    return df


raw_df: pd.DataFrame = load_bank_churn_data(DATA_PATH)
df: pd.DataFrame = clean_bank_churn_data(raw_df)

print("Shape:", df.shape)
print(df["Exited"].value_counts(normalize=True).rename("Exited proportion"))

display(df.head())


We have a snapshot of customers, some of whom **exited** (churned) and others
who stayed. Next we simulate a **retention campaign experiment** on top of this.


## 3. Simulating a retention campaign (randomised experiment)

The dataset does not have an actual campaign, so we **simulate** one.

Conceptually, in a real RCT (randomised controlled trial):

- Each customer is randomly assigned to:
  - **Treatment** (`treatment = 1`): receives a retention offer.
  - **Control** (`treatment = 0`): no offer.
- After some time, we observe whether they churned (`churn_after_campaign`).

To simulate this realistically, we:

1. Build a **baseline churn risk model** from the original data.
2. Define a **heterogeneous treatment effect** `tau(x)` based on customer features.
3. For each customer construct potential churn probabilities:
   - `p_control(x)` – probability of churn with no offer.
   - `p_treated(x)` – probability of churn with an offer.
4. Randomly assign treatment and sample an observed outcome from the
   appropriate Bernoulli distribution.

This gives us synthetic RCT data `(X, treatment, churn_observed)` with known
"true" individual-level treatment effects.


### 3.1 Baseline churn risk model

We first fit a simple **Logistic Regression** as a baseline risk model
`P(Exited = 1 | X)`.


In [None]:
TARGET_COL: str = "Exited"

X_all: pd.DataFrame = df.drop(columns=[TARGET_COL])
y_all: pd.Series = df[TARGET_COL]

# Define categorical and numeric columns
categorical_cols: List[str] = [c for c in ["Geography", "Gender"] if c in X_all.columns]
numeric_cols: List[str] = [c for c in X_all.columns if c not in categorical_cols]

print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)

numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

categorical_transformer = Pipeline(
    steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ]
)

baseline_log_reg = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        (
            "clf",
            LogisticRegression(
                max_iter=1000,
                random_state=RANDOM_STATE,
                n_jobs=-1,
            ),
        ),
    ]
)

# Fit baseline model on all data (for simulation purposes)
baseline_log_reg.fit(X_all, y_all)

p_base: np.ndarray = baseline_log_reg.predict_proba(X_all)[:, 1]

print("Baseline churn probability summary:")
print(pd.Series(p_base).describe())


### 3.2 Define heterogeneous treatment effects

We now define a simple **treatment effect function** `tau(x)`:

- Base effect: the campaign reduces churn by ~5 percentage points.
- Larger effect for certain profiles, for example:
  - Customers with **2+ products** and **active** → +10 points uplift (campaign works best).
  - Customers with very low **credit score** → almost no effect.

We implement this by:

- Starting from the baseline probability `p_base`.
- Constructing `p_control` and `p_treated` per customer.
- Ensuring probabilities stay in `[0.01, 0.99]`.


In [None]:
def compute_potential_outcomes(
    df_features: pd.DataFrame,
    base_probs: np.ndarray,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Compute potential churn probabilities under control and treatment.

    The function encodes a simple heterogeneous treatment effect rule.

    Args:
        df_features: Original feature DataFrame (without target).
        base_probs: Baseline churn probabilities from a model.

    Returns:
        Tuple of (p_control, p_treated, true_ite) arrays.
        - p_control: churn prob with no campaign.
        - p_treated: churn prob with campaign.
        - true_ite: p_control - p_treated (reduction in churn due to campaign).
    """
    p_base = np.asarray(base_probs, dtype=float)

    # Start from baseline as control probability
    p_control = np.clip(p_base, 0.01, 0.99)

    # Base treatment effect (absolute reduction in churn probability)
    base_effect = 0.05  # 5 percentage points

    # Extra effect for certain segments
    extra_effect = np.zeros_like(p_control)

    # Customers with 2+ products and active membership respond well
    if {"NumOfProducts", "IsActiveMember"}.issubset(df_features.columns):
        mask_good_segment = (
            (df_features["NumOfProducts"] >= 2) & (df_features["IsActiveMember"] == 1)
        )
        extra_effect[mask_good_segment.to_numpy()] += 0.10

    # Customers with very low credit score respond less
    if "CreditScore" in df_features.columns:
        mask_low_score = df_features["CreditScore"] < 500
        extra_effect[mask_low_score.to_numpy()] -= 0.03  # reduce effect

    # Treatment effect for each individual
    tau = base_effect + extra_effect
    tau = np.clip(tau, 0.0, 0.3)  # cap effect

    p_treated = np.clip(p_control - tau, 0.01, 0.99)

    true_ite = p_control - p_treated  # positive = reduction in churn

    return p_control, p_treated, true_ite


p_control, p_treated, true_ite = compute_potential_outcomes(df, p_base)

print("True ITE (p_control - p_treated) summary:")
print(pd.Series(true_ite).describe())


### 3.3 Simulate treatment assignment and observed outcomes

We now:

1. Assign each customer **randomly** to treatment or control (`p=0.5`).
2. For each customer:
   - If `treatment = 0`, draw churn from `Bernoulli(p_control)`.
   - If `treatment = 1`, draw churn from `Bernoulli(p_treated)`.

This gives us:

- `treatment` – 0 or 1.
- `churn_observed` – 0 (stayed) or 1 (churned) **after** the simulated campaign.

We also keep `p_control`, `p_treated`, and `true_ite` for evaluation.


In [None]:
n: int = df.shape[0]

# Randomised treatment assignment
p_treat: float = 0.5
treatment = np.random.binomial(1, p_treat, size=n)

# Observed churn after campaign
churn_observed = np.where(
    treatment == 1,
    np.random.binomial(1, p_treated),
    np.random.binomial(1, p_control),
)

sim_df = df.copy()
sim_df["treatment"] = treatment
sim_df["churn_observed"] = churn_observed
sim_df["p_control"] = p_control
sim_df["p_treated"] = p_treated
sim_df["true_ite"] = true_ite

sim_df.head()


### 3.4 Quick randomisation and effect checks

We check:

- Whether treatment assignment is balanced.
- The **true ATE** implied by our potential outcomes.
- The **empirical ATE** estimated from the simulated data.


In [None]:
# Treatment balance
print(sim_df["treatment"].value_counts(normalize=True).rename("treatment_share"))

# True ATE from potential outcomes
true_ate = float(sim_df["true_ite"].mean())
print(f"True ATE (expected churn reduction): {true_ate:.4f}")

# Empirical ATE from simulated outcomes
emp_ate = float(sim_df.loc[sim_df["treatment"] == 1, "churn_observed"].mean() -
                sim_df.loc[sim_df["treatment"] == 0, "churn_observed"].mean())

print(f"Empirical ATE (treatment - control churn): {emp_ate:.4f}")
print(f"Empirical churn reduction (control - treatment): {-emp_ate:.4f}")


We now have a synthetic **RCT** with heterogeneous uplift.

Next, we build an uplift model to **learn where the campaign works best**.


## 4. Train–test split for uplift modelling

We want to train an uplift model and evaluate it on a **hold-out test set**.

We split the simulated dataset into train and test, keeping:

- Features `X` (original columns, excluding `Exited`).
- Treatment indicator `treatment`.
- Outcome `churn_observed`.
- `true_ite` for evaluation only.


In [None]:
# Features for uplift modelling (exclude original Exited and uplift-specific columns)
X_uplift = sim_df.drop(columns=[
    "Exited",
    "churn_observed",
    "treatment",
    "p_control",
    "p_treated",
    "true_ite",
])

y_uplift = sim_df["churn_observed"]  # post-campaign churn
T_uplift = sim_df["treatment"]
true_ite_all = sim_df["true_ite"]

X_train, X_test, y_train, y_test, T_train, T_test, ite_train, ite_test = train_test_split(
    X_uplift,
    y_uplift,
    T_uplift,
    true_ite_all,
    test_size=0.3,
    random_state=RANDOM_STATE,
    stratify=T_uplift,
)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)


We reuse the same **preprocessor** (numeric + categorical) as before.


In [None]:
categorical_cols_uplift: List[str] = [c for c in ["Geography", "Gender"] if c in X_uplift.columns]
numeric_cols_uplift: List[str] = [c for c in X_uplift.columns if c not in categorical_cols_uplift]

print("Categorical cols (uplift):", categorical_cols_uplift)
print("Numeric cols (uplift):", numeric_cols_uplift)

numeric_transformer_u = Pipeline(steps=[("scaler", StandardScaler())])

categorical_transformer_u = Pipeline(
    steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor_u = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer_u, numeric_cols_uplift),
        ("cat", categorical_transformer_u, categorical_cols_uplift),
    ]
)


## 5. Uplift model – two-model approach

There are several uplift modelling strategies. Here we use the classic
**two-model approach**:

1. Train one model on the **treated** group: `P(Y = 1 | X, treatment = 1)`.
2. Train another model on the **control** group: `P(Y = 1 | X, treatment = 0)`.
3. For a new customer with features `x`, estimate:
   - `p_treated_hat(x)` and `p_control_hat(x)`.
   - Uplift in terms of **churn reduction**:

   ```text
   uplift_hat(x) = p_control_hat(x) - p_treated_hat(x)
   ```

A large **positive** `uplift_hat(x)` means:

> "This customer is expected to churn much less if we treat them than if we do nothing."  
> → **High-priority target** for retention campaigns.


In [None]:
def fit_two_model_uplift(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    T_train: pd.Series,
    preprocessor: ColumnTransformer,
) -> Tuple[Pipeline, Pipeline]:
    """Fit two separate models: one for treated, one for control.

    Args:
        X_train: Training features.
        y_train: Training outcome (0/1 churn).
        T_train: Treatment indicator for training rows.
        preprocessor: ColumnTransformer for preprocessing.

    Returns:
        Tuple of (model_control, model_treated) pipelines.
    """
    treated_mask = T_train == 1
    control_mask = T_train == 0

    X_treated = X_train.loc[treated_mask]
    y_treated = y_train.loc[treated_mask]

    X_control = X_train.loc[control_mask]
    y_control = y_train.loc[control_mask]

    print("Treated samples:", X_treated.shape[0])
    print("Control samples:", X_control.shape[0])

    model_treated = Pipeline(
        steps=[
            ("preprocess", preprocessor),
            (
                "clf",
                RandomForestClassifier(
                    n_estimators=300,
                    max_depth=None,
                    min_samples_split=4,
                    min_samples_leaf=2,
                    random_state=RANDOM_STATE,
                    n_jobs=-1,
                ),
            ),
        ]
    )

    model_control = Pipeline(
        steps=[
            ("preprocess", preprocessor),
            (
                "clf",
                RandomForestClassifier(
                    n_estimators=300,
                    max_depth=None,
                    min_samples_split=4,
                    min_samples_leaf=2,
                    random_state=RANDOM_STATE + 1,
                    n_jobs=-1,
                ),
            ),
        ]
    )

    model_treated.fit(X_treated, y_treated)
    model_control.fit(X_control, y_control)

    return model_control, model_treated


model_control, model_treated = fit_two_model_uplift(
    X_train=X_train,
    y_train=y_train,
    T_train=T_train,
    preprocessor=preprocessor_u,
)


In [None]:
def predict_uplift(
    model_control: Pipeline,
    model_treated: Pipeline,
    X: pd.DataFrame,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Predict control and treated churn probabilities and uplift.

    Args:
        model_control: Model trained on control group.
        model_treated: Model trained on treated group.
        X: Feature DataFrame for which to predict uplift.

    Returns:
        Tuple of (p_control_hat, p_treated_hat, uplift_hat) arrays.
        uplift_hat = p_control_hat - p_treated_hat (expected churn reduction).
    """
    p_control_hat = model_control.predict_proba(X)[:, 1]
    p_treated_hat = model_treated.predict_proba(X)[:, 1]

    uplift_hat = p_control_hat - p_treated_hat

    return p_control_hat, p_treated_hat, uplift_hat


p0_hat_test, p1_hat_test, uplift_hat_test = predict_uplift(
    model_control, model_treated, X_test
)

print("Predicted uplift summary:")
print(pd.Series(uplift_hat_test).describe())


We now have, for each test customer:

- `p0_hat_test` – predicted churn probability if **not treated**.
- `p1_hat_test` – predicted churn probability if **treated**.
- `uplift_hat_test` – estimated **reduction in churn** if treated.

We can compare this to the **true individual treatment effect** `true_ite` used
in the simulation.


In [None]:
# Compare predicted uplift to true ITE (simulation ground truth)

from scipy.stats import pearsonr

corr, p_value = pearsonr(uplift_hat_test, ite_test.to_numpy())
print(f"Correlation between predicted uplift and true ITE: {corr:.3f} (p={p_value:.3g})")

plt.scatter(ite_test, uplift_hat_test, alpha=0.3)
plt.xlabel("True ITE (churn reduction)")
plt.ylabel("Predicted uplift (churn reduction)")
plt.title("Predicted uplift vs simulated ground truth")
plt.axhline(0, color="black", linestyle="--")
plt.axvline(0, color="black", linestyle="--")
plt.show()


A positive correlation means the uplift model is **learning something** about
where the campaign works better or worse.

However, uplift is usually evaluated not just by correlation, but by how well
it **concentrates treatment effect** in the top-ranked customers.


## 6. Uplift-by-quantile analysis

A practical way to inspect uplift is to:

1. Rank customers in the **test set** by their predicted uplift.
2. Split them into quantiles (e.g., 5 or 10 groups).
3. For each quantile:
   - Compute **observed churn rates** for treated vs control customers.
   - Estimate **observed uplift**:

   ```text
   observed_uplift = churn_rate_control - churn_rate_treated
   ```

If the model is useful, the **top uplift quantiles** should show the **largest
observed uplift** (largest churn reduction when treated).


In [None]:
def uplift_by_quantile(
    uplift_scores: np.ndarray,
    y: pd.Series,
    T: pd.Series,
    n_quantiles: int = 5,
) -> pd.DataFrame:
    """Compute observed uplift by quantile of predicted uplift.

    Args:
        uplift_scores: Array of predicted uplift scores (higher is better).
        y: Observed outcome (1 = churn).
        T: Treatment indicator (1 = treated, 0 = control).
        n_quantiles: Number of quantile bins.

    Returns:
        DataFrame with metrics per quantile.
    """
    df_local = pd.DataFrame(
        {
            "uplift_hat": uplift_scores,
            "y": y.to_numpy(),
            "T": T.to_numpy(),
        }
    )

    # Higher uplift = better → we rank descending
    df_local["quantile"] = pd.qcut(
        -df_local["uplift_hat"],
        q=n_quantiles,
        labels=[f"Q{i+1}" for i in range(n_quantiles)],
    )

    rows = []
    for q in df_local["quantile"].cat.categories:
        subset = df_local[df_local["quantile"] == q]
        if subset.empty:
            continue

        treated = subset[subset["T"] == 1]
        control = subset[subset["T"] == 0]

        churn_treated = treated["y"].mean() if not treated.empty else np.nan
        churn_control = control["y"].mean() if not control.empty else np.nan

        observed_uplift = churn_control - churn_treated

        rows.append(
            {
                "quantile": q,
                "n": len(subset),
                "n_treated": len(treated),
                "n_control": len(control),
                "churn_treated": churn_treated,
                "churn_control": churn_control,
                "observed_uplift": observed_uplift,
            }
        )

    return pd.DataFrame(rows)


uplift_q_df = uplift_by_quantile(
    uplift_scores=uplift_hat_test,
    y=y_test,
    T=T_test,
    n_quantiles=5,
)

uplift_q_df


In [None]:
# Plot observed uplift by quantile

plt.figure(figsize=(8, 5))
sns.barplot(data=uplift_q_df, x="quantile", y="observed_uplift")
plt.axhline(0, color="black", linestyle="--")
plt.ylabel("Observed uplift (churn reduction)")
plt.title("Observed uplift by predicted uplift quantile")
plt.show()


If the uplift model is effective, the **top quantiles** (e.g. Q1, Q2) should
show **higher positive observed uplift** than the lower quantiles.

This means that, if you could only treat a subset (e.g. just Q1), you would
get more churn reduction than treating a random subset of the same size.


## 7. Business interpretation – using uplift scores

In a real bank retention campaign, your constraints might be:

- Limited **budget** (you cannot treat everyone).
- Limited **operational capacity** (calls per month, emails with human follow-up).

The uplift model supports decisions such as:

1. **Who to target?**
   - Rank customers by `uplift_hat` (expected churn reduction if treated).
   - Choose the top `K` or top `X%` subject to budget.

2. **How to evaluate a campaign design?**
   - Compare uplift-by-quantile between different models or strategies.
   - Estimate **incremental churn reduction** (or revenue) when targeting
     only high-uplift customers vs everyone.

3. **How to combine with value (CLV)?**
   - Multiply predicted uplift (churn reduction) by customer value.
   - Rank customers by **expected incremental CLV**:

   ```text
   incremental_value ≈ uplift_hat(x) * CLV(x)
   ```

   - This is particularly powerful when combined with the CLV notebook.

The key mindset shift is:

> Do not target everyone with high churn risk. Target those where
> the **treatment actually changes the outcome** in a meaningful way.


## 8. Limitations and possible extensions

This notebook uses a **simulated experiment**, which is great for learning but
comes with limitations:

- The true data-generating process is **hand-crafted**, not from a real RCT.
- We used a simple two-model uplift approach; in practice you might use:
  - Meta-learners (T-learner, S-learner, X-learner).
  - Dedicated uplift models (e.g. uplift random forests, causal forests).
- We focused on a single outcome (churn). Real campaigns might track multiple
  KPIs (activation, product uptake, net revenue, etc.).

Potential next steps:

1. Add **costs** and **budget constraints** to derive optimal targeting rules.
2. Integrate uplift with **CLV** to maximise **incremental LTV**.
3. Use libraries for **causal inference** / **uplift modelling** for more
   advanced estimators.
4. Simulate **non-randomised** policies and use causal methods to correct bias.

Even with these simplifications, this project demonstrates how to:

- Frame a churn retention campaign as a **treatment effect** problem.
- Build an **uplift model** to prioritise customers.
- Evaluate uplift in a way that reflects **incremental impact** rather than
  just classification accuracy.
