
# Modern A/B Testing Playbook — From Classical to Advanced Methods

This notebook is designed as a **teaching + reference notebook** for A/B testing:
from basic statistical tests to more modern techniques used in large-scale experimentation.

We cover:

1. **Core frequentist tools**
   - Proportion z-test, t-test on means, chi-square.
   - Confidence intervals, effect sizes, MDE / power.
2. **Permutation / randomization tests** for non-parametric inference.
3. **Regression-based A/B analysis**
   - Logistic regression, linear regression, covariate adjustment.
4. **Variance reduction**
   - CUPED (pre-period covariate) and regression adjustment.
5. **Sequential testing and peeking**
   - Type I error inflation.
   - Simple alpha-spending illustration.
6. **Bayesian A/B testing**
   - Beta–Binomial model for conversion.
   - Posterior decision metrics (P(B > A), expected loss).
7. **Multi-armed bandits (brief)**
   - Thompson sampling for Bernoulli arms, regret curves (simulation).
8. **Causal / observational setting**
   - IPW and doubly-robust (DR) estimation under confounding.

The notebook uses **simulated data** for most examples so that all sections are reproducible.
You can plug in your own experimental dataset by mapping it to the same column structure.


## 0) Setup

In [None]:

from __future__ import annotations

from dataclasses import dataclass
from typing import Tuple

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (7, 4.5)
plt.rcParams["axes.grid"] = True

try:
    import statsmodels.api as sm  # type: ignore
except Exception as e:  # pragma: no cover
    sm = None
    print("statsmodels not available; some regression sections will be skipped:", e)

try:
    from sklearn.linear_model import LogisticRegression  # type: ignore
except Exception as e:  # pragma: no cover
    LogisticRegression = None
    print("scikit-learn not available; logistic sections will be skipped:", e)



## 1) Simulated experimental dataset

We generate a simple, but realistic, experiment:

- Binary outcome: `converted` (0/1).
- Continuous metric: `revenue` (0 if not converted).
- Treatment: `group ∈ {control, treatment}`.
- Covariate: `pre_activity` (pre-period proxy).


In [None]:

def simulate_experiment(
    n: int = 20_000,
    p_control: float = 0.10,
    lift_treatment: float = 0.02,
    mean_revenue: float = 100.0,
    revenue_sd: float = 40.0,
    seed: int | None = 123,
) -> pd.DataFrame:
    """Simulate a simple A/B test dataset."""
    rng = np.random.default_rng(seed)

    user_id = np.arange(n)
    group_flag = rng.binomial(1, 0.5, size=n)
    group = np.where(group_flag == 0, "control", "treatment")

    p_treat = p_control + lift_treatment
    p = np.where(group_flag == 0, p_control, p_treat)
    converted = rng.binomial(1, p)

    rev = rng.normal(loc=mean_revenue, scale=revenue_sd, size=n)
    rev = np.where(converted == 1, rev, 0.0)

    pre_activity = rng.normal(loc=0.0, scale=1.0, size=n) + converted * 0.7

    df = pd.DataFrame(
        {
            "user_id": user_id,
            "group": group,
            "converted": converted.astype(int),
            "revenue": rev.astype(float),
            "pre_activity": pre_activity.astype(float),
        }
    )
    return df


df = simulate_experiment()
df.head()


## 2) Core frequentist tools — conversion

In [None]:

@dataclass(frozen=True)
class PropSummary:
    """Summary of a Bernoulli proportion."""
    p: float
    n: int
    x: int


def summarize_prop(x: int, n: int) -> PropSummary:
    """Validate and summarize a proportion sample."""
    if n <= 0:
        raise ValueError("n must be positive.")
    if not (0 <= x <= n):
        raise ValueError("x must satisfy 0 <= x <= n.")
    return PropSummary(p=x / n, n=n, x=x)


def invPhi(u: float) -> float:
    """Inverse standard normal CDF using erfcinv."""
    if not 0.0 < u < 1.0:
        raise ValueError("u must be in (0,1).")
    return math.sqrt(2.0) * math.erfcinv(2.0 * (1.0 - u))


def two_prop_ztest(
    x1: int,
    n1: int,
    x2: int,
    n2: int,
    two_sided: bool = True,
) -> Tuple[float, float]:
    """Two-sample z-test for proportions with pooled variance."""
    s1, s2 = summarize_prop(x1, n1), summarize_prop(x2, n2)
    p_pool = (s1.x + s2.x) / (s1.n + s2.n)
    se = math.sqrt(p_pool * (1.0 - p_pool) * (1.0 / s1.n + 1.0 / s2.n))
    if se == 0.0:
        raise ZeroDivisionError("Standard error is zero; check inputs.")
    z = (s2.p - s1.p) / se
    if two_sided:
        p = 2.0 * (1.0 - 0.5 * (1.0 + math.erf(abs(z) / math.sqrt(2.0))))
    else:
        p = 1.0 - 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))
    return float(z), float(p)


In [None]:

conv_by_group = (
    df.groupby("group")["converted"]
      .agg(["sum", "count", "mean"])
      .rename(columns={"sum": "x", "count": "n", "mean": "rate"})
)
conv_by_group


In [None]:

xA = int(conv_by_group.loc["control", "x"])
nA = int(conv_by_group.loc["control", "n"])
xB = int(conv_by_group.loc["treatment", "x"])
nB = int(conv_by_group.loc["treatment", "n"])

sA = summarize_prop(xA, nA)
sB = summarize_prop(xB, nB)

z_conv, p_conv = two_prop_ztest(xA, nA, xB, nB, two_sided=True)

alpha = 0.05
z_alpha = abs(invPhi(1.0 - alpha / 2.0))
diff_conv = sB.p - sA.p
se_diff_conv = math.sqrt(
    (sA.p * (1.0 - sA.p)) / sA.n + (sB.p * (1.0 - sB.p)) / sB.n
)
ci_lo_conv = diff_conv - z_alpha * se_diff_conv
ci_hi_conv = diff_conv + z_alpha * se_diff_conv

pd.DataFrame(
    {
        "arm": ["control", "treatment"],
        "n": [sA.n, sB.n],
        "x": [sA.x, sB.x],
        "rate": [sA.p, sB.p],
        "diff_B_minus_A": [diff_conv, diff_conv],
        "diff_CI95_lo": [ci_lo_conv, ci_lo_conv],
        "diff_CI95_hi": [ci_hi_conv, ci_hi_conv],
        "z_stat": [z_conv, z_conv],
        "p_value": [p_conv, p_conv],
    }
)


## 3) t-test on revenue per user

In [None]:

def welch_ttest(
    x: np.ndarray,
    y: np.ndarray,
) -> Tuple[float, float]:
    """Welch t-test for difference in means (two-sided)."""
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    n1, n2 = x.size, y.size
    if n1 < 2 or n2 < 2:
        raise ValueError("Need at least 2 observations per group.")
    m1, m2 = float(x.mean()), float(y.mean())
    v1, v2 = float(x.var(ddof=1)), float(y.var(ddof=1))
    se = math.sqrt(v1 / n1 + v2 / n2)
    if se == 0.0:
        raise ZeroDivisionError("Standard error is zero; check variance.")
    t_stat = (m2 - m1) / se
    p = 2.0 * (1.0 - 0.5 * (1.0 + math.erf(abs(t_stat) / math.sqrt(2.0))))
    return float(t_stat), float(p)


In [None]:

rev_control = df.loc[df["group"] == "control", "revenue"].to_numpy()
rev_treat = df.loc[df["group"] == "treatment", "revenue"].to_numpy()

t_rev, p_rev = welch_ttest(rev_control, rev_treat)

mean_rev_A = float(rev_control.mean())
mean_rev_B = float(rev_treat.mean())
diff_rev = mean_rev_B - mean_rev_A

varA = float(rev_control.var(ddof=1))
varB = float(rev_treat.var(ddof=1))
se_diff_rev = math.sqrt(varA / rev_control.size + varB / rev_treat.size)
ci_lo_rev = diff_rev - z_alpha * se_diff_rev
ci_hi_rev = diff_rev + z_alpha * se_diff_rev

pd.DataFrame(
    {
        "arm": ["control", "treatment"],
        "mean_revenue": [mean_rev_A, mean_rev_B],
        "diff_B_minus_A": [diff_rev, diff_rev],
        "diff_CI95_lo": [ci_lo_rev, ci_lo_rev],
        "diff_CI95_hi": [ci_hi_rev, ci_hi_rev],
        "t_stat": [t_rev, t_rev],
        "p_value": [p_rev, p_rev],
    }
)


## 4) Permutation test for conversion

In [None]:

def permutation_test_diff_mean(
    y: np.ndarray,
    group_labels: np.ndarray,
    n_perm: int = 5000,
    seed: int | None = 1,
) -> float:
    """Permutation test for difference in means between two groups."""
    rng = np.random.default_rng(seed)
    y = np.asarray(y, dtype=float)
    g = np.asarray(group_labels)
    groups = np.unique(g)
    if groups.size != 2:
        raise ValueError("Need exactly two groups.")
    def diff_means(vals: np.ndarray, labels: np.ndarray) -> float:
        m0 = float(vals[labels == groups[0]].mean())
        m1 = float(vals[labels == groups[1]].mean())
        return m1 - m0
    observed = diff_means(y, g)
    diffs = np.empty(n_perm, dtype=float)
    for i in range(n_perm):
        perm_labels = rng.permutation(g)
        diffs[i] = diff_means(y, perm_labels)
    p_perm = float((np.abs(diffs) >= abs(observed)).mean())
    plt.figure()
    plt.hist(diffs, bins=40, density=True)
    plt.axvline(observed, linestyle="--", label="observed")
    plt.legend()
    plt.title("Permutation null distribution")
    plt.tight_layout()
    plt.show()
    return p_perm


p_perm_conv = permutation_test_diff_mean(
    df["converted"].to_numpy(dtype=float),
    df["group"].to_numpy(),
    n_perm=3000,
    seed=123,
)
p_perm_conv


## 5) Variance reduction — CUPED

In [None]:

def cuped_adjust(y: np.ndarray, x: np.ndarray) -> Tuple[np.ndarray, float]:
    """Apply CUPED adjustment Y* = Y - θ (X - mean(X))."""
    y = np.asarray(y, dtype=float)
    x = np.asarray(x, dtype=float)
    if y.shape != x.shape:
        raise ValueError("y and x must have same shape.")
    vx = float(np.var(x, ddof=1))
    if vx == 0.0:
        return y.copy(), 0.0
    cov_yx = float(np.cov(y, x, ddof=1)[0, 1])
    theta = cov_yx / vx
    x_centered = x - float(np.mean(x))
    y_adj = y - theta * x_centered
    return y_adj, theta


y_rev = df["revenue"].to_numpy(dtype=float)
x_pre = df["pre_activity"].to_numpy(dtype=float)
y_rev_cuped, theta_hat = cuped_adjust(y_rev, x_pre)
df["revenue_cuped"] = y_rev_cuped
theta_hat


In [None]:

cuped_summary = (
    df.assign(revenue_raw=df["revenue"].astype(float))
      .groupby("group")[["revenue_raw", "revenue_cuped"]]
      .agg(["mean", "var", "count"])
)
cuped_summary


## 6) Bayesian A/B — Beta–Binomial

In [None]:

def sample_posterior_lift(
    xA: int,
    nA: int,
    xB: int,
    nB: int,
    alpha0: float = 1.0,
    beta0: float = 1.0,
    n_draws: int = 50_000,
    seed: int | None = 1,
) -> pd.DataFrame:
    """Draw from the posterior of conversion rates and their difference."""
    rng = np.random.default_rng(seed)
    alphaA = alpha0 + xA
    betaA = beta0 + nA - xA
    alphaB = alpha0 + xB
    betaB = beta0 + nB - xB
    pA = rng.beta(alphaA, betaA, size=n_draws)
    pB = rng.beta(alphaB, betaB, size=n_draws)
    lift = pB - pA
    return pd.DataFrame({"pA": pA, "pB": pB, "lift": lift})


post = sample_posterior_lift(xA, nA, xB, nB, n_draws=50_000, seed=2025)
prob_better = float((post["lift"] > 0).mean())
ci_lo, ci_hi = np.quantile(post["lift"], [0.025, 0.975])

plt.figure()
plt.hist(post["lift"], bins=50, density=True)
plt.axvline(0.0, linestyle="--")
plt.title("Posterior lift distribution (p_treat - p_control)")
plt.tight_layout()
plt.show()

{
    "posterior_prob_treatment_better": prob_better,
    "lift_cred_int_95": (ci_lo, ci_hi),
}


## 7) Sequential testing and peeking (A/A simulation)

In [None]:

def simulate_null_stream(
    n: int = 10_000,
    p: float = 0.10,
    seed: int | None = None,
) -> pd.DataFrame:
    """Simulate an A/A test (no true effect) with Bernoulli outcomes."""
    rng = np.random.default_rng(seed)
    group_flag = rng.binomial(1, 0.5, size=n)
    group = np.where(group_flag == 0, "control", "treatment")
    y = rng.binomial(1, p, size=n)
    return pd.DataFrame({"group": group, "converted": y})


def peek_pvalues(df_stream: pd.DataFrame, look_step: int = 500) -> Tuple[list[int], list[float]]:
    """Compute p-values over time by looking every 'look_step' users."""
    look_indices: list[int] = []
    p_values: list[float] = []
    for n_curr in range(look_step, len(df_stream) + 1, look_step):
        sub = df_stream.iloc[:n_curr]
        tbl = (
            sub.groupby("group")["converted"]
               .agg(["sum", "count"])
               .rename(columns={"sum": "x", "count": "n"})
        )
        xA, nA = int(tbl.loc["control", "x"]), int(tbl.loc["control", "n"])
        xB, nB = int(tbl.loc["treatment", "x"]), int(tbl.loc["treatment", "n"])
        _, p = two_prop_ztest(xA, nA, xB, nB, two_sided=True)
        look_indices.append(n_curr)
        p_values.append(p)
    return look_indices, p_values


def simulate_type1_inflation(
    n_experiments: int = 200,
    n: int = 8_000,
    p: float = 0.10,
    look_step: int = 400,
    alpha: float = 0.05,
) -> Tuple[float, float]:
    """Compare Type I error under fixed-horizon vs naive sequential peeking."""
    rng = np.random.default_rng(2025)
    reject_fixed = 0
    reject_seq = 0
    for _ in range(n_experiments):
        df_stream = simulate_null_stream(n=n, p=p, seed=int(rng.integers(0, 10_000_000)))
        tbl = (
            df_stream.groupby("group")["converted"]
                     .agg(["sum", "count"])
                     .rename(columns={"sum": "x", "count": "n"})
        )
        xA, nA = int(tbl.loc["control", "x"]), int(tbl.loc["control", "n"])
        xB, nB = int(tbl.loc["treatment", "x"]), int(tbl.loc["treatment", "n"])
        _, p_fix = two_prop_ztest(xA, nA, xB, nB, two_sided=True)
        if p_fix < alpha:
            reject_fixed += 1
        looks, pvals = peek_pvalues(df_stream, look_step=look_step)
        if any(pv < alpha for pv in pvals):
            reject_seq += 1
    return reject_fixed / n_experiments, reject_seq / n_experiments


fixed_alpha, seq_alpha = simulate_type1_inflation()
{"fixed_horizon_alpha_hat": fixed_alpha, "naive_seq_alpha_hat": seq_alpha}


## 8) Causal / observational setting — IPW and DR (simulation)

In [None]:

if LogisticRegression is None:
    print("sklearn not available; skipping IPW/DR demo.")
else:
    def sigmoid(z: np.ndarray | float) -> np.ndarray | float:
        return 1.0 / (1.0 + np.exp(-z))

    def simulate_confounded(
        n: int = 20_000,
        seed: int | None = 123,
    ) -> pd.DataFrame:
        rng = np.random.default_rng(seed)
        d = 4
        X = rng.normal(size=(n, d))
        w_e = np.array([0.8, -0.6, 0.4, 0.2])
        bias_e = -0.1
        e = sigmoid(bias_e + X @ w_e)
        T = rng.binomial(1, e)
        w_y = np.array([0.5, 0.3, -0.2, 0.1])
        bias_y = -1.2
        tau_true = 0.3
        lin = bias_y + X @ w_y + tau_true * T
        p = sigmoid(lin)
        Y = rng.binomial(1, p)
        cols = {f"x{j+1}": X[:, j] for j in range(d)}
        df_sim = pd.DataFrame(cols)
        df_sim["T"] = T
        df_sim["Y"] = Y
        df_sim["e_true"] = e
        return df_sim

    def naive_ate(df_sim: pd.DataFrame) -> float:
        m1 = float(df_sim.loc[df_sim["T"] == 1, "Y"].mean())
        m0 = float(df_sim.loc[df_sim["T"] == 0, "Y"].mean())
        return m1 - m0

    def ipw_ate(df_sim: pd.DataFrame, e_col: str = "e_true") -> float:
        t = df_sim["T"].to_numpy()
        y = df_sim["Y"].to_numpy()
        e = np.clip(df_sim[e_col].to_numpy(), 1e-6, 1.0 - 1e-6)
        w1 = t / e
        w0 = (1.0 - t) / (1.0 - e)
        p1_hat = (w1 * y).sum() / max(w1.sum(), 1e-12)
        p0_hat = (w0 * y).sum() / max(w0.sum(), 1e-12)
        return float(p1_hat - p0_hat)

    def dr_logistic_ate(df_sim: pd.DataFrame, e_col: str = "e_true") -> float:
        feature_cols = [c for c in df_sim.columns if c.startswith("x")]
        X = df_sim[feature_cols].to_numpy()
        t = df_sim["T"].to_numpy()
        y = df_sim["Y"].to_numpy()
        e = np.clip(df_sim[e_col].to_numpy(), 1e-6, 1.0 - 1e-6)
        X1 = X[t == 1]
        y1 = y[t == 1]
        X0 = X[t == 0]
        y0 = y[t == 0]
        mdl1 = LogisticRegression(max_iter=1000).fit(X1, y1)
        mdl0 = LogisticRegression(max_iter=1000).fit(X0, y0)
        m1_hat = mdl1.predict_proba(X)[:, 1]
        m0_hat = mdl0.predict_proba(X)[:, 1]
        term = (m1_hat - m0_hat) + (t * (y - m1_hat) / e) - ((1.0 - t) * (y - m0_hat) / (1.0 - e))
        return float(np.mean(term))

    df_sim = simulate_confounded(n=30_000, seed=2025)
    ate_naive = naive_ate(df_sim)
    ate_ipw_true = ipw_ate(df_sim, e_col="e_true")
    ate_dr = dr_logistic_ate(df_sim, e_col="e_true")
    {"ATE_naive": ate_naive, "ATE_IPW_true": ate_ipw_true, "ATE_DR_logistic": ate_dr}



## 9) Treatment effect heterogeneity and uplift-style modeling

So far we have mostly focused on **average treatment effects** (ATE):

- A single number summarising the effect of treatment across all users.

In practice, the effect of a new feature or price often **varies by user segment**. For example:

- Highly engaged users might respond very differently from cold users.  
- Discount-sensitive users might show larger uplift from a promotion.

Here we build a simple **heterogeneous treatment effect (HTE)** model using:

- A logistic regression for `converted`.  
- An interaction between treatment and a covariate (`pre_activity`).  
- A derived **uplift curve**: predicted \(P(Y=1 \mid T=1, X=x) - P(Y=1 \mid T=0, X=x)\).


In [None]:

if sm is None:
    print("statsmodels is not available; skipping heterogeneity / uplift section.")
else:
    # Prepare data with treatment indicator and interaction
    df_het = df.copy()
    df_het["treat_flag"] = (df_het["group"] == "treatment").astype(int)
    df_het["interaction"] = df_het["treat_flag"] * df_het["pre_activity"]

    # Design matrix: intercept + main effects + interaction
    X = df_het[["treat_flag", "pre_activity", "interaction"]]
    X_sm = sm.add_constant(X)
    y = df_het["converted"].to_numpy()

    logit_het = sm.Logit(y, X_sm).fit(disp=False)
    logit_het.summary()



The key coefficient here is the **interaction term** on `treat_flag × pre_activity`:

- If it is **positive**, the treatment effect grows as `pre_activity` increases.  
- If it is **negative**, the treatment effect shrinks (or even flips) for high `pre_activity`.  

Logistic coefficients are on the **log-odds** scale. To make this easier to interpret, we look at
predicted **conversion probabilities** and **uplift** as a function of `pre_activity`.


In [None]:

if sm is not None:
    # Build a grid of pre-activity values (central 98% range to avoid extreme tails)
    pre_min = float(df_het["pre_activity"].quantile(0.01))
    pre_max = float(df_het["pre_activity"].quantile(0.99))
    grid = np.linspace(pre_min, pre_max, 100)

    # Construct design matrices for control vs treatment at each grid point
    X_ctrl = pd.DataFrame(
        {
            "const": 1.0,
            "treat_flag": np.zeros_like(grid),
            "pre_activity": grid,
            "interaction": np.zeros_like(grid),  # 0 * pre_activity
        }
    )
    X_treat = pd.DataFrame(
        {
            "const": 1.0,
            "treat_flag": np.ones_like(grid),
            "pre_activity": grid,
            "interaction": grid,  # 1 * pre_activity
        }
    )

    # Predicted conversion probabilities
    p_ctrl = logit_het.predict(X_ctrl)
    p_treat = logit_het.predict(X_treat)
    uplift = p_treat - p_ctrl

    uplift_df = pd.DataFrame(
        {
            "pre_activity": grid,
            "p_control": p_ctrl,
            "p_treatment": p_treat,
            "uplift": uplift,
        }
    )
    display(uplift_df.head())


In [None]:

if sm is not None:
    # Plot uplift as a function of pre_activity
    plt.figure()
    plt.plot(uplift_df["pre_activity"], uplift_df["uplift"])
    plt.axhline(0.0, linestyle="--")
    plt.xlabel("pre_activity")
    plt.ylabel("uplift = P(T=1) - P(T=0)")
    plt.title("Predicted treatment uplift vs pre-activity")
    plt.tight_layout()
    plt.show()



This curve is a simple **uplift function**: it tells you how much you expect conversion to change
for a user with a given `pre_activity` if they receive treatment instead of control.

- Regions where uplift is **large and positive** are good candidates for **targeting** the treatment.  
- Regions where uplift is **near zero** or negative suggest that treatment is not helpful (or even harmful).


In [None]:

if sm is not None:
    # Derive coarse uplift segments by pre-activity quantiles
    df_seg = df_het.copy()
    df_seg["pre_segment"] = pd.qcut(df_seg["pre_activity"], 3, labels=["low", "mid", "high"])

    seg_summary = (
        df_seg.groupby(["pre_segment", "group"])["converted"]
              .agg(["mean", "count"])
              .rename(columns={"mean": "conversion_rate"})
              .reset_index()
    )

    # Also compute segment-level uplift
    seg_pivot = (
        seg_summary.pivot(index="pre_segment", columns="group", values="conversion_rate")
    )
    seg_pivot["uplift_treat_minus_control"] = (
        seg_pivot["treatment"] - seg_pivot["control"]
    )
    seg_pivot



### How this relates to uplift modeling

The model we used is a **single logistic regression** with an interaction term:

\[
\log \frac{P(Y=1 \mid T, X)}{1 - P(Y=1 \mid T, X)} =
\beta_0 + \beta_1 T + \beta_2 X + \beta_3 (T \cdot X).
\]

This is equivalent to:

- One base log-odds curve in \(X\) (through \(\beta_2\)).  
- A treatment effect that **depends on X** (through \(\beta_1 + \beta_3 X\)).

More advanced uplift approaches extend this idea:

- Two-model approach: one model for treated, one for control, and subtract predictions.  
- Direct uplift models (e.g. interaction trees, causal forests, meta-learners like T-learner, S-learner).  
- Regularisation and non-linear features to capture complex patterns.

However, even this simple logistic-with-interactions model is a very practical
and **modern** way to explore treatment heterogeneity in many real-world A/B tests.



## 10) Uplift modeling with a T-learner

In the previous section we used a **single logistic regression with an interaction term**
to model treatment effect heterogeneity.

Another popular approach is the **T-learner**:

- Fit one model for the treated group: \(m_1(x) = \mathbb{E}[Y \mid T=1, X=x]\).  
- Fit another model for the control group: \(m_0(x) = \mathbb{E}[Y \mid T=0, X=x]\).  
- Define the **uplift function** as:

\[
\tau(x) = m_1(x) - m_0(x),
\]

which estimates how much the treatment changes the outcome probability for a user with features \(X=x\).

In a randomized A/B test this is a way to **learn heterogeneity** in a flexible manner, and the same idea
extends to more complex learners (trees, forests, boosted models).


In [None]:

if LogisticRegression is None:
    print("scikit-learn not available; skipping T-learner section.")
else:
    from sklearn.model_selection import train_test_split

    def train_t_learner_logistic(
        df_in: pd.DataFrame,
        feature_cols: list[str],
        target_col: str = "converted",
        group_col: str = "group",
        treat_label: str = "treatment",
        test_size: float = 0.3,
        seed: int | None = 2025,
    ) -> tuple[pd.DataFrame, pd.DataFrame]:
        """Train a simple T-learner with logistic regression on a train/holdout split.

        Parameters
        ----------
        df_in : DataFrame
            Input experiment data with treatment, outcome, and features.
        feature_cols : list[str]
            Names of columns used as features X.
        target_col : str
            Name of binary outcome column.
        group_col : str
            Name of treatment group column (e.g., "group").
        treat_label : str
            Label for treated users in `group_col`, others are considered control.
        test_size : float
            Fraction of data reserved for evaluation.
        seed : int | None
            Random seed for splitting.

        Returns
        -------
        df_train, df_test : DataFrame
            Train and test splits with all original columns.
        """
        df_local = df_in.copy()
        # Binary treatment indicator
        df_local["treat_flag"] = (df_local[group_col] == treat_label).astype(int)

        # Train/test split
        df_train, df_test = train_test_split(
            df_local, test_size=test_size, random_state=seed, stratify=df_local["treat_flag"]
        )

        # Fit separate logistic models in train set
        df_train_t = df_train[df_train["treat_flag"] == 1]
        df_train_c = df_train[df_train["treat_flag"] == 0]

        X_t = df_train_t[feature_cols].to_numpy()
        y_t = df_train_t[target_col].to_numpy()

        X_c = df_train_c[feature_cols].to_numpy()
        y_c = df_train_c[target_col].to_numpy()

        mdl_t = LogisticRegression(max_iter=1000)
        mdl_c = LogisticRegression(max_iter=1000)

        mdl_t.fit(X_t, y_t)
        mdl_c.fit(X_c, y_c)

        # Store fitted models and feature list on df_train for reference (not used directly)
        df_train.attrs["t_learner_models"] = {
            "treated_model": mdl_t,
            "control_model": mdl_c,
            "feature_cols": feature_cols,
        }

        # Predict uplift on test set
        X_test = df_test[feature_cols].to_numpy()
        p1_hat = mdl_t.predict_proba(X_test)[:, 1]
        p0_hat = mdl_c.predict_proba(X_test)[:, 1]
        uplift_hat = p1_hat - p0_hat

        df_test = df_test.copy()
        df_test["p1_hat"] = p1_hat
        df_test["p0_hat"] = p0_hat
        df_test["uplift_hat"] = uplift_hat

        return df_train, df_test


    # Use pre_activity as the feature for this example
    feature_cols = ["pre_activity"]
    df_train_tl, df_test_tl = train_t_learner_logistic(
        df_in=df,
        feature_cols=feature_cols,
        target_col="converted",
        group_col="group",
        treat_label="treatment",
        test_size=0.3,
        seed=2025,
    )

    df_test_tl.head()



The test set now has, for each user:

- `p1_hat`: predicted probability of conversion **if treated**.  
- `p0_hat`: predicted probability of conversion **if in control**.  
- `uplift_hat = p1_hat - p0_hat`: the modelled **individual uplift** in absolute probability points.

Next we examine how this uplift signal correlates with **actual** treatment effects by ranking users
according to `uplift_hat`.


In [None]:

if LogisticRegression is not None:
    def evaluate_uplift_by_quantile(
        df_eval: pd.DataFrame,
        uplift_col: str = "uplift_hat",
        group_col: str = "group",
        target_col: str = "converted",
        n_bins: int = 5,
    ) -> pd.DataFrame:
        """Evaluate uplift by quantiles of predicted uplift.

        Parameters
        ----------
        df_eval : DataFrame
            Evaluation data (e.g., test set) containing uplift predictions, group, and outcome.
        uplift_col : str
            Column name containing uplift predictions.
        group_col : str
            Column name of treatment group (must contain 'control' and 'treatment').
        target_col : str
            Binary outcome column name.
        n_bins : int
            Number of bins/quantiles for ranking.

        Returns
        -------
        DataFrame
            For each bin: average predicted uplift, actual conversion by group,
            and realized uplift (treatment - control).
        """
        df_ev = df_eval.copy()

        # Higher uplift_hat = more likely to benefit from treatment
        df_ev["uplift_bin"] = pd.qcut(df_ev[uplift_col], n_bins, labels=False, duplicates="drop")

        summaries = []
        for b in sorted(df_ev["uplift_bin"].dropna().unique()):
            sub = df_ev[df_ev["uplift_bin"] == b]
            avg_uplift_hat = float(sub[uplift_col].mean())
            # Actual conversion by arm
            conv_tab = (
                sub.groupby(group_col)[target_col]
                   .agg(["mean", "count"])
                   .rename(columns={"mean": "conversion_rate"})
            )
            # Handle possible missing arm in rare small bins
            conv_control = float(conv_tab.loc["control", "conversion_rate"]) if "control" in conv_tab.index else float("nan")
            conv_treat = float(conv_tab.loc["treatment", "conversion_rate"]) if "treatment" in conv_tab.index else float("nan")
            realized_uplift = conv_treat - conv_control if (not math.isnan(conv_treat) and not math.isnan(conv_control)) else float("nan")

            summaries.append(
                {
                    "bin": int(b),
                    "n_users": int(sub.shape[0]),
                    "avg_uplift_hat": avg_uplift_hat,
                    "conv_control": conv_control,
                    "conv_treat": conv_treat,
                    "realized_uplift": realized_uplift,
                }
            )

        df_bins = pd.DataFrame(summaries).sort_values("bin")
        return df_bins


    uplift_bins = evaluate_uplift_by_quantile(df_test_tl, uplift_col="uplift_hat", n_bins=5)
    uplift_bins


In [None]:

if LogisticRegression is not None:
    # Plot realized uplift vs predicted uplift bin
    plt.figure()
    plt.plot(uplift_bins["bin"], uplift_bins["realized_uplift"], marker="o")
    plt.xlabel("uplift_hat quantile bin (0 = lowest predicted uplift)")
    plt.ylabel("realized uplift (conversion_treat - conversion_control)")
    plt.title("T-learner uplift model: realized uplift by predicted uplift bin")
    plt.tight_layout()
    plt.show()



If the T-learner is capturing useful signal, you should see that:

- Bins with **higher predicted uplift** (`uplift_hat` larger) tend to have **larger realized uplift**.  
- Low uplift bins may have small or even negative realized uplift.

This is the core idea of uplift modeling:

1. Use historical randomized data to **learn a function** \(\tau(x)\) that predicts who benefits most.  
2. In future, **target** the treatment (e.g., discount, new feature) to users with high predicted uplift.  
3. Continue to log data and periodically **retrain** the uplift model.

In production systems you would typically:

- Use richer feature sets (RFM variables, geography, device, etc.).  
- Use more flexible learners (gradient boosting, random forests, neural nets).  
- Monitor stability and drift of uplift predictions over time.
