
# Modern A/B Testing Notebook — Multiple Public Case Studies (with Uplift)

This notebook aggregates **multiple well-known A/B case studies** and adds a **modern uplift** example:

1) **Udacity — Landing Page (`ab_data.csv`)**: user-level test, classic conversion analysis.  
2) **Udacity — Free Trial Screener (aggregates)**: GC/NC metrics, power/MDE.  
3) **Criteo Uplift Modeling (public sample)**: treatment/control ads with features, enabling **uplift evaluation** (Qini, uplift@K).

All sections come with Markdown explanations before/after code cells.


## Setup — Imports & Shared Statistical Helpers

In [None]:

from __future__ import annotations
from dataclasses import dataclass
from typing import Tuple, Dict, Optional
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['axes.grid'] = True
RANDOM_SEED = 7
np.random.seed(RANDOM_SEED)

@dataclass(frozen=True)
class ProportionSummary:
    p: float
    n: int
    x: int

def summarize_proportion(x: int, n: int) -> ProportionSummary:
    if n <= 0:
        raise ValueError("n must be positive.")
    if not (0 <= x <= n):
        raise ValueError("x must be in [0, n].")
    return ProportionSummary(p=x / n, n=n, x=x)

def two_prop_ztest(x1: int, n1: int, x2: int, n2: int, two_sided: bool = True):
    s1 = summarize_proportion(x1, n1)
    s2 = summarize_proportion(x2, n2)
    p_pool = (s1.x + s2.x) / (s1.n + s2.n)
    se = math.sqrt(p_pool * (1 - p_pool) * (1/s1.n + 1/s2.n))
    if se == 0.0:
        raise ZeroDivisionError("SE=0; verify inputs.")
    z = (s1.p - s2.p) / se
    p_val = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2)))) if two_sided else (1 - 0.5 * (1 + math.erf(z / math.sqrt(2))))
    return z, p_val

def bootstrap_ci_diff(pA: float, pB: float, nA: int, nB: int, B: int = 5000, alpha: float = 0.05):
    diffs = np.empty(B, dtype=float)
    for b in range(B):
        xA = np.random.binomial(nA, pA)
        xB = np.random.binomial(nB, pB)
        diffs[b] = xB / nB - xA / nA
    lo = float(np.quantile(diffs, alpha/2))
    hi = float(np.quantile(diffs, 1 - alpha/2))
    return lo, hi

def required_n_two_proportions(pA: float, pB: float, alpha: float = 0.05, power: float = 0.8, two_sided: bool = True) -> int:
    def invPhi(u: float) -> float:
        return math.sqrt(2) * math.erfcinv(2*(1 - u))
    z_alpha = abs(invPhi(1 - alpha/2)) if two_sided else abs(invPhi(1 - alpha))
    z_beta = abs(invPhi(power))
    pbar = 0.5 * (pA + pB)
    delta = abs(pB - pA)
    if delta == 0.0:
        raise ValueError("delta=0 implies infinite sample size to detect.")
    se = math.sqrt(2 * pbar * (1 - pbar))
    n = ((z_alpha + z_beta) * se / delta) ** 2
    return math.ceil(n)

def mde_for_n(pA: float, n_per_arm: int, alpha: float = 0.05, power: float = 0.8, two_sided: bool = True) -> float:
    def invPhi(u: float) -> float:
        return math.sqrt(2) * math.erfcinv(2*(1 - u))
    z_alpha = abs(invPhi(1 - alpha/2)) if two_sided else abs(invPhi(1 - alpha))
    z_beta = abs(invPhi(power))
    se = math.sqrt(2 * pA * (1 - pA))
    return float((z_alpha + z_beta) * se / math.sqrt(max(n_per_arm, 1)))


**Helpers.** Z-test, bootstrap CI, sample size (required_n), and MDE calculation are provided once and reused.

## 1) Udacity — Landing Page (`ab_data.csv`)

In [None]:

RAW_URL = "https://raw.githubusercontent.com/udacity/sdand-ab-testing-project/main/ab_data.csv"
df = pd.read_csv(RAW_URL)
mask_ok = ((df['group']=='control') & (df['landing_page']=='old_page')) | ((df['group']=='treatment') & (df['landing_page']=='new_page'))
df_clean = df[mask_ok].drop_duplicates('user_id', keep='first').copy()

grp = df_clean.groupby('group')['converted'].agg(['sum','count','mean']).rename(columns={'sum':'x','count':'n','mean':'rate'})
A = summarize_proportion(int(grp.loc['control','x']), int(grp.loc['control','n']))
B = summarize_proportion(int(grp.loc['treatment','x']), int(grp.loc['treatment','n']))
z, p = two_prop_ztest(A.x, A.n, B.x, B.n, two_sided=True)
ci_lo, ci_hi = bootstrap_ci_diff(A.p, B.p, A.n, B.n, B=3000, alpha=0.05)
abs_lift = B.p - A.p
rel_lift = abs_lift / A.p if A.p else float('inf')

pd.DataFrame({
    'arm':['control','treatment'],
    'n':[A.n,B.n],
    'x':[A.x,B.x],
    'rate':[A.p,B.p],
    'p_value(z-test)': [p,p],
    'abs_lift(B-A)':[abs_lift,abs_lift],
    'rel_lift':[rel_lift,rel_lift],
    'boot_CI_lo':[ci_lo,ci_lo],
    'boot_CI_hi':[ci_hi,ci_hi],
})


**Explanation.** Classic two-proportion test on conversion; bootstrap CI provides a robust interval for the absolute lift.

## 2) Udacity — Free Trial Screener (Aggregates)

In [None]:

totals = {
    "sanity": {
        "pageviews": {"control": 345543, "experiment": 344660},
        "clicks":    {"control": 28378,  "experiment": 28325},
    },
    "metrics": {
        "gross_conversion": {
            "clicks":      {"control": 17293, "experiment": 17260},
            "enrollments": {"control":  3785, "experiment":  3423},
        },
        "net_conversion": {
            "clicks":   {"control": 17293, "experiment": 17260},
            "payments": {"control":  2033, "experiment":  1945},
        }
    }
}

def run_ratio_test(xA, nA, xB, nB):
    A = summarize_proportion(xA, nA); B = summarize_proportion(xB, nB)
    z, p = two_prop_ztest(A.x, A.n, B.x, B.n, two_sided=True)
    ci_lo, ci_hi = bootstrap_ci_diff(A.p, B.p, A.n, B.n, B=3000, alpha=0.05)
    return A, B, z, p, ci_lo, ci_hi

# Gross Conversion
gc = totals["metrics"]["gross_conversion"]
A, B, z_gc, p_gc, lo_gc, hi_gc = run_ratio_test(gc["enrollments"]["control"], gc["clicks"]["control"],
                                                gc["enrollments"]["experiment"], gc["clicks"]["experiment"])

# Net Conversion
nc = totals["metrics"]["net_conversion"]
A2, B2, z_nc, p_nc, lo_nc, hi_nc = run_ratio_test(nc["payments"]["control"], nc["clicks"]["control"],
                                                  nc["payments"]["experiment"], nc["clicks"]["experiment"])

pd.DataFrame({
    'metric':['gross_conversion','net_conversion'],
    'control_rate':[A.p, A2.p],
    'treatment_rate':[B.p, B2.p],
    'p_value(z-test)':[p_gc, p_nc],
    'boot_CI_lo(B-A)':[lo_gc, lo_nc],
    'boot_CI_hi(B-A)':[hi_gc, hi_nc],
})


**Explanation.** Screener aims to reduce GC while keeping NC flat. We report p-values and bootstrap CIs for both metrics.

## 3) Criteo Uplift Modeling — Modern Uplift A/B


**Why this is modern?** Many modern experiments focus on **heterogeneous treatment effects**: *who* benefits from treatment, not just whether treatment is better on average.  
The **Criteo Uplift Modeling** dataset contains treatment (`treatment`) and outcome (`conversion`) with user features, enabling **uplift** analysis.

We load a **public sample** from the CausalML repository (CSV.gz). We'll compute:
- Group-level conversion (classic A/B sanity).
- A simple **uplift@K** curve using a logistic score (no external libraries).
- **Qini coefficient** approximation.


In [None]:

import pandas as pd
import numpy as np

# Public sample hosted by CausalML repo (CSV.GZ)
URL = "https://raw.githubusercontent.com/uber/causalml/master/examples/data/criteo_uplift.csv.gz"
upl = pd.read_csv(URL, compression='gzip')

# Expect columns (common sample): 'treatment', 'conversion' (0/1), and features.
expected = {'treatment','conversion'}
missing = expected - set(upl.columns)
if missing:
    raise ValueError(f"Expected columns missing: {missing}")

upl[['treatment','conversion']].head()


**Sanity.** We check that treatment/control exist and compute base conversion by group.

In [None]:

upl.groupby('treatment')['conversion'].agg(['mean','sum','count'])


### A simple uplift scoring and Qini/uplift@K

In [None]:

from sklearn.linear_model import LogisticRegression

# We'll fit two separate logistic models (T-learner style) just for a ranking score:
# P(y|X, T=1) and P(y|X, T=0). Uplift score ~ p1 - p0.
# NOTE: Only for demonstration; production uplift uses stronger learners & CV.

# Use numeric features only
feat_cols = [c for c in upl.columns if c not in ('treatment','conversion')]
X = upl[feat_cols].select_dtypes(include=[np.number]).fillna(0.0).values
y = upl['conversion'].values
t = upl['treatment'].values

X1, y1 = X[t==1], y[t==1]
X0, y0 = X[t==0], y[t==0]

clf1 = LogisticRegression(max_iter=1000)
clf0 = LogisticRegression(max_iter=1000)
clf1.fit(X1, y1)
clf0.fit(X0, y0)

p1 = clf1.predict_proba(X)[:,1]
p0 = clf0.predict_proba(X)[:,1]
uplift_score = p1 - p0

# Rank by uplift score desc and compute uplift@k curve
order = np.argsort(-uplift_score)
conv = y[order]
t_ord = t[order]

# Compute cumulative treatment vs control conversions as we include top-k users
cum_treat = np.cumsum((t_ord==1) * conv)
cum_ctrl  = np.cumsum((t_ord==0) * conv)

# Normalize by group counts in prefix to approximate incremental conversions
cnt_treat = np.cumsum((t_ord==1).astype(int))
cnt_ctrl  = np.cumsum((t_ord==0).astype(int))

# Avoid division by zero
rate_treat = np.where(cnt_treat>0, cum_treat/np.maximum(cnt_treat,1), 0.0)
rate_ctrl  = np.where(cnt_ctrl>0,  cum_ctrl/np.maximum(cnt_ctrl,1),  0.0)

uplift_at_k = rate_treat - rate_ctrl

# Qini-like area (simple discrete sum)
qini = float(np.trapz(uplift_at_k, dx=1.0/len(uplift_at_k)))

len(uplift_at_k), qini


**Plot uplift@K (percentile on x-axis)**

In [None]:

k = len(uplift_at_k)
x = np.linspace(0, 100, num=k)
plt.figure()
plt.plot(x, uplift_at_k)
plt.title("Uplift@K curve (T-learner logistic)")
plt.xlabel("Top-K% (ranked by uplift score)")
plt.ylabel("Estimated uplift (Δconv rate)")
plt.tight_layout()
plt.show()



**Interpretation.** If the curve is above zero for the top-K%, targeting those users would **increase** conversions relative to control.  
The **Qini-like area** summarizes this curve; higher is better.

> In production, prefer robust uplift estimators (e.g., DR-Learner, X-Learner) and cross-validation; consider constraints (budget, exposure).
