
# A/B Testing Playbook — Professional Notebook (Udacity + Criteo, with Modern Methods)

This notebook is a **professional, end‑to‑end** A/B testing playbook built on **well‑known public datasets**.
It serves as a **teaching** and **production** reference with **clear Markdown explanations** and **rigorous methods**.

**Datasets included**
1. **Udacity — Landing Page (`ab_data.csv`)** — user–level conversion test (old vs new page).  
2. **Udacity — Free Trial Screener** — aggregate metrics (Gross & Net Conversion).  
3. **Criteo Uplift Modeling** — treatment/control ads enabling **uplift/heterogeneous effects** analysis.

**What you’ll find here**
- Sanity checks (**SRM** by chi‑square), data validation and cleaning
- EDA & rate visualizations (matplotlib only)
- Frequentist tests: two‑proportion z‑test; **GLM(Logit)** with **day fixed effects**
- **Bootstrap** confidence intervals for absolute lift
- **CUPED** (variance reduction) with careful caveats and correct construction
- **Bayesian Beta–Binomial** sanity check
- **Power & MDE** planners, **achieved power**
- **Multiple testing** correction (Holm) when assessing several metrics
- **Modern uplift** evaluation (Criteo): uplift@K and Qini‑like area

> Every code block is preceded by Markdown instructions and followed by a short interpretation.


## 0. Setup — Libraries & Helpers

In [None]:

from __future__ import annotations
from dataclasses import dataclass
from typing import Dict, Tuple, Iterable
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Optional but standard for modeling
from statsmodels.api import Logit, add_constant, OLS  # type: ignore
from sklearn.linear_model import LogisticRegression  # used only in uplift demo

plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['axes.grid'] = True
RANDOM_SEED = 7
np.random.seed(RANDOM_SEED)


### Core statistical helpers

In [None]:

@dataclass(frozen=True)
class ProportionSummary:
    p: float
    n: int
    x: int

def summarize_proportion(x: int, n: int) -> ProportionSummary:
    if n <= 0:
        raise ValueError("n must be positive.")
    if not (0 <= x <= n):
        raise ValueError("x must be in [0, n].")
    return ProportionSummary(p=x/n, n=n, x=x)

def two_prop_ztest(x1: int, n1: int, x2: int, n2: int, two_sided: bool = True):
    s1, s2 = summarize_proportion(x1, n1), summarize_proportion(x2, n2)
    p_pool = (s1.x + s2.x) / (s1.n + s2.n)
    se = math.sqrt(p_pool * (1 - p_pool) * (1/s1.n + 1/s2.n))
    if se == 0.0:
        raise ZeroDivisionError("SE=0; verify inputs.")
    z = (s1.p - s2.p) / se
    p_val = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2)))) if two_sided else (1 - 0.5 * (1 + math.erf(z / math.sqrt(2))))
    return z, p_val

def chisq_srm(nA: int, nB: int) -> float:
    n = nA + nB
    expected = [n/2, n/2]
    observed = [nA, nB]
    chi2 = sum((o - e)**2 / e for o, e in zip(observed, expected))
    return 2 * (1 - 0.5 * (1 + math.erf(math.sqrt(chi2) / math.sqrt(2))))

def bootstrap_ci_diff(pA: float, pB: float, nA: int, nB: int, B: int = 5000, alpha: float = 0.05):
    diffs = np.empty(B, dtype=float)
    for b in range(B):
        xA = np.random.binomial(nA, pA)
        xB = np.random.binomial(nB, pB)
        diffs[b] = xB/nB - xA/nA
    lo = float(np.quantile(diffs, alpha/2)); hi = float(np.quantile(diffs, 1 - alpha/2))
    return lo, hi

def required_n_two_proportions(pA: float, pB: float, alpha: float = 0.05, power: float = 0.8, two_sided: bool = True) -> int:
    def invPhi(u: float) -> float:
        return math.sqrt(2) * math.erfcinv(2*(1 - u))
    z_alpha = abs(invPhi(1 - alpha/2)) if two_sided else abs(invPhi(1 - alpha))
    z_beta = abs(invPhi(power))
    pbar = 0.5*(pA + pB)
    delta = abs(pB - pA)
    if delta == 0.0:
        raise ValueError("delta=0 implies infinite n.")
    se = math.sqrt(2 * pbar * (1 - pbar))
    n = ((z_alpha + z_beta) * se / delta)**2
    return int(math.ceil(n))

def mde_for_n(pA: float, n_per_arm: int, alpha: float = 0.05, power: float = 0.8, two_sided: bool = True) -> float:
    def invPhi(u: float) -> float:
        return math.sqrt(2) * math.erfcinv(2*(1 - u))
    z_alpha = abs(invPhi(1 - alpha/2)) if two_sided else abs(invPhi(1 - alpha))
    z_beta = abs(invPhi(power))
    se = math.sqrt(2 * pA * (1 - pA))
    return float((z_alpha + z_beta) * se / math.sqrt(max(n_per_arm,1)))

def holm_correction(pvals, alpha: float = 0.05):
    import numpy as np
    p = np.array(list(pvals), dtype=float)
    m = len(p)
    order = np.argsort(p)
    adj = np.empty(m, dtype=float)
    for k, idx in enumerate(order, start=1):
        adj[idx] = (m - k + 1) * p[idx]
    adj = np.maximum.accumulate(adj[order][::-1])[::-1]
    return adj, order


## 1) Udacity — Landing Page (`ab_data.csv`)

### Load, clean, SRM

In [None]:

RAW_URL = "https://raw.githubusercontent.com/udacity/sdand-ab-testing-project/main/ab_data.csv"
df = pd.read_csv(RAW_URL)
mask_ok = ((df['group']=='control') & (df['landing_page']=='old_page')) | ((df['group']=='treatment') & (df['landing_page']=='new_page'))
df = df[mask_ok].drop_duplicates('user_id', keep='first').copy()
nA = (df['group']=='control').sum(); nB = (df['group']=='treatment').sum()
p_srm = chisq_srm(nA, nB)
nA, nB, p_srm


### Primary test + Bootstrap + MDE

In [None]:

grp = df.groupby('group')['converted'].agg(['sum','count','mean']).rename(columns={'sum':'x','count':'n','mean':'rate'})
A = summarize_proportion(int(grp.loc['control','x']), int(grp.loc['control','n']))
B = summarize_proportion(int(grp.loc['treatment','x']), int(grp.loc['treatment','n']))
z, p = two_prop_ztest(A.x, A.n, B.x, B.n, two_sided=True)
ci_lo, ci_hi = bootstrap_ci_diff(A.p, B.p, A.n, B.n, B=3000, alpha=0.05)
abs_lift = B.p - A.p
n_per_arm = min(A.n, B.n)
mde = mde_for_n(A.p, n_per_arm, 0.05, 0.8, True)
pd.DataFrame({'arm':['control','treatment'],'n':[A.n,B.n],'x':[A.x,B.x],'rate':[A.p,B.p],'p_value(z)':[p,p],
              'abs_lift(B-A)':[abs_lift,abs_lift],'boot_CI_lo':[ci_lo,ci_lo],'boot_CI_hi':[ci_hi,ci_hi],
              'MDE_abs_at_80%_power':[mde,mde]})


### GLM(Logit) with day fixed effects

In [None]:

from statsmodels.api import add_constant, Logit  # type: ignore
df['date'] = pd.to_datetime(df['timestamp']).dt.date
X = pd.get_dummies(df[['group','date']], drop_first=True).rename(columns={'group_treatment':'treatment'}).astype(float)
Xc = add_constant(X)
y = df['converted'].astype(int).to_numpy()
model = Logit(y, Xc).fit(disp=False)
t_coef = model.params.get('treatment', float('nan')); t_se = model.bse.get('treatment', float('nan'))
z_t = t_coef / t_se if t_se not in (0, float('nan')) else float('nan')
p_t = 2 * (1 - 0.5*(1 + math.erf(abs(z_t)/math.sqrt(2)))) if not math.isnan(z_t) else float('nan')
pd.DataFrame({'term':['treatment'],'coef':[t_coef],'se':[t_se],'z':[z_t],'p_value':[p_t]})


### CUPED (day-control baseline) + OLS on adjusted outcome

In [None]:

from statsmodels.api import OLS  # type: ignore

def cuped_transform(y, x_cov):
    y = np.asarray(y, dtype=float); x_cov = np.asarray(x_cov, dtype=float)
    if y.shape != x_cov.shape: raise ValueError("Shapes must match.")
    vx = np.var(x_cov); theta = 0.0 if vx==0 else float(np.cov(y, x_cov, ddof=1)[0,1] / vx)
    return (y - theta * x_cov), theta

df_c = df.copy()
baseline = df_c[df_c['group']=='control'].groupby('date')['converted'].mean()
df_c = df_c.merge(baseline.rename('baseline_day_control'), left_on='date', right_index=True, how='left')
y = df_c['converted'].astype(float).to_numpy()
x = df_c['baseline_day_control'].fillna(baseline.mean()).astype(float).to_numpy()
y_adj, theta = cuped_transform(y, x)

X_fe = add_constant(pd.get_dummies(df_c[['group','date']], drop_first=True).rename(columns={'group_treatment':'treatment'}).astype(float))
mdl = OLS(y_adj, X_fe).fit()
coef_t = mdl.params.get('treatment', float('nan')); se_t = mdl.bse.get('treatment', float('nan'))
t_stat = coef_t / se_t if se_t not in (0, float('nan')) else float('nan')
p_val = 2 * (1 - 0.5*(1 + math.erf(abs(t_stat)/math.sqrt(2)))) if not math.isnan(t_stat) else float('nan')
pd.DataFrame({'theta':[theta],'coef_treatment_after_CUPED':[coef_t],'se':[se_t],'p_value_approx':[p_val]})


### Bayesian Beta–Binomial

In [None]:

def beta_post_params(x, n, a0=1.0, b0=1.0):
    return a0+x, b0+(n-x)

def prob_B_gt_A(xA,nA,xB,nB, draws=50000):
    aA,bA = beta_post_params(xA,nA); aB,bB = beta_post_params(xB,nB)
    pA = np.random.beta(aA,bA,size=draws); pB = np.random.beta(aB,bB,size=draws)
    return float(np.mean(pB > pA))

p_B_gt_A = prob_B_gt_A(int(grp.loc['control','x']), int(grp.loc['control','n']),
                       int(grp.loc['treatment','x']), int(grp.loc['treatment','n']))
p_B_gt_A


## 2) Udacity — Free Trial Screener (Aggregates)

### Metrics, SRM, multiple testing

In [None]:

totals = {
    "sanity": {"pageviews": {"control": 345543, "experiment": 344660},
               "clicks":    {"control": 28378,  "experiment": 28325}},
    "metrics": {"gross_conversion": {"clicks": {"control": 17293, "experiment": 17260},
                                     "enrollments": {"control": 3785, "experiment": 3423}},
                "net_conversion":  {"clicks": {"control": 17293, "experiment": 17260},
                                     "payments": {"control": 2033, "experiment": 1945}}}
}

def ratio_test(xA,nA,xB,nB):
    A = summarize_proportion(xA,nA); B = summarize_proportion(xB,nB)
    z,p = two_prop_ztest(A.x,A.n,B.x,B.n,True); lo,hi = bootstrap_ci_diff(A.p,B.p,A.n,B.n,3000,0.05)
    return A,B,p,lo,hi

p_srm_page = chisq_srm(totals["sanity"]["pageviews"]["control"], totals["sanity"]["pageviews"]["experiment"])
p_srm_click = chisq_srm(totals["sanity"]["clicks"]["control"], totals["sanity"]["clicks"]["experiment"])

A_gc,B_gc,p_gc,lo_gc,hi_gc = ratio_test(totals["metrics"]["gross_conversion"]["enrollments"]["control"],
                                        totals["metrics"]["gross_conversion"]["clicks"]["control"],
                                        totals["metrics"]["gross_conversion"]["enrollments"]["experiment"],
                                        totals["metrics"]["gross_conversion"]["clicks"]["experiment"])
A_nc,B_nc,p_nc,lo_nc,hi_nc = ratio_test(totals["metrics"]["net_conversion"]["payments"]["control"],
                                        totals["metrics"]["net_conversion"]["clicks"]["control"],
                                        totals["metrics"]["net_conversion"]["payments"]["experiment"],
                                        totals["metrics"]["net_conversion"]["clicks"]["experiment"])
adj, order = holm_correction([p_gc,p_nc], 0.05)
pd.DataFrame({'metric':['gross_conversion','net_conversion'],
              'control_rate':[A_gc.p,A_nc.p],'treatment_rate':[B_gc.p,B_nc.p],
              'p_value_raw':[p_gc,p_nc],'p_value_Holm_adj':[adj[0],adj[1]],
              'boot_CI_lo(B-A)':[lo_gc,lo_nc],'boot_CI_hi(B-A)':[hi_gc,hi_nc]}), {'SRM_pageviews_p':p_srm_page,'SRM_clicks_p':p_srm_click}


### Power & MDE (Screener)

In [None]:

n_gc = min(totals["metrics"]["gross_conversion"]["clicks"]["control"],
           totals["metrics"]["gross_conversion"]["clicks"]["experiment"])
n_nc = min(totals["metrics"]["net_conversion"]["clicks"]["control"],
           totals["metrics"]["net_conversion"]["clicks"]["experiment"])
pA_gc = A_gc.p; pA_nc = A_nc.p
mde_gc = mde_for_n(pA_gc, n_gc); mde_nc = mde_for_n(pA_nc, n_nc)
pd.DataFrame({'metric':['gross_conversion','net_conversion'],
              'n_per_arm':[n_gc, n_nc],
              'baseline_rate(control)':[pA_gc, pA_nc],
              'MDE_abs_at_80%_power':[mde_gc, mde_nc]})


**CUPED note.** With only arm‑level totals, CUPED is not directly applicable. If you obtain daily aggregates, use day fixed effects or weighted GLM as a CUPED‑like strategy.

## 3) Criteo Uplift — Heterogeneous Effects

### Load data & quick sanity

In [None]:

URL = "https://raw.githubusercontent.com/uber/causalml/master/examples/data/criteo_uplift.csv.gz"
upl = pd.read_csv(URL, compression='gzip')
if not {'treatment','conversion'}.issubset(upl.columns):
    raise ValueError("Expected 'treatment' and 'conversion'.")
upl.groupby('treatment')['conversion'].agg(['mean','sum','count'])


### Uplift scoring (T‑learner logistic), uplift@K and Qini-like

In [None]:

from sklearn.linear_model import LogisticRegression
feat_cols = [c for c in upl.columns if c not in ('treatment','conversion')]
X = upl[feat_cols].select_dtypes(include=[np.number]).fillna(0.0).to_numpy()
y = upl['conversion'].to_numpy(); t = upl['treatment'].to_numpy()

clf1 = LogisticRegression(max_iter=1000); clf0 = LogisticRegression(max_iter=1000)
clf1.fit(X[t==1], y[t==1]); clf0.fit(X[t==0], y[t==0])
p1 = clf1.predict_proba(X)[:,1]; p0 = clf0.predict_proba(X)[:,1]
uplift_score = p1 - p0

order = np.argsort(-uplift_score); y_ord = y[order]; t_ord = t[order]
cum_treat = np.cumsum((t_ord==1) * y_ord); cum_ctrl  = np.cumsum((t_ord==0) * y_ord)
cnt_treat = np.cumsum((t_ord==1).astype(int)); cnt_ctrl  = np.cumsum((t_ord==0).astype(int))
rate_t = np.where(cnt_treat>0, cum_treat/np.maximum(cnt_treat,1), 0.0)
rate_c = np.where(cnt_ctrl>0,  cum_ctrl/np.maximum(cnt_ctrl,1),  0.0)
uplift_at_k = rate_t - rate_c
qini = float(np.trapz(uplift_at_k, dx=1.0/len(uplift_at_k)))

x = np.linspace(0,100,num=len(uplift_at_k))
plt.figure(); plt.plot(x, uplift_at_k); plt.title("Uplift@K (T-learner logistic)"); plt.xlabel("Top-K%"); plt.ylabel("Δ rate"); plt.tight_layout(); plt.show()
qini


**Reading.** Positive uplift among top‑ranked users indicates value in targeting. For production, prefer DR/X‑Learners with cross‑fitting and policy constraints.


# 5) Sequential Testing — Pocock & O'Brien–Fleming

**Motivação.** Em testes online, é comum analisar resultados em múltiplos *looks* (peeking).  
Sem correção, isto **infla** o erro tipo I. Métodos **group-sequential** (Pocock, OBF) ajustam limiares críticos por *look*.

**Configuração.** Assumiremos um teste bicaudal (\(\alpha=0.05\)) e \(K\) *looks* a frações de informação \(t_i\in(0,1]\), p.ex. diárias.


In [None]:

from __future__ import annotations
from typing import List, Tuple
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def z_alpha_two_sided(alpha: float = 0.05) -> float:
    """Return z_{1 - alpha/2} via inverse erfc."""
    return math.sqrt(2) * math.erfcinv(alpha)

def obrien_fleming_boundaries(information_fracs: List[float], alpha: float = 0.05) -> List[float]:
    """
    Two-sided O'Brien-Fleming critical z per look i: z_i = z_{1-alpha/2} / sqrt(t_i).
    Symmetric boundaries ±z_i (conservative early, liberal late).
    """
    if not information_fracs:
        raise ValueError("information_fracs must be non-empty.")
    if not all(0 < t <= 1 for t in information_fracs):
        raise ValueError("Each information fraction t must be in (0,1].")
    z_star = z_alpha_two_sided(alpha)
    return [float(z_star / math.sqrt(t)) for t in information_fracs]

def pocock_boundaries(K: int, alpha: float = 0.05) -> List[float]:
    """
    Two-sided Pocock uses a constant boundary across looks.
    In practice, values are tabulated. For simplicity, we provide good approximations for K<=10.
    For two-sided alpha=0.05, the boundary is about ~2.41 for K in [2..10].
    """
    if not (2 <= K <= 10):
        raise ValueError("Pocock implementation supports 2<=K<=10 looks.")
    table = {k: 2.414 for k in range(2, 11)}
    return [table[K]] * K

def simulate_peeking(z_true: float, information_fracs: List[float], strategy: str = "OBF", alpha: float = 0.05, n_sims: int = 2000) -> pd.DataFrame:
    """
    Simulate repeated z-statistics accumulating over looks.
    - z_true: true noncentral mean of z at final look (effect size scaled).
    - Generates Normal( mean = z_true*sqrt(t_i), sd=1 ) at each look to emulate information accrual.
    - strategy: "OBF" or "Pocock"
    Returns a DataFrame with stop proportions and boundary crossing rates.
    """
    if strategy not in ("OBF", "Pocock"):
        raise ValueError("strategy must be 'OBF' or 'Pocock'")
    K = len(information_fracs)
    if strategy == "OBF":
        bounds = obrien_fleming_boundaries(information_fracs, alpha=alpha)
    else:
        bounds = pocock_boundaries(K, alpha=alpha)

    stops = np.zeros(K, dtype=int)
    for _ in range(n_sims):
        for i, t in enumerate(information_fracs):
            z_i = np.random.normal(loc=z_true*math.sqrt(t), scale=1.0)
            if abs(z_i) >= bounds[i]:
                stops[i] += 1
                break

    prop = stops / n_sims
    out = pd.DataFrame({"look": np.arange(1, K+1), "info_frac": information_fracs, "crit_z": bounds, "stop_rate": prop})
    return out



### 5.1 Exemplo: limites e probabilidade de paragem

Quatro *looks* (25%, 50%, 75%, 100%).  
Comparamos OBF vs. Pocock sob um efeito verdadeiro ligeiro (\(z_{true}=1.5\)).


In [None]:

info = [0.25, 0.5, 0.75, 1.0]
obf = obrien_fleming_boundaries(info, alpha=0.05)
poc = pocock_boundaries(len(info), alpha=0.05)

df_obf = simulate_peeking(z_true=1.5, information_fracs=info, strategy="OBF", alpha=0.05, n_sims=2000)
df_poc = simulate_peeking(z_true=1.5, information_fracs=info, strategy="Pocock", alpha=0.05, n_sims=2000)

plt.figure()
plt.plot(info, obf, marker='o', label='OBF critical z')
plt.plot(info, poc, marker='s', label='Pocock critical z')
plt.title("Critical z by information fraction (two-sided α=0.05)")
plt.xlabel("Information fraction"); plt.ylabel("Critical |z|")
plt.legend(); plt.tight_layout(); plt.show()

plt.figure()
plt.plot(df_obf["look"], df_obf["stop_rate"], marker='o', label='OBF stop rate')
plt.plot(df_poc["look"], df_poc["stop_rate"], marker='s', label='Pocock stop rate')
plt.title("Stop rate per look (z_true=1.5)")
plt.xlabel("Look"); plt.ylabel("Proportion stopped at look")
plt.legend(); plt.tight_layout(); plt.show()

df_obf, df_poc



**Leitura.** OBF é conservador no início e mais permissivo no fim; Pocock mantém limite constante.  
Ambos controlam \(\alpha\) quando os *looks* são pré-definidos e os limites seguidos.



# 6) Multi‑Armed Bandits — Thompson Sampling vs. A/B fixo

**Objetivo.** Maximizar reward durante o teste, não apenas no fim.  
Comparamos **Thompson Sampling** (Beta‑Bernoulli) a uma política fixa 50/50 em termos de **regret**.


In [None]:

from __future__ import annotations
from typing import Tuple
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def simulate_thompson_bernoulli(pA: float, pB: float, T: int = 10000, seed: int = 7) -> Tuple[np.ndarray, np.ndarray]:
    """
    Thompson Sampling para dois braços Bernoulli.
    Retorna curves de regret cumulativo: TS vs 50/50 fixo.
    """
    rng = np.random.default_rng(seed)
    aA=bA=1.0; aB=bB=1.0  # Beta(1,1) priors
    best = max(pA, pB)
    regret_ts = np.zeros(T); regret_fixed = np.zeros(T)
    cum_reg_ts = 0.0; cum_reg_fixed = 0.0

    for t in range(T):
        thetaA = rng.beta(aA, bA); thetaB = rng.beta(aB, bB)
        arm = 0 if thetaA >= thetaB else 1
        rewardA = rng.binomial(1, pA); rewardB = rng.binomial(1, pB)
        reward = rewardA if arm==0 else rewardB
        if arm==0: aA += reward; bA += 1 - reward
        else:      aB += reward; bB += 1 - reward

        cum_reg_ts += (best - (pA if arm==0 else pB))
        regret_ts[t] = cum_reg_ts

        cum_reg_fixed += (best - (0.5*pA + 0.5*pB))
        regret_fixed[t] = cum_reg_fixed

    return regret_ts, regret_fixed

pA, pB = 0.10, 0.11
reg_ts, reg_fx = simulate_thompson_bernoulli(pA, pB, T=8000, seed=11)

plt.figure()
plt.plot(np.arange(1, len(reg_ts)+1), reg_ts, label="Thompson Sampling")
plt.plot(np.arange(1, len(reg_fx)+1), reg_fx, label="Fixed 50/50")
plt.title("Cumulative Regret — TS vs Fixed (pA=0.10, pB=0.11)")
plt.xlabel("Rounds"); plt.ylabel("Cumulative regret")
plt.legend(); plt.tight_layout(); plt.show()



**Nota.** Bandits reduzem regret mas complicam a inferência. Para estimar efeito médio com bandits, use ponderação por propensão (IPW) e evite *naive peeking*.



# 7) Decision Playbook — Regras Parametrizadas

**Entrada:** lift observado e IC, **MDE** mínimo aceitável, benefício/custo, tráfego e horizonte.  
**Saída:** recomendação `ship` / `hold` / `roll-back` + métricas de apoio.


In [None]:

from dataclasses import dataclass
from typing import Literal, Tuple

Decision = Literal["ship", "hold", "roll-back"]

@dataclass
class DecisionInputs:
    lift_abs: float
    ci_lo: float
    ci_hi: float
    baseline_rate: float
    mde_abs: float
    benefit_per_conversion: float
    cost_per_user_exposed: float
    traffic_per_day: int
    risk_aversion: float = 1.0
    min_days: int = 7

def decision_playbook(inp: DecisionInputs) -> Tuple[Decision, dict]:
    if not (0.0 <= inp.baseline_rate <= 1.0):
        raise ValueError("baseline_rate must be in [0,1].")
    if inp.mde_abs <= 0.0:
        raise ValueError("mde_abs must be positive.")
    if inp.traffic_per_day <= 0:
        raise ValueError("traffic_per_day must be positive.")

    adj_lift = max(inp.ci_lo * inp.risk_aversion, inp.lift_abs / 2.0)

    delta_conv_per_user = adj_lift
    delta_conversions_day = delta_conv_per_user * inp.traffic_per_day
    gross_benefit_day = delta_conversions_day * inp.benefit_per_conversion
    cost_day = inp.traffic_per_day * inp.cost_per_user_exposed
    net_benefit_horizon = (gross_benefit_day - cost_day) * inp.min_days

    meets_mde = abs(adj_lift) >= inp.mde_abs
    sig_positive = inp.ci_lo > 0.0
    sig_negative = inp.ci_hi < 0.0

    if sig_negative:
        decision: Decision = "roll-back"
    elif sig_positive and meets_mde and net_benefit_horizon > 0:
        decision = "ship"
    elif meets_mde and net_benefit_horizon > 0 and inp.lift_abs > 0:
        decision = "ship"
    else:
        decision = "hold"

    return decision, {
        "risk_adjusted_lift": adj_lift,
        "meets_mde": meets_mde,
        "sig_positive": sig_positive,
        "sig_negative": sig_negative,
        "net_benefit_horizon": net_benefit_horizon,
    }

# Example
example = DecisionInputs(
    lift_abs=0.002,
    ci_lo=-0.001, ci_hi=0.005,
    baseline_rate=0.12,
    mde_abs=0.003,
    benefit_per_conversion=50.0,
    cost_per_user_exposed=0.02,
    traffic_per_day=20000,
    risk_aversion=1.0,
    min_days=14,
)
decision, details = decision_playbook(example)
decision, details



## 6.1 Inferência com Bandits — IPW (Inverse Propensity Weighting)

Quando a alocação **não é fixa** (ex.: **Thompson Sampling**), a probabilidade de tratamento varia no tempo e por utilizador.
Para obter uma estimativa **não-viesada** do efeito médio de tratamento (ATE) precisamos de **ponderar** cada observação pela **inversa da propensão** (probabilidade de ter recebido o braço observado).

**Ideia básica do IPW (para outcome binário):**
\[
\widehat{\Delta}_{IPW}
= \frac{\sum_i w_i \, T_i \, Y_i}{\sum_i w_i \, T_i}
- \frac{\sum_i w_i \, (1-T_i) \, Y_i}{\sum_i w_i \, (1-T_i)},
\quad
w_i=\frac{1}{\Pr(A_i=T_i \mid \text{história}_i)}
\]

- \(T_i\in\{0,1\}\) é o tratamento, \(Y_i\in\{0,1\}\) o outcome.
- \(\Pr(A_i=T_i \mid \cdot)\) é a **propensão** (probabilidade de seleção do braço) **no momento** da decisão.
- Em TS, essa propensão vem do **posterior** no instante \(i\) (podemos logá-la durante o teste).

Abaixo, simulamos um bandit simples **já a registar as propensões** e estimamos o ATE via **IPW** (simples e estabilizado).


In [None]:

from __future__ import annotations
import numpy as np
import pandas as pd

def simulate_ts_with_propensity(pA: float, pB: float, T: int = 20000, seed: int = 7) -> pd.DataFrame:
    """
    Thompson Sampling (dois braços Bernoulli) que REGISTA a propensão de escolha do braço aplicado.
    Retorna um DataFrame com colunas: t, arm, reward, prop
    - arm ∈ {0,1}
    - prop = P(escolher 'arm' naquele passo), vinda do posterior Beta.
    """
    rng = np.random.default_rng(seed)
    aA=bA=1.0; aB=bB=1.0
    rows = []
    for t in range(T):
        # Amostragem para decisão
        thetaA = rng.beta(aA, bA); thetaB = rng.beta(aB, bB)
        # Probabilidade de escolher A é P(thetaA >= thetaB) sob o posterior atual.
        # Estimamos via Monte Carlo rápido:
        thetasA = rng.beta(aA, bA, size=200)
        thetasB = rng.beta(aB, bB, size=200)
        p_choose_A = float(np.mean(thetasA >= thetasB))
        p_choose_B = 1.0 - p_choose_A

        arm = 0 if thetaA >= thetaB else 1
        rewardA = rng.binomial(1, pA)
        rewardB = rng.binomial(1, pB)
        reward = rewardA if arm==0 else rewardB
        if arm==0:
            aA += reward; bA += 1 - reward
            prop = p_choose_A
        else:
            aB += reward; bB += 1 - reward
            prop = p_choose_B

        rows.append((t, arm, reward, prop))
    return pd.DataFrame(rows, columns=["t","arm","reward","prop"])

def ipw_ate(df: pd.DataFrame) -> dict:
    """
    Estima ATE via IPW (simples e estabilizado).
    Espera colunas: arm ∈ {0,1}, reward ∈ {0,1}, prop ∈ (0,1].
    """
    if not set(["arm","reward","prop"]).issubset(df.columns):
        raise ValueError("Columns arm, reward, prop are required.")
    # Pesos inversos
    w = 1.0 / df["prop"].to_numpy()
    t = df["arm"].to_numpy()
    y = df["reward"].to_numpy()

    # IPW simples para cada braço
    w_t = w * t
    w_c = w * (1 - t)
    p1_ipw = (w_t * y).sum() / max(w_t.sum(), 1e-12)
    p0_ipw = (w_c * y).sum() / max(w_c.sum(), 1e-12)
    delta_ipw = p1_ipw - p0_ipw

    # IPW estabilizado: multiplica pesos por propensão marginal do braço
    pi1 = float(t.mean())
    pi0 = 1.0 - pi1
    w_stab_t = w_t * pi1
    w_stab_c = w_c * pi0
    p1_stab = (w_stab_t * y).sum() / max(w_stab_t.sum(), 1e-12)
    p0_stab = (w_stab_c * y).sum() / max(w_stab_c.sum(), 1e-12)
    delta_stab = p1_stab - p0_stab

    return {
        "p1_ipw": p1_ipw, "p0_ipw": p0_ipw, "delta_ipw": delta_ipw,
        "p1_ipw_stab": p1_stab, "p0_ipw_stab": p0_stab, "delta_ipw_stab": delta_stab,
        "pi1_observed": pi1
    }

# Demo com pequeno lift
sim = simulate_ts_with_propensity(pA=0.10, pB=0.11, T=40000, seed=42)
res = ipw_ate(sim)
res



**Leitura.** `delta_ipw` e `delta_ipw_stab` aproximam o **efeito médio** sob alocação adaptativa.  
Em produção: loga sempre as **propensões** por requisição (ou estima-as com um **policy model** fiel) e usa **intervalos** (bootstrap em blocos/tempo) para incerteza.



## 5.2 Lan–DeMets Spending Functions — OBF e Pocock (generalização)

Em vez de tabelas fixas, **Lan–DeMets** trata o controle do erro tipo I como uma **função de *spending* \(\alpha(t)\)** ao longo da fração de informação \(t\in(0,1]\).
Duas escolhas clássicas reproduzem (aproximadamente) OBF e Pocock:

- **OBF‑like:** \(\alpha(t) \approx 2 - 2\,\Phi\!\left(\frac{z_{\alpha/2}}{\sqrt{t}}\right)\).  
  *Poupa* \(\alpha\) no início e **gasta** no fim.
- **Pocock‑like:** \(\alpha(t) \approx \alpha \,\log\!\big(1 + (e-1)\,t\big)\).  
  Gasto **mais uniforme** ao longo do tempo.

Para *looks* discretos \(t_1<\dots<t_K\), o *spending* por *look* é \(\alpha_i=\alpha(t_i)-\alpha(t_{i-1})\), e podemos definir um limiar **crítico** aproximado como
\(z_i \approx \Phi^{-1}\!\big(1-\alpha_i/2\big)\) (bicaudal).


In [None]:

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def Phi(z: float) -> float:
    return 0.5 * (1 + math.erf(z / math.sqrt(2)))

def Phi_inv(p: float) -> float:
    # inverse CDF via erfcinv
    # p in (0,1): z = sqrt(2) * erfcinv(2*(1-p))
    return math.sqrt(2) * math.erfcinv(2*(1 - p))

def spending_obf(t: np.ndarray, alpha: float = 0.05) -> np.ndarray:
    z = Phi_inv(1 - alpha/2.0)
    return 2 - 2*Phi(z/np.sqrt(np.clip(t, 1e-12, 1.0)))

def spending_pocock(t: np.ndarray, alpha: float = 0.05) -> np.ndarray:
    return alpha * np.log(1 + (math.e - 1)*t)

def ld_boundaries(info_fracs, alpha=0.05, kind="OBF"):
    t = np.asarray(info_fracs, dtype=float)
    if kind.upper() == "OBF":
        A = spending_obf(t, alpha=alpha)
    else:
        A = spending_pocock(t, alpha=alpha)
    A_prev = np.r_[0.0, A[:-1]]
    alpha_i = np.clip(A - A_prev, 1e-10, 1.0)  # incremental spending
    # approximate two-sided critical z for each look
    z_i = Phi_inv(1 - alpha_i/2.0)
    return pd.DataFrame({"look": np.arange(1, len(t)+1), "t": t, "alpha_cum": A, "alpha_inc": alpha_i, "crit_z": z_i})

# Example grid of information fractions
t = np.array([0.2, 0.4, 0.6, 0.8, 1.0])
df_obf = ld_boundaries(t, alpha=0.05, kind="OBF")
df_poc = ld_boundaries(t, alpha=0.05, kind="POCOCK")

# Plot spending
plt.figure()
plt.plot(df_obf["t"], df_obf["alpha_cum"], marker='o', label="OBF-like α(t)")
plt.plot(df_poc["t"], df_poc["alpha_cum"], marker='s', label="Pocock-like α(t)")
plt.title("Lan–DeMets α-spending (two-sided α=0.05)")
plt.xlabel("Information fraction t"); plt.ylabel("Cumulative α(t)")
plt.legend(); plt.tight_layout(); plt.show()

# Plot critical z per look
plt.figure()
plt.plot(df_obf["t"], df_obf["crit_z"], marker='o', label="OBF-like critical z")
plt.plot(df_poc["t"], df_poc["crit_z"], marker='s', label="Pocock-like critical z")
plt.title("Approximate critical |z| per look via α-spending")
plt.xlabel("Information fraction t"); plt.ylabel("Critical |z|")
plt.legend(); plt.tight_layout(); plt.show()

df_obf, df_poc



**Notas práticas.**
- A implementação acima fornece **aproximações úteis** para planeamento e comunicação.
- Em ambientes regulamentados, use bibliotecas validadas de **group‑sequential design** que resolvem exatamente os limiares.
- Para testes **desbalanceados** (1:m), substitui a fração de informação por uma métrica adequada (p.ex., informação de Fisher cumulativa).



## 6.2 ICs com Bandits — **Block Bootstrap Temporal**

Com **bandits**, as observações são **dependentes** ao longo do tempo (propensões mudam, *posterior* atualizada).
Para intervalos de confiança mais realistas, usamos **block bootstrap por tempo**:

1. Escolhe um **tamanho de bloco** \(B\) (p.ex., 200–1000 rondas), cobrindo a dependência local.  
2. Particiona a série temporal em blocos contíguos.  
3. Reamostra blocos **com reposição** até reconstruir uma série do mesmo tamanho.  
4. Recalcula o estimador (IPW) em cada *replicate* → quantis \([2.5\%, 97.5\%]\).

Abaixo: funções `block_bootstrap_ipw` e um exemplo rápido sobre a simulação `sim` criada na secção 6.1.


In [None]:

import numpy as np
import pandas as pd

def block_bootstrap_ipw(df: pd.DataFrame, block_size: int = 500, B: int = 400, seed: int = 123) -> tuple[float, float]:
    """
    Block bootstrap (temporal) do delta_IPW.
    - df: deve conter colunas ['t','arm','reward','prop'] da simulação bandit.
    - block_size: tamanho do bloco contíguo de tempo.
    - B: número de réplicas bootstrap.
    Retorna (lo, hi) para ~95% CI por percentis.
    """
    rng = np.random.default_rng(seed)

    # Sanidade
    req = {'t','arm','reward','prop'}
    if not req.issubset(df.columns):
        raise ValueError(f"df must include columns {req}")

    n = len(df)
    # Construir blocos contíguos
    blocks = []
    for start in range(0, n, block_size):
        end = min(start + block_size, n)
        blocks.append(df.iloc[start:end])

    # Função IPW simples (sem estabilização) para rapidez
    def ipw_delta_local(d: pd.DataFrame) -> float:
        w = 1.0 / np.clip(d["prop"].to_numpy(), 1e-12, None)
        t = d["arm"].to_numpy()
        y = d["reward"].to_numpy()
        w_t = w * t
        w_c = w * (1 - t)
        p1 = (w_t * y).sum() / max(w_t.sum(), 1e-12)
        p0 = (w_c * y).sum() / max(w_c.sum(), 1e-12)
        return float(p1 - p0)

    # Bootstrap
    boots = np.empty(B, dtype=float)
    m = len(blocks)
    for b in range(B):
        # Amostrar blocos com reposição e concatenar
        idx = rng.integers(0, m, size=m)
        df_star = pd.concat([blocks[i] for i in idx], ignore_index=True).iloc[:n]
        boots[b] = ipw_delta_local(df_star)

    lo = float(np.quantile(boots, 0.025))
    hi = float(np.quantile(boots, 0.975))
    lo, hi

# Exemplo: ICs block-bootstrap para IPW da simulação (secção 6.1)
ci_lo_ipw, ci_hi_ipw = block_bootstrap_ipw(sim, block_size=800, B=300, seed=777)
{"ipw_block_bootstrap_CI95": (ci_lo_ipw, ci_hi_ipw)}



**Notas práticas.**
- **Sensibilidade ao tamanho do bloco**: aumente \(B\) e varie o `block_size` em *stress tests*.  
- Para logs reais, use **block bootstrap por dia/slot** (respeitando resets do tráfego e sazonalidade).



## 6.3 **Doubly‑Robust (DR) Estimator** — Consistência sob *misspecification*

O estimador **DR** combina **IPW** com um **modelo de outcome** \(m_t(x)=\mathbb{E}[Y\mid X=x, T=t]\).
É **consistente** se **ou** a propensão \(e(x)=P(T=1\mid X)\) **ou** o modelo de outcome estiver correto (não precisam ambos).

**Forma (binário, propensão conhecida/estimada \(e_i\))**:
\[
\widehat{\tau}_{DR}
= \frac{1}{n}\sum_i \Big[ \big(m_1(x_i)-m_0(x_i)\big)
+ \frac{T_i(Y_i - m_1(x_i))}{e_i}
- \frac{(1-T_i)(Y_i - m_0(x_i))}{1-e_i} \Big].
\]

Abaixo, mostramos:
1. Um **modelo de outcome simples** (constante por braço) — robusto e estável.  
2. Uma versão com **logistic regression** (quando houver *features* \(X\)).  
3. **ICs via block bootstrap** para \(\widehat{\tau}_{DR}\).


In [None]:

import numpy as np
import pandas as pd

def dr_constant_outcome(df: pd.DataFrame) -> float:
    """
    DR com outcome models constantes por braço:
    m1(x)=mean(Y|T=1), m0(x)=mean(Y|T=0).
    Usa propensões "prop" registadas.
    """
    req = {'arm','reward','prop'}
    if not req.issubset(df.columns):
        raise ValueError(f"df must include columns {req}")
    t = df['arm'].to_numpy()
    y = df['reward'].to_numpy()
    e = np.clip(df['prop'].to_numpy(), 1e-6, 1-1e-6)

    # Outcome models (constantes)
    m1 = float(df.loc[df['arm']==1, 'reward'].mean())
    m0 = float(df.loc[df['arm']==0, 'reward'].mean())

    term = (m1 - m0) + (t * (y - m1) / e) - ((1 - t) * (y - m0) / (1 - e))
    return float(np.mean(term))

def block_bootstrap_dr(df: pd.DataFrame, block_size: int = 500, B: int = 300, seed: int = 123) -> tuple[float,float,float]:
    """
    Block bootstrap temporal para o estimador DR (constante por braço).
    Retorna (tau_dr_hat, lo, hi).
    """
    rng = np.random.default_rng(seed)
    n = len(df)
    # Construir blocos contíguos
    blocks = []
    for start in range(0, n, block_size):
        end = min(start + block_size, n)
        blocks.append(df.iloc[start:end])

    tau0 = dr_constant_outcome(df)

    boots = np.empty(B, dtype=float)
    m = len(blocks)
    for b in range(B):
        idx = rng.integers(0, m, size=m)
        df_star = pd.concat([blocks[i] for i in idx], ignore_index=True).iloc[:n]
        boots[b] = dr_constant_outcome(df_star)

    lo = float(np.quantile(boots, 0.025))
    hi = float(np.quantile(boots, 0.975))
    return float(tau0), lo, hi

tau_dr, lo_dr, hi_dr = block_bootstrap_dr(sim, block_size=800, B=300, seed=2024)
{"dr_tau_hat": tau_dr, "dr_block_bootstrap_CI95": (lo_dr, hi_dr)}



**Quando usar qual?**
- **IPW** é simples quando tens **propensões fiáveis** (logged).  
- **DR** oferece **robustez**: se o modelo de outcome for razoável (mesmo simples), ganhas proteção caso as propensões estejam levemente mal especificadas — e vice‑versa.



## 6.4 Doubly‑Robust with **Logistic Outcome Models** (with user features)

In many real experiments you have per‑user covariates **X** (e.g., device, geo, prior activity).  
To improve efficiency and robustness, we can fit **two logistic outcome models**:
\(
m_1(x)=\Pr(Y=1\mid X=x, T=1),\quad
m_0(x)=\Pr(Y=1\mid X=x, T=0)
\)
and plug them into the DR formula:
\[
\widehat{\tau}_{DR}
= \frac{1}{n}\sum_i \Big[ (m_1(x_i)-m_0(x_i))
+ \frac{T_i(Y_i - m_1(x_i))}{e_i}
- \frac{(1-T_i)(Y_i - m_0(x_i))}{1-e_i} \Big].
\]

Below we **simulate a confounded scenario** where the treatment propensity depends on **X**, and the outcome also depends on **X** and **T**.  
We compare **IPW** vs **DR (logistic)** in terms of bias and variability.


In [None]:

from __future__ import annotations
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def simulate_confounded(n: int = 20000, seed: int = 123) -> pd.DataFrame:
    """
    Simulate binary outcome with confounding via X -> treatment and X -> outcome.
    - Generate features X ~ N(0,1)^d
    - Propensity e(x) = sigmoid(bias_e + w_e^T x)
    - Outcome logit: logit P(Y=1 | X, T) = bias_y + w_y^T x + tau_true * T
    Returns DataFrame with X columns, T, Y, and true e(x).
    """
    rng = np.random.default_rng(seed)
    d = 4
    X = rng.normal(size=(n, d))
    # Propensity model
    w_e = np.array([0.8, -0.6, 0.4, 0.2])
    bias_e = -0.2
    e = sigmoid(bias_e + X @ w_e)
    T = rng.binomial(1, e)
    # Outcome model
    w_y = np.array([0.5, 0.3, -0.2, 0.1])
    bias_y = -1.2
    tau_true = 0.12  # log-odds lift due to treatment
    # Convert to probability
    lin = bias_y + X @ w_y + tau_true * T
    p = sigmoid(lin)
    Y = rng.binomial(1, p)
    cols = {f"x{j+1}": X[:, j] for j in range(d)}
    df = pd.DataFrame(cols)
    df["T"] = T
    df["Y"] = Y
    df["e_true"] = e
    # For fair comparison, we assume we know/estimate e(x).
    # In practice, you'd fit a separate propensity model on (X,T).
    return df

def ipw_ate_from_df(df: pd.DataFrame, e_col: str = "e_true") -> float:
    t = df["T"].to_numpy(); y = df["Y"].to_numpy()
    e = np.clip(df[e_col].to_numpy(), 1e-6, 1-1e-6)
    w_t = t / e
    w_c = (1 - t) / (1 - e)
    p1 = (w_t * y).sum() / max(w_t.sum(), 1e-12)
    p0 = (w_c * y).sum() / max(w_c.sum(), 1e-12)
    return float(p1 - p0)

def dr_ate_logistic(df: pd.DataFrame, e_col: str = "e_true") -> float:
    """
    DR estimator with two logistic outcome models.
    """
    X = df[[c for c in df.columns if c.startswith("x")]].to_numpy()
    t = df["T"].to_numpy()
    y = df["Y"].to_numpy()
    e = np.clip(df[e_col].to_numpy(), 1e-6, 1-1e-6)

    # Fit outcome models separately by arm
    X1 = X[t == 1]; y1 = y[t == 1]
    X0 = X[t == 0]; y0 = y[t == 0]
    m1_lr = LogisticRegression(max_iter=1000).fit(X1, y1)
    m0_lr = LogisticRegression(max_iter=1000).fit(X0, y0)

    m1_hat = m1_lr.predict_proba(X)[:, 1]
    m0_hat = m0_lr.predict_proba(X)[:, 1]

    term = (m1_hat - m0_hat) + (t * (y - m1_hat) / e) - ((1 - t) * (y - m0_hat) / (1 - e))
    return float(np.mean(term))

# Single-run demo
df_sim = simulate_confounded(n=30000, seed=2025)
ate_ipw = ipw_ate_from_df(df_sim, e_col="e_true")
ate_dr  = dr_ate_logistic(df_sim, e_col="e_true")
{"IPW_ATE": ate_ipw, "DR_logistic_ATE": ate_dr}



### Repeated simulation: bias/variance comparison

We repeat the simulation **R** times to inspect sampling variability and bias (relative to the true effect in log‑odds ≈ 0.12).  
Note: IPW/DR estimate **difference in probabilities**, while the DGP uses a log‑odds lift.  
The sign and **relative** performance (variance and bias) is the focus here.


In [None]:

import numpy as np
import pandas as pd

def repeat_compare(R: int = 100, n: int = 20000, seed: int = 9) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    rows = []
    for r in range(R):
        df = simulate_confounded(n=n, seed=int(rng.integers(0, 10_000_000)))
        ipw = ipw_ate_from_df(df, e_col="e_true")
        dr  = dr_ate_logistic(df, e_col="e_true")
        rows.append((ipw, dr))
    out = pd.DataFrame(rows, columns=["ipw", "dr_logit"])
    return out

res = repeat_compare(R=120, n=20000, seed=17)
summary = res.agg(["mean","std","median","min","max"])
ax = res.plot(kind="hist", bins=40, alpha=0.6)
ax.set_title("Sampling distribution: IPW vs DR (logistic)")
ax.set_xlabel("Estimated ATE (probability difference)")
ax.set_ylabel("Frequency")
summary



**Takeaways.**
- **DR(logistic)** tends to have **lower variance** and often **smaller bias** than pure IPW, especially when outcome models are reasonably specified.
- With **misspecification**, DR remains **consistent** if *either* the propensity (here known) *or* the outcome model is correct.



## 8) **Experiment Launch Checklist** (Operational & Statistical)

Use this checklist to prevent preventable failures and to keep your results decision‑grade.

**Design & Instrumentation**
- ☑ Clearly define the **primary metric** (and its unit of analysis) and the **direction** of improvement.  
- ☑ Pre-register **stopping rules**: fixed‑horizon or **group‑sequential** (Pocock/OBF/Lan–DeMets).  
- ☑ Define **exposure eligibility** and **unit consistency** (user/session/pageview).  
- ☑ Implement **SRM checks** (sample ratio mismatch) on traffic split and funnels.  
- ☑ Log **timestamps**, **assignments**, **propensities** (if adaptive), and **covariates**.

**Power/MDE & Risks**
- ☑ Compute **MDE** at the baseline rate and traffic constraints; ensure business‑relevant effects are detectable.  
- ☑ Set a **maximum duration** to avoid state drift (seasonality, product changes).  
- ☑ If multiple KPIs, plan **multiplicity correction** (Holm or hierarchical testing).

**Analysis Plan**
- ☑ Primary analysis: two‑proportion test **with CIs**; keep a **GLM with day fixed effects** as robustness.  
- ☑ If pre‑treatment covariates exist, consider **CUPED** or regression adjustment.  
- ☑ If adaptive allocation (**bandits**), use **IPW/DR** + **time‑block bootstrap** for uncertainty.  
- ☑ Guard against **peeking**: sequential boundaries / α‑spending. Store **look timestamps**.

**Decision Framework**
- ☑ Predefine **business rules** (MDE threshold, benefit/cost, risk tolerance).  
- ☑ Use the **Decision Playbook** function to formalise **ship / hold / roll‑back**.  
- ☑ For wins, plan a **ramp plan** (e.g., 10% → 25% → 50% → 100%) with monitoring gates.  
- ☑ For losses, document learnings and update hypotheses/backlog.

**Post‑Mortem & Governance**
- ☑ Archive notebook, raw exports, and code versions (commit hash, environment).  
- ☑ Summarise findings with **CIs**, effect sizes, and operational notes (incidents, outages).  
- ☑ Maintain an **Experiment Registry** with metadata (owner, metrics, dates, links, blocking rules).
