
# Cookie Cats A/B Testing Playbook (gate_30 vs gate_40)

This notebook is a self‑contained, **professional** A/B testing analysis using the well‑known **Cookie Cats** mobile game dataset.
The experiment changes the level at which a gate is introduced (**30 vs 40**) and evaluates impact on **1‑day** and **7‑day** retention.

> **Reproducibility:** This notebook is written to be run end‑to‑end. Markdown is in **English** throughout.



## 0) What you'll learn & do

- Load and validate the Cookie Cats A/B dataset.
- Sanity checks including **SRM** (sample ratio mismatch).
- Explore distributions and retention metrics (1‑day and 7‑day).
- Frequentist inference: **two‑proportion z‑tests** and **bootstrap** confidence intervals.
- Model‑based view: **GLM (Logit)** with fixed effects.
- **CUPED** variance reduction using pre‑outcome covariate (`sum_gamerounds`).
- **Power/MDE** planning helpers.
- Clean **executive summary** and a **Next steps** section (sequential testing / bandits).



## 1) Setup

We stick to NumPy, pandas, matplotlib; and use `statsmodels` for logistic regression (industry standard).


In [None]:

from __future__ import annotations
from dataclasses import dataclass
from typing import Tuple, Iterable
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (7, 4.5)
plt.rcParams['axes.grid'] = True

# Optional, used for GLM (logit)
try:
    from statsmodels.api import Logit, add_constant  # type: ignore
except Exception as e:
    Logit = None
    add_constant = None
    print("statsmodels not available; GLM cells will be skippable.", e)


### Helpers (proportions, tests, bootstrap, power/MDE)

In [None]:

@dataclass(frozen=True)
class PropSummary:
    p: float
    n: int
    x: int

def summarize_prop(x: int, n: int) -> PropSummary:
    if n <= 0: raise ValueError("n must be positive")
    if not (0 <= x <= n): raise ValueError("x must be in [0,n]")
    return PropSummary(p=x/n, n=n, x=x)

def two_prop_ztest(x1: int, n1: int, x2: int, n2: int, two_sided: bool = True):
    s1, s2 = summarize_prop(x1,n1), summarize_prop(x2,n2)
    p_pool = (s1.x + s2.x) / (s1.n + s2.n)
    se = math.sqrt(p_pool*(1-p_pool)*(1/s1.n + 1/s2.n))
    if se == 0.0: raise ZeroDivisionError("SE=0; check inputs")
    z = (s2.p - s1.p)/se
    p = 2*(1 - 0.5*(1 + math.erf(abs(z)/math.sqrt(2)))) if two_sided else (1 - 0.5*(1 + math.erf(z/math.sqrt(2))))
    return z, p

def bootstrap_ci_diff(pA: float, pB: float, nA: int, nB: int, B: int = 5000, alpha: float = 0.05):
    diffs = np.empty(B, dtype=float)
    for b in range(B):
        xA = np.random.binomial(nA, pA)
        xB = np.random.binomial(nB, pB)
        diffs[b] = xB/nB - xA/nA
    lo = float(np.quantile(diffs, alpha/2)); hi = float(np.quantile(diffs, 1-alpha/2))
    return lo, hi

def chisq_srm(nA: int, nB: int) -> float:
    n = nA + nB
    exp = [n/2, n/2]; obs = [nA, nB]
    chi2 = sum((o-e)**2/e for o,e in zip(obs,exp))
    return 2 * (1 - 0.5*(1 + math.erf(math.sqrt(chi2)/math.sqrt(2))))

def invPhi(u: float) -> float:
    return math.sqrt(2) * math.erfcinv(2*(1-u))

def required_n_two_proportions(pA: float, pB: float, alpha: float = 0.05, power: float = 0.8, two_sided: bool = True) -> int:
    z_alpha = abs(invPhi(1 - alpha/2)) if two_sided else abs(invPhi(1 - alpha))
    z_beta = abs(invPhi(power))
    pbar = 0.5*(pA+pB)
    delta = abs(pB - pA)
    if delta == 0.0: raise ValueError("delta=0 → infinite n")
    se = math.sqrt(2*pbar*(1-pbar))
    n = ((z_alpha + z_beta)*se/delta)**2
    return int(math.ceil(n))

def mde_for_n(pA: float, n_per_arm: int, alpha: float = 0.05, power: float = 0.8, two_sided: bool = True) -> float:
    z_alpha = abs(invPhi(1 - alpha/2)) if two_sided else abs(invPhi(1 - alpha))
    z_beta = abs(invPhi(power))
    se = math.sqrt(2*pA*(1-pA))
    return float((z_alpha + z_beta) * se / math.sqrt(max(n_per_arm,1)))



## 2) Data: load & validate

**Where to get the data:**  
The Cookie Cats CSV (`cookie_cats.csv`) is widely mirrored in public repos.  
- If you have it locally, put it under `data/cookie_cats.csv`.  
- Otherwise, set `REMOTE_URL` to a raw CSV URL you trust (e.g., a public GitHub mirror).

**Expected columns (typical):**
- `userid` (unique player ID)
- `version` (`gate_30` or `gate_40`)
- `sum_gamerounds` (number of game rounds during observation window)
- `retention_1` (1‑day retention, 0/1)
- `retention_7` (7‑day retention, 0/1)


In [None]:

from pathlib import Path
import pandas as pd

LOCAL_PATHS = [Path("data/cookie_cats.csv"), Path("cookie_cats.csv")]
REMOTE_URL = None  # e.g., "https://raw.githubusercontent.com/....../cookie_cats.csv"

def load_cookie_cats() -> pd.DataFrame:
    for p in LOCAL_PATHS:
        if p.exists():
            return pd.read_csv(p)
    if REMOTE_URL:
        return pd.read_csv(REMOTE_URL)
    raise FileNotFoundError("Provide cookie_cats.csv locally (data/ or CWD) or set REMOTE_URL.")

df = load_cookie_cats()
df.head()


### 2.1 Basic hygiene & SRM

In [None]:

df = df.drop_duplicates(subset=["userid"], keep="first").copy()
assert set(df["version"].unique()) <= {"gate_30","gate_40"}, "Unexpected version labels."
nA = (df["version"]=="gate_30").sum()
nB = (df["version"]=="gate_40").sum()
p_srm = chisq_srm(nA, nB)
nA, nB, p_srm



## 3) EDA & Metrics


In [None]:

summ = df.groupby("version")[["retention_1","retention_7","sum_gamerounds"]].agg(["mean","median","std","count"])
summ


In [None]:

plt.figure()
for v in ["gate_30","gate_40"]:
    vals = df.loc[df["version"]==v, "sum_gamerounds"].values
    plt.hist(vals, bins=40, alpha=0.5, label=v)
plt.title("sum_gamerounds by version")
plt.xlabel("sum_gamerounds"); plt.ylabel("count"); plt.legend(); plt.tight_layout(); plt.show()



## 4) Primary inference: 1‑day & 7‑day retention


In [None]:

def proportion_table(df: pd.DataFrame, col: str) -> pd.DataFrame:
    grp = df.groupby("version")[col].agg(["sum","count","mean"]).rename(columns={"sum":"x","count":"n","mean":"rate"})
    A = summarize_prop(int(grp.loc["gate_30","x"]), int(grp.loc["gate_30","n"]))
    B = summarize_prop(int(grp.loc["gate_40","x"]), int(grp.loc["gate_40","n"]))
    z, p = two_prop_ztest(A.x, A.n, B.x, B.n, two_sided=True)
    lo, hi = bootstrap_ci_diff(A.p, B.p, A.n, B.n, B=3000, alpha=0.05)
    abs_lift = B.p - A.p
    out = pd.DataFrame({
        "metric":[col, col],
        "arm":["gate_30","gate_40"],
        "n":[A.n, B.n],
        "x":[A.x, B.x],
        "rate":[A.p, B.p],
        "ztest_pvalue":[p, p],
        "abs_lift_B_minus_A":[abs_lift, abs_lift],
        "boot95_lo":[lo, lo],
        "boot95_hi":[hi, hi],
    })
    return out

t1 = proportion_table(df, "retention_1")
t7 = proportion_table(df, "retention_7")
t1, t7


### 4.1 GLM (logit) perspective (optional)

In [None]:

if Logit is not None and add_constant is not None:
    df_glm = df.copy()
    df_glm["treatment"] = (df_glm["version"]=="gate_40").astype(int)
    df_glm["rounds_bin"] = pd.qcut(df_glm["sum_gamerounds"], q=10, duplicates="drop")
    X = pd.get_dummies(df_glm[["treatment","rounds_bin"]], drop_first=True).astype(float)
    Xc = add_constant(X)
    y1 = df_glm["retention_1"].astype(int).to_numpy()
    y7 = df_glm["retention_7"].astype(int).to_numpy()
    mdl1 = Logit(y1, Xc).fit(disp=False)
    mdl7 = Logit(y7, Xc).fit(disp=False)
    coef_t1 = mdl1.params.get("treatment", float("nan")); se_t1 = mdl1.bse.get("treatment", float("nan"))
    z_t1 = coef_t1 / se_t1 if se_t1 not in (0.0, float("nan")) else float("nan")
    p_t1 = 2*(1 - 0.5*(1 + math.erf(abs(z_t1)/math.sqrt(2)))) if not math.isnan(z_t1) else float("nan")
    coef_t7 = mdl7.params.get("treatment", float("nan")); se_t7 = mdl7.bse.get("treatment", float("nan"))
    z_t7 = coef_t7 / se_t7 if se_t7 not in (0.0, float("nan")) else float("nan")
    p_t7 = 2*(1 - 0.5*(1 + math.erf(abs(z_t7)/math.sqrt(2)))) if not math.isnan(z_t7) else float("nan")
    pd.DataFrame({"metric":["retention_1","retention_7"],
                  "coef_treatment":[coef_t1, coef_t7],
                  "se":[se_t1, se_t7],
                  "z":[z_t1, z_t7],
                  "p_value":[p_t1, p_t7]})
else:
    print("statsmodels not available; skip GLM cells.")


## 5) CUPED (variance reduction)

In [None]:

def cuped_adjust(y: np.ndarray, x: np.ndarray):
    y = np.asarray(y, dtype=float); x = np.asarray(x, dtype=float)
    if y.shape != x.shape: raise ValueError("y and x must have same shape")
    vx = np.var(x)
    if vx == 0.0: return y.copy(), 0.0
    theta = float(np.cov(y, x, ddof=1)[0,1] / vx)
    x_centered = x - float(np.mean(x))
    return y - theta * x_centered, theta

def cuped_on_metric(df: pd.DataFrame, metric: str, covariate: str = "sum_gamerounds") -> pd.DataFrame:
    d = df[["version", metric, covariate]].dropna().copy()
    y_adj, theta = cuped_adjust(d[metric].to_numpy().astype(float), d[covariate].to_numpy().astype(float))
    d["y_adj"] = y_adj
    grp = d.groupby("version")["y_adj"].agg(["mean","count"])
    A_mean, A_n = float(grp.loc["gate_30","mean"]), int(grp.loc["gate_30","count"])
    B_mean, B_n = float(grp.loc["gate_40","mean"]), int(grp.loc["gate_40","count"])
    pooled_var = float(np.var(y_adj, ddof=1))
    se = math.sqrt(pooled_var*(1/A_n + 1/B_n))
    z = (B_mean - A_mean)/se if se>0 else float("nan")
    p = 2*(1 - 0.5*(1 + math.erf(abs(z)/math.sqrt(2)))) if not math.isnan(z) else float("nan")
    return pd.DataFrame({"metric":[metric],"theta":[theta],"A_mean_adj":[A_mean],"B_mean_adj":[B_mean],"z":[z],"p_value":[p]})

cuped_1 = cuped_on_metric(df, "retention_1")
cuped_7 = cuped_on_metric(df, "retention_7")
cuped_1, cuped_7


## 6) Power & MDE

In [None]:

pA_1 = df.loc[df["version"]=="gate_30","retention_1"].mean()
pA_7 = df.loc[df["version"]=="gate_30","retention_7"].mean()
n_per_arm_1 = min((df["version"]=="gate_30").sum(), (df["version"]=="gate_40").sum())
mde_1 = mde_for_n(pA_1, n_per_arm_1)
mde_7 = mde_for_n(pA_7, n_per_arm_1)
pd.DataFrame({"metric":["retention_1","retention_7"],
              "baseline_rate(A)":[pA_1, pA_7],
              "n_per_arm":[n_per_arm_1, n_per_arm_1],
              "MDE_abs_at_80%_power":[mde_1, mde_7]})


## 7) Executive summary (for decision)

## 8) Next steps


### 5.1 Sequential testing (Pocock / O'Brien–Fleming)

In mobile A/B tests, it is very common to **peek** at the results every day.
If you keep using a fixed two-sided 0.05 threshold at every look, you inflate your **Type I error**.

Group-sequential methods (e.g., **Pocock**, **O'Brien–Fleming**) give you a **schedule of critical z-values**
for a fixed number of planned looks. This lets you:

- Inspect results mid-test without losing Type I control.
- Possibly stop early for strong wins or clear losses.

Here we construct critical |z| boundaries for O'Brien–Fleming and Pocock for a few information fractions.
You can think of information fractions as “percentage of the total planned information / sample size”
if the allocation is reasonably stable over time.


In [None]:

from __future__ import annotations
from typing import List
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def Phi(z: float) -> float:
    """Standard normal CDF."""
    return 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))

def Phi_inv(p: float) -> float:
    """Inverse standard normal CDF using erfcinv."""
    if not 0.0 < p < 1.0:
        raise ValueError("p must be in (0,1)")
    return math.sqrt(2.0) * math.erfcinv(2.0 * (1.0 - p))

def obrien_fleming_boundaries(info_fracs: List[float], alpha: float = 0.05) -> List[float]:
    """
    Two-sided O'Brien–Fleming boundaries:
    critical |z_i| = z_alpha / sqrt(t_i), where t_i is the information fraction of look i.
    Very conservative early (higher |z|), more permissive near the final look.
    """
    if not info_fracs:
        raise ValueError("info_fracs must be a non-empty list.")
    if not all(0.0 < t <= 1.0 for t in info_fracs):
        raise ValueError("All information fractions must be in (0,1].")

    z_alpha = Phi_inv(1.0 - alpha / 2.0)  # two-sided
    return [float(z_alpha / math.sqrt(t)) for t in info_fracs]

def pocock_boundaries(K: int, alpha: float = 0.05) -> List[float]:
    """
    Two-sided Pocock boundaries (approximation).
    Uses a constant critical |z| for all looks.
    For alpha=0.05 and 2 <= K <= 10, the critical |z| is ~2.414.
    """
    if not (2 <= K <= 10):
        raise ValueError("K must be between 2 and 10 for this approximation.")
    crit = 2.414  # tabulated approximate value for two-sided alpha=0.05
    return [crit] * K

# Example: 4 looks at 25%, 50%, 75%, 100% of total information
info_fracs = [0.25, 0.50, 0.75, 1.00]
obf_crit = obrien_fleming_boundaries(info_fracs, alpha=0.05)
poc_crit = pocock_boundaries(len(info_fracs), alpha=0.05)

# Plot the boundaries
plt.figure()
plt.plot(info_fracs, obf_crit, marker="o", label="O'Brien–Fleming |z|")
plt.plot(info_fracs, poc_crit, marker="s", label="Pocock |z|")
plt.title("Critical |z| by information fraction (two-sided α=0.05)")
plt.xlabel("Information fraction")
plt.ylabel("Critical |z|")
plt.legend()
plt.tight_layout()
plt.show()

pd.DataFrame(
    {
        "look": range(1, len(info_fracs) + 1),
        "info_fraction": info_fracs,
        "OBF_crit_z": obf_crit,
        "Pocock_crit_z": poc_crit,
    }
)



**How to use this in practice**

- Choose in advance how many **looks** you want (for example, 4 looks at days 3, 5, 7, and 10).
  Map those to **information fractions** (e.g. 0.25, 0.5, 0.75, 1.0).
- At each look, compute your usual z-statistic for the primary metric, but compare `|z|`
  to the **look-specific** threshold instead of 1.96.
- With O'Brien–Fleming you almost never stop at very early looks unless the effect is very large.
- Pocock uses a constant threshold, so it is more willing to stop early and slightly less powerful
  at the final look.



## 6) Bandits vs static A/B — Thompson Sampling (simulation)

So far in this notebook, we assumed a **static 50/50 split** between `gate_30` and `gate_40`.

Sometimes you want to **optimize reward during the test**, not just at the end.
A popular approach is a **multi-armed bandit**, such as **Thompson Sampling** for Bernoulli rewards:

- Arms = variants (e.g., `gate_30`, `gate_40`).
- Reward = retention or conversion (0/1 per user).
- At each step, sample a parameter for each arm from its posterior and choose the arm with the largest sample.
- This automatically trades off exploration and exploitation.

Below we simulate Thompson Sampling in a toy setting, not with the real Cookie Cats data
(because the real experiment was not run as a bandit).
The goal is to understand the mechanics and the **regret** comparison versus a fixed 50/50 policy.


In [None]:

from typing import Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def simulate_thompson_two_arms(
    pA: float,
    pB: float,
    T: int = 8000,
    seed: int | None = 7,
) -> Tuple[np.ndarray, np.ndarray, pd.DataFrame]:
    """
    Simulate Thompson Sampling for two Bernoulli arms (A and B).

    Parameters
    ----------
    pA, pB : float
        True success probabilities for arm A and arm B.
    T : int
        Number of rounds (users) to simulate.
    seed : int | None
        Random seed for reproducibility.

    Returns
    -------
    regret_ts : np.ndarray
        Cumulative regret under Thompson Sampling.
    regret_fixed : np.ndarray
        Cumulative regret under a static 50/50 policy.
    log_df : pd.DataFrame
        A log of the TS run with columns:
        - t: time step (0..T-1)
        - arm: 0 or 1
        - reward: 0 or 1
        - prop: estimated propensity of chosen arm at that step.
    """
    rng = np.random.default_rng(seed)

    # Beta(1,1) priors for each arm
    aA = bA = 1.0
    aB = bB = 1.0

    best = max(pA, pB)

    regret_ts = np.zeros(T, dtype=float)
    regret_fixed = np.zeros(T, dtype=float)
    cum_reg_ts = 0.0
    cum_reg_fixed = 0.0

    rows: list[tuple[int, int, int, float]] = []

    for t in range(T):
        # Sample one theta from each posterior to choose the arm
        thetaA = rng.beta(aA, bA)
        thetaB = rng.beta(aB, bB)

        # Approximate action probabilities using Monte Carlo under the current posterior
        # This is used to estimate propensities for IPW later.
        thetasA = rng.beta(aA, bA, size=200)
        thetasB = rng.beta(aB, bB, size=200)
        p_choose_A = float(np.mean(thetasA >= thetasB))
        p_choose_B = 1.0 - p_choose_A

        arm = 0 if thetaA >= thetaB else 1

        # Draw rewards for each arm for this "user"
        rewardA = rng.binomial(1, pA)
        rewardB = rng.binomial(1, pB)
        reward = rewardA if arm == 0 else rewardB

        # Update posterior of the chosen arm
        if arm == 0:
            aA += reward
            bA += 1 - reward
            prop = p_choose_A
        else:
            aB += reward
            bB += 1 - reward
            prop = p_choose_B

        # Regret vs always pulling the best arm
        chosen_p = pA if arm == 0 else pB
        cum_reg_ts += best - chosen_p
        regret_ts[t] = cum_reg_ts

        # Static 50/50 expected regret
        fixed_p = 0.5 * pA + 0.5 * pB
        cum_reg_fixed += best - fixed_p
        regret_fixed[t] = cum_reg_fixed

        rows.append((t, arm, reward, prop))

    log_df = pd.DataFrame(rows, columns=["t", "arm", "reward", "prop"])
    return regret_ts, regret_fixed, log_df

# Example: two variants with a small difference in 7-day retention
pA_demo = 0.18
pB_demo = 0.20

reg_ts, reg_fixed, log_ts = simulate_thompson_two_arms(pA_demo, pB_demo, T=8000, seed=11)

plt.figure()
plt.plot(np.arange(1, len(reg_ts) + 1), reg_ts, label="Thompson Sampling")
plt.plot(np.arange(1, len(reg_fixed) + 1), reg_fixed, label="Fixed 50/50")
plt.title("Cumulative regret — TS vs static policy")
plt.xlabel("Round")
plt.ylabel("Cumulative regret")
plt.legend()
plt.tight_layout()
plt.show()

log_ts.head()



### 6.1 Inference under adaptive allocation — IPW and DR

With **adaptive allocation** (bandits), the probability of receiving each arm depends on time
and on past data. If you ignore this and just compute simple differences in means,
your estimate of the treatment effect is generally **biased**.

Two important tools:

- **IPW (Inverse Propensity Weighting)**: reweight each observation by the inverse of the
  probability of receiving the arm it actually received.
- **DR (Doubly-Robust)**: combines IPW with an **outcome model** and remains consistent if
  either the propensity model *or* the outcome model is correctly specified.


In [None]:

from typing import Dict

def ipw_ate_from_log(df: pd.DataFrame) -> float:
    """
    IPW estimate of average treatment effect from a bandit log.

    Parameters
    ----------
    df : pd.DataFrame
        Must contain columns:
        - arm: 0 or 1
        - reward: 0/1
        - prop: estimated propensity of the chosen arm at each time step.

    Returns
    -------
    float
        Estimated difference in success probabilities (arm 1 minus arm 0).
    """
    required_cols = {"arm", "reward", "prop"}
    if not required_cols.issubset(df.columns):
        raise ValueError(f"df must contain columns {required_cols}")

    t = df["arm"].to_numpy()
    y = df["reward"].to_numpy()
    prop = np.clip(df["prop"].to_numpy(), 1e-6, 1.0)

    # Inverse propensity weights
    w = 1.0 / prop
    w1 = w * t
    w0 = w * (1 - t)

    p1_hat = (w1 * y).sum() / max(w1.sum(), 1e-12)
    p0_hat = (w0 * y).sum() / max(w0.sum(), 1e-12)

    return float(p1_hat - p0_hat)

def dr_ate_constant(df: pd.DataFrame) -> float:
    """
    Doubly-Robust ATE with constant outcome models for each arm.

    This is the simplest DR version:
    - m1(x) = mean reward for arm 1
    - m0(x) = mean reward for arm 0
    and we combine them with IPW residual corrections.
    """
    required_cols = {"arm", "reward", "prop"}
    if not required_cols.issubset(df.columns):
        raise ValueError(f"df must contain columns {required_cols}")

    t = df["arm"].to_numpy()
    y = df["reward"].to_numpy()
    prop = np.clip(df["prop"].to_numpy(), 1e-6, 1.0)

    # Outcome models
    m1 = float(df.loc[df["arm"] == 1, "reward"].mean())
    m0 = float(df.loc[df["arm"] == 0, "reward"].mean())

    # DR term
    term = (m1 - m0) + (t * (y - m1) / prop) - ((1.0 - t) * (y - m0) / (1.0 - prop))
    return float(np.mean(term))

ate_ipw_demo: float = ipw_ate_from_log(log_ts)
ate_dr_demo: float = dr_ate_constant(log_ts)

{
    "true_diff": pB_demo - pA_demo,
    "IPW_ATE": ate_ipw_demo,
    "DR_ATE_const": ate_dr_demo,
}



**Interpretation**

- `true_diff` is the true success-probability difference we used to simulate the bandit.
- `IPW_ATE` and `DR_ATE_const` are two ways of recovering that effect under adaptive allocation.
- In real experiments:
  - You should log **propensities** whenever you use a bandit or any adaptive policy.
  - If you have user features \(X\), you can fit richer outcome models (e.g., logistic regression)
    and use a fully **doubly-robust** estimator.



## 7) Executive summary and decision template

When the notebook is run on the real `cookie_cats.csv`, you will have:

- Baseline 1-day and 7-day retention for `gate_30`.
- Absolute lifts and 95% CIs for `gate_40` vs `gate_30`.
- CUPED-adjusted tests (using `sum_gamerounds`).
- MDE at 80% power for each metric.

A concise, decision-grade executive summary should cover:

1. **Sanity checks**
   - SRM p-value for the user split between `gate_30` and `gate_40`.
   - Any data quality issues (missing fields, outliers, logging changes).

2. **Primary metrics and effect sizes**
   - 1-day retention difference with 95% CI and p-value.
   - 7-day retention difference with 95% CI and p-value.
   - CUPED results for the same metrics and how they compare to the raw tests.

3. **Power & MDE context**
   - Whether the observed CIs and MDE indicate that “no meaningful effect” is plausible,
     or whether the experiment was underpowered.

4. **Decision and rollout plan**
   - **Ship**: if the effect is positive, practically meaningful (above MDE) and robust across methods.
   - **Hold / rerun**: if CIs include both small positive and negative effects, or if power is insufficient.
   - **Roll back**: if there is strong evidence of harm (e.g. 7-day retention clearly lower).

A concrete narrative template:

> SRM check for the user split passed (p = ...), suggesting randomization is sound.  
> 1-day retention improved by Δ₁ = ... percentage points (95% CI [..., ...], p = ...).  
> 7-day retention improved by Δ₇ = ... percentage points (95% CI [..., ...], p = ...).  
> CUPED with `sum_gamerounds` yields similar direction and slightly tighter intervals (p = ...).  
> The MDE at 80% power for 7-day retention is ... pp, and the observed lift is (above / below) that threshold.  
> Given the estimated uplift and business value per retained user, we recommend **(ship / hold / roll back)**,
> with a rollout plan of (25% → 50% → 100%) and ongoing monitoring of retention and monetisation.



### 5.2 Lan–DeMets α-spending (OBF-like and Pocock-like)

Group-sequential tests can be expressed in terms of how they **spend Type I error** over time.
The Lan–DeMets framework defines a **spending function** \(\alpha(t)\) where \(t \in (0,1]\) is the
information fraction.

Two popular choices:

- **OBF-like spending**: very conservative early, spends most of \(\alpha\) near the end.  
- **Pocock-like spending**: spends \(\alpha\) more uniformly across looks.

For discrete looks \(0 < t_1 < \dots < t_K \leq 1\), the incremental spending at look \(i\) is
\(\alpha_i = \alpha(t_i) - \alpha(t_{i-1})\) (with \(\alpha(t_0) = 0\)).
We can then approximate critical \(|z_i|\) by treating \(\alpha_i\) as a two-sided error
piece at that look and computing \(z_i \approx \Phi^{-1}(1 - \alpha_i / 2)\).


In [None]:

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def Phi_ld(z: float) -> float:
    """Standard normal CDF (duplicate name to avoid conflicts)."""
    return 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))

def Phi_inv_ld(p: float) -> float:
    """Inverse standard normal CDF using erfcinv for Lan–DeMets helpers."""
    if not 0.0 < p < 1.0:
        raise ValueError("p must be in (0,1)")
    return math.sqrt(2.0) * math.erfcinv(2.0 * (1.0 - p))

def spending_obf_like(t: np.ndarray, alpha: float = 0.05) -> np.ndarray:
    """
    O'Brien–Fleming-like alpha-spending function (approximate).
    This function is small for small t and approaches alpha as t -> 1.
    """
    z = Phi_inv_ld(1.0 - alpha / 2.0)
    t_clipped = np.clip(t, 1e-6, 1.0)
    # One common approximation for OBF-like spending:
    return 2.0 - 2.0 * Phi_ld(z / np.sqrt(t_clipped))

def spending_pocock_like(t: np.ndarray, alpha: float = 0.05) -> np.ndarray:
    """
    Pocock-like alpha-spending function (approximate).
    Spends alpha more uniformly over information fraction.
    """
    t_clipped = np.clip(t, 1e-6, 1.0)
    return alpha * np.log(1.0 + (math.e - 1.0) * t_clipped)

def lan_demets_boundaries(info_fracs, alpha: float = 0.05, kind: str = "OBF") -> pd.DataFrame:
    """
    Compute approximate Lan–DeMets-style alpha-spending and critical z per look.

    Parameters
    ----------
    info_fracs : sequence of float
        Monotone increasing information fractions in (0,1].
    alpha : float
        Global two-sided alpha.
    kind : {"OBF", "Pocock"}
        Spending function flavor.

    Returns
    -------
    DataFrame with columns: look, t, alpha_cum, alpha_inc, crit_z.
    """
    t = np.asarray(info_fracs, dtype=float)
    if not np.all((t > 0.0) & (t <= 1.0)):
        raise ValueError("Information fractions must all be in (0,1].")
    if not np.all(np.diff(t) > 0):
        raise ValueError("Information fractions must be strictly increasing.")

    if kind.upper() == "OBF":
        A = spending_obf_like(t, alpha=alpha)
    else:
        A = spending_pocock_like(t, alpha=alpha)

    A_prev = np.r_[0.0, A[:-1]]
    alpha_inc = np.clip(A - A_prev, 1e-10, 1.0)
    crit_z = np.array([Phi_inv_ld(1.0 - a_i / 2.0) for a_i in alpha_inc])

    return pd.DataFrame(
        {
            "look": np.arange(1, len(t) + 1),
            "t": t,
            "alpha_cum": A,
            "alpha_inc": alpha_inc,
            "crit_z": crit_z,
        }
    )

# Example: 5 looks at 20%, 40%, 60%, 80%, 100% info
t_grid = np.array([0.2, 0.4, 0.6, 0.8, 1.0])
df_obf_ld = lan_demets_boundaries(t_grid, alpha=0.05, kind="OBF")
df_poc_ld = lan_demets_boundaries(t_grid, alpha=0.05, kind="Pocock")

# Plot cumulative alpha spending
plt.figure()
plt.plot(df_obf_ld["t"], df_obf_ld["alpha_cum"], marker="o", label="OBF-like α(t)")
plt.plot(df_poc_ld["t"], df_poc_ld["alpha_cum"], marker="s", label="Pocock-like α(t)")
plt.title("Lan–DeMets cumulative alpha-spending (two-sided α=0.05)")
plt.xlabel("Information fraction t")
plt.ylabel("Cumulative α(t)")
plt.legend()
plt.tight_layout()
plt.show()

# Plot critical z per look
plt.figure()
plt.plot(df_obf_ld["t"], df_obf_ld["crit_z"], marker="o", label="OBF-like critical |z|")
plt.plot(df_poc_ld["t"], df_poc_ld["crit_z"], marker="s", label="Pocock-like critical |z|")
plt.title("Lan–DeMets approximate critical |z| per look")
plt.xlabel("Information fraction t")
plt.ylabel("Critical |z|")
plt.legend()
plt.tight_layout()
plt.show()

df_obf_ld, df_poc_ld



**Interpretation.**

- The **OBF-like** spending function keeps cumulative \(\alpha(t)\) very low at early information fractions,
  then spends most of \(\alpha\) near \(t \approx 1\). This corresponds to very strict early boundaries.
- The **Pocock-like** spending function increases \(\alpha(t)\) more uniformly in \(t\), resulting in
  similar critical \(|z|\) thresholds across looks.
- In real systems, you would usually rely on a validated group-sequential design library that solves
  the exact boundaries, but this construction is very useful for planning and intuition.



### 5.3 Peeking without correction — Type I inflation (simulation)

To see *why* we bother with sequential corrections, we can simulate a scenario where the **null hypothesis is true**
(no difference between arms), and repeatedly:

1. Draw Bernoulli outcomes for A and B with the same true rate \(p\).  
2. Compute a standard two-proportion z-test at several interim looks (e.g. 25%, 50%, 75%, 100%).  
3. Declare "significance" as soon as any look passes the usual 1.96 threshold (two-sided 5%).

If we repeat this over many simulated experiments, we can estimate the **empirical Type I error**.
It will be **larger** than 5%, illustrating the **peeking problem**.


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def simulate_peeking_type1(
    true_p: float = 0.25,
    total_n_per_arm: int = 40000,
    looks: list[float] | None = None,
    n_experiments: int = 2000,
    seed: int | None = 123,
) -> pd.DataFrame:
    """
    Simulate Type I error under repeated peeking using naive 0.05 z-threshold.

    Parameters
    ----------
    true_p : float
        True success probability in both arms (null is true).
    total_n_per_arm : int
        Total planned sample size per arm.
    looks : list of float
        Fractions of the total sample size at which we peek (e.g. [0.25,0.5,0.75,1.0]).
    n_experiments : int
        Number of independent experiments to simulate.
    seed : int | None
        Random seed.

    Returns
    -------
    DataFrame with per-look cumulative reject proportion and overall Type I error.
    """
    if looks is None:
        looks = [0.25, 0.5, 0.75, 1.0]
    looks = sorted(looks)
    rng = np.random.default_rng(seed)

    reject_any = 0
    reject_by_look = np.zeros(len(looks), dtype=int)

    for exp_idx in range(n_experiments):
        # pre-generate outcomes under the null
        A = rng.binomial(1, true_p, size=total_n_per_arm)
        B = rng.binomial(1, true_p, size=total_n_per_arm)

        rejected_this_exp = False
        for i, f in enumerate(looks):
            n_look = int(total_n_per_arm * f)
            xA = int(A[:n_look].sum())
            xB = int(B[:n_look].sum())

            # pooled z-test under null
            p_pool = (xA + xB) / (2.0 * n_look)
            se = math.sqrt(p_pool * (1.0 - p_pool) * (2.0 / n_look))
            if se == 0.0:
                continue
            z = (xB / n_look - xA / n_look) / se
            pval = 2.0 * (1.0 - 0.5 * (1.0 + math.erf(abs(z) / math.sqrt(2.0))))

            if pval < 0.05 and not rejected_this_exp:
                rejected_this_exp = True
                reject_any += 1
                reject_by_look[i] += 1

    # Compute proportions
    type1_overall = reject_any / n_experiments
    per_look = reject_by_look / n_experiments
    df = pd.DataFrame(
        {
            "look_index": np.arange(1, len(looks) + 1),
            "info_fraction": looks,
            "reject_at_look": per_look,
        }
    )
    df.attrs["type1_overall"] = type1_overall
    return df

df_peek = simulate_peeking_type1(
    true_p=0.25, total_n_per_arm=40000,
    looks=[0.25, 0.5, 0.75, 1.0],
    n_experiments=2000, seed=2025,
)

overall_type1 = df_peek.attrs["type1_overall"]
display(df_peek)
overall_type1



If you run the simulation multiple times with reasonable settings (e.g. 4 looks, 2,000–5,000 experiments),
you will typically see an **overall Type I error** substantially **above 5%** (often in the 8–12% range,
depending on the configuration).

This is why **sequential designs** (Pocock, O'Brien–Fleming, Lan–DeMets) matter: they make it possible to
peek **and** keep the nominal Type I error under control.
