
# Sequential and Always-Valid Testing for A/B Experiments

This notebook focuses on **sequential / always-valid testing** in A/B experiments.

We compare three approaches on simulated data:

1. **Fixed-horizon test**  
   - Decide only once at the planned sample size (classical design).  
2. **Naive peeking**  
   - Look at the p-value many times and stop as soon as it is < 0.05.  
   - This inflates Type I error (false positive rate).  
3. **Mixture sequential test (mSPRT / e-value)**  
   - Uses an **always-valid** test statistic built from the z-score.  
   - You can stop **any time** you like and still control Type I error.

We use a **simplified but practical** mSPRT-style test for the difference in Bernoulli
conversion between two arms (control vs treatment). The construction:

- At each interim look, we compute the usual z-statistic \(Z_t\) for the difference in proportions.  
- Under the null, \(Z_t \approx \mathcal{N}(0, 1)\).  
- We define an **e-value** (likelihood ratio) based on a **mixture alternative**:

\[
E_t = \frac{f_\text{mix}(Z_t)}{f_0(Z_t)},
\]

where:
- \(f_0\) is the standard normal density,  
- \(f_\text{mix}\) is a mixture of normal alternatives centered at 0 with variance \(1+\tau^2\).

This yields a closed-form expression:

\[
E_t = \frac{1}{\sqrt{1 + \tau^2}} 
\exp\left( \frac{\tau^2}{2(1 + \tau^2)} Z_t^2 \right).
\]

Under \(H_0\), \(E_t\) is a **nonnegative martingale** with \(\mathbb{E}[E_t] = 1\).  
By Ville's inequality, for any \(\alpha\in(0,1)\):

\[
\mathbb{P}_0\big(\sup_t E_t \ge 1/\alpha\big) \le \alpha,
\]

so the stopping rule “**reject when \(E_t \ge 1/\alpha\)**” is valid even if we peek all the time.


## 0) Setup

In [None]:

from __future__ import annotations

from dataclasses import dataclass
from typing import Tuple, Literal, Dict, Any

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (7, 4.5)
plt.rcParams["axes.grid"] = True



## 1) Proportion helpers and z-test

We re-use a typed summary class and a two-sample proportion z-test. These are standard
building blocks for A/B tests with a Bernoulli conversion metric.


In [None]:

@dataclass(frozen=True)
class PropSummary:
    """Summary of a Bernoulli proportion.

    Attributes
    ----------
    p : float
        Sample proportion x / n.
    n : int
        Sample size.
    x : int
        Number of successes.
    """
    p: float
    n: int
    x: int


def summarize_prop(x: int, n: int) -> PropSummary:
    """Validate and summarize a proportion sample.

    Parameters
    ----------
    x : int
        Number of successes, in [0, n].
    n : int
        Sample size, must be positive.

    Returns
    -------
    PropSummary
        Dataclass with p, n, x.
    """
    if n <= 0:
        raise ValueError("n must be positive.")
    if not (0 <= x <= n):
        raise ValueError("x must satisfy 0 <= x <= n.")
    return PropSummary(p=x / n, n=n, x=x)


def two_prop_z(
    x1: int,
    n1: int,
    x2: int,
    n2: int,
) -> float:
    """Compute the z-statistic for a two-sample proportion test.

    Uses the usual pooled-variance estimate under H0: p1 = p2.

    Parameters
    ----------
    x1, n1, x2, n2 : int
        Success counts and sample sizes for arms 1 and 2.

    Returns
    -------
    float
        z-statistic (signed), where z > 0 means arm 2 has higher conversion.
    """
    s1, s2 = summarize_prop(x1, n1), summarize_prop(x2, n2)
    p_pool = (s1.x + s2.x) / (s1.n + s2.n)
    se = math.sqrt(p_pool * (1.0 - p_pool) * (1.0 / s1.n + 1.0 / s2.n))
    if se == 0.0:
        raise ZeroDivisionError("Standard error is zero; check inputs.")
    z = (s2.p - s1.p) / se
    return float(z)


def two_prop_pvalue_from_z(z: float, two_sided: bool = True) -> float:
    """Compute p-value from a z-statistic using the normal CDF approximation."""
    # standard normal CDF via erf
    cdf = 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))
    if two_sided:
        p = 2.0 * min(cdf, 1.0 - cdf)
    else:
        p = 1.0 - cdf
    return float(p)



## 2) Simulated streaming experiments

We simulate experiments as **streams of users** arriving over time.

Each user:

- Is randomly assigned to `group ∈ {control, treatment}` with probability 0.5 each.  
- Converts with probability:
  - `p_control` in the control arm,  
  - `p_treat` in the treatment arm.

We consider both:

- **A/A (null)** experiments: `p_control = p_treat`, to estimate Type I error.  
- **A/B (effect)** experiments: `p_treat > p_control`, to compare detection performance.


In [None]:

def simulate_stream(
    n: int,
    p_control: float,
    p_treat: float,
    seed: int | None = None,
) -> pd.DataFrame:
    """Simulate a streaming A/B experiment with Bernoulli conversion.

    Parameters
    ----------
    n : int
        Total number of users.
    p_control : float
        Conversion probability in control.
    p_treat : float
        Conversion probability in treatment.
    seed : int | None
        Random seed.

    Returns
    -------
    DataFrame
        Columns: index (arrival order), group, converted.
    """
    rng = np.random.default_rng(seed)
    group_flag = rng.binomial(1, 0.5, size=n)
    group = np.where(group_flag == 0, "control", "treatment")

    p = np.where(group_flag == 0, p_control, p_treat)
    converted = rng.binomial(1, p)

    df = pd.DataFrame(
        {
            "group": group,
            "converted": converted.astype(int),
        }
    )
    return df



## 3) Fixed-horizon and naive-peeking tests

We revisit two simple strategies:

1. **Fixed-horizon**: compute the z-test and p-value once at the final sample size `n`.  
2. **Naive-peeking**: every `look_step` users, recompute the p-value and stop early if `p < α`.

Both use the same z-test; the only difference is **how often we look**.


In [None]:

def ztest_at_n(df_stream: pd.DataFrame, n: int, two_sided: bool = True) -> float:
    """Compute two-sided p-value at a fixed sample size using the z-test.

    Parameters
    ----------
    df_stream : DataFrame
        Streaming experiment data with columns group, converted.
    n : int
        Sample size at which to compute the test.
    two_sided : bool
        If True, use two-sided p-value.

    Returns
    -------
    float
        p-value at sample size n.
    """
    sub = df_stream.iloc[:n]
    tab = (
        sub.groupby("group")["converted"]
           .agg(["sum", "count"])
           .rename(columns={"sum": "x", "count": "n"})
    )
    x1, n1 = int(tab.loc["control", "x"]), int(tab.loc["control", "n"])
    x2, n2 = int(tab.loc["treatment", "x"]), int(tab.loc["treatment", "n"])
    z = two_prop_z(x1, n1, x2, n2)
    p = two_prop_pvalue_from_z(z, two_sided=two_sided)
    return p


def naive_peek_pvalues(
    df_stream: pd.DataFrame,
    look_step: int,
    two_sided: bool = True,
) -> Tuple[list[int], list[float]]:
    """Compute p-values over time by peeking every `look_step` users.

    Parameters
    ----------
    df_stream : DataFrame
        Streaming experiment data.
    look_step : int
        Frequency of interim looks.
    two_sided : bool
        If True, two-sided p-values.

    Returns
    -------
    look_sizes : list[int]
        Sample sizes at which we looked.
    p_values : list[float]
        p-values at each look.
    """
    look_sizes: list[int] = []
    p_values: list[float] = []

    n_total = df_stream.shape[0]
    for n in range(look_step, n_total + 1, look_step):
        p = ztest_at_n(df_stream, n, two_sided=two_sided)
        look_sizes.append(n)
        p_values.append(p)

    return look_sizes, p_values



## 4) Mixture sequential test: an always-valid e-value

We now define an **always-valid sequential test** based on a **mixture likelihood ratio**
for the z-statistic.

Recall:

- Under \(H_0\), the z-statistic \(Z_t\) is approximately \(\mathcal{N}(0,1)\).  
- Under a mixture of normal alternatives centered at 0 with variance \(1 + \tau^2\),

  \[
  f_\text{mix}(z) = \frac{1}{\sqrt{2\pi (1 + \tau^2)}} 
  \exp\Big(-\frac{z^2}{2(1 + \tau^2)}\Big).
  \]

- Under the null, the density is \(f_0(z) = (2\pi)^{-1/2} \exp(-z^2/2)\).

The resulting **mixture likelihood ratio** (our e-value) is:

\[
E(z) = \frac{f_\text{mix}(z)}{f_0(z)}
= \frac{1}{\sqrt{1 + \tau^2}} 
  \exp\left( \frac{\tau^2}{2(1 + \tau^2)} z^2 \right).
\]

Under \(H_0\), \(E(Z_t)\) is a nonnegative martingale with expectation 1.
We then define:

- The running maximum \(E^*_t = \max_{s \le t} E(Z_s)\).  
- A stopping rule: **reject** when \(E^*_t \ge 1/\alpha\).  
- An always-valid p-value at time t: \(p_t = \min(1, 1 / E^*_t)\).

This controls the Type I error at level \(\alpha\) regardless of when we stop.


In [None]:

def mixture_sprt_evalue(
    z: float,
    tau2: float = 1.0,
) -> float:
    """Compute the mixture SPRT e-value E(z) for a z-statistic.

    Parameters
    ----------
    z : float
        Z-statistic (approximately standard normal under H0).
    tau2 : float
        Prior variance parameter for the mixture of normal alternatives.

    Returns
    -------
    float
        E(z) = f_mix(z) / f_0(z).
    """
    if tau2 <= 0.0:
        raise ValueError("tau2 must be positive.")
    # E(z) = 1/sqrt(1+tau2) * exp( tau2/(2(1+tau2)) * z^2 )
    coef = 1.0 / math.sqrt(1.0 + tau2)
    exponent = (tau2 / (2.0 * (1.0 + tau2))) * (z ** 2)
    return float(coef * math.exp(exponent))


def always_valid_test_on_stream(
    df_stream: pd.DataFrame,
    look_step: int,
    alpha: float = 0.05,
    tau2: float = 1.0,
) -> Dict[str, Any]:
    """Run an always-valid mixture sequential test on a streaming A/B experiment.

    At each look (every `look_step` users), we compute the z-statistic for
    difference in proportions, then the e-value E(z), and track the running
    maximum E*.

    We reject when E* >= 1/alpha.

    Parameters
    ----------
    df_stream : DataFrame
        Streaming experiment data with columns group, converted.
    look_step : int
        Frequency of interim looks.
    alpha : float
        Desired Type I error level.
    tau2 : float
        Mixture prior variance for the e-value.

    Returns
    -------
    dict
        Keys:
        - rejected : bool, whether the test ever crossed the boundary.
        - stopping_n : int | None, sample size at first rejection (None if never).
        - look_sizes : list[int], sample sizes at each look.
        - e_values : list[float], e-value at each look.
        - e_running_max : list[float], running max E* at each look.
        - p_always_valid : list[float], always-valid p-value at each look.
    """
    look_sizes: list[int] = []
    e_values: list[float] = []
    e_running_max: list[float] = []
    p_always_valid: list[float] = []

    E_star = 0.0
    rejected = False
    stopping_n: int | None = None

    n_total = df_stream.shape[0]
    boundary = 1.0 / alpha

    for n in range(look_step, n_total + 1, look_step):
        sub = df_stream.iloc[:n]
        tab = (
            sub.groupby("group")["converted"]
               .agg(["sum", "count"])
               .rename(columns={"sum": "x", "count": "n"})
        )
        x1, n1 = int(tab.loc["control", "x"]), int(tab.loc["control", "n"])
        x2, n2 = int(tab.loc["treatment", "x"]), int(tab.loc["treatment", "n"])

        z = two_prop_z(x1, n1, x2, n2)
        e = mixture_sprt_evalue(z, tau2=tau2)

        E_star = max(E_star, e)
        p_ev = min(1.0, 1.0 / E_star if E_star > 0.0 else 1.0)

        look_sizes.append(n)
        e_values.append(e)
        e_running_max.append(E_star)
        p_always_valid.append(p_ev)

        if (not rejected) and (E_star >= boundary):
            rejected = True
            stopping_n = n

    return {
        "rejected": rejected,
        "stopping_n": stopping_n,
        "look_sizes": look_sizes,
        "e_values": e_values,
        "e_running_max": e_running_max,
        "p_always_valid": p_always_valid,
    }



## 5) Comparing Type I error under A/A (no true effect)

We now compare **false positive rates** of:

1. Fixed-horizon z-test at sample size `n_total`.  
2. Naive peeking (check p-value every `look_step` users, stop if p < α).  
3. Always-valid mixture test (reject when e-process crosses 1/α).

All experiments here are **A/A**: `p_control = p_treat = 0.10`.


In [None]:

def simulate_type1_rates(
    n_experiments: int = 500,
    n_total: int = 10_000,
    p: float = 0.10,
    look_step: int = 500,
    alpha: float = 0.05,
    tau2: float = 1.0,
    seed: int | None = 2025,
) -> Dict[str, float]:
    """Estimate Type I error for three strategies via A/A simulations.

    Parameters
    ----------
    n_experiments : int
        Number of simulated experiments.
    n_total : int
        Total users per experiment.
    p : float
        Conversion rate for both arms under H0.
    look_step : int
        Frequency of interim looks for peeking / always-valid test.
    alpha : float
        Nominal test level.
    tau2 : float
        Mixture prior variance for the e-value.
    seed : int | None
        Random seed.

    Returns
    -------
    dict
        Approximate Type I error for each method.
    """
    rng = np.random.default_rng(seed)
    fixed_rejects = 0
    peek_rejects = 0
    always_rejects = 0

    for _ in range(n_experiments):
        s = int(rng.integers(0, 10_000_000))
        df_stream = simulate_stream(n_total, p_control=p, p_treat=p, seed=s)

        # 1) Fixed-horizon
        p_fix = ztest_at_n(df_stream, n_total, two_sided=True)
        if p_fix < alpha:
            fixed_rejects += 1

        # 2) Naive peeking
        look_sizes, pvals = naive_peek_pvalues(df_stream, look_step=look_step, two_sided=True)
        if any(pv < alpha for pv in pvals):
            peek_rejects += 1

        # 3) Always-valid mixture test
        res_ev = always_valid_test_on_stream(
            df_stream, look_step=look_step, alpha=alpha, tau2=tau2
        )
        if res_ev["rejected"]:
            always_rejects += 1

    return {
        "fixed_horizon_alpha_hat": fixed_rejects / n_experiments,
        "naive_peek_alpha_hat": peek_rejects / n_experiments,
        "always_valid_alpha_hat": always_rejects / n_experiments,
    }


type1_estimates = simulate_type1_rates(
    n_experiments=300,
    n_total=8000,
    p=0.10,
    look_step=400,
    alpha=0.05,
    tau2=1.0,
    seed=2025,
)
type1_estimates



You should see something like:

- `fixed_horizon_alpha_hat` ≈ 0.05 (close to nominal 5%).  
- `naive_peek_alpha_hat` **well above** 0.05 (inflated false positives).  
- `always_valid_alpha_hat` ≈ 0.05 (controlled type I error even with flexible stopping).

This illustrates why naive peeking is dangerous, and how an always-valid test can
restore valid inference while still allowing flexible stopping rules.



## 6) Power and stopping behavior under A/B (true effect)

We now simulate experiments with a **real treatment effect**, e.g.:

- `p_control = 0.10`,  
- `p_treat   = 0.12` (absolute lift of 2 percentage points).

We compare:

- Probability of detection (power).  
- Distribution of stopping sample sizes for the always-valid test.


In [None]:

def simulate_power_and_stopping(
    n_experiments: int = 300,
    n_total: int = 10_000,
    p_control: float = 0.10,
    p_treat: float = 0.12,
    look_step: int = 500,
    alpha: float = 0.05,
    tau2: float = 1.0,
    seed: int | None = 42,
) -> Dict[str, Any]:
    """Compare power of fixed / naive / always-valid under a true effect.

    Parameters
    ----------
    n_experiments : int
        Number of simulated experiments.
    n_total : int
        Max users per experiment.
    p_control, p_treat : float
        Conversion probabilities in control and treatment (H1).
    look_step : int
        Frequency of interim looks.
    alpha : float
        Significance level.
    tau2 : float
        Mixture prior variance.
    seed : int | None
        Random seed.

    Returns
    -------
    dict
        Contains detection rates and stopping size summaries.
    """
    rng = np.random.default_rng(seed)
    fixed_detect = 0
    peek_detect = 0
    always_detect = 0
    always_stoppings: list[int] = []

    for _ in range(n_experiments):
        s = int(rng.integers(0, 10_000_000))
        df_stream = simulate_stream(n_total, p_control=p_control, p_treat=p_treat, seed=s)

        # Fixed-horizon
        p_fix = ztest_at_n(df_stream, n_total, two_sided=True)
        if p_fix < alpha:
            fixed_detect += 1

        # Naive peeking
        _, pvals = naive_peek_pvalues(df_stream, look_step=look_step, two_sided=True)
        if any(pv < alpha for pv in pvals):
            peek_detect += 1

        # Always-valid
        res_ev = always_valid_test_on_stream(
            df_stream, look_step=look_step, alpha=alpha, tau2=tau2
        )
        if res_ev["rejected"]:
            always_detect += 1
            if res_ev["stopping_n"] is not None:
                always_stoppings.append(res_ev["stopping_n"])
        else:
            # no detection, treat stopping at n_total
            always_stoppings.append(n_total)

    # Summaries
    detection_rates = {
        "fixed_horizon_power": fixed_detect / n_experiments,
        "naive_peek_power": peek_detect / n_experiments,
        "always_valid_power": always_detect / n_experiments,
    }

    stopping_series = pd.Series(always_stoppings)
    stopping_summary = stopping_series.describe(percentiles=[0.25, 0.5, 0.75])

    return {
        "detection_rates": detection_rates,
        "always_valid_stopping_sizes": stopping_series,
        "always_valid_stopping_summary": stopping_summary,
    }


res_power = simulate_power_and_stopping(
    n_experiments=300,
    n_total=8000,
    p_control=0.10,
    p_treat=0.12,
    look_step=400,
    alpha=0.05,
    tau2=1.0,
    seed=777,
)
res_power["detection_rates"], res_power["always_valid_stopping_summary"]


In [None]:

# Plot distribution of stopping sample size for the always-valid test
stopping_sizes = res_power["always_valid_stopping_sizes"]

plt.figure()
plt.hist(stopping_sizes, bins=20)
plt.xlabel("stopping sample size (n)")
plt.ylabel("count")
plt.title("Always-valid test: distribution of stopping sample sizes")
plt.tight_layout()
plt.show()



Typically you will see that:

- The always-valid test detects the effect with **similar or slightly lower power**
  compared to naive peeking at the same nominal α (because naive peeking is effectively
  using a higher alpha in practice).  
- Many experiments **stop early** when the effect is strong enough, saving traffic and time.  
- The Type I error is still controlled at ≈ α by construction.

This makes always-valid tests attractive for real growth teams:

- You can monitor experiments continuously.  
- You can make decisions as soon as evidence is strong enough.  
- You keep a clean, interpretable notion of “5% false positive rate”.



## 7) How to use this in a real experimentation workflow

Practical recipe for a growth / product team:

1. **Choose a primary metric** (e.g. conversion).  
2. Decide a maximum exposure `n_total` and a look frequency `look_step`.  
3. Run an always-valid mixture test as in this notebook:
   - Track `E*` and always-valid p-value `p_t`.  
   - Stop and **declare success** once `p_t < α` and guardrails are acceptable.  
4. If the experiment reaches `n_total` without crossing the boundary, treat the result as
   **non-significant** and decide whether to hold, rerun, or accept a small/no effect.

Extensions you can add on top:

- Combine with **CUPED** or regression adjustment to reduce variance before computing z.  
- Add **Bayesian decision layers** on top of the always-valid test (e.g., require a minimum
  effect size in absolute terms).  
- Log all z and e-process values to make post-hoc audits and dashboards.
