
# A/B Testing with Guardrail Metrics and Multi-Metric Decision Rules

This notebook focuses specifically on **guardrail metrics** and **multi-metric decisions** in A/B tests.

We will:

1. Simulate an experiment with a primary metric (conversion + revenue).  
2. Add realistic **guardrails**: refund rate, support tickets, latency.  
3. Define a **frequentist decision rule** that combines primary and guardrails.  
4. Define a **Bayesian decision rule** using Beta–Binomial posteriors.  
5. Provide a reusable **decision matrix** template.


## 0) Setup

In [None]:

from __future__ import annotations

from dataclasses import dataclass
from typing import Tuple

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (7, 4.5)
plt.rcParams["axes.grid"] = True


@dataclass(frozen=True)
class PropSummary:
    """Summary of a Bernoulli proportion.

    Attributes
    ----------
    p : float
        Sample proportion x / n.
    n : int
        Sample size.
    x : int
        Number of successes.
    """
    p: float
    n: int
    x: int


def summarize_prop(x: int, n: int) -> PropSummary:
    """Validate and summarize a proportion sample."""
    if n <= 0:
        raise ValueError("n must be positive.")
    if not (0 <= x <= n):
        raise ValueError("x must satisfy 0 <= x <= n.")
    return PropSummary(p=x / n, n=n, x=x)


def sample_posterior_lift(
    xA: int,
    nA: int,
    xB: int,
    nB: int,
    alpha0: float = 1.0,
    beta0: float = 1.0,
    n_draws: int = 50_000,
    seed: int | None = 1,
) -> pd.DataFrame:
    """Draw from the posterior of Bernoulli rates and their difference.

    Beta–Binomial model:

    - Prior: p ~ Beta(alpha0, beta0)
    - Data: x successes out of n
    - Posterior: p | data ~ Beta(alpha0 + x, beta0 + n - x)

    Parameters
    ----------
    xA, nA, xB, nB : int
        Successes and sample sizes for control (A) and treatment (B).
    alpha0, beta0 : float
        Beta prior hyperparameters (shared across arms).
    n_draws : int
        Number of Monte Carlo draws.
    seed : int | None
        Random seed.

    Returns
    -------
    DataFrame
        Columns: pA, pB, lift = pB - pA.
    """
    rng = np.random.default_rng(seed)

    alphaA = alpha0 + xA
    betaA = beta0 + nA - xA
    alphaB = alpha0 + xB
    betaB = beta0 + nB - xB

    pA_draws = rng.beta(alphaA, betaA, size=n_draws)
    pB_draws = rng.beta(alphaB, betaB, size=n_draws)
    lift = pB_draws - pA_draws

    return pd.DataFrame(
        {
            "pA": pA_draws,
            "pB": pB_draws,
            "lift": lift,
        }
    )



## 1) Simulated experiment with primary metrics

We create a simple experiment with:

- `group ∈ {control, treatment}`  
- `converted` (0/1)  
- `revenue` (0 for non-converters)  
- `pre_activity` (pre-period user proxy)


In [None]:

def simulate_experiment(
    n: int = 20_000,
    p_control: float = 0.10,
    lift_treatment: float = 0.02,
    mean_revenue: float = 100.0,
    revenue_sd: float = 40.0,
    seed: int | None = 123,
) -> pd.DataFrame:
    """Simulate a simple online experiment with binary conversion and revenue."""
    rng = np.random.default_rng(seed)

    user_id = np.arange(n)
    group_flag = rng.binomial(1, 0.5, size=n)
    group = np.where(group_flag == 0, "control", "treatment")

    p_treat = p_control + lift_treatment
    p = np.where(group_flag == 0, p_control, p_treat)
    converted = rng.binomial(1, p)

    # revenue: only for converters, Normal for illustration
    rev = rng.normal(loc=mean_revenue, scale=revenue_sd, size=n)
    rev = np.where(converted == 1, rev, 0.0)

    # pre-activity covariate (correlated with conversion)
    pre_activity = rng.normal(loc=0.0, scale=1.0, size=n) + converted * 0.7

    df_sim = pd.DataFrame(
        {
            "user_id": user_id,
            "group": group,
            "converted": converted.astype(int),
            "revenue": rev.astype(float),
            "pre_activity": pre_activity.astype(float),
        }
    )
    return df_sim


df = simulate_experiment()
df.head()


In [None]:

# Basic primary metrics
conv_by_group = (
    df.groupby("group")["converted"]
      .agg(["sum", "count", "mean"])
      .rename(columns={"sum": "x", "count": "n", "mean": "rate"})
)
rev_by_group = df.groupby("group")["revenue"].agg(["mean", "std", "count"])

conv_by_group, rev_by_group



## 2) Adding guardrail metrics

We now augment the dataset with three guardrail metrics:

- `refund` — only possible for converters, slightly higher in treatment.  
- `support_ticket` — higher for low-activity users and in treatment.  
- `latency_ms` — slightly larger latency under treatment.


In [None]:

rng_guard = np.random.default_rng(2025)

df = df.copy()

# Refunds: only among converters; assume treatment slightly worse
mask_conv = df["converted"] == 1
mask_control = df["group"] == "control"
mask_treat = df["group"] == "treatment"

prob_refund = np.zeros(len(df), dtype=float)
prob_refund[mask_conv & mask_control] = 0.04  # ~4% refunds for converters in control
prob_refund[mask_conv & mask_treat] = 0.06    # ~6% for converters in treatment

df["refund"] = rng_guard.binomial(1, prob_refund)

# Support tickets: base + bump for low pre_activity + bump for treatment
base_support = 0.03
extra_low_activity = 0.02 * (df["pre_activity"] < 0.0).astype(float)
extra_treat = 0.01 * (df["group"] == "treatment").astype(float)

prob_support = base_support + extra_low_activity + extra_treat
prob_support = np.clip(prob_support, 0.0, 0.30)
df["support_ticket"] = rng_guard.binomial(1, prob_support)

# Latency: base 300 ms, treatment adds ~30ms plus noise
latency_noise = rng_guard.normal(loc=0.0, scale=30.0, size=len(df))
df["latency_ms"] = (
    300.0
    + 30.0 * (df["group"] == "treatment").astype(float)
    + latency_noise
)

guardrail_summary = (
    df.groupby("group")[["refund", "support_ticket", "latency_ms"]]
      .agg(["mean", "std", "count"])
)
guardrail_summary



## 3) Frequentist multi-metric decision rule

Assume:

- **Primary metric**: revenue per user (RPU).  
- **Guardrails**: `refund` and `support_ticket` (both “lower is better”).

We define a simple decision rule:

> **Ship** treatment if  
> 1. RPU difference (treatment − control) is **positive**, and  
> 2. Refund rate increase is at most 1 percentage point, and  
> 3. Support ticket rate increase is at most 0.5 percentage points.


In [None]:

# Primary metric: revenue per user
rev_group = (
    df.groupby("group")["revenue"]
      .agg(["mean", "var", "count"])
      .rename(columns={"mean": "mean_revenue", "count": "n"})
)
rev_group


In [None]:

# Primary RPU difference
mean_rev_ctrl = float(rev_group.loc["control", "mean_revenue"])
mean_rev_treat = float(rev_group.loc["treatment", "mean_revenue"])
diff_rpu = mean_rev_treat - mean_rev_ctrl

# Guardrail 1: refund rate
x_ref_ctrl = int(df.loc[df["group"] == "control", "refund"].sum())
n_ref_ctrl = int(df.loc[df["group"] == "control", "refund"].count())
x_ref_treat = int(df.loc[df["group"] == "treatment", "refund"].sum())
n_ref_treat = int(df.loc[df["group"] == "treatment", "refund"].count())

s_ref_ctrl = summarize_prop(x_ref_ctrl, n_ref_ctrl)
s_ref_treat = summarize_prop(x_ref_treat, n_ref_treat)
diff_refund = s_ref_treat.p - s_ref_ctrl.p

# Guardrail 2: support ticket rate
x_sup_ctrl = int(df.loc[df["group"] == "control", "support_ticket"].sum())
n_sup_ctrl = int(df.loc[df["group"] == "control", "support_ticket"].count())
x_sup_treat = int(df.loc[df["group"] == "treatment", "support_ticket"].sum())
n_sup_treat = int(df.loc[df["group"] == "treatment", "support_ticket"].count())

s_sup_ctrl = summarize_prop(x_sup_ctrl, n_sup_ctrl)
s_sup_treat = summarize_prop(x_sup_treat, n_sup_treat)
diff_support = s_sup_treat.p - s_sup_ctrl.p

# Decision thresholds (absolute differences)
max_refund_increase = 0.01   # allow up to +1 percentage point
max_support_increase = 0.005 # allow up to +0.5 percentage points

ship_frequentist = (
    (diff_rpu > 0.0)
    and (diff_refund <= max_refund_increase)
    and (diff_support <= max_support_increase)
)

{
    "diff_rpu_treat_minus_ctrl": diff_rpu,
    "diff_refund_rate": diff_refund,
    "diff_support_rate": diff_support,
    "max_refund_increase_allowed": max_refund_increase,
    "max_support_increase_allowed": max_support_increase,
    "ship_frequentist_rule": ship_frequentist,
}



This rule is intentionally simple and transparent.

In real experiments you might also require:

- RPU improvement to be **statistically significant**, and  
- Guardrail degradations to be **not statistically significant** (or below a practical threshold).

Next we build a Bayesian version using Beta–Binomial posteriors.



## 4) Bayesian multi-metric decision rule (primary + guardrail)

We use the same Beta–Binomial model for:

- Primary: conversion (for illustration).  
- Guardrail: refund rate (undesirable, lower is better).

Decision rule:

> **Ship** if  
> - \(P(p_\text{treat} - p_\text{control} > 0 \mid data) > 0.95\) (treatment increases conversion), **and**  
> - \(P(\text{refund}_\text{treat} - \text{refund}_\text{control} < 0.01 \mid data) > 0.90\)
>   (refund rate increase is probably less than 1 percentage point).


In [None]:

# Conversion counts for primary Bayesian metric
conv_by_group = (
    df.groupby("group")["converted"]
      .agg(["sum", "count"])
      .rename(columns={"sum": "x", "count": "n"})
)
xA = int(conv_by_group.loc["control", "x"])
nA = int(conv_by_group.loc["control", "n"])
xB = int(conv_by_group.loc["treatment", "x"])
nB = int(conv_by_group.loc["treatment", "n"])

post_conv = sample_posterior_lift(
    xA=xA,
    nA=nA,
    xB=xB,
    nB=nB,
    alpha0=1.0,
    beta0=1.0,
    n_draws=80_000,
    seed=2026,
)

prob_conv_positive = float((post_conv["lift"] > 0.0).mean())
conv_lo, conv_hi = np.quantile(post_conv["lift"], [0.025, 0.975])

prob_conv_positive, (conv_lo, conv_hi)


In [None]:

# Posterior for refund rate lift (guardrail)
post_refund = sample_posterior_lift(
    xA=x_ref_ctrl,
    nA=n_ref_ctrl,
    xB=x_ref_treat,
    nB=n_ref_treat,
    alpha0=1.0,
    beta0=1.0,
    n_draws=80_000,
    seed=2027,
)

refund_lift = post_refund["lift"]  # treat - control
refund_lo, refund_hi = np.quantile(refund_lift, [0.025, 0.975])

# Probability that refund increase is less than +1 percentage point
bayes_max_refund_increase = 0.01
prob_refund_within_band = float((refund_lift < bayes_max_refund_increase).mean())

{
    "refund_lift_CI95": (refund_lo, refund_hi),
    "bayesian_max_refund_increase": bayes_max_refund_increase,
    "prob_refund_increase_less_than_threshold": prob_refund_within_band,
}


In [None]:

# Combine into a Bayesian decision
conv_prob_threshold = 0.95
guardrail_prob_threshold = 0.90

ship_bayesian = (
    (prob_conv_positive > conv_prob_threshold)
    and (prob_refund_within_band > guardrail_prob_threshold)
)

{
    "P(conv_lift > 0)": prob_conv_positive,
    "P(refund_lift < 0.01)": prob_refund_within_band,
    "conv_prob_threshold": conv_prob_threshold,
    "guardrail_prob_threshold": guardrail_prob_threshold,
    "ship_bayesian_rule": ship_bayesian,
}



This Bayesian rule encodes both **upside appetite** and **risk tolerance**:

- Large `conv_prob_threshold` ⇒ more demanding on primary impact.  
- Large `guardrail_prob_threshold` and small allowed refund increase ⇒ more conservative on risk.

You can tune these thresholds per **business line** (e.g., payments vs marketing).



## 5) Decision matrix template

You can map experiment results into a **decision matrix** that combines primary and guardrail metrics:

| Case | Primary metric (e.g. conversion / RPU) | Guardrails (refund, support, latency) | Suggested action |
|------|----------------------------------------|----------------------------------------|------------------|
| A    | Clearly improved                       | Not degraded (within tolerances)      | **Ship** and monitor |
| B    | Clearly improved                       | Mild degradation but acceptable        | **Ship** with mitigation plan |
| C    | Neutral / unclear                      | Clean guardrails                       | **Hold / rerun** or gather more data |
| D    | Degraded                               | Clean guardrails                       | **Do not ship** |
| E    | Improved                               | Clearly degraded (beyond limits)       | **Do not ship**; investigate root cause |
| F    | Degraded                               | Degraded                               | **Do not ship**, consider rollback |

The frequentist or Bayesian rules you implement should **pre-map** an experiment into one
of these cases, so that decisions are consistent across teams and over time.
