<a href="https://colab.research.google.com/github/Orolu1/Olayinkaorolu.github.io/blob/main/ab_test_feasibility_checker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A/B Test Feasibility & Sample Size Reality Checker
This notebook helps product teams decide whether an A/B test is worth running at all.
Instead of focusing only on statistical formulas, it incorporates real world constraints, such as; traffic volume, fixed timelines, and data loss, to answer a more practical question:

 *Can this experiment realistically reach statistical power before we need to make a decision?*

**Problem Framing**:

**Why experiment feasibility matters**

In practice, many A/B tests are launched even though they are statistically impossible to conclude within the available time or traffic.

This leads to:


*   Inconclusive results
*   Misleading “wins”
*  Wasted engineering and product effort



*This notebook provides a simple framework to **screen experiments before launch**, ensuring they are both **statistically sound** and **operationally feasible**.*

## **Imports**

In [3]:
import math
import numpy as np
import pandas as pd
try:
    import scipy
except ImportError:
    print("Scipy not available, skipping import.")

## **Experiment assumptions and constraints**

### **Defining the experiment inputs**

This section captures the **assumptions and constraints** that are typically known before an experiment starts.

These Include

* Baseline conversion rate
* Minimum Detectable Effect (MDE)
* Desired statistical power and significance level
* Daily traffic volume
* Maximum allowable experiment duration
* Real-world adjustments such as ramp-up periods and data loss

In [4]:
params = {
    # Experiment assumptions
    "baseline_cr": 0.06,
    "mde_abs": 0.005,
    "alpha": 0.05,
    "power": 0.80,
    "two_sided": True,

    # Real-world constraints
    "traffic_per_day": 12000,
    "allocation_treatment": 0.50,
    "max_days": 14,
    "ramp_days_excluded": 0,
    "data_loss_rate": 0.05,

    # Optional: safety clamp so p2 stays valid
    "min_cr": 1e-6,
    "max_cr": 1 - 1e-6,
}

## **Sample size calculation**

Here i'm calculating the minimum sample size required per arm using a two-sample test of proportions.

This uses a normal approximation, which is standard for experiment planning when sample sizes are large.

This answers:

*`How many users do we need in each variant to reliably detect the chosen effect size?`*

In [8]:
from scipy.stats import norm
def required_n_two_proportions(p1: float, p2: float, alpha: float = 0.05, power: float = 0.8, two_sided: bool = True) -> int:

    if p1 <= 0 or p1 >= 1 or p2 <= 0 or p2 >= 1:
        raise ValueError("p1 and p2 must be in (0,1).")

    delta = abs(p2 - p1)
    if delta <= 0:
        raise ValueError("p2 must differ from p1 for sample size calculation.")

    # Z-scores
    z_alpha = norm.ppf(1 - alpha/2) if two_sided else norm.ppf(1 - alpha)
    z_beta = norm.ppf(power)

    p_bar = (p1 + p2) / 2
    term1 = z_alpha * math.sqrt(2 * p_bar * (1 - p_bar))
    term2 = z_beta  * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))
    n = ((term1 + term2) ** 2) / (delta ** 2)

    return int(math.ceil(n))

## **Translating sample size into calendar time**

In this step, i'm translating the required sample size into:

1. Expected days of data collection
2. Total calendar time including ramp-up or excluded days

This allows me to directly compare statistical requirements against business constraints.

In [14]:
def clamp(p: float, lo: float, hi: float) -> float:
    return max(lo, min(hi, p))

def feasibility_check(cfg: dict) -> dict:
    p1 = cfg["baseline_cr"]
    p2 = clamp(p1 + cfg["mde_abs"], cfg["min_cr"], cfg["max_cr"])

    n_per_arm = required_n_two_proportions(
        p1=p1,
        p2=p2,
        alpha=cfg["alpha"],
        power=cfg["power"],
        two_sided=cfg["two_sided"]
    )

    alloc_t = cfg["allocation_treatment"]
    alloc_c = 1 - alloc_t
    if alloc_t <= 0 or alloc_t >= 1:
        raise ValueError("allocation_treatment must be in (0,1).")

    effective_traffic = cfg["traffic_per_day"] * (1 - cfg["data_loss_rate"])
    per_day_t = effective_traffic * alloc_t
    per_day_c = effective_traffic * alloc_c
    per_day_min_arm = min(per_day_t, per_day_c)

    if per_day_min_arm <= 0:
        raise ValueError("Effective traffic per day must be > 0 after allocation and data loss.")

    days_needed_for_n = int(math.ceil(n_per_arm / per_day_min_arm))
    total_calendar_days = days_needed_for_n + int(cfg["ramp_days_excluded"])

    # Verdict rule (simple + explainable)
    max_days = int(cfg["max_days"])
    if total_calendar_days <= max_days:
        status = "FEASIBLE"
        label = "Feasible"
    elif total_calendar_days <= int(math.ceil(1.5 * max_days)):
        status = "RISKY"
        label = "Risky"
    else:
        status = "NOT_FEASIBLE"
        label = "Not feasible"

    # Recommendations (actionable next steps)
    recs = []
    if status != "FEASIBLE":
        recs.append("Increase max test duration (days), if possible.")
        recs.append("Target a larger MDE (detect only bigger effects).")
        recs.append("Increase eligible traffic (wider audience, more surfaces, higher allocation).")
        recs.append("Improve tracking to reduce data loss (instrumentation QA).")

    return {
        "baseline_cr": p1,
        "target_cr": p2,
        "mde_abs": cfg["mde_abs"],
        "alpha": cfg["alpha"],
        "power": cfg["power"],
        "traffic_per_day": cfg["traffic_per_day"],
        "allocation_treatment": alloc_t,
        "data_loss_rate": cfg["data_loss_rate"],
        "ramp_days_excluded": cfg["ramp_days_excluded"],
        "n_required_per_arm": n_per_arm,
        "days_needed_for_n": days_needed_for_n,
        "total_calendar_days": total_calendar_days,
        "max_days": max_days,
        "status": status,
        "label": label,
        "recommendations": recs
    }

## **Feasibility verdict and recommendations**

### Assessing whether the experiment is viable

Based on the estimated runtime and the maximum allowed duration, the experiment is classified as:

1. Feasible: can realistically reach power
2. Risky: borderline; results may be inconclusive
3. Not feasible: unlikely to reach power in time

When an experiment is not feasible, the notebook also suggests practical next steps, such as increasing traffic, relaxing the MDE, or extending the runtime.

In [15]:
result = feasibility_check(params)

print(result["label"])
print(f"- Baseline CR: {result['baseline_cr']:.3%}")
print(f"- Target CR:   {result['target_cr']:.3%} (MDE {result['mde_abs']:.2%} absolute)")
print(f"- Required n/arm: {result['n_required_per_arm']:,}")
print(f"- Days (data collection): {result['days_needed_for_n']} days")
print(f"- Ramp excluded: {result['ramp_days_excluded']} days")
print(f"- Total calendar days: {result['total_calendar_days']} (cap: {result['max_days']})")

if result["recommendations"]:
    print("\nNext steps:")
    for r in result["recommendations"]:
        print(f"• {r}")

Feasible
- Baseline CR: 6.000%
- Target CR:   6.500% (MDE 0.50% absolute)
- Required n/arm: 36,791
- Days (data collection): 7 days
- Ramp excluded: 0 days
- Total calendar days: 7 (cap: 14)


## Sensitivity analysis
Instead of relying on a single MDE assumption, this section explores a range of effect sizes and shows:
1. How required sample size changes
2. How long each scenario would take to run
3. Which effect sizes are feasible within the same time constraint

This helps product teams make informed trade-offs between ambition and realism.

In [16]:
def sensitivity_table(cfg: dict, mde_list: list[float]) -> pd.DataFrame:
    rows = []
    for mde in mde_list:
        cfg2 = dict(cfg)
        cfg2["mde_abs"] = mde
        out = feasibility_check(cfg2)
        rows.append({
            "MDE (abs)": mde,
            "MDE (pp)": mde * 100,
            "n/arm": out["n_required_per_arm"],
            "days_needed": out["days_needed_for_n"],
            "calendar_days": out["total_calendar_days"],
            "feasible_within_cap": out["total_calendar_days"] <= out["max_days"],
            "verdict": out["status"]
        })
    df = pd.DataFrame(rows).sort_values("MDE (abs)")
    return df

mde_values = [0.002, 0.003, 0.004, 0.005, 0.0075, 0.010]
sens_df = sensitivity_table(params, mde_values)
sens_df

Unnamed: 0,MDE (abs),MDE (pp),n/arm,days_needed,calendar_days,feasible_within_cap,verdict
0,0.002,0.2,224787,40,40,False,NOT_FEASIBLE
1,0.003,0.3,100670,18,18,False,RISKY
2,0.004,0.4,57057,11,11,True,FEASIBLE
3,0.005,0.5,36791,7,7,True,FEASIBLE
4,0.0075,0.75,16656,3,3,True,FEASIBLE
5,0.01,1.0,9540,2,2,True,FEASIBLE


## Summary and takeaways

### Key lessons
1. Not every experiment should be launched
2. Sample size must be evaluated alongside real-world constraints
3. Feasibility checks prevent wasted experimentation effort
4. Small changes in MDE or traffic can dramatically change outcomes

This framework can be used as a pre-experiment checklist for product, growth, and experimentation teams.

In [17]:
scenario_a = dict(params)
scenario_a.update({
    "baseline_cr": 0.045,
    "mde_abs": 0.005,
    "traffic_per_day": 2500,
    "max_days": 14,
    "ramp_days_excluded": 1,
    "data_loss_rate": 0.07,
})

res_a = feasibility_check(scenario_a)
res_a

{'baseline_cr': 0.045,
 'target_cr': 0.049999999999999996,
 'mde_abs': 0.005,
 'alpha': 0.05,
 'power': 0.8,
 'traffic_per_day': 2500,
 'allocation_treatment': 0.5,
 'data_loss_rate': 0.07,
 'ramp_days_excluded': 1,
 'n_required_per_arm': 28408,
 'days_needed_for_n': 25,
 'total_calendar_days': 26,
 'max_days': 14,
 'status': 'NOT_FEASIBLE',
 'label': 'Not feasible',
 'recommendations': ['Increase max test duration (days), if possible.',
  'Target a larger MDE (detect only bigger effects).',
  'Increase eligible traffic (wider audience, more surfaces, higher allocation).',
  'Improve tracking to reduce data loss (instrumentation QA).']}

In [18]:
scenario_b = dict(params)
scenario_b.update({
    "baseline_cr": 0.06,
    "mde_abs": 0.005,
    "traffic_per_day": 40000,
    "max_days": 14,
    "ramp_days_excluded": 0,
    "data_loss_rate": 0.03,
})

res_b = feasibility_check(scenario_b)
res_b

{'baseline_cr': 0.06,
 'target_cr': 0.065,
 'mde_abs': 0.005,
 'alpha': 0.05,
 'power': 0.8,
 'traffic_per_day': 40000,
 'allocation_treatment': 0.5,
 'data_loss_rate': 0.03,
 'ramp_days_excluded': 0,
 'n_required_per_arm': 36791,
 'days_needed_for_n': 2,
 'total_calendar_days': 2,
 'max_days': 14,
 'status': 'FEASIBLE',
 'label': 'Feasible',
 'recommendations': []}

In [19]:
def summary_row(name: str, out: dict) -> dict:
    return {
        "scenario": name,
        "baseline_cr": f"{out['baseline_cr']:.2%}",
        "mde(pp)": f"{out['mde_abs']*100:.2f}pp",
        "traffic/day": f"{out['traffic_per_day']:,}",
        "n/arm": f"{out['n_required_per_arm']:,}",
        "calendar_days": out["total_calendar_days"],
        "cap_days": out["max_days"],
        "verdict": out["label"]
    }

summary_df = pd.DataFrame([
    summary_row("Scenario A (low traffic)", res_a),
    summary_row("Scenario B (healthy traffic)", res_b),
])

summary_df

Unnamed: 0,scenario,baseline_cr,mde(pp),traffic/day,n/arm,calendar_days,cap_days,verdict
0,Scenario A (low traffic),4.50%,0.50pp,2500,28408,26,14,Not feasible
1,Scenario B (healthy traffic),6.00%,0.50pp,40000,36791,2,14,Feasible
