# Basic Power Analysis 

Lecture 14 | CMU ANLP Fall 2025 | Instructor: Sean Welleck

Power analysis helps determine the required sample size to detect a meaningful difference between two systems (e.g., model A vs. model B) with sufficient statistical confidence.

*Based on the methodology in "Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations" ([arXiv:2411.00640](https://arxiv.org/pdf/2411.00640))*

## Background

- **Minimum detectable effect (δ)**: The smallest difference we care about detecting
- **Power (1-β)**: The probability of detecting a true difference when it exists (typically 0.80)
- **Significance level (α)**: The probability of falsely detecting a difference (typically 0.05)
- **Required sample size (n)**: How many test examples we need


### Setup and utility functions

In [22]:
import numpy as np
import pandas as pd
from scipy.stats import norm

### Simulating paired binary scores

For binary outcomes (correct/incorrect), we simulate paired scores with a specified correlation structure.

In [23]:
def simulate_paired_scores_with_corr(n, mu_A, mu_B, rho, rng=None):
    if rng is None:
        rng = np.random.default_rng()
    cov = np.array([[1.0, rho],[rho, 1.0]])
    z = rng.multivariate_normal(mean=[0,0], cov=cov, size=n)
    tA = norm.ppf(mu_A)
    tB = norm.ppf(mu_B)
    scores_A = (z[:,0] <= tA).astype(int)
    scores_B = (z[:,1] <= tB).astype(int)
    return scores_A, scores_B

### Estimating variance from pilot data

The variance of the paired differences (ω²) is a key parameter for power analysis. We estimate it from a pilot study.

In [24]:
def estimate_omega2(sA, sB):
    d = sA.astype(float) - sB.astype(float)
    return float(np.var(d, ddof=1))

### Computing required sample size

Given a desired minimum detectable effect size δ, significance level α, power, and variance estimate ω², we can compute the required number of test examples.

In [26]:
def required_n(delta: float, alpha: float, power: float, omega2: float) -> int:
    beta = 1 - power
    z_alpha2 = norm.ppf(1 - alpha/2.0)
    z_beta   = norm.ppf(1 - beta)
    n = (z_alpha2 + z_beta)**2 * (omega2) / (delta**2)
    return int(np.ceil(n))

## Example: Determining sample size for a model comparison

Suppose we have two models with similar accuracy (~86%), and we want to detect differences as small as 1-3%. Let's use a pilot study to estimate the variance and determine the required sample size.

In [27]:
# Pilot study parameters
mu_A = 0.86  # System A accuracy
mu_B = 0.86  # System B accuracy  
n_pilot = 156  # Pilot study size

# Simulate pilot data with correlation rho=0.8
scores_A, scores_B = simulate_paired_scores_with_corr(n_pilot, mu_A, mu_B, rho=0.8)

# Estimate variance from pilot
omega2_hat = estimate_omega2(scores_A, scores_B)

print("Pilot study results:")
print(f"  Sample size: {n_pilot}")
print(f"  System A accuracy: {scores_A.mean():.4f}")
print(f"  System B accuracy: {scores_B.mean():.4f}")
print(f"  Estimated ω²: {omega2_hat:.4f}")

Pilot study results:
  Sample size: 156
  System A accuracy: 0.8718
  System B accuracy: 0.8718
  Estimated ω²: 0.1161


### Computing required sample sizes for different effect sizes

In [28]:
# Standard parameters
alpha = 0.05
power = 0.80

# Different minimum detectable differences to consider
deltas = [0.01, 0.015, 0.02, 0.03]

# Compute required n for each delta
rows = []
for delta in deltas:
    rows.append({
        "delta (min detectable difference)": delta,
        "required_n (paired items)": required_n(delta, alpha, power, omega2_hat)
    })

summary = pd.DataFrame(rows)
summary

Unnamed: 0,delta (min detectable difference),required_n (paired items)
0,0.01,9115
1,0.015,4052
2,0.02,2279
3,0.03,1013
