# Power Analysis for A/B Testing with MLArena

This notebook demonstrates how to use MLArena's power analysis functionality to plan A/B tests with appropriate sample sizes.

Power analysis helps you answer critical questions like:
- How many users do I need to detect a meaningful difference?
- What's the smallest effect I can reliably detect with my current sample size?
- With a given sample size, what's the probability I'll detect an effect if it truly exists?

In [1]:
import pandas as pd
import numpy as np
import mlarena.utils.stats_utils as sut

# Set random seed for reproducibility
np.random.seed(42)

## 1. Power Analysis for Proportions

Let's say you're testing a new flow on a website and want to improve conversion from 5% to 6%.

### 1.1 Estimating Required Sample Size

In [2]:
# Question: How many users per group do I need to detect a 5% -> 6% conversion improvement with 80% power?
sample_size_result = sut.sample_size_proportion(
    baseline_rate=0.05,
    treatment_rate=0.06,
    power=0.8,
    alpha=0.05
)

print("Sample Size Analysis for Conversion Rate Test")
print("=" * 50)
print(f"Baseline conversion rate: {sample_size_result['baseline_rate']*100:.1f}%")
print(f"Target conversion rate: {sample_size_result['treatment_rate']*100:.1f}%")
print(f"Relative lift: {sample_size_result['relative_lift']*100:.1f}%")
print(f"Absolute lift: {sample_size_result['absolute_lift']*100:.1f} percentage points")
print(f"\nRequired sample size per group: {sample_size_result['sample_size_per_group']:,}")
print(f"Total sample size needed: {sample_size_result['total_sample_size']:,}")
print(f"Effect size (Cohen's h): {sample_size_result['effect_size']:.3f}")

Sample Size Analysis for Conversion Rate Test
Baseline conversion rate: 5.0%
Target conversion rate: 6.0%
Relative lift: 20.0%
Absolute lift: 1.0 percentage points

Required sample size per group: 8,143
Total sample size needed: 16,286
Effect size (Cohen's h): 0.044


### 1.2 Estimating Power

In [15]:
# Question: What if I can get 10,000 users per group? What's my power?
group_size = 10000
power_result = sut.power_analysis_proportion(
    baseline_rate=0.05,
    treatment_rate=0.06,
    sample_size_per_group=group_size
)

print("Power Analysis with Fixed Sample Size")
print("=" * 40)
print(f"Sample size per group: {group_size}")
print(f"Power to detect a 5% -> 6% improvement: {power_result['power']*100:.1f}%")
print(f"\nInterpretation: There's a {power_result['power']*100:.1f}% chance of detecting the improvement if it truly exists.")


Power Analysis with Fixed Sample Size
Sample size per group: 10000
Power to detect a 5% -> 6% improvement: 87.4%

Interpretation: There's a 87.4% chance of detecting the improvement if it truly exists.


In [16]:
# Question: What if I can only get 5,000 users per group? What's my power?
group_size = 5000
power_result = sut.power_analysis_proportion(
    baseline_rate=0.05,
    treatment_rate=0.06,
    sample_size_per_group=group_size
)

print("Power Analysis with Fixed Sample Size")
print("=" * 40)
print(f"Sample size per group: {group_size}")
print(f"Power to detect a 5% -> 6% improvement: {power_result['power']*100:.1f}%")
print(f"\nInterpretation: There's a {power_result['power']*100:.1f}% chance of detecting the improvement if it truly exists.")

Power Analysis with Fixed Sample Size
Sample size per group: 5000
Power to detect a 5% -> 6% improvement: 59.3%

Interpretation: There's a 59.3% chance of detecting the improvement if it truly exists.


## 2. Choosing the Right Power Level: Best Practices

Power represents the probability of detecting a true effect when it exists. 80% power (0.8) is the industry standard, but the right choice depends on your context:

- **80% (0.8)**: Industry standard, good balance of cost vs. risk
- **90% (0.9)**: Higher confidence, but requires ~30% more samples
- **70% (0.7)**: Lower cost, but higher risk of missing real effects



In [5]:
# Demonstrate the cost of different power levels
print("Sample Size Requirements by Power Level")
print("=" * 45)
print("Scenario: Detect 5% -> 6% conversion improvement\n")

power_levels = [0.7, 0.8, 0.9, 0.95]
power_comparison = []

for power in power_levels:
    result = sut.sample_size_proportion(0.05, 0.06, power=power)
    power_comparison.append({
        'Power': f"{power*100:.0f}%",
        'Sample per Group': f"{result['sample_size_per_group']:,}",
        'Total Sample': f"{result['total_sample_size']:,}",
        'Risk of Missing Effect': f"{(1-power)*100:.0f}%",
        'Relative to 80%': f"{result['sample_size_per_group']/sut.sample_size_proportion(0.05, 0.06, power=0.8)['sample_size_per_group']:.1f}x"
    })

power_df = pd.DataFrame(power_comparison)
print(power_df.to_string(index=False))

Sample Size Requirements by Power Level
Scenario: Detect 5% -> 6% conversion improvement

Power Sample per Group Total Sample Risk of Missing Effect Relative to 80%
  70%            6,403       12,806                    30%            0.8x
  80%            8,143       16,286                    20%            1.0x
  90%           10,901       21,802                    10%            1.3x
  95%           13,482       26,964                     5%            1.7x


## 3. Power Analysis for Numeric Metrics

Let's say we would like to develop A/B test to detect change in numeric metrics such as Revenue or session duration.

### 3.1 Estimating Required Sample Size

In [6]:
# Question: How many users do I need to detect a "medium" effect size (Cohen's d = 0.5)?
sample_size_numeric_result = sut.sample_size_numeric(
    effect_size=0.5,  # Medium effect size
    power=0.8,
    alpha=0.05
)

print("Sample Size Analysis for Numeric Metrics")
print("=" * 45)
print(f"Effect size (Cohen's d): {sample_size_numeric_result['effect_size']}")
print(f"Required sample size per group: {sample_size_numeric_result['sample_size_per_group']}")
print(f"Total sample size needed: {sample_size_numeric_result['total_sample_size']}")

Sample Size Analysis for Numeric Metrics
Effect size (Cohen's d): 0.5
Required sample size per group: 34
Total sample size needed: 68


In [7]:
# Create a comparison table for different effect sizes
effect_sizes = [0.1, 0.2, 0.3, 0.5, 0.8]
results = []

for es in effect_sizes:
    result = sut.sample_size_numeric(effect_size=es, power=0.8)
    results.append({
        'Effect Size': es,
        'Interpretation': 'Very Small' if es < 0.2 else 'Small' if es < 0.5 else 'Medium' if es < 0.8 else 'Large',
        'Sample per Group': result['sample_size_per_group'],
        'Total Sample': result['total_sample_size']
    })

comparison_df = pd.DataFrame(results)
print("Sample Size Requirements by Effect Size")
print("=" * 45)
print(comparison_df.to_string(index=False))

Sample Size Requirements by Effect Size
 Effect Size Interpretation  Sample per Group  Total Sample
         0.1     Very Small               787          1574
         0.2          Small               199           398
         0.3          Small                90           180
         0.5         Medium                34            68
         0.8          Large                15            30


### 3.2 Estimating Power



In [12]:
# Question: I am interested in detecting even a small effect size of 0.2. What if I can only get 100 users per group? What's my power?
group_size = 100
effect_size = 0.2
power_result = sut.power_analysis_numeric(
    effect_size=effect_size,
    alpha=0.05,
    sample_size_per_group=group_size
)

print("Power Analysis with Fixed Sample Size")
print("=" * 40)
print(f"Sample size per group: {group_size}")
print(f"Power to detect an effect size of {effect_size:.1f}: {power_result['power']*100:.1f}%")
print(f"\nInterpretation: There's a {power_result['power']*100:.1f}% chance of detecting the effect if it truly exists")

Power Analysis with Fixed Sample Size
Sample size per group: 100
Power to detect an effect size of 0.2: 50.8%

Interpretation: There's a 50.8% chance of detecting the effect if it truly exists


In [13]:
# Question: I am only interested in detecting the effect if it is of medium size of 0.5. And I have 100 users per group? What's my power?
group_size = 100
effect_size = 0.5
power_result = sut.power_analysis_numeric(
    effect_size=effect_size,
    alpha=0.05,
    sample_size_per_group=group_size
)

print("Power Analysis with Fixed Sample Size")
print("=" * 40)
print(f"Sample size per group: {group_size}")
print(f"Power to detect an effect size of {effect_size:.1f}: {power_result['power']*100:.1f}%")
print(f"\nInterpretation: There's a {power_result['power']*100:.1f}% chance of detecting the effect if it truly exists")

Power Analysis with Fixed Sample Size
Sample size per group: 100
Power to detect an effect size of 0.5: 99.9%

Interpretation: There's a 99.9% chance of detecting the effect if it truly exists


### 3.3 Calculate Effect Size 

**Different Types of Effect Size for Numeric Targets**   
There’s no one-size-fits-all metric. Different experimental designs call for different measures of effect size.



| Effect Size Metric | Scenario                                      | Formula Highlights                                            |
| ------------------ | --------------------------------------------- | ------------------------------------------------------------- |
| **Cohen's d**      | Two independent groups                        | $(\text{mean}_1 - \text{mean}_2) / \text{pooled std}$         |
| **Paired d**       | Same group measured twice (pre/post)          | $(\text{mean}_\text{diff}) / \text{std}_\text{diff}$          |
| **Glass’s delta**  | Two groups, assume only control std known     | $(\text{mean}_1 - \text{mean}_2) / \text{std}_\text{control}$ |
| **Cohen’s f / η²** | More than two groups (ANOVA)                  | Variance between groups vs. within-group variance             |
| **f² (R² change)** | Regression / multiple predictors              | $(R^2 / (1 - R^2))$                                           |


**why `numeric_effectsize()` and What It Assumes**

In building A/B testing workflows, our tyipical situation looks like this:

- We have historical data from the control group
- We expect a specific improvement (e.g., +3 units)
- We need to quickly estimate the standardized effect size for power analysis

So we created numeric_effectsize() — a focused utility designed specifically for:

✅ Two-group A/B tests    
✅ Independent samples    
✅ Can compute pooled standard deviation if you provide `std1`, `std2`, `n1`, and `n2`    
✅ Otherwise, it assumes equal variance between the two groups by default    



In [10]:
# You’re planning to test a change in UI, and you believe it will improve the average purchase amount.

# Historical average: $100
# Expected increase: $5 (i.e., new mean = $105)
# Historical standard deviation: $20
# You want to compute the standardized effect size to plug into your power analysis.

# option 1: using means directly
d = sut.numeric_effectsize(mean1=105, mean2=100, std=20)
print(f"Cohen's d: {d:.3f}")

# option 2: using mean difference
d = sut.numeric_effectsize(mean_diff=5, std=20)
print(f"Cohen's d: {d:.3f}")

Cohen's d: 0.250
Cohen's d: 0.250


## 5 Integration with MLArena's A/B Testing Workflow

Combine power analysis with MLArena's existing A/B testing capabilities:

#### TODO: Add 1) Stratification Optimization, 2) Group Comparison beforehand, and 3) Group Comparision afterwards

In [11]:
# Example workflow combining power analysis with existing MLArena functions
from mlarena.utils.stats_utils import compare_groups, add_stratified_groups

# Step 1: Plan our test with power analysis
print("STEP 1: Plan Test with Power Analysis")
print("=" * 40)
target_sample = sut.sample_size_proportion(0.05, 0.06, power=0.8)
print(f"Need {target_sample['sample_size_per_group']:,} users per group")

# Step 2: Create properly stratified groups (if applicable)
print("\nSTEP 2: Create Stratified Groups")
print("=" * 35)
# Simulate user data with characteristics to stratify on
user_data = pd.DataFrame({
    'user_id': range(target_sample['total_sample_size']),
    'region': np.random.choice(['North', 'South', 'East', 'West'], target_sample['total_sample_size']),
    'user_segment': np.random.choice(['New', 'Returning', 'VIP'], target_sample['total_sample_size']),
    'metric1': np.random.normal(100, 15, target_sample['total_sample_size'])
})

# Add stratified A/B groups
stratified_data = add_stratified_groups(
    user_data, 
    stratifier_col=['region', 'user_segment'],
    group_labels=('control', 'treatment')
)

print(f"Created {len(stratified_data)} user assignments")
print("Group distribution:")
print(stratified_data['stratified_group'].value_counts())

# Step 3: After test completion, analyze results
print("\nSTEP 3: Analyze Test Results")
print("=" * 32)
# Simulate test completion with results
stratified_data['converted'] = np.where(
    stratified_data['stratified_group'] == 'control',
    np.random.binomial(1, 0.05, len(stratified_data)),
    np.random.binomial(1, 0.06, len(stratified_data))
)

# Use compare_groups to validate results
effect_size, summary = compare_groups(
    stratified_data.copy().assign(converted=stratified_data['converted'].astype(str)),
    'stratified_group', 
    ['converted'],
    cat_test='chi2'
)

print("Test results summary:")
print(summary[['target_var', 'p_value', 'effect_size', 'is_significant']].to_string(index=False))

control_rate = stratified_data[stratified_data['stratified_group']=='control']['converted'].mean()
treatment_rate = stratified_data[stratified_data['stratified_group']=='treatment']['converted'].mean()
print(f"\nControl conversion: {control_rate:.1%}")
print(f"Treatment conversion: {treatment_rate:.1%}")
print(f"Lift: {(treatment_rate-control_rate)/control_rate:.1%}")

STEP 1: Plan Test with Power Analysis
Need 8,143 users per group

STEP 2: Create Stratified Groups


Created 16286 user assignments
Group distribution:
stratified_group
treatment    8143
control      8143
Name: count, dtype: int64

STEP 3: Analyze Test Results
Test results summary:
target_var  p_value  effect_size  is_significant
 converted 0.000252     0.028681            True

Control conversion: 4.8%
Treatment conversion: 6.1%
Lift: 27.4%
