# A/B Test Analysis: Website Checkout Button Redesign

## Executive Summary

This notebook analyzes an A/B test comparing two checkout button designs for an e-commerce website. We'll determine whether the new design (Version B) significantly improves conversion rates compared to the current design (Version A).

**Business Question:** Should we roll out the new checkout button design to all users?

---

## 1. Setup & Data Generation

First, let's import our libraries and create realistic test data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

### Generate Test Data

We'll simulate a 2-week A/B test with:
- **Control Group (A):** Current green "Checkout" button - 12% baseline conversion
- **Treatment Group (B):** New orange "Complete Purchase" button - 14.5% conversion (hypothesized improvement)
- **Sample Size:** 5,000 users per group

In [None]:
# Test parameters
n_control = 5000
n_treatment = 5000
control_conversion_rate = 0.12
treatment_conversion_rate = 0.145

# Generate conversion data (1 = converted, 0 = did not convert)
control_conversions = np.random.binomial(1, control_conversion_rate, n_control)
treatment_conversions = np.random.binomial(1, treatment_conversion_rate, n_treatment)

# Create DataFrame
df = pd.DataFrame({
    'user_id': range(1, n_control + n_treatment + 1),
    'group': ['Control'] * n_control + ['Treatment'] * n_treatment,
    'converted': np.concatenate([control_conversions, treatment_conversions])
})

# Preview data
print("Dataset Preview:")
print(df.head(10))
print(f"\nTotal Users: {len(df):,}")

---

## 2. Exploratory Data Analysis

Let's first understand our data before diving into statistical tests.

In [None]:
# Calculate conversion rates by group
conversion_summary = df.groupby('group').agg({
    'converted': ['sum', 'count', 'mean']
}).round(4)

conversion_summary.columns = ['Conversions', 'Total_Users', 'Conversion_Rate']
conversion_summary['Conversion_Rate_Pct'] = (conversion_summary['Conversion_Rate'] * 100).round(2)

print("Conversion Summary by Group:")
print(conversion_summary)

# Calculate absolute and relative lift
control_rate = conversion_summary.loc['Control', 'Conversion_Rate']
treatment_rate = conversion_summary.loc['Treatment', 'Conversion_Rate']

absolute_lift = treatment_rate - control_rate
relative_lift = (treatment_rate / control_rate - 1) * 100

print(f"\nüìä Key Metrics:")
print(f"Control Rate: {control_rate:.2%}")
print(f"Treatment Rate: {treatment_rate:.2%}")
print(f"Absolute Lift: {absolute_lift:.2%}")
print(f"Relative Lift: {relative_lift:.2f}%")

In [None]:
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of conversion rates
ax1 = axes[0]
rates = [control_rate * 100, treatment_rate * 100]
groups = ['Control\n(Current Button)', 'Treatment\n(New Button)']
colors = ['#3498db', '#e74c3c']

bars = ax1.bar(groups, rates, color=colors, alpha=0.7, edgecolor='black')
ax1.set_ylabel('Conversion Rate (%)', fontsize=12, fontweight='bold')
ax1.set_title('Conversion Rate by Group', fontsize=14, fontweight='bold')
ax1.set_ylim(0, max(rates) * 1.3)

# Add value labels on bars
for bar, rate in zip(bars, rates):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
             f'{rate:.2f}%',
             ha='center', va='bottom', fontsize=12, fontweight='bold')

# Stacked bar chart showing conversions vs non-conversions
ax2 = axes[1]
summary_data = df.groupby(['group', 'converted']).size().unstack(fill_value=0)
summary_pct = summary_data.div(summary_data.sum(axis=1), axis=0) * 100

summary_pct.plot(kind='bar', stacked=True, ax=ax2, 
                 color=['#95a5a6', '#27ae60'], alpha=0.8, edgecolor='black')
ax2.set_ylabel('Percentage of Users (%)', fontsize=12, fontweight='bold')
ax2.set_xlabel('')
ax2.set_title('Conversion Breakdown by Group', fontsize=14, fontweight='bold')
ax2.legend(['Did Not Convert', 'Converted'], loc='upper right')
ax2.set_xticklabels(['Control', 'Treatment'], rotation=0)

plt.tight_layout()
plt.show()

---

## 3. Hypothesis Testing

### Setting Up Our Hypotheses

Before we run the test, let's clearly state what we're testing:

**Null Hypothesis (H‚ÇÄ):** There is no difference in conversion rates between Control and Treatment.  
- Mathematically: `p_control = p_treatment`

**Alternative Hypothesis (H‚ÇÅ):** The Treatment group has a different conversion rate than Control.  
- Mathematically: `p_control ‚â† p_treatment`

**Significance Level (Œ±):** 0.05 (5%)  
- This means we need 95% confidence to reject the null hypothesis

### Why a Two-Proportion Z-Test?

We're comparing conversion rates (proportions) between two independent groups. The two-proportion z-test is the appropriate statistical test for this scenario.

In [None]:
# Extract data for statistical test
control_conv = control_conversions.sum()
control_n = len(control_conversions)
treatment_conv = treatment_conversions.sum()
treatment_n = len(treatment_conversions)

# Calculate pooled proportion (used in z-test)
pooled_prob = (control_conv + treatment_conv) / (control_n + treatment_n)
pooled_se = np.sqrt(pooled_prob * (1 - pooled_prob) * (1/control_n + 1/treatment_n))

# Calculate z-statistic
z_stat = (treatment_rate - control_rate) / pooled_se

# Calculate p-value (two-tailed test)
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

print("="*60)
print("TWO-PROPORTION Z-TEST RESULTS")
print("="*60)
print(f"\nControl Group:")
print(f"  Conversions: {control_conv:,} out of {control_n:,} ({control_rate:.2%})")
print(f"\nTreatment Group:")
print(f"  Conversions: {treatment_conv:,} out of {treatment_n:,} ({treatment_rate:.2%})")
print(f"\nTest Statistics:")
print(f"  Z-statistic: {z_stat:.4f}")
print(f"  P-value: {p_value:.4f}")
print(f"  Significance level (Œ±): 0.05")
print("\n" + "="*60)

### Alternative: Two-Sample T-Test

We can also verify our results using a t-test (appropriate for comparing means of binary data).

In [None]:
# Perform two-sample t-test
t_stat, t_pvalue = stats.ttest_ind(treatment_conversions, control_conversions)

print("TWO-SAMPLE T-TEST RESULTS")
print("="*60)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {t_pvalue:.4f}")
print("="*60)
print("\nNote: Both tests should yield similar conclusions.")

---

## 4. Confidence Intervals

Confidence intervals tell us the **range of plausible values** for the true conversion rate difference. A 95% confidence interval means: "If we ran this experiment 100 times, we'd expect the true difference to fall within this range 95 times."

In [None]:
def calculate_ci(successes, n, confidence=0.95):
    """
    Calculate confidence interval for a proportion using normal approximation.
    """
    prop = successes / n
    z_critical = stats.norm.ppf(1 - (1 - confidence) / 2)
    se = np.sqrt(prop * (1 - prop) / n)
    margin = z_critical * se
    return prop - margin, prop + margin

# Calculate 95% confidence intervals for each group
control_ci = calculate_ci(control_conv, control_n)
treatment_ci = calculate_ci(treatment_conv, treatment_n)

# Calculate confidence interval for the DIFFERENCE in conversion rates
diff = treatment_rate - control_rate
se_diff = np.sqrt((control_rate * (1 - control_rate) / control_n) + 
                  (treatment_rate * (1 - treatment_rate) / treatment_n))
z_critical = 1.96  # For 95% confidence
diff_ci = (diff - z_critical * se_diff, diff + z_critical * se_diff)

print("95% CONFIDENCE INTERVALS")
print("="*60)
print(f"\nControl Group Conversion Rate:")
print(f"  Point Estimate: {control_rate:.2%}")
print(f"  95% CI: [{control_ci[0]:.2%}, {control_ci[1]:.2%}]")

print(f"\nTreatment Group Conversion Rate:")
print(f"  Point Estimate: {treatment_rate:.2%}")
print(f"  95% CI: [{treatment_ci[0]:.2%}, {treatment_ci[1]:.2%}]")

print(f"\nDifference (Treatment - Control):")
print(f"  Point Estimate: {diff:.2%}")
print(f"  95% CI: [{diff_ci[0]:.2%}, {diff_ci[1]:.2%}]")
print("="*60)

In [None]:
# Visualize confidence intervals
fig, ax = plt.subplots(figsize=(12, 6))

# Data for plotting
groups = ['Control', 'Treatment']
estimates = [control_rate * 100, treatment_rate * 100]
ci_lower = [control_ci[0] * 100, treatment_ci[0] * 100]
ci_upper = [control_ci[1] * 100, treatment_ci[1] * 100]
errors_lower = [estimates[i] - ci_lower[i] for i in range(2)]
errors_upper = [ci_upper[i] - estimates[i] for i in range(2)]

# Create error bars
ax.errorbar(groups, estimates, 
            yerr=[errors_lower, errors_upper],
            fmt='o', markersize=12, capsize=10, capthick=2,
            linewidth=2, color='#2c3e50', ecolor='#34495e')

# Add point estimates as text
for i, (group, est) in enumerate(zip(groups, estimates)):
    ax.text(i, est + 0.3, f'{est:.2f}%', 
            ha='center', fontsize=11, fontweight='bold')

ax.set_ylabel('Conversion Rate (%)', fontsize=13, fontweight='bold')
ax.set_title('95% Confidence Intervals for Conversion Rates', 
             fontsize=15, fontweight='bold', pad=20)
ax.grid(axis='y', alpha=0.3)
ax.set_ylim(10, 16)

plt.tight_layout()
plt.show()

print("\nüí° Key Insight:")
if diff_ci[0] > 0:
    print("The confidence interval for the difference does NOT include zero.")
    print("This means we can be 95% confident that Treatment truly outperforms Control.")
else:
    print("The confidence interval for the difference includes zero.")
    print("This means we cannot rule out that there's no real difference between groups.")

---

## 5. Statistical Significance Interpretation

Let's interpret our p-value in plain English.

In [None]:
alpha = 0.05

print("\n" + "="*70)
print("STATISTICAL SIGNIFICANCE INTERPRETATION")
print("="*70)
print(f"\nP-value: {p_value:.4f}")
print(f"Significance level (Œ±): {alpha}")

if p_value < alpha:
    print(f"\n‚úÖ RESULT: Statistically Significant (p < {alpha})")
    print("\nüìä What this means in plain English:")
    print(f"   If there were truly NO difference between the buttons, we would")
    print(f"   see a difference this large or larger only {p_value*100:.2f}% of the time")
    print(f"   due to random chance alone.")
    print(f"\n   Since this is less than our {alpha*100:.0f}% threshold, we have strong")
    print(f"   evidence that the new button design genuinely improves conversions.")
    print(f"\nüéØ Recommendation: REJECT the null hypothesis.")
    print(f"   The Treatment button appears to be genuinely better.")
else:
    print(f"\n‚ùå RESULT: Not Statistically Significant (p >= {alpha})")
    print("\nüìä What this means in plain English:")
    print(f"   If there were truly NO difference between the buttons, we would")
    print(f"   see a difference this large or larger {p_value*100:.2f}% of the time")
    print(f"   due to random chance alone.")
    print(f"\n   Since this exceeds our {alpha*100:.0f}% threshold, we don't have enough")
    print(f"   evidence to conclude the new button is better.")
    print(f"\nüéØ Recommendation: FAIL TO REJECT the null hypothesis.")
    print(f"   We cannot confidently say the Treatment button is better.")

print("\n" + "="*70)

---

## 6. Effect Size & Practical Significance

**Statistical significance ‚â† Practical significance**

Even if our test is statistically significant, we need to ask: "Is this difference large enough to matter in the real world?"

In [None]:
# Calculate effect size (Cohen's h for proportions)
def cohens_h(p1, p2):
    """
    Calculate Cohen's h effect size for two proportions.
    Small effect: h = 0.2
    Medium effect: h = 0.5
    Large effect: h = 0.8
    """
    return 2 * (np.arcsin(np.sqrt(p1)) - np.arcsin(np.sqrt(p2)))

effect_size = abs(cohens_h(treatment_rate, control_rate))

# Determine effect size category
if effect_size < 0.2:
    effect_category = "Small"
elif effect_size < 0.5:
    effect_category = "Medium"
else:
    effect_category = "Large"

print("EFFECT SIZE ANALYSIS")
print("="*60)
print(f"\nCohen's h: {effect_size:.4f} ({effect_category} effect)")
print(f"\nAbsolute difference: {absolute_lift:.2%}")
print(f"Relative lift: {relative_lift:.2f}%")
print("\n" + "="*60)

In [None]:
# Business impact calculation
print("\nüí∞ BUSINESS IMPACT PROJECTION")
print("="*60)

# Hypothetical business metrics
monthly_visitors = 100000
avg_order_value = 75

# Current state
current_monthly_conversions = monthly_visitors * control_rate
current_monthly_revenue = current_monthly_conversions * avg_order_value

# Projected state with new button
projected_monthly_conversions = monthly_visitors * treatment_rate
projected_monthly_revenue = projected_monthly_conversions * avg_order_value

# Incremental impact
additional_conversions = projected_monthly_conversions - current_monthly_conversions
additional_revenue = projected_monthly_revenue - current_monthly_revenue

print(f"\nAssumptions:")
print(f"  ‚Ä¢ Monthly website visitors: {monthly_visitors:,}")
print(f"  ‚Ä¢ Average order value: ${avg_order_value:.2f}")

print(f"\nCurrent Performance (Control):")
print(f"  ‚Ä¢ Conversions/month: {current_monthly_conversions:,.0f}")
print(f"  ‚Ä¢ Revenue/month: ${current_monthly_revenue:,.2f}")

print(f"\nProjected Performance (Treatment):")
print(f"  ‚Ä¢ Conversions/month: {projected_monthly_conversions:,.0f}")
print(f"  ‚Ä¢ Revenue/month: ${projected_monthly_revenue:,.2f}")

print(f"\nüìà Incremental Impact:")
print(f"  ‚Ä¢ Additional conversions/month: {additional_conversions:,.0f}")
print(f"  ‚Ä¢ Additional revenue/month: ${additional_revenue:,.2f}")
print(f"  ‚Ä¢ Additional revenue/year: ${additional_revenue * 12:,.2f}")
print("\n" + "="*60)

---

## 7. Power Analysis & Sample Size

**Statistical Power** is the probability of detecting a real effect when it exists. Typically, we want at least 80% power.

Let's check if our test had adequate power.

In [None]:
from statsmodels.stats.power import zt_ind_solve_power
from statsmodels.stats.proportion import proportion_effectsize

# Calculate effect size
effect_size_power = proportion_effectsize(control_rate, treatment_rate)

# Calculate achieved power
achieved_power = zt_ind_solve_power(effect_size=effect_size_power,
                                    nobs1=control_n,
                                    alpha=0.05,
                                    ratio=treatment_n/control_n,
                                    alternative='two-sided')

# Calculate required sample size for 80% power
required_n = zt_ind_solve_power(effect_size=effect_size_power,
                                power=0.8,
                                alpha=0.05,
                                ratio=1.0,
                                alternative='two-sided')

print("STATISTICAL POWER ANALYSIS")
print("="*60)
print(f"\nActual sample size per group: {control_n:,}")
print(f"Achieved statistical power: {achieved_power:.2%}")
print(f"\nRequired sample size for 80% power: {required_n:,.0f} per group")

if achieved_power >= 0.8:
    print(f"\n‚úÖ Our test had sufficient power to detect this effect.")
else:
    print(f"\n‚ö†Ô∏è  Our test was underpowered. Consider collecting more data.")
    
print("\n" + "="*60)

---

## 8. Conclusion & Recommendations

### Summary of Findings

In [None]:
print("\n" + "="*70)
print(" " * 20 + "FINAL DECISION SUMMARY")
print("="*70)

print("\nüìä Test Results:")
print(f"   ‚Ä¢ Control conversion rate: {control_rate:.2%}")
print(f"   ‚Ä¢ Treatment conversion rate: {treatment_rate:.2%}")
print(f"   ‚Ä¢ Absolute improvement: {absolute_lift:.2%}")
print(f"   ‚Ä¢ Relative improvement: {relative_lift:.1f}%")

print(f"\nüìà Statistical Analysis:")
print(f"   ‚Ä¢ P-value: {p_value:.4f}")
print(f"   ‚Ä¢ Result: {'Statistically significant' if p_value < 0.05 else 'Not statistically significant'}")
print(f"   ‚Ä¢ 95% CI for difference: [{diff_ci[0]:.2%}, {diff_ci[1]:.2%}]")
print(f"   ‚Ä¢ Effect size: {effect_category} (Cohen's h = {effect_size:.3f})")
print(f"   ‚Ä¢ Statistical power: {achieved_power:.1%}")

print(f"\nüí∞ Business Impact:")
print(f"   ‚Ä¢ Additional monthly revenue: ${additional_revenue:,.2f}")
print(f"   ‚Ä¢ Projected annual revenue lift: ${additional_revenue * 12:,.2f}")

print("\n" + "="*70)
print("\nüéØ FINAL RECOMMENDATION:")
print("="*70)

if p_value < 0.05 and diff_ci[0] > 0:
    print("\n‚úÖ PROCEED WITH ROLLOUT")
    print("\nThe new checkout button design shows statistically significant")
    print("improvement over the current design. Based on our analysis:")
    print("\n  1. The treatment group converted at a significantly higher rate")
    print(f"  2. We're 95% confident the true improvement is between")
    print(f"     {diff_ci[0]:.2%} and {diff_ci[1]:.2%}")
    print(f"  3. The projected annual revenue impact is ${additional_revenue * 12:,.2f}")
    print("\n  Recommendation: Roll out the new button design to all users.")
else:
    print("\n‚ö†Ô∏è  DO NOT PROCEED - INSUFFICIENT EVIDENCE")
    print("\nWhile the treatment group showed higher conversion rates,")
    print("the difference is not statistically significant. This means:")
    print("\n  1. We cannot rule out random chance as the cause")
    print("  2. The observed difference might disappear with more data")
    print("\n  Options:")
    print("    ‚Ä¢ Extend the test to collect more data")
    print("    ‚Ä¢ Test a more dramatic design change")
    print("    ‚Ä¢ Keep the current design")

print("\n" + "="*70)

---

## 9. Key Learnings & Next Steps

### What We Learned About A/B Testing

1. **Statistical Significance vs. Practical Significance**  
   A result can be statistically significant but too small to matter in business terms (or vice versa in small samples).

2. **The Role of Sample Size**  
   Larger samples give us more confidence and allow us to detect smaller effects.

3. **P-values Tell One Part of the Story**  
   Always combine p-values with confidence intervals and effect sizes for complete understanding.

4. **Business Context Matters**  
   Even a small percentage improvement can translate to significant revenue at scale.

### Potential Next Steps

- **Segment Analysis:** Do results differ by user type, device, or traffic source?
- **Long-term Monitoring:** Continue tracking to ensure results hold over time
- **Additional Tests:** Test other elements (copy, color, placement)
- **Multi-variate Testing:** Test multiple changes simultaneously

---

## Appendix: Statistical Formulas Used

### Two-Proportion Z-Test
$$z = \frac{\hat{p}_2 - \hat{p}_1}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}}$$

where $\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$ (pooled proportion)

### Confidence Interval for Proportion
$$CI = \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

### Cohen's h (Effect Size for Proportions)
$$h = 2(\arcsin(\sqrt{p_1}) - \arcsin(\sqrt{p_2}))$$

---

*End of Analysis*