# Marketing A/B Test - Frequentist Statistical Analysis

**Author**: Analytics Team  
**Date**: November 2025  
**Version**: 1.0

---

## Executive Summary

This notebook performs comprehensive frequentist statistical testing to determine if the advertising campaign had a statistically significant impact on conversion rates. We employ multiple statistical methods to ensure robust conclusions.

## Statistical Methods

1. **Two-Sample T-Test (Welch's)**: Tests for difference in means between two groups with unequal variances
2. **Chi-Square Test for Independence**: Tests whether group assignment and conversion are independent
3. **Bootstrap Confidence Intervals**: Non-parametric confidence interval estimation using resampling
4. **Effect Size (Cohen's h/d)**: Measures the magnitude of the difference, independent of sample size
5. **Statistical Power Analysis**: Assesses the probability of detecting a true effect

## Interpretation Guidelines

- **p < 0.05**: Statistically significant difference
- **0.05 ≤ p < 0.10**: Marginally significant (proceed with caution)
- **p ≥ 0.10**: No significant difference detected


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
from statsmodels.stats.power import TTestIndPower
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")


## 1. Data Preparation

Load the data and separate into test (ad) and control (psa) groups.


In [None]:
# Load data
df = pd.read_csv('marketing_AB.csv')

# Separate groups
ad_group = df[df['test group'] == 'ad']
psa_group = df[df['test group'] == 'psa']

# Extract conversion data
ad_conversions = ad_group['converted'].values
psa_conversions = psa_group['converted'].values

# Calculate conversion rates
cr_ad = ad_conversions.mean()
cr_psa = psa_conversions.mean()
n_ad = len(ad_conversions)
n_psa = len(psa_conversions)

print(f"Ad Group:  {n_ad:,} users, Conversion Rate: {cr_ad:.6f} ({cr_ad*100:.4f}%)")
print(f"PSA Group: {n_psa:,} users, Conversion Rate: {cr_psa:.6f} ({cr_psa*100:.4f}%)")
print(f"Difference: {cr_ad - cr_psa:.6f} ({(cr_ad - cr_psa)*100:.4f}%)")


## 2. Two-Sample T-Test

### Hypothesis Testing

**Null Hypothesis ($H_0$)**: There is no difference in conversion rates between ad and psa groups
$$H_0: \mu_{\text{ad}} = \mu_{\text{psa}}$$

**Alternative Hypothesis ($H_1$)**: There is a difference in conversion rates
$$H_1: \mu_{\text{ad}} \neq \mu_{\text{psa}}$$

### T-Statistic Formula

For unequal variances (Welch's t-test):

$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

where:
- $\bar{x}_1, \bar{x}_2$ are sample means
- $s_1^2, s_2^2$ are sample variances
- $n_1, n_2$ are sample sizes

### Degrees of Freedom (Welch's Approximation)

$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$$

### Confidence Interval

For a $(1-\alpha)$ confidence interval:

$$CI = (\bar{x}_1 - \bar{x}_2) \pm t_{\alpha/2, df} \times SE$$

where $SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$


In [None]:
# Visualization: T-Test Results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution comparison
axes[0].hist(ad_conversions, bins=2, alpha=0.7, label='Ad Group', color='#3B82F6', edgecolor='black', linewidth=1.5)
axes[0].hist(psa_conversions, bins=2, alpha=0.7, label='PSA Group', color='#6B7280', edgecolor='black', linewidth=1.5)
axes[0].axvline(cr_ad, color='#3B82F6', linestyle='--', linewidth=2, label=f'Ad Mean: {cr_ad:.4f}')
axes[0].axvline(cr_psa, color='#6B7280', linestyle='--', linewidth=2, label=f'PSA Mean: {cr_psa:.4f}')
axes[0].set_xlabel('Conversion (0=No, 1=Yes)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[0].set_title('Conversion Distribution Comparison', fontsize=14, fontweight='bold', pad=20)
axes[0].legend(fontsize=11)
axes[0].grid(alpha=0.3, linestyle='--')

# Confidence interval visualization
ci_range = ci_95_upper - ci_95_lower
axes[1].errorbar(0, cr_ad - cr_psa, yerr=[[cr_ad - cr_psa - ci_95_lower], [ci_95_upper - (cr_ad - cr_psa)]], 
                 fmt='o', markersize=12, capsize=10, capthick=2, color='green' if p_value < 0.05 else 'red',
                 label=f'Difference: {cr_ad - cr_psa:.6f}')
axes[1].axhline(y=0, color='black', linestyle='--', linewidth=1.5)
axes[1].set_xlim(-0.5, 0.5)
axes[1].set_ylabel('Difference in Conversion Rate', fontsize=12, fontweight='bold')
axes[1].set_title('95% Confidence Interval for Difference', fontsize=14, fontweight='bold', pad=20)
axes[1].set_xticks([])
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3, linestyle='--')
axes[1].text(0, ci_95_upper + 0.0001, f'Upper: {ci_95_upper:.6f}', ha='center', fontsize=10)
axes[1].text(0, ci_95_lower - 0.0001, f'Lower: {ci_95_lower:.6f}', ha='center', fontsize=10)

plt.tight_layout()
plt.show()


In [None]:
# Perform t-test (unequal variances - Welch's t-test)
t_stat, p_value = stats.ttest_ind(ad_conversions, psa_conversions, equal_var=False)

# Calculate standard errors
se_ad = np.std(ad_conversions, ddof=1) / np.sqrt(n_ad)
se_psa = np.std(psa_conversions, ddof=1) / np.sqrt(n_psa)
se_diff = np.sqrt(se_ad**2 + se_psa**2)

# Degrees of freedom (Welch's approximation)
var_ad = np.var(ad_conversions, ddof=1)
var_psa = np.var(psa_conversions, ddof=1)
df_welch = (se_ad**2 + se_psa**2)**2 / (se_ad**4/(n_ad-1) + se_psa**4/(n_psa-1))

# 95% Confidence interval
ci_95_lower = (cr_ad - cr_psa) - stats.t.ppf(0.975, df_welch) * se_diff
ci_95_upper = (cr_ad - cr_psa) + stats.t.ppf(0.975, df_welch) * se_diff

print("T-TEST RESULTS")
print("="*60)
print(f"T-statistic: {t_stat:.6f}")
print(f"P-value: {p_value:.6f}")
print(f"Degrees of Freedom (Welch): {df_welch:.2f}")
print(f"95% CI for difference: [{ci_95_lower:.6f}, {ci_95_upper:.6f}]")
print(f"95% CI for difference (%): [{(ci_95_lower*100):.4f}%, {(ci_95_upper*100):.4f}%]")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print(f"\n✅ Statistically Significant (p < {alpha})")
    print("   We reject the null hypothesis. There is evidence of a difference.")
else:
    print(f"\n⚠️  Not Statistically Significant (p ≥ {alpha})")
    print("   We fail to reject the null hypothesis.")


## 3. Chi-Square Test for Independence

Tests whether group assignment and conversion are independent.

### Chi-Square Statistic

$$\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

where:
- $O_{ij}$ = Observed frequency in cell $(i,j)$
- $E_{ij}$ = Expected frequency in cell $(i,j)$

### Expected Frequencies

$$E_{ij} = \frac{(\text{Row Total}_i) \times (\text{Column Total}_j)}{\text{Grand Total}}$$

### Degrees of Freedom

$$df = (r-1) \times (c-1)$$

where $r$ = number of rows, $c$ = number of columns


In [None]:
# Create contingency table
contingency_table = pd.crosstab(df['test group'], df['converted'])
print("CONTINGENCY TABLE:")
print(contingency_table)

# Perform chi-square test
chi2, p_chi2, dof, expected = chi2_contingency(contingency_table)

print(f"\nCHI-SQUARE TEST RESULTS")
print("="*60)
print(f"Chi-square statistic: {chi2:.6f}")
print(f"P-value: {p_chi2:.6f}")
print(f"Degrees of Freedom: {dof}")

print(f"\nExpected Frequencies:")
print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))

# Interpretation
if p_chi2 < 0.05:
    print(f"\n✅ Groups are NOT independent (significant association)")
    print("   Group assignment and conversion are related.")
else:
    print(f"\n⚠️  Groups appear independent (no significant association)")


## 4. Effect Size Calculation

Effect size measures the magnitude of the difference, independent of sample size.

### Cohen's h (for Proportions)

Cohen's h uses the arcsine transformation:

$$h = 2 \times (\arcsin(\sqrt{p_1}) - \arcsin(\sqrt{p_2}))$$

### Cohen's d (for Continuous Variables)

$$d = \frac{\mu_1 - \mu_2}{\sigma_{\text{pooled}}}$$

where the pooled standard deviation is:

$$\sigma_{\text{pooled}} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}$$

### Interpretation

- $|h| < 0.2$: Negligible effect
- $0.2 \leq |h| < 0.5$: Small effect
- $0.5 \leq |h| < 0.8$: Medium effect
- $|h| \geq 0.8$: Large effect


In [None]:
# Cohen's h for proportions (arcsine transformation)
def cohens_h(p1, p2):
    """Calculate Cohen's h for two proportions"""
    h = 2 * (np.arcsin(np.sqrt(p1)) - np.arcsin(np.sqrt(p2)))
    return h

# Cohen's d (for continuous, using conversion rates as means)
pooled_std = np.sqrt(((n_ad - 1) * var_ad + (n_psa - 1) * var_psa) / (n_ad + n_psa - 2))
cohens_d = (cr_ad - cr_psa) / pooled_std if pooled_std > 0 else 0

# Cohen's h
cohens_h_value = cohens_h(cr_ad, cr_psa)

print("EFFECT SIZE METRICS")
print("="*60)
print(f"Cohen's h: {cohens_h_value:.6f}")
print(f"Cohen's d: {cohens_d:.6f}")

# Interpretation
def interpret_effect_size_h(h):
    abs_h = abs(h)
    if abs_h < 0.2:
        return "Negligible"
    elif abs_h < 0.5:
        return "Small"
    elif abs_h < 0.8:
        return "Medium"
    else:
        return "Large"

effect_interpretation = interpret_effect_size_h(cohens_h_value)
print(f"\nEffect Size Interpretation: {effect_interpretation} effect (|h| = {abs(cohens_h_value):.4f})")


## 5. Bootstrap Confidence Intervals

Bootstrap is a non-parametric method that resamples the data with replacement to estimate the sampling distribution.

### Bootstrap Procedure

1. Resample $n_1$ observations from group 1 with replacement
2. Resample $n_2$ observations from group 2 with replacement
3. Calculate the difference in means
4. Repeat $B$ times (typically 10,000)
5. Use percentiles of the bootstrap distribution for confidence intervals

### Bootstrap Confidence Interval

For a $(1-\alpha)$ confidence interval:

$$CI = [Q_{\alpha/2}, Q_{1-\alpha/2}]$$

where $Q_p$ is the $p$-th percentile of the bootstrap distribution.


In [None]:
def bootstrap_ci(data1, data2, n_bootstrap=10000, ci_level=0.95):
    """Calculate bootstrap confidence interval for difference in means"""
    n1, n2 = len(data1), len(data2)
    differences = []
    
    for _ in range(n_bootstrap):
        # Resample with replacement
        sample1 = np.random.choice(data1, size=n1, replace=True)
        sample2 = np.random.choice(data2, size=n2, replace=True)
        # Calculate difference
        diff = sample1.mean() - sample2.mean()
        differences.append(diff)
    
    differences = np.array(differences)
    alpha = 1 - ci_level
    lower = np.percentile(differences, 100 * alpha/2)
    upper = np.percentile(differences, 100 * (1 - alpha/2))
    
    return lower, upper, differences

print("Running Bootstrap (10,000 iterations)...")
bootstrap_lower, bootstrap_upper, bootstrap_diffs = bootstrap_ci(
    ad_conversions, psa_conversions, n_bootstrap=10000, ci_level=0.95
)

print(f"\nBOOTSTRAP RESULTS")
print("="*60)
print(f"95% CI for difference: [{bootstrap_lower:.6f}, {bootstrap_upper:.6f}]")
print(f"95% CI for difference (%): [{(bootstrap_lower*100):.4f}%, {(bootstrap_upper*100):.4f}%]")
print(f"Bootstrap mean difference: {bootstrap_diffs.mean():.6f}")
print(f"Bootstrap std error: {bootstrap_diffs.std():.6f}")


## 6. Statistical Power Analysis

Statistical power is the probability of correctly rejecting a false null hypothesis.

### Power Formula

For a two-sample t-test:

$$\text{Power} = 1 - \beta = P(\text{reject } H_0 | H_1 \text{ is true})$$

Power depends on:
- Effect size (Cohen's d)
- Sample size ($n$)
- Significance level ($\alpha$)
- Type of test (one-tailed vs two-tailed)

### Required Sample Size

To achieve a desired power $(1-\beta)$:

$$n = \frac{2(z_{\alpha/2} + z_{\beta})^2 \sigma^2}{(\mu_1 - \mu_2)^2}$$

where $z_p$ is the $p$-th percentile of the standard normal distribution.


## Summary & Conclusions

### Statistical Test Results
- **T-Test P-value**: {p_value:.6f}
- **Chi-Square P-value**: {p_chi2:.6f}
- **Effect Size (Cohen's h)**: {cohens_h_value:.6f} ({effect_interpretation})
- **Statistical Power**: {achieved_power:.2%}

### Key Findings
- {significance_conclusion}
- 95% Confidence Interval: [{ci_95_lower:.6f}, {ci_95_upper:.6f}]
- Bootstrap 95% CI: [{bootstrap_lower:.6f}, {bootstrap_upper:.6f}]

### Recommendations
Based on the frequentist analysis, we can {recommendation} the null hypothesis and conclude that {conclusion}.


In [None]:
# Summary and conclusions
significance_conclusion = "Statistically Significant" if p_value < 0.05 else "Not Statistically Significant"
recommendation = "reject" if p_value < 0.05 else "fail to reject"
conclusion = "there is evidence of a difference in conversion rates" if p_value < 0.05 else "there is insufficient evidence of a difference"

print("\n" + "="*80)
print("FREQUENTIST ANALYSIS SUMMARY")
print("="*80)
print(f"\nStatistical Significance: {significance_conclusion}")
print(f"Effect Size: {effect_interpretation} (|h| = {abs(cohens_h_value):.4f})")
print(f"Statistical Power: {achieved_power:.2%}")
print(f"\nRecommendation: {recommendation.capitalize()} the null hypothesis")
print(f"Conclusion: {conclusion}")

# Save results
summary = {
    'ad_group_size': n_ad,
    'psa_group_size': n_psa,
    'ad_conversion_rate': cr_ad,
    'psa_conversion_rate': cr_psa,
    'absolute_lift': cr_ad - cr_psa,
    'relative_lift': (cr_ad - cr_psa) / cr_psa,
    't_statistic': t_stat,
    'p_value': p_value,
    'chi2_statistic': chi2,
    'chi2_p_value': p_chi2,
    'cohens_h': cohens_h_value,
    'cohens_d': cohens_d,
    'ci_95_lower': ci_95_lower,
    'ci_95_upper': ci_95_upper,
    'bootstrap_ci_lower': bootstrap_lower,
    'bootstrap_ci_upper': bootstrap_upper,
    'statistical_power': achieved_power,
    'required_sample_size': int(np.ceil(required_n)),
    'is_significant': p_value < 0.05
}

import json
with open('frequentist_results.json', 'w') as f:
    json.dump({k: float(v) if isinstance(v, (np.integer, np.floating)) else v 
              for k, v in summary.items()}, f, indent=2)

print("\n✅ Results saved to 'frequentist_results.json'")
print("="*80)


In [None]:
# Calculate achieved power
power_analysis = TTestIndPower()
achieved_power = power_analysis.power(
    effect_size=cohens_d,
    nobs1=n_ad,
    ratio=n_psa/n_ad,
    alpha=0.05,
    alternative='two-sided'
)

print("POWER ANALYSIS")
print("="*60)
print(f"Observed effect size (Cohen's d): {cohens_d:.6f}")
print(f"Sample size (Ad): {n_ad:,}")
print(f"Sample size (PSA): {n_psa:,}")
print(f"Achieved power: {achieved_power:.4f} ({achieved_power*100:.2f}%)")

# Calculate required sample size for 80% power
required_n = power_analysis.solve_power(
    effect_size=cohens_d,
    power=0.80,
    ratio=1.0,
    alpha=0.05,
    alternative='two-sided'
)

print(f"\nRequired sample size per group (80% power): {int(np.ceil(required_n)):,}")
if n_ad >= required_n:
    print(f"✅ Sample size is adequate")
else:
    print(f"⚠️  Sample size may be insufficient")
