# End-to-End A/B Testing & Experimentation Framework

This project demonstrates how to design, analyze, and interpret an A/B experiment
to make a product launch decision using statistical evidence.

The experiment evaluates whether a new “Buy Now” button improves conversion rate
compared to the existing button.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats


## Experiment Design

**Decision**  
Should we launch the new “Buy Now” button to all users?

**Primary Metric**  
Conversion Rate (number of users who purchase / number of users who visit)

**Minimum Detectable Effect**  
+0.5% absolute increase in conversion rate

**Unit of Experiment**  
Individual users


In [2]:
np.random.seed(42)

n_users = 10000

control_cr = 0.10      # 10% baseline
treatment_cr = 0.105  # 10.5% (minimum meaningful uplift)

control = np.random.binomial(1, control_cr, n_users)
treatment = np.random.binomial(1, treatment_cr, n_users)

df = pd.DataFrame({
    "group": ["control"] * n_users + ["treatment"] * n_users,
    "converted": np.concatenate([control, treatment])
})

df.head()


Unnamed: 0,group,converted
0,control,0
1,control,1
2,control,0
3,control,0
4,control,0


In [3]:
df.groupby("group")["converted"].mean()


Unnamed: 0_level_0,converted
group,Unnamed: 1_level_1
control,0.0961
treatment,0.1078


## Statistical Testing

We use a two-proportion z-test because:
- The outcome is binary (conversion vs no conversion)
- We are comparing two independent groups
- Sample size is sufficiently large


Decision rule (we define this BEFORE testing)

Significance level (α) = 0.05

If p-value < 0.05 → difference is statistically significant

Else → not enough evidence

In [6]:
# Separate groups
control_data = df[df["group"] == "control"]["converted"]
treatment_data = df[df["group"] == "treatment"]["converted"]

# Counts
successes = np.array([control_data.sum(), treatment_data.sum()])
observations = np.array([len(control_data), len(treatment_data)])

# Conversion rates
cr_control = successes[0] / observations[0]
cr_treatment = successes[1] / observations[1]

# Pooled proportion
p_pool = successes.sum() / observations.sum()

# Standard error
se = np.sqrt(p_pool * (1 - p_pool) * (1/observations[0] + 1/observations[1]))

# Z-score
z = (cr_treatment - cr_control) / se

# Two-sided p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

print("Control CR:", cr_control)
print("Treatment CR:", cr_treatment)
print("Z-score:", z)
print("P-value:", p_value)


Control CR: 0.0961
Treatment CR: 0.1078
Z-score: 2.734179295256223
P-value: 0.00625359816640314


In [7]:
# Difference in conversion rates
diff = cr_treatment - cr_control

# Standard error of difference
se_diff = np.sqrt(
    (cr_control * (1 - cr_control)) / observations[0] +
    (cr_treatment * (1 - cr_treatment)) / observations[1]
)

# 95% confidence interval
z_critical = stats.norm.ppf(0.975)  # 1.96

ci_lower = diff - z_critical * se_diff
ci_upper = diff + z_critical * se_diff

print("Observed uplift:", diff)
print("95% CI lower bound:", ci_lower)
print("95% CI upper bound:", ci_upper)


Observed uplift: 0.011700000000000002
95% CI lower bound: 0.0033145614527193973
95% CI upper bound: 0.020085438547280607


## Experiment Summary

**Decision**
Should we launch the new “Buy Now” button to all users?

**Primary Metric**
Conversion Rate (purchases / visitors)

**Minimum Detectable Effect**
+0.5% absolute increase in conversion rate

**Experiment Design**
- Type: A/B Test
- Unit of randomization: User
- Traffic split: 50% Control / 50% Treatment
- Sample size: 10,000 users per group

**Results**
- Control CR: 9.61%
- Treatment CR: 10.78%
- Observed uplift: +1.17%
- P-value: 0.0063
- 95% CI for uplift: [ +0.33%, +2.01% ]

**Decision**
Launch the new button. Results are statistically significant and practically meaningful, with low downside risk.


## Notes & Limitations

- This experiment uses simulated data; real-world effects may vary.
- External factors (seasonality, traffic quality) are not modeled.
- Results assume correct randomization and no sample contamination.
- Post-launch monitoring is recommended to validate long-term impact.
