End-to-end experimentation framework that analyzes 500K+ user sessions, applies frequentist and Bayesian statistical methods, and delivers a $10.6M projected annual revenue lift — with 100% confidence that the new checkout design wins.
An e-commerce company redesigned its checkout page to improve conversion — streamlined layout, mobile-optimized, reduced friction. Before rolling out to 100% of traffic, the product team needs to answer three questions:
- Is the improvement real or just random variation?
- How large is the effect — and is it practically meaningful?
- Which user segments benefit the most from the change?
Making the wrong call has a real cost:
- Ship a losing variant → lose revenue at scale
- Hold back a winning variant → leave millions on the table
- Ignore segment differences → miss optimization opportunities
This framework answers those questions with statistical rigor.
A complete A/B testing pipeline that generates realistic experiment data (500K users, industry-standard parameters), runs both frequentist (Z-test, T-test) and Bayesian statistical analysis, validates experiment integrity, performs segmentation analysis across device/country/channel/user type, and projects business impact at scale.
From experiment design to ship/no-ship decision — with p<0.001, 100% Bayesian confidence, and $10.6M in projected annual revenue lift.
┌─────────────────────┐ ┌──────────────────────┐ ┌─────────────────────┐
│ Synthetic Data │───▶│ Python Analysis │───▶│ MySQL DB │
│ 500K users │ │ Stats · Bayesian │ │ 4 structured tables│
│ Industry params │ │ Segmentation │ │ ab_results │
└─────────────────────┘ └──────────────────────┘ │ ab_summary │
│ statistical_results│
│ business_impact │
└──────────┬──────────┘
│
┌──────────────▼──────────────┐
│ Power BI Dashboard │
│ Experiment Results Monitor │
└─────────────────────────────┘
| Step | Action | Technology | Business Value |
|---|---|---|---|
| 1 | Experiment design — sample size, power, duration | statsmodels | Ensure experiment is adequately powered before launch |
| 2 | Data generation — 500K users, realistic parameters | Python · numpy | Industry-standard CR, AOV, device and segment distributions |
| 3 | EDA — group balance, daily trends, segment breakdown | Python · pandas · seaborn | Validate randomization and understand baseline metrics |
| 4 | SRM check — sample ratio mismatch detection | scipy | Detect broken randomization before analysis |
| 5 | Z-test — conversion rate significance | scipy · statsmodels | Frequentist test for primary metric |
| 6 | T-test — revenue per user & AOV significance | scipy | Validate secondary metrics |
| 7 | Effect size — Cohen's h | scipy | Measure practical significance beyond p-value |
| 8 | Confidence intervals — 95% CI for all metrics | statsmodels | Quantify uncertainty around observed lifts |
| 9 | Bayesian analysis — Beta-Binomial model | numpy | P(Treatment > Control) with credible intervals |
| 10 | Segmentation — device, country, channel, user type | Python · pandas | Identify where the effect is strongest |
| 11 | Business impact — monthly/annual revenue projection | Python · pandas | Translate statistics into executive decisions |
| Metric | Control | Treatment | Lift | Significant |
|---|---|---|---|---|
| Conversion Rate | 3.022% | 3.475% | +15.0% | ✅ p<0.001 |
| Avg Order Value | $85.40 | $91.16 | +6.7% | ✅ p<0.001 |
| Revenue per User | $2.58 | $3.17 | +22.8% | ✅ p<0.001 |
| Business Projection | Value |
|---|---|
| Extra conversions/month | +6,798 |
| Revenue lift/month | +$880,732 |
| Revenue lift/year | +$10,568,782 |
| Conservative estimate/year | +$5,284,391 |
| Bayesian confidence | 100% |
| Recommendation | ✅ SHIP IT |
Frequentist approach:
- Z-test for conversion rate (primary metric) — p=1.60e-19
- T-test for revenue per user — p=2.82e-31
- T-test for average order value — p=8.62e-15
- Cohen's h effect size — Small magnitude, large practical impact
- 95% Confidence Interval — [+0.355pp, +0.551pp] — entirely positive
Bayesian approach (Beta-Binomial model):
- Prior: Beta(1,1) — uninformative
- P(Treatment > Control): 100.0%
- 95% Credible Interval: [+11.56%, +18.54%]
- Expected loss if shipping Treatment: 0.000000pp
Experiment validation:
- Sample Ratio Mismatch (SRM): ✅ Passed
- Statistical power achieved: 100% (min required: 23,993/group · our sample: 250,000)
| Segment | Lift | Significant |
|---|---|---|
| Mobile (device) | +17.0% | ✅ |
| Social (channel) | +20.5% | ✅ |
| Australia (country) | +21.9% | ✅ |
| New users | +16.8% | ✅ |
| Loyal users | +16.3% | ✅ |
| Direct traffic | +4.8% | ❌ |
| UK (country) | +8.3% | ✅ (lower) |
Key insight: Mobile users show the highest lift (+17%) — the mobile-optimized design delivers exactly the intended improvement. Direct traffic shows no significant effect — users who already know the site are less impacted by UX changes.
Experiment Overview & EDA
Conversion rate by group (3.02% vs 3.48%), revenue per user ($2.58 vs $3.17), daily CR trend over 14 days, CR by device and user segment, and revenue distribution for converted users showing AOV shift from $85 to $91.
Bayesian Analysis & Segmentation
Beta-Binomial posterior distributions showing zero overlap between control and treatment, P(Treatment > Control) = 100%, segmentation heatmap across all dimensions, and CR lift by device.
Executive Summary & Business Recommendations
Key metrics comparison, relative lift across CR/AOV/RPU, 12-month cumulative revenue projection ($5.3M–$10.6M range), statistical tests summary, top segments by lift, and business recommendation table.
| Layer | Technology | Purpose |
|---|---|---|
| Data generation | Python · numpy | Synthetic experiment data with industry parameters |
| Analysis | Python · pandas | Data manipulation, segmentation, business impact |
| Frequentist stats | scipy · statsmodels | Z-test, T-test, confidence intervals, power analysis |
| Bayesian stats | numpy (Beta-Binomial) | Probabilistic decision framework |
| Visualization | matplotlib · seaborn | Analysis charts and executive reporting |
| Database | MySQL 8.0 · SQLAlchemy | Structured storage with indexed tables |
| ETL | Python · pymysql | Automated data loading pipeline |
| Dashboard | Power BI · DAX | Interactive experiment results monitor |
ab-testing-framework/
│
├── notebooks/
│ └── 01_ab_testing.ipynb # Full analysis: design, stats, Bayesian, segmentation
├── scripts/
│ └── load_to_mysql.py # ETL: experiment results → MySQL
├── dashboard/
│ └── ab_testing.pbix # Power BI dashboard
├── data/ # Generated experiment data (not tracked in git)
├── img/ # Analysis and dashboard screenshots
├── .env.example # Environment variables template
├── .gitignore
├── requirements.txt
├── LICENSE
└── README_ES.md # Spanish version
Andrés Navarro Data Analyst · Experimentation · Statistical Analysis · Python · SQL
Built to demonstrate production-grade experimentation capabilities — experiment design, frequentist and Bayesian statistical analysis, segmentation, and business impact quantification — skills directly applicable to e-commerce, fintech, SaaS, and any product-led company running data-driven experiments.