
# E‑commerce A/B Testing Playbook — New vs Old Landing Page

This notebook walks through a **complete analysis** of an A/B test from an e‑commerce website.
The company experimented with a **new landing page** and wants to know whether to:

- ship the new page,
- keep the old page,
- or keep testing because results are still too uncertain.

All narration is in **English**, and the notebook is meant to be run end‑to‑end.



## 0) Data and context

We will use the widely known **Udacity e‑commerce A/B test dataset** (`ab_data.csv`), which appears in many
tutorials and Kaggle projects.

Typical structure:

- `user_id` — unique visitor ID (integer).  
- `timestamp` — time of page view.  
- `group` — `"control"` or `"treatment"`.  
- `landing_page` — `"old_page"` or `"new_page"`.  
- `converted` — 1 if the visitor converted (paid), 0 otherwise.

Optionally, there is a `countries.csv` file with:

- `user_id`, `country` ∈ { `"US"`, `"CA"`, `"UK"` }.

> **Goal:** quantify the impact of the new page on **conversion rate**, explore **heterogeneity by country**,
> and provide a **decision recommendation** using both frequentist and Bayesian perspectives.



## 1) Setup

We use `numpy`, `pandas`, `matplotlib` and a few simple helper functions for proportions and tests.
For part of the analysis we also use `statsmodels` for logistic regression.


In [None]:

from __future__ import annotations

from dataclasses import dataclass
from typing import Tuple

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (7, 4.5)
plt.rcParams["axes.grid"] = True

# Optional: used for GLM (logit)
try:
    import statsmodels.api as sm  # type: ignore
except Exception as e:  # pragma: no cover
    sm = None
    print("statsmodels not available; GLM cells will be skipped.", e)



### 1.1 Helper functions (proportions, z-test, CIs, power)


In [None]:

@dataclass(frozen=True)
class PropSummary:
    """Summary of a Bernoulli proportion.

    Attributes
    ----------
    p : float
        Sample proportion \(x / n\).
    n : int
        Sample size.
    x : int
        Number of successes.
    """
    p: float
    n: int
    x: int


def summarize_prop(x: int, n: int) -> PropSummary:
    """Validate and summarize a proportion sample.

    Parameters
    ----------
    x : int
        Number of successes (must be in [0, n]).
    n : int
        Total sample size (must be positive).

    Returns
    -------
    PropSummary
        Dataclass with p, n, x.
    """
    if n <= 0:
        raise ValueError("n must be positive.")
    if not (0 <= x <= n):
        raise ValueError("x must satisfy 0 <= x <= n.")
    return PropSummary(p=x / n, n=n, x=x)


def invPhi(u: float) -> float:
    """Inverse standard normal CDF using erfcinv.

    Parameters
    ----------
    u : float
        Probability in (0, 1).

    Returns
    -------
    float
        z such that Phi(z) = u.
    """
    if not 0.0 < u < 1.0:
        raise ValueError("u must be in (0,1).")
    return math.sqrt(2.0) * math.erfcinv(2.0 * (1.0 - u))


def two_prop_ztest(
    x1: int,
    n1: int,
    x2: int,
    n2: int,
    two_sided: bool = True,
) -> Tuple[float, float]:
    """Two-sample z-test for proportions with pooled variance.

    Tests H0: p1 = p2 vs H1: p1 != p2 (two-sided by default).

    Parameters
    ----------
    x1, n1, x2, n2 : int
        Success counts and sample sizes for groups 1 and 2.
    two_sided : bool
        If True, compute a two-sided p-value. If False, right-sided (p2 > p1).

    Returns
    -------
    z : float
        z-statistic (signed).
    p_value : float
        Corresponding p-value.
    """
    s1, s2 = summarize_prop(x1, n1), summarize_prop(x2, n2)
    p_pool = (s1.x + s2.x) / (s1.n + s2.n)
    se = math.sqrt(p_pool * (1.0 - p_pool) * (1.0 / s1.n + 1.0 / s2.n))
    if se == 0.0:
        raise ZeroDivisionError("Standard error is zero; check inputs.")
    z = (s2.p - s1.p) / se
    # standard normal tail via erf
    if two_sided:
        p = 2.0 * (1.0 - 0.5 * (1.0 + math.erf(abs(z) / math.sqrt(2.0))))
    else:
        p = 1.0 - 0.5 * (1.0 + math.erf(z / math.sqrt(2.0)))
    return float(z), float(p)


def chisq_srm(nA: int, nB: int) -> float:
    """Chi-square SRM (sample ratio mismatch) test for a 50/50 split.

    Parameters
    ----------
    nA, nB : int
        Sample sizes for arms A and B.

    Returns
    -------
    float
        Approximate two-sided p-value for chi-square(1) test.
    """
    n = nA + nB
    exp = [n / 2.0, n / 2.0]
    obs = [nA, nB]
    chi2 = sum((o - e) ** 2 / e for o, e in zip(obs, exp))
    # Approximate tail via normal on sqrt(chi2)
    z = math.sqrt(chi2)
    p = 2.0 * (1.0 - 0.5 * (1.0 + math.erf(z / math.sqrt(2.0))))
    return float(p)


def mde_for_n(
    p_baseline: float,
    n_per_arm: int,
    alpha: float = 0.05,
    power: float = 0.8,
    two_sided: bool = True,
) -> float:
    """Compute absolute MDE for a given baseline and sample size per arm.

    Uses normal approximation for two-sample proportion test.

    Parameters
    ----------
    p_baseline : float
        Baseline conversion rate (between 0 and 1).
    n_per_arm : int
        Sample size per arm.
    alpha : float
        Significance level.
    power : float
        Desired power (1 - beta).
    two_sided : bool
        If True, uses two-sided z_{alpha/2}.

    Returns
    -------
    float
        Approximate minimal detectable effect (absolute difference in p).
    """
    if not 0.0 < p_baseline < 1.0:
        raise ValueError("p_baseline must be in (0,1).")
    if n_per_arm <= 0:
        raise ValueError("n_per_arm must be positive.")

    z_alpha = abs(invPhi(1.0 - alpha / 2.0)) if two_sided else abs(invPhi(1.0 - alpha))
    z_beta = abs(invPhi(power))
    se = math.sqrt(2.0 * p_baseline * (1.0 - p_baseline))
    return float((z_alpha + z_beta) * se / math.sqrt(n_per_arm))



## 2) Load and clean the e‑commerce A/B dataset

You need the classic `ab_data.csv` file in the working directory.

> If you do not have it yet, you can download it from:
> - Udacity's Data Analyst Nanodegree project, or
> - Multiple Kaggle notebooks that mirror the same CSV.

We will:

1. Read the CSV.  
2. Remove rows where **group** and **landing_page** do not match (data issue in the original file).  
3. Drop duplicate `user_id` rows.  
4. Confirm the group labels and page labels are as expected.


In [None]:

from pathlib import Path

DATA_PATH = Path("ab_data.csv")

if not DATA_PATH.exists():
    raise FileNotFoundError(
        "ab_data.csv not found in the current directory.
"
        "Place the Udacity e-commerce A/B dataset here and re-run this cell."
    )

df_raw = pd.read_csv(DATA_PATH)
df_raw.head()



### 2.1 Data cleaning and SRM

We keep only rows where:

- `control` users see the `old_page`, and  
- `treatment` users see the `new_page`.

Then we check for duplicate users and run an **SRM test** on the group sizes.


In [None]:

df = df_raw.copy()

# Filter valid combinations: control-old_page, treatment-new_page
mask_valid = (
    ((df["group"] == "control") & (df["landing_page"] == "old_page"))
    | ((df["group"] == "treatment") & (df["landing_page"] == "new_page"))
)
df = df.loc[mask_valid].copy()

# Drop duplicate user_id if any (keep first occurrence)
df = df.drop_duplicates(subset=["user_id"], keep="first").reset_index(drop=True)

# Basic checks
print("Unique groups:", df["group"].unique())
print("Unique landing_page:", df["landing_page"].unique())

n_control = (df["group"] == "control").sum()
n_treat = (df["group"] == "treatment").sum()
p_srm = chisq_srm(n_control, n_treat)

n_control, n_treat, p_srm



**Reading SRM.**

- If the SRM p-value is very small (e.g., < 0.01), the 50/50 split may be compromised (bug in randomization or logging).  
- For this dataset, the split is typically close to 50/50 and SRM is not a concern.



## 3) Primary metric: conversion rate (frequentist view)

Our primary metric is the **conversion rate**:

\[
\text{CR} = \mathbb{P}(\text{converted} = 1).
\]

We compare `control` (old page) vs `treatment` (new page) using:

- A **two-proportion z-test**.  
- A **normal-approximation CI** around the difference in conversion rates.


In [None]:

# Aggregate conversions by group
conv_summary = (
    df.groupby("group")["converted"]
    .agg(["sum", "count", "mean"])
    .rename(columns={"sum": "x", "count": "n", "mean": "rate"})
)
conv_summary


In [None]:

# Extract A (control) and B (treatment)
xA = int(conv_summary.loc["control", "x"])
nA = int(conv_summary.loc["control", "n"])
xB = int(conv_summary.loc["treatment", "x"])
nB = int(conv_summary.loc["treatment", "n"])

sA = summarize_prop(xA, nA)
sB = summarize_prop(xB, nB)

z, p = two_prop_ztest(xA, nA, xB, nB, two_sided=True)

# Normal-approximation CI for the difference (B - A)
diff = sB.p - sA.p
alpha = 0.05
z_alpha = abs(invPhi(1.0 - alpha / 2.0))
se_diff = math.sqrt(
    (sA.p * (1.0 - sA.p)) / sA.n + (sB.p * (1.0 - sB.p)) / sB.n
)
ci_lo = diff - z_alpha * se_diff
ci_hi = diff + z_alpha * se_diff

pd.DataFrame(
    {
        "arm": ["control", "treatment"],
        "n": [sA.n, sB.n],
        "x": [sA.x, sB.x],
        "rate": [sA.p, sB.p],
        "diff_B_minus_A": [diff, diff],
        "diff_CI95_lo": [ci_lo, ci_lo],
        "diff_CI95_hi": [ci_hi, ci_hi],
        "z_stat": [z, z],
        "p_value": [p, p],
    }
)



**Interpretation.**

- The difference column (`diff_B_minus_A`) shows the **absolute lift** in conversion rate of the new page vs the old page.  
- The 95% CI tells us which lifts are compatible with the data under the normal approximation.  
- The p-value is the usual frequentist test of H0: *no difference* vs H1: *some difference*.

Next we cross‑check this with a **logistic regression** that can include **country** as a covariate if we have that file.



## 4) Logistic regression with country (if available)

If `countries.csv` is present, we will:

1. Merge it on `user_id`.  
2. Fit a logistic regression of `converted ~ treatment + country`.  
3. Interpret the coefficient on the `treatment` indicator and country interactions.


In [None]:

countries_path = Path("countries.csv")
if countries_path.exists():
    countries = pd.read_csv(countries_path)
    df_merged = df.merge(countries, on="user_id", how="left")
    print(df_merged["country"].value_counts(dropna=False))
else:
    df_merged = df.copy()
    df_merged["country"] = "UNKNOWN"
    print("countries.csv not found; using a single dummy country 'UNKNOWN'.")

df_merged.head()


In [None]:

if sm is None:
    print("statsmodels not available; skipping logistic regression.")
else:
    # Build design matrix
    df_glm = df_merged.copy()
    df_glm["treatment"] = (df_glm["group"] == "treatment").astype(int)
    # One-hot encode country, drop first to avoid collinearity
    X = pd.get_dummies(df_glm[["treatment", "country"]], drop_first=True).astype(float)
    X = sm.add_constant(X)
    y = df_glm["converted"].astype(int)

    logit_model = sm.Logit(y, X).fit(disp=False)
    logit_model.summary2().tables[1]



**Reading the logistic regression.**

- The coefficient on `treatment` (in log-odds) indicates the **direction** and strength of the new page effect.  
- If you have multiple countries, the country dummies capture **baseline differences** across markets.  
- You can also add interaction terms (e.g., `treatment × country`) to explore **heterogeneous effects**.

Next we complement the frequentist view with a **Bayesian Beta–Binomial** analysis of conversion.



## 5) Bayesian A/B — Beta–Binomial model

For conversion data, a natural Bayesian model is:

- Likelihood: \(X_A \sim \text{Binomial}(n_A, p_A)\), \(X_B \sim \text{Binomial}(n_B, p_B)\).  
- Prior: \(p_A, p_B \sim \text{Beta}(1,1)\) (uniform).

Conjugacy gives posteriors:

\[
p_A \mid \text{data} \sim \text{Beta}(1 + x_A, 1 + n_A - x_A),\\
p_B \mid \text{data} \sim \text{Beta}(1 + x_B, 1 + n_B - x_B).
\]

We can sample from these posteriors to estimate quantities like:

- \(\mathbb{P}(p_B > p_A \mid \text{data})\)  
- Distribution of the **lift** \(p_B - p_A\).  
- Risk metrics (probability that the lift is below a threshold, etc.).


In [None]:

def beta_posterior_params(x: int, n: int, a_prior: float = 1.0, b_prior: float = 1.0) -> Tuple[float, float]:
    """Posterior Beta parameters for a Binomial count with Beta(a_prior, b_prior) prior."""
    if n < 0 or x < 0:
        raise ValueError("x and n must be non-negative.")
    if x > n:
        raise ValueError("x must be <= n.")
    return a_prior + x, b_prior + (n - x)


def sample_posterior_lift(
    xA: int,
    nA: int,
    xB: int,
    nB: int,
    n_draws: int = 100000,
    seed: int | None = 123,
) -> pd.DataFrame:
    """Sample from the Beta posteriors for pA and pB and compute lifts.

    Parameters
    ----------
    xA, nA, xB, nB : int
        Success counts and sample sizes for A and B.
    n_draws : int
        Number of Monte Carlo draws.
    seed : int | None
        Random seed for reproducibility.

    Returns
    -------
    DataFrame
        Columns: pA, pB, lift (pB - pA).
    """
    rng = np.random.default_rng(seed)
    aA, bA = beta_posterior_params(xA, nA)
    aB, bB = beta_posterior_params(xB, nB)

    pA_draws = rng.beta(aA, bA, size=n_draws)
    pB_draws = rng.beta(aB, bB, size=n_draws)
    lift = pB_draws - pA_draws

    return pd.DataFrame({"pA": pA_draws, "pB": pB_draws, "lift": lift})


post_samples = sample_posterior_lift(xA, nA, xB, nB, n_draws=50000, seed=42)
post_samples.describe(percentiles=[0.025, 0.5, 0.975])


In [None]:

# Posterior probability that new page is better
prob_B_better = float((post_samples["lift"] > 0).mean())

# 95% credible interval for lift
ci_lo_bayes, ci_hi_bayes = np.quantile(post_samples["lift"], [0.025, 0.975])

# Plot posterior lift distribution
plt.figure()
plt.hist(post_samples["lift"], bins=60, density=True)
plt.axvline(0.0, linestyle="--")
plt.title("Posterior distribution of lift (p_B - p_A)")
plt.xlabel("lift")
plt.ylabel("density")
plt.tight_layout()
plt.show()

{
    "posterior_prob_new_better": prob_B_better,
    "lift_cred_int_95": (ci_lo_bayes, ci_hi_bayes),
}



**Bayesian reading.**

- `posterior_prob_new_better` is \(\mathbb{P}(p_B > p_A \mid \text{data})\).  
- The **credible interval** on the lift can be directly interpreted as:
  “given data and prior, there is 95% probability the true lift lies in this interval”.  
- This view is often easier to communicate to non‑statisticians than p-values.

Next we connect the statistical results to **business impact**, using a simple revenue model.



## 6) Business framing: revenue and decision rule

Assume:

- Average **revenue per conversion**: `rev_per_conv`.  
- Number of users exposed per day: `users_per_day`.  
- Horizon of interest: `H` days.  

If we ship the new page, the expected **incremental revenue** over horizon H is roughly:

\[
\Delta R \approx H \cdot \text{users_per_day} \cdot \mathbb{E}[p_B - p_A].
\]

We can compute this under the **Bayesian posterior** as the mean of the sampled lifts.
We also look at **downside risk** (e.g., probability the lift is negative).


In [None]:

# Simple business parameters (edit as needed)
rev_per_conv = 50.0      # revenue per conversion in your currency
users_per_day = 20000    # daily traffic eligible for the test
H = 30                   # horizon in days

mean_lift = float(post_samples["lift"].mean())
prob_lift_negative = float((post_samples["lift"] < 0).mean())

# Expected incremental revenue over horizon H
delta_R = H * users_per_day * mean_lift * rev_per_conv

{
    "mean_lift": mean_lift,
    "prob_lift_negative": prob_lift_negative,
    "expected_delta_revenue_H": delta_R,
}



**Decision sketch.**

You could define a decision rule such as:

- **Ship new page** if
  - posterior_prob_new_better > 0.95, and  
  - expected_delta_revenue_H is significantly positive, and  
  - downside risk (probability of negative lift) is below some tolerance (say < 0.1).

Otherwise you **hold** (or keep testing) until uncertainty shrinks or the effect becomes clearer.



## 7) Power and MDE recap

Finally, we sanity‑check whether the test is able to detect **business‑relevant** effects.

We use the baseline conversion rate of the control group and the realized sample size per arm
to compute the **MDE** at 80% power and 5% two‑sided alpha.


In [None]:

p_baseline = sA.p  # control conversion rate
n_per_arm = min(sA.n, sB.n)
mde_80 = mde_for_n(p_baseline, n_per_arm, alpha=0.05, power=0.8, two_sided=True)

{
    "baseline_rate_control": p_baseline,
    "n_per_arm": n_per_arm,
    "MDE_abs_at_80pct_power": mde_80,
}



If the observed posterior lift and frequentist CI are both well **within ±MDE**, the experiment may be
**underpowered** to detect the kind of changes you care about. In that case you might:

- run the test longer,  
- aggregate more traffic, or  
- shift to more sensitive proxies (e.g. lead form submissions) if the outcome is truly rare.



## 8) Executive summary template

Use this structure when writing up the final decision:

1. **Sanity checks**
   - SRM p-value, data cleaning decisions (invalid rows, duplicates).

2. **Main effect**
   - Conversion rates for control and treatment.  
   - Frequentist difference with 95% CI and p-value.  
   - Bayesian posterior probability that new page is better and 95% credible interval.

3. **Heterogeneity**
   - Any notable differences across countries or segments (if `countries.csv` was available).

4. **Business impact**
   - Approximate expected incremental revenue over H days.  
   - Discussion of downside risk (probability of harm).

5. **Decision**
   - Ship / hold / roll back and why.  
   - If ship: ramp plan (e.g., 25% → 50% → 100%) and monitoring.  
   - If hold: what additional data or changes are needed.

This keeps the analysis **decision‑oriented** rather than purely statistical.
