In [1]:
%%html
<style>
  table {
    margin-left: 0 !important;
  }
</style>

# A/B Testing
A/B testing is a controlled experiment that compares two versions (A and B) to determine which performs better on a chosen metric, such as conversion rate, click-through rate, or revenue.

- Key Points of A/B Testing:
    - Random assignment: Users or subjects are randomly split into two groups to test variant A versus variant B.
    - Hypotheses:
        -  Null Hypothesis ($H_0$): There is no difference between A and B.
        -  Alternative Hypothesis ($H_1$): There is a difference (e.g., B performs better than A).
    - Data collection: Collect data on user interaction with each variant.
    - Statistical analysis: Use hypothesis testing (t-tests, z-tests, or nonparametric tests) to determine if observed differences are statistically significant.
    - Decision making: If the null hypothesis is rejected with sufficient evidence, conclude that one variant is superior; otherwise, no difference is confirmed.
- Practical Steps:
    - Formulate hypotheses.
    - Randomly assign users to groups.
    - Run the experiment, collecting relevant metrics.
    - Perform statistical tests to check for significant differences.
    - Implement the winning variant based on results.

A/B testing is widely used in product development, marketing, and UX research to optimize user engagement and business outcomes by data-driven decision-making. This method is a real-world application of statistical hypothesis testing designed for practical experiments involving two variants.

#### 1. Understand the Experiment Design
- Clarify the goal: Increase conversion rate (the number of users who pay for the product).
- Identify groups: ‚ÄúControl‚Äù (old page) vs. ‚ÄúTreatment‚Äù (new page).
- Know your metrics: Focus on conversion rate, but also review related metrics like average order value or bounce rate for context.

#### 2. Load and Explore the Data
- Import all relevant data files (typically user-level logs or summaries of visits, conversions, variant assignments, etc.).
- Use .head(), .info(), and basic plots (hist(), value_counts()) to get a sense for data shape, missingness, and distributions.

#### 3. Check Data Integrity
- Ensure randomization: Each user should be assigned to only one group.
- Check for duplicates and missing values.
- Validate that test groups are balanced in size.

#### 4. Define Success Criteria
- Decide on the statistical significance threshold ($\alpha$ is standard).
- Set the minimum detectable effect (MDE) your business cares about.
- Clarify sample size requirements for adequate statistical power.

#### 5. Perform Exploratory Data Analysis (EDA)
- Visualize conversion rates in control vs. treatment groups.
- Plot histograms or boxplots for order value and other key metrics.
- Summarize basic statistics: mean, median, counts, proportions.

#### 6. Statistical Testing
- Calculate the observed difference in conversion rates.
- Use appropriate hypothesis tests (e.g., z-test for proportions, t-test if comparing means) to assess significance.
- Consider permutation tests if assumptions for parametric tests are questionable.

#### 7. Interpret Results
- Compare $\rho-value$ to the $\alpha$ threshold.
- If significant: consider business and practical impact.
- If not significant: review sample size and power‚Äîconsider whether to extend testing.

#### 8. Additional Testing and Validation Steps
- Common Additional Testing and Validation Steps:
    - Permutation (Randomization) Tests: Directly estimate the null distribution of your test statistic by permuting group labels. Confirms robustness if assumptions of traditional parametric tests (like normality) are questionable.
    - Bootstrap Confidence Intervals: Resample your observed data to estimate more robust confidence intervals for differences in conversion rate or other metrics.
    - Subgroup Analysis: Check if effects are consistent across different customer segments (e.g., geography, device type, user tenure) to rule out confounding or strange heterogeneity.
    - Test for Balance: Re-validate that the treatment and control groups are similar in covariates prior to treatment‚Äîimbalance could invalidate causal inference.
    - Holdout Validation or Split-Test Replication: Run a smaller version of the experiment (or leave out a random subset as a ‚Äúholdout‚Äù group) to check if effects replicate.
    - Power Analysis Post-Hoc: Calculate the observed power of your test to help interpret non-significant results‚Äîis the test underpowered, or is there truly no effect?

Note: If your main A/B z-test yields a p-value just above 0.05, running a permutation test and bootstrap interval can confirm whether this result is robust or might vary with sampling noise. Subgroup analysis may reveal that the treatment only helps a specific user group‚Äîaffecting your rollout decision.

#### 9. Make and Justify Recommendation
- Based on the statistical and practical analysis, advise whether to:
    - Roll out the new page,
    - Keep the old page,
    - Or continue/adjust the experiment. 

## What is a Hypothesis?
In statistics, a hypothesis is a specific, testable statement about a population parameter (like a mean, proportion, or variance).
In A/B testing, it‚Äôs used to decide whether an observed difference between two groups (A and B) is real or just due to random chance.

#### The Two Competing Hypotheses
| Type                  | Symbol      | Meaning                                                              |
|-----------------------|-------------|----------------------------------------------------------------------|
| Null Hypothesis       | $H_0$     | Assumes no difference between A and B. Any observed difference is due to random variation. |
| Alternative Hypothesis | $H_1$ or $H_a$ | Assumes there is a real difference (the new variant changed the metric).                |

#### Example: A/B Test on Average Order Value (AOV)
You test whether a new marketing strategy (B) increases AOV compared to the current one (A).
- $H_0:\mu_A$ = $\mu_B$ (no difference in mean AOV)
- $H_1:\mu_B$ > $\mu_A$ (variant B increases mean AOV)

#### One-tailed vs Two-tailed Tests
| Test Type   | When to Use                                   | Example                             |
|-------------|----------------------------------------------|-----------------------------------|
| Two-tailed  | When any difference (increase or decrease) is interesting. | $H_1: \mu_A \ne \mu_B$        |
| One-tailed  | When you care about only one direction (e.g., increase).   | $H_1: \mu_B > \mu_A$           |

**‚ö†Ô∏è One-tailed tests are more powerful but risk bias if the effect goes in the opposite direction.**

#### Decision Framework
| Step | Concept                       | Description                                                       |
| :--- | :---------------------------- | :---------------------------------------------------------------- |
| 1    | **State hypotheses**          | Define ($H_0$) and ($H_1$).                                       |
| 2    | **Choose significance level** | Typically ($\alpha$ = 0.05).                                      |
| 3    | **Compute test statistic**    | e.g., t, z, U ‚Äî depending on test.                                |
| 4    | **Compute p-value**           | Probability of seeing your data if ($H_0$) were true.             |
| 5    | **Decision rule**             | If ($\rho \le \alpha$), **reject ($H_0$)** ‚Üí significant difference. |

#### Example (t-test)
```
from scipy import stats
import numpy as np

A = np.random.normal(50, 10, 100)
B = np.random.normal(52, 10, 100)

tstat, pval = stats.ttest_ind(A, B, equal_var=False)
print(f"t = {tstat:.3f}, p = {pval:.4f}")
```

#### Output example:
t = -2.041, p = 0.043

**Interpretation:** <br>
Since p = 0.043 < 0.05, reject $H_0$ <br>
The difference between groups is statistically significant. <br>

#### Type I and Type II Errors
| Error Type        | Symbol | What It Means                                               | Example                                   |
| :---------------- | :----- | :---------------------------------------------------------- | :---------------------------------------- |
| **Type I Error**  | $\alpha$      | Rejecting ($H_0$) when it‚Äôs true (false positive).          | You think B works better, but it doesn‚Äôt. |
| **Type II Error** | $\beta$      | Failing to reject ($H_0$) when it‚Äôs false (false negative). | You miss a real improvement.              |

**Power = $1 ‚Äì \beta$ ‚Üí probability of correctly detecting a real effect.**

#### Hypothesis Testing in A/B Context
| **Metric**                  | **Null Hypothesis ($H_0$)** | **Alternative ($H_1$)** | **Test Used**         |
| :-------------------------- | :-------------------------- | :---------------------- | :-------------------- |
| Conversion Rate (CR)        | ($p_A = p_B$)               | ($p_A \ne p_B$)         | Two-proportion z-test |
| Average Order Value (AOV)   | ($\mu_A = \mu_B$)           | ($\mu_A \ne \mu_B$)     | t-test                |
| Orders per User (Frequency) | ($F_A = F_B$)               | ($F_A \ne F_B$)         | Mann‚ÄìWhitney U test   |
| Revenue per Visitor (RPV)   | ($\mu_A = \mu_B$)           | ($%\mu_A \ne \mu_B$)    | t-test                |

#### Example Summary (A/B test on AOV)
| Step                              | Result                                       |
| :-------------------------------- | :------------------------------------------- |
| ($H_0$): No difference in AOV     |                                              |
| ($H_1$): Strategy B increases AOV |                                              |
| $\alpha = 0.05$                   | Significance threshold                       |
| $\rho = 0.082$                    | Computed $\rho-value$                        |
| Decision: Fail to reject ($H_0$)  | No significant difference detected           |
| Interpretation                    | Strategy B didn‚Äôt significantly improve AOV. |

#### What is AOV (Average Order Value)?
AOV stands for Average Order Value, a key e-commerce and marketing performance metric that measures the average amount of money customers spend per order.

**Formula:**
$$AVO = \frac {\text{Total Revenue}}{\text{Number of Orders}}$$

#### Why AOV Matters
| **Business Impact**        | **Explanation**                                                                      |
| :------------------------- | :----------------------------------------------------------------------------------- |
| **Revenue Growth Lever**   | Increasing AOV can grow total revenue without increasing customer acquisition costs. |
| **Pricing & Upselling**    | Measures how well your upsells, bundles, and promotions perform.                     |
| **Customer Value Insight** | Helps segment high-value vs. low-value customers.                                    |
| **Campaign ROI**           | Determines if a new marketing strategy increases purchase size.                      |

#### AOV in A/B Testing
In A/B testing, AOV is often used as a primary or secondary metric to assess the financial impact of design or pricing changes.
|                                              | **Null Hypothesis ($H_0$)** | **Alternative Hypothesis ($H_1$)** |
| :------------------------------------------- | :-------------------------- | :--------------------------------- |
| **Goal:** Test if new strategy increases AOV | ($\mu_A = \mu_B$)           | ($\mu_B > \mu_A$)                  |

*Where:*
- $\mu_A$: Mean AOV for Control group (A)
- $\mu_B$: Mean AOV for Variant group (B)

#### Statistical Test Used
| **Condition**                                      | **Test**                      | **Reason**                                           |
| :------------------------------------------------- | :---------------------------- | :--------------------------------------------------- |
| Large sample (n > 30/group), AOV roughly symmetric | **Two-sample t-test**         | Tests mean difference assuming approximate normality |
| Skewed AOV (common in e-commerce) or small samples | **Mann‚ÄìWhitney U test**       | Non-parametric, does not assume normality            |
| Very large samples or bootstrapping available      | **Bootstrap mean difference** | Empirical, assumption-free confidence interval       |
```
# Sample Code
import numpy as np
from scipy import stats

# Example AOVs for A and B
A = np.random.normal(50, 15, 1000)
B = np.random.normal(52, 15, 1000)

# Welch‚Äôs t-test (no equal variance assumption)
tstat, pval = stats.ttest_ind(A, B, equal_var=False)
print(f"Mean AOV_A = {A.mean():.2f}, Mean AOV_B = {B.mean():.2f}")
print(f"t = {tstat:.3f}, p = {pval:.4f}")
```
**Output:**
Mean AOV_A = 50.12, Mean AOV_B = 52.17 <br>
t = -2.331, p = 0.020 <br>

**‚úÖ Interpretation:**
- $\rho < 0.05$ ‚Üí reject $H_0$
- Strategy B significantly increases AOV.

#### Common Pitfalls
| **Pitfall**                                       | **Why It‚Äôs a Problem**                          | **Better Approach**                                   |
| :------------------------------------------------ | :---------------------------------------------- | :---------------------------------------------------- |
| AOV is highly skewed (a few huge orders dominate) | Violates t-test assumptions                     | Use **log-transformed AOV** or **Mann‚ÄìWhitney test**  |
| Ignoring conversion rate                          | High AOV but low conversions can reduce revenue | Analyze **Revenue per Visitor (RPV = CR √ó AOV)**      |
| Comparing cumulative AOV too early                | Early results fluctuate heavily                 | Use **fixed sample window** or **sequential testing** |
| Confusing AOV per user vs per order               | Users may place multiple orders                 | Clarify unit of analysis (order-level vs. user-level) |

#### Related Metrics
| **Metric**                      | **Formula**                               | **Interpretation**                       |
| :------------------------------ | :---------------------------------------- | :--------------------------------------- |
| **Conversion Rate (CR)**        | ( \frac{\text{Orders}}{\text{Visitors}} ) | % of visitors who buy                    |
| **Revenue per Visitor (RPV)**   | ( \text{CR} \times \text{AOV} )           | Average revenue contribution per visitor |
| **Orders per User (Frequency)** | ( \frac{\text{Orders}}{\text{Users}} )    | Customer repeat rate                     |

#### ‚úÖ Summary
| **Aspect**           | **Details**                                              |
| :------------------- | :------------------------------------------------------- |
| **Definition**       | Mean order value across all transactions                 |
| **Goal in A/B test** | Measure if variant increases transaction size            |
| **Typical test**     | Welch‚Äôs t-test or Mann‚ÄìWhitney U test                    |
| **Business insight** | AOV increase = users buying more expensive or more items |
| **Watch for**        | Skewness, low sample size, ignoring CR impact            |


## What is the Normality Assumption?
It‚Äôs the assumption that the data (or more precisely, the sampling distribution of the test statistic) follows a normal (Gaussian) distribution.

**In formulas:**
- $X \sim N(\mu, \sigma^2)$
    - This matters because many classical statistical tests ‚Äî like the t-test ‚Äî are derived under this assumption.

#### A/B Testing Context
In an A/B test, you compare two versions (A and B) of something (e.g., webpage, email) to see which performs better on a metric (e.g., conversion rate, time spent, click-through rate).

Commonly used tests:
- Two-sample t-test ‚Äì for comparing means (e.g., average time on site)
- Z-test for proportions ‚Äì for comparing rates (e.g., conversion rates)
- Nonparametric tests ‚Äì if normality is violated

#### When the Normality Assumption Matters
| Case                                      | Is Normality Needed? | Why                                                                                                                                                         |
| ----------------------------------------- | -------------------- | ---------------------------------------------------------------------------------------------------------------|
| **Small sample sizes (n < 30)**           | ‚úÖ Yes                | The t-test assumes data are approximately normal.                                                              |
| **Large sample sizes (n ‚â• 30 per group)** | ‚ùå Not strictly       | Thanks to the **Central Limit Theorem (CLT)**, the sampling distribution of the mean (or proportion) becomes approximately normal, even if the data aren‚Äôt. |
| **Proportion data (binary outcomes)**     | ‚ùå Not directly       | The Z-test for proportions uses the CLT; normality of the *underlying data* isn‚Äôt required.      |
| **Highly skewed data or outliers**        | ‚ö†Ô∏è Maybe not         | Consider data transformation (e.g., log) or a nonparametric alternative.  |

#### In Practice
- For large-scale A/B tests (typical in web experiments), sample sizes are usually huge ‚Üí normality assumption is not a concern.
- For small experiments, check normality:
    - Visual check (histogram, Q‚ÄìQ plot)
    - Statistical tests (Shapiro‚ÄìWilk, Anderson‚ÄìDarling)
- If the assumption fails, use:
    - Mann-Whitney U test (for medians instead of means)
    - Bootstrap methods (nonparametric confidence intervals)

#### ‚úÖ Summary
| Question                                 | Answer                                                                                       |
| ---------------------------------------- | -------------------------------------------------------------------------------------------- |
| Do you need normal data for A/B testing? | Usually, **no**, if you have large samples.                                                   |
| Why not?                                 | The **Central Limit Theorem** ensures approximate normality of the sample means/proportions. |
| When should you care?                    | When your sample is small or the data are extremely skewed.                                  |

#### Normality assumption check
- Checking the normality assumption is an important step before applying tests like the t-test in A/B testing (especially with small samples).

**Step 1. Clarify What You‚Äôre Testing for Normality**
- You‚Äôre testing whether your sample data (e.g., metric values from group A and B) are approximately normally distributed.
- In A/B testing:
    - If your metric is continuous (e.g., time on site, revenue per user) ‚Üí check normality directly.
    - If your metric is binary (e.g., converted / not converted) ‚Üí no need to check; the test statistic (proportion) normality comes from the Central Limit Theorem.
      
**Step 2. Visual Checks**
- Histogram
    - Plot a histogram of your metric for each group:
        - Should look roughly bell-shaped (symmetrical, unimodal).
- Q‚ÄìQ (Quantile‚ÄìQuantile) Plot
    - Plots your data‚Äôs quantiles vs. those of a theoretical normal distribution:
        - If points fall roughly along a $45 \degree$ line, ‚Üí data are approximately normal.

```
# Python example:
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Suppose `group_a` and `group_b` are arrays or Series of your metric
sns.histplot(group_a, kde=True)
plt.title("Histogram of Group A")
plt.show()

stats.probplot(group_a, dist="norm", plot=plt)
plt.title("Q-Q Plot for Group A")
plt.show()
```

**Step 3. Statistical Normality Tests**
Formal tests for normality (use with caution ‚Äî they‚Äôre sensitive to large sample sizes):

- Shapiro‚ÄìWilk Test
    - Most common for small to medium samples (n < 5000).

```
# Sample Python
from scipy.stats import shapiro

stat, p = shapiro(group_a)
print(f"Statistic={stat:.3f}, p={p:.3f}")
if p > 0.05:
    print("Sample looks normal (fail to reject H‚ÇÄ).")
else:
    print("Sample does not look normal (reject H‚ÇÄ).")

```

- Kolmogorov‚ÄìSmirnov Test or Anderson‚ÄìDarling Test
```
from scipy.stats import anderson

result = anderson(group_a)
print('Statistic: %.3f' % result.statistic)
for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < cv:
        print(f"At {sl}% level: data looks normal.")
    else:
        print(f"At {sl}% level: data not normal.")

```
**Step 4. Interpret in Context**
- If the data look roughly normal, you can use a t-test.
- If not normal, but the sample is large (n ‚â• 30), the Central Limit Theorem justifies approximate normality ‚Üí still okay.
- If not normal and small sample, use:
    - Mann-Whitney U test (nonparametric)
    - Or bootstrap confidence intervals

**‚úÖ Summary Table**
| Method                | Type        | When to Use | Interpretation                       |
| --------------------- | ----------- | ----------- | ------------------------------------ |
| Histogram             | Visual      | Always      | Rough shape                          |
| Q‚ÄìQ Plot              | Visual      | Always      | Deviation from straight line         |
| Shapiro‚ÄìWilk          | Statistical | n < 5000    | p > 0.05 ‚Üí normal                    |
| Anderson‚ÄìDarling      | Statistical | Any size    | Compare statistic to critical values |
| Large sample (n ‚â• 30) | ‚Äî           | ‚Äî           | Normality assumption not critical    |


#### Normality Assumption:
- The Normality Assumption is one of the core statistical considerations in A/B testing, especially when you‚Äôre using parametric tests like the t-test or z-test.
- In an A/B test, you compare the means of two groups (e.g., control vs. variant).
If you‚Äôre using a t-test (e.g., scipy.stats.ttest_ind) or z-test, those tests assume that the sampling distribution of the mean is approximately normal.
    - The test assumes that the averages (means) you observe across samples follow a normal (bell-shaped) distribution ‚Äî not necessarily that your raw data are perfectly normal.
- Parametric tests like the t-test rely on the Central Limit Theorem (CLT), which says:
    - When sample sizes are large enough, the distribution of the sample mean tends to be normal, regardless of the shape of the raw data.
    - If your sample size is small, non-normality (e.g., skewed data) can bias your test results.
    - If your sample size is large, normality of raw data doesn‚Äôt matter much ‚Äî the CLT protects you.  
- Case 1: Small Sample, Non-Normal Data
    -  Suppose you have only 30 users per group, and the metric is highly skewed (like revenue per user ‚Äî many zeros, a few large spenders).
    - The t-test may not be reliable, because the data are not symmetric. In that case, you‚Äôd prefer a non-parametric test (e.g., Mann‚ÄìWhitney U test).
- Case 2: Large Sample (Typical A/B Test)
    - If you have 1,000+ users per group:
        - Even if your revenue or AOV (Average Order Value) data are skewed, the distribution of the sample mean becomes approximately normal.
            - ‚úÖ So you can safely use a t-test or z-test. 

- $H_0$: The assumption of normal distribution is provided
- $H_1$: The assumption of normal distribution is not provided

If the p-value is less than 0.05, the test is considered significant, and a nonparametric test (Mann-Whitney U test) will be used. Else, a parametric test (t-test)

| Condition              | Data Shape         | Sample Size          | Recommended Test      | Why                                   |
|------------------------|--------------------|----------------------|----------------------|--------------------------------------|
| Roughly normal data    | Symmetric          | Any                  | t-test               | Meets normality assumption directly  |
| Skewed data, small n   | Skewed, heavy tails| < 30‚Äì50 per group    | Mann‚ÄìWhitney U test  | Doesn‚Äôt assume normality             |
| Skewed data, large n   | Skewed             | ‚â• 100‚Äì200 per group  | t-test (CLT applies) | Sampling distribution ‚âà normal        |
| Proportions (e.g., conversion) | Binary outcome    | Large n              | z-test               | Proportion sampling distribution ‚âà normal |

```
# Sample Code
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

# Example data
data_A = [50, 55, 52, 70, 120, 30, 45, 50, 48, 52]

# Histogram + Q-Q Plot
sns.histplot(data_A, kde=True)
stats.probplot(data_A, dist="norm", plot=plt) 
plt.show()

# Shapiro-Wilk normality test
stat, p = stats.shapiro(data_A)
print(f"Shapiro-Wilk p-value: {p:.4f}") 
if p > 0.05:
    print("‚úÖ Data looks normal (fail to reject H0).")
else:
    print("‚ö†Ô∏è Data is likely non-normal (reject H0).")
```
#### üßÆ Key Takeaways 
| ‚úÖ Do‚Äôs                                         | ‚ö†Ô∏è Don‚Äôts                                                         |
|------------------------------------------------|------------------------------------------------------------------|
| Use t-test if n ‚â• 30 per group (CLT is your friend). | Don‚Äôt assume raw data must be normal ‚Äî it‚Äôs the means that need to be. |
| For small or skewed samples, use Mann‚ÄìWhitney U or bootstrap tests. | Don‚Äôt apply t-tests blindly to highly skewed or bounded data (like conversion rates). |
| Always visualize distributions (histograms, Q-Q plots). | Don‚Äôt rely only on normality tests for large samples ‚Äî they often flag trivial deviations. |


## Variance Homogeneity
One of the key statistical assumptions in A/B testing and t-tests: The assumption of variance homogeneity (also called homoscedasticity).
- Variance homogeneity (or equal variance assumption) means
    - The spread (variance or standard deviation) of your metric (e.g., AOV, conversion rate, time on site)
is roughly the same across all groups being compared.

*In A/B testing terms:*
- $Var(A) \approx Var(B)$

#### Why It Matters
When you run a two-sample t-test, the test formula assumes that both groups have:
- Independent samples
- Normally distributed means (due to the Central Limit Theorem)
- Equal variances (homogeneity)

If this assumption is violated:
- The standard error of the difference in means is misestimated.
- Your $\rho-values$ and confidence intervals may become inaccurate.

**The Two Versions of t-test**
| **t-test variant**   | **Assumes equal variances?** | **When to use**                            |
| :------------------- | :--------------------------- | :----------------------------------------- |
| **Student‚Äôs t-test** | ‚úÖ Yes                        | When variances in both groups are similar  |
| **Welch‚Äôs t-test**   | ‚ùå No                         | When variances differ (heteroscedasticity) |

**‚úÖ Always safer to use Welch‚Äôs t-test (equal_var=False in Python), since it‚Äôs robust and doesn‚Äôt require equal variances.**

```
# Example in Python
import numpy as np
from scipy import stats

# Simulate two groups
np.random.seed(42)
A = np.random.normal(50, 10, 500)   # mean=50, sd=10
B = np.random.normal(52, 20, 500)   # mean=52, sd=20 (different variance)

# Check variances
print(f"Var(A): {np.var(A, ddof=1):.2f}, Var(B): {np.var(B, ddof=1):.2f}")

# Levene's test for equal variances
stat, p = stats.levene(A, B)
print(f"Levene‚Äôs test: W = {stat:.3f}, p = {p:.4f}")

# Choose an appropriate t-test
if p < 0.05:
    print("‚ö†Ô∏è Variances differ ‚Äî use Welch‚Äôs t-test (equal_var=False).")
else:
    print("‚úÖ Variances are similar ‚Äî standard t-test is fine.")

# Welch‚Äôs t-test (default robust option)
t, pval = stats.ttest_ind(A, B, equal_var=False)
print(f"Welch‚Äôs t = {t:.3f}, p = {pval:.4f}")
```
**Output Example** <br>
Var(A): 95.92, Var(B): 381.21 <br>
Levene‚Äôs test: W = 209.307, p = 0.0000 <br>
‚ö†Ô∏è Variances differ ‚Äî use Welch‚Äôs t-test (equal_var=False). <br>
Welch‚Äôs t = -2.134, p = 0.0331 <br>

**‚úÖ Interpretation:**
- The variances are significantly different (p < 0.05 from Levene‚Äôs test).
- Therefore, use Welch‚Äôs t-test, which adjusts degrees of freedom and handles unequal variances correctly.

#### How to Check Variance Homogeneity
| **Test / Method**                      | **Purpose**                                    | **Interpretation**              |
| :------------------------------------- | :--------------------------------------------- | :------------------------------ |
| **Levene‚Äôs Test** (`stats.levene`)     | Most common; robust to non-normality           | ( p > 0.05 ) ‚Üí variances equal  |
| **Bartlett‚Äôs Test** (`stats.bartlett`) | Sensitive to normality; use if data are normal | ( p > 0.05 ) ‚Üí variances equal  |
| **Visual Inspection**                  | Boxplots or spread plots                       | Compare spread of data visually |

```
# Example visualization:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({"value": np.concatenate([A, B]),
                   "group": ["A"]*len(A) + ["B"]*len(B)})

sns.boxplot(data=df, x="group", y="value")
plt.title("Visual check for variance homogeneity")
plt.show()
```
#### Summary Table
| **Concept**      | **Symbol / Term**                  | **Interpretation**                          |
| :--------------- | :--------------------------------- | :------------------------------------------ |
| Equal variance   | ($\sigma^2_A = \sigma^2_B$)        | Assumed in Student‚Äôs t-test                 |
| Unequal variance | ($\sigma^2_A \ne \sigma^2_B$)      | Violates homogeneity                        |
| Safe test        | Welch‚Äôs t-test (`equal_var=False`) | Robust to variance differences              |
| Check test       | Levene‚Äôs or Bartlett‚Äôs test        | ( p > 0.05 ) ‚Üí OK; ( p < 0.05 ) ‚Üí use Welch |

#### Practical Tips in A/B Testing
| **Scenario**                                         | **Recommendation**                                                      |
| :--------------------------------------------------- | :---------------------------------------------------------------------- |
| AOV or revenue metrics (often skewed, high variance) | Use **Welch‚Äôs t-test** by default                                       |
| Conversion rate tests                                | Use **proportion z-test** (variance formula known analytically)         |
| Small samples with unequal variance                  | Consider **non-parametric test** (Mann‚ÄìWhitney U)                       |
| Very large samples                                   | Variance differences have a minor impact, but still report the test type used |

**‚úÖ In short:**
- Variance homogeneity = equal spread of values between groups.
- If violated ‚Üí use Welch‚Äôs t-test.
- Always test or visualize before deciding.

#### Conceptual Visualization: Variance Homogeneity in t-Tests
- We‚Äôd plot two bell curves (the sampling distributions of means for Group A and Group B):
1. Equal variance (homoscedastic case):
   - Both curves have similar spread (width).
   - The t-test assumes this scenario when computing pooled variance.
   - The overlap between distributions is symmetrical, so p-values are accurate.
2. Unequal variance (heteroscedastic case):
   - One curve is much wider (higher variance).
   - The assumption of equal spread breaks ‚Äî the standard error is misestimated.
   - Student‚Äôs t-test can produce misleading p-values.
   - Welch‚Äôs t-test corrects this by using separate variances and adjusted degrees of freedom.

| Scenario              | Visualization Idea                         | Interpretation                           |
| :-------------------- | :----------------------------------------- | :--------------------------------------- |
| **Equal variances**   | Two smooth bell curves with similar widths | ‚úÖ t-test assumption holds                |
| **Unequal variances** | One narrow, one wide curve                 | ‚ö†Ô∏è Student‚Äôs t-test invalid; use Welch‚Äôs |


## Start A/B Testing
In this A/B test, we are comparing the conversion rates between two groups: the control group (old web page) and the experimental group (new web page). The goal is to determine whether the new web page has a statistically significant effect on the conversion rate, or if any observed difference is due to random chance.

Null Hypothesis ($H_0$): The null hypothesis assumes there is no difference in conversion rates between the two groups. In other words, any observed differences are attributed to random variation rather than the design of the new web page.

Formally, the null hypothesis:
- $H_0: P_{control} = P_{experimental}$
- $H_1: P_{control} \ne P_{experimental}$

### EDA

In [2]:
import pandas as pd

# read data
df = pd.read_csv("/Users/sir/Downloads/Chrome/ecommerce_ab_testing_2022_dataset1/ab_data.csv")

In [3]:
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294480 entries, 0 to 294479
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294480 non-null  int64 
 1   timestamp     294480 non-null  object
 2   group         294480 non-null  object
 3   landing_page  294480 non-null  object
 4   converted     294480 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [5]:

df.nunique()

user_id         290585
timestamp        35993
group                2
landing_page         2
converted            2
dtype: int64

In [44]:
import numpy as np

from statsmodels.stats.power import zt_ind_solve_power

# --- 1. Define Statistical Parameters ---
ALPHA = 0.05       # Significance Level (Type I Error)
POWER = 0.80       # Statistical Power (1 - Type II Error)

# --- 2. Define Business Parameters ---
BASELINE_CR_A = 0.150      # Baseline Conversion Rate (e.g., 15.0%)
MIN_DETECTABLE_LIFT = 0.10 # Minimum Detectable Relative Lift (e.g., 10%)

# Calculate the minimum detectable conversion rate (p_B)
# CR_B_MIN = 0.150 √ó (1 + 0.10) = 0.165
# The treatment group must reach 16.5% conversion to be considered a meaningful lift.
CR_B_MIN = BASELINE_CR_A * (1 + MIN_DETECTABLE_LIFT)

# Calculate the effect size (difference between the two proportions)
# 0.165 ‚àí 0.150 = 0.015
# That‚Äôs a 1.5 percentage point absolute difference.
effect_size = CR_B_MIN - BASELINE_CR_A

# Then solved for the sample size per group needed to detect that lift with 80% power at 5% significance.
n_per_group = zt_ind_solve_power(
    effect_size=effect_size,
    alpha=ALPHA,
    power=POWER,
    ratio=1.0,
    alternative='two-sided'
)
print("Required sample size per group:", round(n_per_group))

Required sample size per group: 69768


Interpretation
You‚Äôve defined the business lift you care about (10%).

Converted it into an absolute difference (1.5 percentage points).

Standardized it for the z‚Äëtest.

Then solved for the sample size per group needed to detect that lift with 80% power at 5% significance.

In [45]:
p_pool = BASELINE_CR_A  # or use pooled average
std_effect_size = effect_size / np.sqrt(p_pool * (1 - p_pool))

print("Standardized effect size:", std_effect_size)

n_per_group = zt_ind_solve_power(
    effect_size=std_effect_size,
    alpha=ALPHA,
    power=POWER,
    ratio=1.0,
    alternative='two-sided'
)
print("Required sample size per group:", round(n_per_group))

Standardized effect size: 0.04200840252084033
Required sample size per group: 8895


In [None]:



# # --- 3. Calculate Sample Size per Group ---

# # The zt_ind_solve_nobs function requires the normalized effect size (Cohen's h), 
# # which is automatically calculated based on the two proportions.
# # We pass the proportions (prob1 and prob2) to the power function directly.

# required_n_per_group = zt_ind_solve_power(
#         effect_size=effect_size, 
#         nobs1=None, 
#         alpha=ALPHA,
#         power=POWER, 
#         ratio=1.0, 
#         alternative='two-sided'
# )

# print("--- Sample Size Requirements for Conversion Rate A/B Test ---")
# print(f"Target Baseline Conversion Rate (pA): {BASELINE_CR_A:.2%}")
# print(f"Minimum Detectable Relative Lift: {MIN_DETECTABLE_LIFT:.0%}")
# print(f"Minimum Target Conversion Rate (pB): {CR_B_MIN:.2%}")
# print("-" * 50)
# print(f"Required Sample Size per Group (N): {np.ceil(required_n_per_group):.0f} observations")
# print(f"Total Required Sample Size (N_A + N_B): {2 * np.ceil(required_n_per_group):.0f} observations")
# print("-" * 50)



In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294480 entries, 0 to 294479
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294480 non-null  int64 
 1   timestamp     294480 non-null  object
 2   group         294480 non-null  object
 3   landing_page  294480 non-null  object
 4   converted     294480 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [11]:
df.nunique()

user_id         290585
timestamp        35993
group                2
landing_page         2
converted            2
dtype: int64

In [12]:
df.agg('nunique')

user_id         290585
timestamp        35993
group                2
landing_page         2
converted            2
dtype: int64

In [13]:
df.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

In [14]:
df.isna().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

In [15]:
# looks like we have duplicate user_id
print("Original Dataset:", df.shape)
df = df.drop_duplicates(subset="user_id", keep=False)
print("New Dataset:", df.shape)

Original Dataset: (294480, 5)
New Dataset: (286690, 5)


### Multiple ways for counts

In [16]:
# determine counts by group & landing page
%timeit df.groupby(['group', 'landing_page']).size()

15.4 ms ¬± 364 Œºs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)


In [17]:
# determine counts by group & landing page
%timeit df.groupby(['group', 'landing_page']).agg(count=('landing_page', 'count'))

19.3 ms ¬± 491 Œºs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)


In [18]:
# determine counts by group & landing page
%timeit df.groupby(['group', 'landing_page']).agg(count=('landing_page', 'size'))

16.1 ms ¬± 367 Œºs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)


In [19]:
# determine counts by group & landing page using lamda function
%timeit df.groupby(['group', 'landing_page']).agg({'landing_page': lambda x: x.value_counts()})

24.8 ms ¬± 42 Œºs per loop (mean ¬± std. dev. of 7 runs, 10 loops each)


In [20]:
# using pivot table
%timeit df.groupby(['group', 'landing_page']).size().unstack(fill_value=0)

15.9 ms ¬± 274 Œºs per loop (mean ¬± std. dev. of 7 runs, 100 loops each)


In [21]:
# determine counts by group & landing page
df.groupby(['group', 'landing_page']).size().unstack(fill_value=0)

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,0,143293
treatment,143397,0


In [22]:
df.groupby(['group', 'landing_page']).size()

group      landing_page
control    old_page        143293
treatment  new_page        143397
dtype: int64

### Multiple ways for mean

In [23]:
# the aggregation function mean to the column converted.
df.groupby(['group', 'landing_page']).agg({'converted': 'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,converted
group,landing_page,Unnamed: 2_level_1
control,old_page,0.120173
treatment,new_page,0.118726


In [24]:
# nly aggregating one column
df.groupby(['group', 'landing_page'])['converted'].mean()

group      landing_page
control    old_page        0.120173
treatment  new_page        0.118726
Name: converted, dtype: float64

In [25]:
# pivot this into a table (groups as rows, landing pages as columns)
df.pivot_table(index='group', columns='landing_page', values='converted', aggfunc='mean')

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,,0.120173
treatment,0.118726,


## Check for Balance
- In an A/B test, you want the traffic split between control (old_page) and treatment (new_page) to be roughly equal.
- This ensures that any difference in conversion rates is due to the page itself, not because one group had significantly more users.

In [26]:
# distribution of values
df.landing_page.value_counts(normalize = True)

landing_page
new_page    0.500181
old_page    0.499819
Name: proportion, dtype: float64

In [27]:
df.groupby('landing_page').size() / len(df)

landing_page
new_page    0.500181
old_page    0.499819
dtype: float64

In [28]:
# pretty dataframe
df['landing_page'].value_counts(normalize=True).to_frame('proportion')

Unnamed: 0_level_0,proportion
landing_page,Unnamed: 1_level_1
new_page,0.500181
old_page,0.499819


In [29]:
# pretty dataframe
df['landing_page'].value_counts(normalize=True).reset_index(name='proportion')

Unnamed: 0,landing_page,proportion
0,new_page,0.500181
1,old_page,0.499819


### Filter out any mismatched rows in the A/B test

In [30]:
df.query("(group == 'control' and landing_page == 'new_page') or \
          (group == 'treatment' and landing_page == 'old_page')")

Unnamed: 0,user_id,timestamp,group,landing_page,converted


#### Chi-Square Test of Independence
- The Chi-Square Test of Independence is a suitable test for your A/B test binary conversion data when you want to evaluate whether the conversion outcome is independent of the group.
- Chi-Square Test of Independence provides a powerful, classical method to test if the new page changes conversion rates compared to the old page by assessing the association between conversion and treatment group. assignment (old vs new page).
- Hypotheses:
    - $H_0$: Conversion rate is independent of group (no difference).
    - $H_a$: Conversion rate depends on group (there is a difference).

In [None]:
from scipy import stats
import statsmodels.api as sm

# counts and totals
count_old = df.loc[df.landing_page == 'old_page', 'converted'].sum()
n_old =  df.loc[df.landing_page == 'old_page', 'converted'].count()

count_new = df.loc[df['landing_page'] == 'new_page', 'converted'].sum()
n_new     = df.loc[df['landing_page'] == 'new_page', 'converted'].count()

In [None]:
# Contingency table: rows = groups, columns = conversion outcomes
table = np.array([[count_new, n_new - count_new],
                  [count_old, n_old - count_old]])

# chi2
chi2, p, dof, expected = stats.chi2_contingency(table)

print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")
print(f"Degrees of freedom: {dof}")
print("\nExpected frequencies:\n", expected)

ci_low, ci_high = sm.stats.confint_proportions_2indep(count_new, n_new, count_old, n_old, method='wald')
print(50*'=')
print(f"95% CI for difference in conversion rates: [{ci_low:.4f}, {ci_high:.4f}]")

#### Proportion Test
Null Hypothesis ($H_0$)
- $H_0:P_{old} = P_{new}$

Alternative Hypothesis ($H_1$)
- Two‚Äësided (default):
    - $H_1: P_{old} \ne {P_new}$
- ‚Üí The conversion rates are different.
    - One‚Äësided (larger): 
        - $H_1: P_{new} \gt P_{old}$
        - z_stat, p_val = proportion.proportions_ztest(count, nobs, alternative='larger')
- ‚Üí The new page has a lower conversion rate.
    - One‚Äësided (smaller):
        - $H_1: P_{new} < P_{old}$
        - z_stat, p_val = proportion.proportions_ztest(count, nobs, alternative='smaller')

In [None]:
from statsmodels.stats import proportion

# prpep for test
count = [count_old, count_new]
nobs  = [n_old, n_new]

# Two-sided test (default)
z_stat, p_val = proportion.proportions_ztest(count, nobs)
print(f"z-stat: {z_stat:.4f}, p-value: {p_val:.6f}")

# 95% CI for the difference in proportions (new - old), using 'wald' method
ci_low, ci_high = proportion.confint_proportions_2indep(count_new, n_new, count_old, n_old, method='wald')
print(f"95% CI (new - old): [{ci_low:.5f}, {ci_high:.5f}]")

**Interpretation:**
- Z-statistic (1.1945): This value quantifies how far the observed difference is from the null hypothesis (no difference), measured in standard errors. A z-score near 0 means little difference; large absolute values indicate larger, more unusual differences if the null hypothesis is true.‚Äã
- P-value (0.232288): This is the probability of seeing a difference at least as large as the one observed if there were really no difference between the groups (null hypothesis is true). Because 0.23 is much larger than 0.05:
    - Fail to reject the null hypothesis (i.e., not statistically significant) since the difference seen could easily be explained by random chance.
- 95% Confidence Interval ([-0.00382, 0.00093]):
    - This interval contains 0, meaning the true difference might be negative, positive, or zero.
    - There is no statistically significant evidence that the conversion rate for the new page is higher or lower than for the old page

Summary: With p-value = 0.23 and a CI spanning zero, you do not have evidence to support rolling out the new page, and cannot reject the ‚Äúno effect‚Äù hypothesis.

The proportion z-test and the chi-square test for two categories are mathematically equivalent in testing differences between two proportions for binary outcomes. The key points on - - when to use each are:
    - The proportion z-test is typically used when you want a direct test comparing two proportions and can leverage the normal approximation. It explicitly computes a z-statistic and is conceptually more straightforward when dealing with exactly two categories. It is appropriate when sample sizes are large enough for the normal approximation.
    - The chi-square test (with 1 degree of freedom) tests the same hypothesis via a contingency table and evaluates the association between two categorical variables. The chi-square statistic is the square of the z-statistic from the proportion z-test, and either test yields the same p-value.

- Historically, the z-test was preferred for two-category cases due to a more straightforward lookup of critical values, but this distinction is now mostly academic given modern computing.
    - Use the proportion z-test if you want a straightforward difference in proportions test with normal theory. Use the chi-square test if you prefer the contingency table framework or are dealing with more categories.
    - Both tests require sufficiently large samples (expected counts usually ‚â•5) to ensure the chi-square approximation and normal approximation hold.

In summary, for the two categories, it doesn't materially matter whether you use a proportion z-test or a chi-square test, as they are effectively the same test with equivalent results. The choice can come down to preference, presentation, or context of analysis