In [1]:
%%html
<style>
  table {
    margin-left: 0 !important;
  }
</style>

# A/B Testing - Frequentist approach
A/B testing is a controlled experiment that compares two versions (A and B) to determine which performs better on a chosen metric, such as conversion rate, click-through rate, or revenue.

- Key Points of A/B Testing:
    - Random assignment: Users or subjects are randomly split into two groups to test variant A versus variant B.
    - Hypotheses:
        -  Null Hypothesis ($H_0$): There is no difference between A and B.
        -  Alternative Hypothesis ($H_1$): There is a difference (e.g., B performs better than A).
    - Data collection: Collect data on user interaction with each variant.
    - Statistical analysis: Use hypothesis testing (t-tests, z-tests, or nonparametric tests) to determine if observed differences are statistically significant.
    - Decision making: If the null hypothesis is rejected with sufficient evidence, conclude that one variant is superior; otherwise, no difference is confirmed.
- Practical Steps:
    - Formulate hypotheses.
    - Randomly assign users to groups.
    - Run the experiment, collecting relevant metrics.
    - Perform statistical tests to check for significant differences.
    - Implement the winning variant based on results.

A/B testing is widely used in product development, marketing, and UX research to optimize user engagement and business outcomes by data-driven decision-making. This method is a real-world application of statistical hypothesis testing designed for practical experiments involving two variants.

#### Statistical Errors and Power
When you run an A/B test, you are making a decision about the whole population based on a sample. Because of sampling variation, there are two primary mistakes (errors) you can make:
| Error Type   | Definition                                              | Consequence                                                                                   | Statistical Measure       |
|--------------|---------------------------------------------------------|----------------------------------------------------------------------------------------------|---------------------------|
| **Type I Error** | False Positive: Rejecting $(H_0)$ when $(H_0)$ is true | Deploying variant B thinking it is better when it is actually no better than control A. Loss of time/resources. | Significance Level $(\alpha)$ |
| **Type II Error** | False Negative: Failing to reject $(H_0)$ when $(H_1)$ is true | Failing to deploy a winning variant B because results were inconclusive. Loss of potential profit. | Statistical Power $(1 - \beta)$ |

##### Frequentist Analysis: P-Value vs. Confidence Interval
| Metric           | Purpose                | Output                         | Decision Rule                                |
|------------------|------------------------|--------------------------------|----------------------------------------------|
| P-value          | Significance (Pass/Fail) | Single number between 0 and 1  | If $(p < \alpha)$ (usually 0.05), declare a winner reject $(H_0)$. |
| Confidence Interval (CI) | Magnitude (How Much?)    | Range of values (e.g., [3.0%, 7.0%]) | If the range excludes zero, the result is significant.              |



#### 1. Understand the Experiment Design
- Clarify the goal: Increase conversion rate (the number of users who pay for the product).
- Identify groups: ‚ÄúControl‚Äù (old page) vs. ‚ÄúTreatment‚Äù (new page).
- Know your metrics: Focus on conversion rate, but also review related metrics like average order value or bounce rate for context.

#### 2. Load and Explore the Data
- Import all relevant data files (typically user-level logs or summaries of visits, conversions, variant assignments, etc.).
- Use .head(), .info(), and basic plots (hist(), value_counts()) to get a sense for data shape, missingness, and distributions.

#### 3. Check Data Integrity
- Ensure randomization: Each user should be assigned to only one group.
- Check for duplicates and missing values.
- Validate that test groups are balanced in size.

#### 4. Define Success Criteria
- Decide on the statistical significance threshold ($\alpha$ is standard).
- Set the minimum detectable effect (MDE) your business cares about.
- Clarify sample size requirements for adequate statistical power.

#### 5. Perform Exploratory Data Analysis (EDA)
- Visualize conversion rates in control vs. treatment groups.
- Plot histograms or boxplots for order value and other key metrics.
- Summarize basic statistics: mean, median, counts, proportions.

#### 6. Statistical Testing
- Calculate the observed difference in conversion rates.
- Use appropriate hypothesis tests (e.g., z-test for proportions, t-test if comparing means) to assess significance.
- Consider permutation tests if assumptions for parametric tests are questionable.

#### 7. Interpret Results
- Compare $\rho-value$ to the $\alpha$ threshold.
- If significant: consider business and practical impact.
- If not significant: review sample size and power‚Äîconsider whether to extend testing.

#### 8. Additional Testing and Validation Steps
- Common Additional Testing and Validation Steps:
    - Permutation (Randomization) Tests: Directly estimate the null distribution of your test statistic by permuting group labels. Confirms robustness if assumptions of traditional parametric tests (like normality) are questionable.
    - Bootstrap Confidence Intervals: Resample your observed data to estimate more robust confidence intervals for differences in conversion rate or other metrics.
    - Subgroup Analysis: Check if effects are consistent across different customer segments (e.g., geography, device type, user tenure) to rule out confounding or strange heterogeneity.
    - Test for Balance: Re-validate that the treatment and control groups are similar in covariates prior to treatment‚Äîimbalance could invalidate causal inference.
    - Holdout Validation or Split-Test Replication: Run a smaller version of the experiment (or leave out a random subset as a ‚Äúholdout‚Äù group) to check if effects replicate.
    - Power Analysis Post-Hoc: Calculate the observed power of your test to help interpret non-significant results‚Äîis the test underpowered, or is there truly no effect?

Note: If your main A/B z-test yields a p-value just above 0.05, running a permutation test and bootstrap interval can confirm whether this result is robust or might vary with sampling noise. Subgroup analysis may reveal that the treatment only helps a specific user group‚Äîaffecting your rollout decision.

#### 9. Make and Justify Recommendation
- Based on the statistical and practical analysis, advise whether to:
    - Roll out the new page,
    - Keep the old page,
    - Or continue/adjust the experiment. 


#### Summary of Your A/B Testing Workflow
| Step                        | Objective                                | Key Tools / Techniques                                                                            |
| :-------------------------- | :--------------------------------------- | :------------------------------------------------------------------------------------------------ |
| 1. Experiment Design        | Define hypothesis, groups, and metrics.  | Clarify goal (conversion ‚Üë), define control/treatment, metric definitions.                        |
| 2. Load & Explore Data      | Understand data structure and quality.    | pandas, `.info()`, `.describe()`, histograms, missing value checks.                              |
| 3. Data Integrity Checks    | Ensure randomization and data quality.    | Check duplicates, assignment consistency, group balance.                                         |
| 4. Success Criteria         | Define decision thresholds.               | Œ± (e.g., 0.05), Minimum Detectable Effect (MDE), power analysis (`statsmodels.stats.power`).     |
| 5. Exploratory Data Analysis (EDA) | Summarize and visualize metrics.   | Conversion rate by group, boxplots, histograms, correlations.                                    |
| 6. Statistical Testing      | Evaluate if observed differences are significant. | z-test (proportions), t-test (means), permutation or bootstrap if needed.                |
| 7. Interpret Results        | Draw meaningful conclusions.              | Compare p-value vs Œ±, assess practical vs statistical significance.                              |
| 8. Validation / Robustness Checks | Strengthen confidence in findings.  | Permutation tests, bootstrapping, subgroup analysis, covariate balance.                          |
| 9. Recommendation           | Translate stats into business action.     | Rollout / hold / rerun decision.                                                                 |


## What is a Hypothesis?
In statistics, a hypothesis is a specific, testable statement about a population parameter (like a mean, proportion, or variance).
In A/B testing, it‚Äôs used to decide whether an observed difference between two groups (A and B) is real or just due to random chance.
A hypothesis in statistics is a specific, testable statement about a population parameter, such as a mean or proportion. In A/B testing, hypotheses help determine whether an observed difference between groups is real or due to random chance.

#### The Two Competing Hypotheses
| Type                  | Symbol      | Meaning                                                              |
|-----------------------|-------------|----------------------------------------------------------------------|
| Null Hypothesis       | $H_0$     | Assumes no difference between A and B. Any observed difference is due to random variation. |
| Alternative Hypothesis | $H_1$ or $H_a$ | Assumes there is a real difference (the new variant changed the metric).                |

#### Example: A/B Test on Average Order Value (AOV)
You test whether a new marketing strategy (B) increases AOV compared to the current one (A).
- $H_0:\mu_A$ = $\mu_B$ (no difference in mean AOV)
- $H_1:\mu_B$ > $\mu_A$ (variant B increases mean AOV)

#### One-tailed vs Two-tailed Tests
| Test Type   | When to Use                                   | Example                             |
|-------------|----------------------------------------------|-----------------------------------|
| Two-tailed  | When any difference (increase or decrease) is interesting. | $H_1: \mu_A \ne \mu_B$        |
| One-tailed  | When you care about only one direction (e.g., increase).   | $H_1: \mu_B > \mu_A$           |

**‚ö†Ô∏è One-tailed tests are more powerful but risk bias if the effect goes in the opposite direction.**

#### Decision Framework
| Step | Concept                       | Description                                                       |
| :--- | :---------------------------- | :---------------------------------------------------------------- |
| 1    | **State hypotheses**          | Define ($H_0$) and ($H_1$).                                       |
| 2    | **Choose significance level** | Typically ($\alpha$ = 0.05).                                      |
| 3    | **Compute test statistic**    | e.g., t, z, U ‚Äî depending on test.                                |
| 4    | **Compute p-value**           | Probability of seeing your data if ($H_0$) were true.             |
| 5    | **Decision rule**             | If ($\rho \le \alpha$), **reject ($H_0$)** ‚Üí significant difference. |

#### Example (t-test)
```
from scipy import stats
import numpy as np

A = np.random.normal(50, 10, 100)
B = np.random.normal(52, 10, 100)

tstat, pval = stats.ttest_ind(A, B, equal_var=False)
print(f"t = {tstat:.3f}, p = {pval:.4f}")
```

#### Output example:
t = -2.041, p = 0.043

**Interpretation:** <br>
Since p = 0.043 < 0.05, reject $H_0$ <br>
The difference between groups is statistically significant. <br>

#### Type I and Type II Errors
| Error Type        | Symbol | What It Means                                               | Example                                   |
| :---------------- | :----- | :---------------------------------------------------------- | :---------------------------------------- |
| **Type I Error**  | $\alpha$      | Rejecting ($H_0$) when it‚Äôs true (false positive).          | You think B works better, but it doesn‚Äôt. |
| **Type II Error** | $\beta$      | Failing to reject ($H_0$) when it‚Äôs false (false negative). | You miss a real improvement.              |

**Power = $1 ‚Äì \beta$ ‚Üí probability of correctly detecting a real effect.**

#### Hypothesis Testing in A/B Context
| **Metric**                  | **Null Hypothesis ($H_0$)** | **Alternative ($H_1$)** | **Test Used**         |
| :-------------------------- | :-------------------------- | :---------------------- | :-------------------- |
| Conversion Rate (CR)        | ($p_A = p_B$)               | ($p_A \ne p_B$)         | Two-proportion z-test |
| Average Order Value (AOV)   | ($\mu_A = \mu_B$)           | ($\mu_A \ne \mu_B$)     | t-test                |
| Orders per User (Frequency) | ($F_A = F_B$)               | ($F_A \ne F_B$)         | Mann‚ÄìWhitney U test   |
| Revenue per Visitor (RPV)   | ($\mu_A = \mu_B$)           | ($%\mu_A \ne \mu_B$)    | t-test                |

#### Example Summary (A/B test on AOV)
| Step                              | Result                                       |
| :-------------------------------- | :------------------------------------------- |
| ($H_0$): No difference in AOV     |                                              |
| ($H_1$): Strategy B increases AOV |                                              |
| $\alpha = 0.05$                   | Significance threshold                       |
| $\rho = 0.082$                    | Computed $\rho-value$                        |
| Decision: Fail to reject ($H_0$)  | No significant difference detected           |
| Interpretation                    | Strategy B didn‚Äôt significantly improve AOV. |

#### What is AOV (Average Order Value)?
AOV stands for Average Order Value, a key e-commerce and marketing performance metric that measures the average amount of money customers spend per order.

**Formula:**
$$AVO = \frac {\text{Total Revenue}}{\text{Number of Orders}}$$

#### Why AOV Matters
| **Business Impact**        | **Explanation**                                                                      |
| :------------------------- | :----------------------------------------------------------------------------------- |
| **Revenue Growth Lever**   | Increasing AOV can grow total revenue without increasing customer acquisition costs. |
| **Pricing & Upselling**    | Measures how well your upsells, bundles, and promotions perform.                     |
| **Customer Value Insight** | Helps segment high-value vs. low-value customers.                                    |
| **Campaign ROI**           | Determines if a new marketing strategy increases purchase size.                      |

#### AOV in A/B Testing
In A/B testing, AOV is often used as a primary or secondary metric to assess the financial impact of design or pricing changes.
|                                              | **Null Hypothesis ($H_0$)** | **Alternative Hypothesis ($H_1$)** |
| :------------------------------------------- | :-------------------------- | :--------------------------------- |
| **Goal:** Test if new strategy increases AOV | ($\mu_A = \mu_B$)           | ($\mu_B > \mu_A$)                  |

*Where:*
- $\mu_A$: Mean AOV for Control group (A)
- $\mu_B$: Mean AOV for Variant group (B)

#### Statistical Test Used
| **Condition**                                      | **Test**                      | **Reason**                                           |
| :------------------------------------------------- | :---------------------------- | :--------------------------------------------------- |
| Large sample (n > 30/group), AOV roughly symmetric | **Two-sample t-test**         | Tests mean difference assuming approximate normality |
| Skewed AOV (common in e-commerce) or small samples | **Mann‚ÄìWhitney U test**       | Non-parametric, does not assume normality            |
| Very large samples or bootstrapping available      | **Bootstrap mean difference** | Empirical, assumption-free confidence interval       |
```
# Sample Code
import numpy as np
from scipy import stats

# Example AOVs for A and B
A = np.random.normal(50, 15, 1000)
B = np.random.normal(52, 15, 1000)

# Welch‚Äôs t-test (no equal variance assumption)
tstat, pval = stats.ttest_ind(A, B, equal_var=False)
print(f"Mean AOV_A = {A.mean():.2f}, Mean AOV_B = {B.mean():.2f}")
print(f"t = {tstat:.3f}, p = {pval:.4f}")
```
**Output:**
Mean AOV_A = 50.12, Mean AOV_B = 52.17 <br>
t = -2.331, p = 0.020 <br>

**‚úÖ Interpretation:**
- $\rho < 0.05$ ‚Üí reject $H_0$
- Strategy B significantly increases AOV.

#### Common Pitfalls
| **Pitfall**                                       | **Why It‚Äôs a Problem**                          | **Better Approach**                                   |
| :------------------------------------------------ | :---------------------------------------------- | :---------------------------------------------------- |
| AOV is highly skewed (a few huge orders dominate) | Violates t-test assumptions                     | Use **log-transformed AOV** or **Mann‚ÄìWhitney test**  |
| Ignoring conversion rate                          | High AOV but low conversions can reduce revenue | Analyze **Revenue per Visitor (RPV = CR √ó AOV)**      |
| Comparing cumulative AOV too early                | Early results fluctuate heavily                 | Use **fixed sample window** or **sequential testing** |
| Confusing AOV per user vs per order               | Users may place multiple orders                 | Clarify unit of analysis (order-level vs. user-level) |

#### Related Metrics
| **Metric**                      | **Formula**                               | **Interpretation**                       |
| :------------------------------ | :---------------------------------------- | :--------------------------------------- |
| **Conversion Rate (CR)**        | ( \frac{\text{Orders}}{\text{Visitors}} ) | % of visitors who buy                    |
| **Revenue per Visitor (RPV)**   | ( \text{CR} \times \text{AOV} )           | Average revenue contribution per visitor |
| **Orders per User (Frequency)** | ( \frac{\text{Orders}}{\text{Users}} )    | Customer repeat rate                     |

#### ‚úÖ Summary
| **Aspect**           | **Details**                                              |
| :------------------- | :------------------------------------------------------- |
| **Definition**       | Mean order value across all transactions                 |
| **Goal in A/B test** | Measure if variant increases transaction size            |
| **Typical test**     | Welch‚Äôs t-test or Mann‚ÄìWhitney U test                    |
| **Business insight** | AOV increase = users buying more expensive or more items |
| **Watch for**        | Skewness, low sample size, ignoring CR impact            |


## What is the Normality Assumption?
It‚Äôs the assumption that the data (or more precisely, the sampling distribution of the test statistic) follows a normal (Gaussian) distribution. The normality assumption means the data or the sampling distribution of the test statistic follows a normal distribution ($X \sim N(\mu, \sigma^2)$). This assumption underpins many classical tests such as the two-sample t-test used in A/B testing.

**In formulas:**
- $X \sim N(\mu, \sigma^2)$
    - This matters because many classical statistical tests ‚Äî like the t-test ‚Äî are derived under this assumption.

#### A/B Testing Context
In an A/B test, you compare two versions (A and B) of something (e.g., webpage, email) to see which performs better on a metric (e.g., conversion rate, time spent, click-through rate).

Commonly used tests:
- Two-sample t-test ‚Äì for comparing means (e.g., average time on site)
- Z-test for proportions ‚Äì for comparing rates (e.g., conversion rates)
- Nonparametric tests ‚Äì if normality is violated

#### When the Normality Assumption Matters
| Case                                      | Is Normality Needed? | Why                                                                                                                                                         |
| ----------------------------------------- | -------------------- | ---------------------------------------------------------------------------------------------------------------|
| **Small sample sizes (n < 30)**           | ‚úÖ Yes                | The t-test assumes data are approximately normal.                                                              |
| **Large sample sizes (n ‚â• 30 per group)** | ‚ùå Not strictly       | Thanks to the **Central Limit Theorem (CLT)**, the sampling distribution of the mean (or proportion) becomes approximately normal, even if the data aren‚Äôt. |
| **Proportion data (binary outcomes)**     | ‚ùå Not directly       | The Z-test for proportions uses the CLT; normality of the *underlying data* isn‚Äôt required.      |
| **Highly skewed data or outliers**        | ‚ö†Ô∏è Maybe not         | Consider data transformation (e.g., log) or a nonparametric alternative.  |

**Why Normality Assumption Matters:**
- It enables the use of parametric tests with well-understood properties.
- When violated with small samples, parametric tests may be invalid.
- For large samples, tests remain robust despite non-normality.

**Practical Takeaways:**
- Check normality when sample sizes are small (tests like Shapiro-Wilk).
- For large A/B tests, normality concerns usually diminish.
- Use nonparametric tests if normality is violated and sample size is small.
This foundation explains why normality remains central in small-sample hypothesis testing but less restrictive for large-scale A/B tests due to the CLT.

#### In Practice
- For large-scale A/B tests (typical in web experiments), sample sizes are usually huge ‚Üí normality assumption is not a concern.
- For small experiments, check normality:
    - Visual check (histogram, Q‚ÄìQ plot)
    - Statistical tests (Shapiro‚ÄìWilk, Anderson‚ÄìDarling)
- If the assumption fails, use:
    - Mann-Whitney U test (for medians instead of means)
    - Bootstrap methods (nonparametric confidence intervals)

#### ‚úÖ Summary
| Question                                 | Answer                                                                                       |
| ---------------------------------------- | -------------------------------------------------------------------------------------------- |
| Do you need normal data for A/B testing? | Usually, **no**, if you have large samples.                                                   |
| Why not?                                 | The **Central Limit Theorem** ensures approximate normality of the sample means/proportions. |
| When should you care?                    | When your sample is small or the data are extremely skewed.                                  |

#### Normality assumption check
- Checking the normality assumption is an important step before applying tests like the t-test in A/B testing (especially with small samples).

**Step 1. Clarify What You‚Äôre Testing for Normality**
- You‚Äôre testing whether your sample data (e.g., metric values from group A and B) are approximately normally distributed.
- In A/B testing:
    - If your metric is continuous (e.g., time on site, revenue per user) ‚Üí check normality directly.
    - If your metric is binary (e.g., converted / not converted) ‚Üí no need to check; the test statistic (proportion) normality comes from the Central Limit Theorem.
      
**Step 2. Visual Checks**
- Histogram
    - Plot a histogram of your metric for each group:
        - Should look roughly bell-shaped (symmetrical, unimodal).
- Q‚ÄìQ (Quantile‚ÄìQuantile) Plot
    - Plots your data‚Äôs quantiles vs. those of a theoretical normal distribution:
        - If points fall roughly along a $45 \degree$ line, ‚Üí data are approximately normal.

```
# Python example:
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Suppose `group_a` and `group_b` are arrays or Series of your metric
sns.histplot(group_a, kde=True)
plt.title("Histogram of Group A")
plt.show()

stats.probplot(group_a, dist="norm", plot=plt)
plt.title("Q-Q Plot for Group A")
plt.show()
```

**Step 3. Statistical Normality Tests**
Formal tests for normality (use with caution ‚Äî they‚Äôre sensitive to large sample sizes):

- Shapiro‚ÄìWilk Test
    - Most common for small to medium samples (n < 5000).

```
# Sample Python
from scipy.stats import shapiro

stat, p = shapiro(group_a)
print(f"Statistic={stat:.3f}, p={p:.3f}")
if p > 0.05:
    print("Sample looks normal (fail to reject H‚ÇÄ).")
else:
    print("Sample does not look normal (reject H‚ÇÄ).")

```

- Kolmogorov‚ÄìSmirnov Test or Anderson‚ÄìDarling Test
```
from scipy.stats import anderson

result = anderson(group_a)
print('Statistic: %.3f' % result.statistic)
for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < cv:
        print(f"At {sl}% level: data looks normal.")
    else:
        print(f"At {sl}% level: data not normal.")

```
**Step 4. Interpret in Context**
- If the data look roughly normal, you can use a t-test.
- If not normal, but the sample is large (n ‚â• 30), the Central Limit Theorem justifies approximate normality ‚Üí still okay.
- If not normal and small sample, use:
    - Mann-Whitney U test (nonparametric)
    - Or bootstrap confidence intervals

**‚úÖ Summary Table**
| Method                | Type        | When to Use | Interpretation                       |
| --------------------- | ----------- | ----------- | ------------------------------------ |
| Histogram             | Visual      | Always      | Rough shape                          |
| Q‚ÄìQ Plot              | Visual      | Always      | Deviation from straight line         |
| Shapiro‚ÄìWilk          | Statistical | n < 5000    | p > 0.05 ‚Üí normal                    |
| Anderson‚ÄìDarling      | Statistical | Any size    | Compare statistic to critical values |
| Large sample (n ‚â• 30) | ‚Äî           | ‚Äî           | Normality assumption not critical    |


#### Normality Assumption:
- The Normality Assumption is one of the core statistical considerations in A/B testing, especially when you‚Äôre using parametric tests like the t-test or z-test.
- In an A/B test, you compare the means of two groups (e.g., control vs. variant).
If you‚Äôre using a t-test (e.g., scipy.stats.ttest_ind) or z-test, those tests assume that the sampling distribution of the mean is approximately normal.
    - The test assumes that the averages (means) you observe across samples follow a normal (bell-shaped) distribution ‚Äî not necessarily that your raw data are perfectly normal.
- Parametric tests like the t-test rely on the Central Limit Theorem (CLT), which says:
    - When sample sizes are large enough, the distribution of the sample mean tends to be normal, regardless of the shape of the raw data.
    - If your sample size is small, non-normality (e.g., skewed data) can bias your test results.
    - If your sample size is large, normality of raw data doesn‚Äôt matter much ‚Äî the CLT protects you.  
- Case 1: Small Sample, Non-Normal Data
    -  Suppose you have only 30 users per group, and the metric is highly skewed (like revenue per user ‚Äî many zeros, a few large spenders).
    - The t-test may not be reliable, because the data are not symmetric. In that case, you‚Äôd prefer a non-parametric test (e.g., Mann‚ÄìWhitney U test).
- Case 2: Large Sample (Typical A/B Test)
    - If you have 1,000+ users per group:
        - Even if your revenue or AOV (Average Order Value) data are skewed, the distribution of the sample mean becomes approximately normal.
            - ‚úÖ So you can safely use a t-test or z-test. 

- $H_0$: The assumption of normal distribution is provided
- $H_1$: The assumption of normal distribution is not provided

If the p-value is less than 0.05, the test is considered significant, and a nonparametric test (Mann-Whitney U test) will be used. Else, a parametric test (t-test)

| Condition              | Data Shape         | Sample Size          | Recommended Test      | Why                                   |
|------------------------|--------------------|----------------------|----------------------|--------------------------------------|
| Roughly normal data    | Symmetric          | Any                  | t-test               | Meets normality assumption directly  |
| Skewed data, small n   | Skewed, heavy tails| < 30‚Äì50 per group    | Mann‚ÄìWhitney U test  | Doesn‚Äôt assume normality             |
| Skewed data, large n   | Skewed             | ‚â• 100‚Äì200 per group  | t-test (CLT applies) | Sampling distribution ‚âà normal        |
| Proportions (e.g., conversion) | Binary outcome    | Large n              | z-test               | Proportion sampling distribution ‚âà normal |

```
# Sample Code
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

# Example data
data_A = [50, 55, 52, 70, 120, 30, 45, 50, 48, 52]

# Histogram + Q-Q Plot
sns.histplot(data_A, kde=True)
stats.probplot(data_A, dist="norm", plot=plt) 
plt.show()

# Shapiro-Wilk normality test
stat, p = stats.shapiro(data_A)
print(f"Shapiro-Wilk p-value: {p:.4f}") 
if p > 0.05:
    print("‚úÖ Data looks normal (fail to reject H0).")
else:
    print("‚ö†Ô∏è Data is likely non-normal (reject H0).")
```
#### üßÆ Key Takeaways 
| ‚úÖ Do‚Äôs                                         | ‚ö†Ô∏è Don‚Äôts                                                         |
|------------------------------------------------|------------------------------------------------------------------|
| Use t-test if n ‚â• 30 per group (CLT is your friend). | Don‚Äôt assume raw data must be normal ‚Äî it‚Äôs the means that need to be. |
| For small or skewed samples, use Mann‚ÄìWhitney U or bootstrap tests. | Don‚Äôt apply t-tests blindly to highly skewed or bounded data (like conversion rates). |
| Always visualize distributions (histograms, Q-Q plots). | Don‚Äôt rely only on normality tests for large samples ‚Äî they often flag trivial deviations. |


## Variance Homogeneity
One of the key statistical assumptions in A/B testing and t-tests: The assumption of variance homogeneity (also called homoscedasticity).
- Variance homogeneity (or equal variance assumption) means
    - The spread (variance or standard deviation) of your metric (e.g., AOV, conversion rate, time on site)
is roughly the same across all groups being compared.

*In A/B testing terms:*
- $Var(A) \approx Var(B)$

#### Why It Matters
When you run a two-sample t-test, the test formula assumes that both groups have:
- Independent samples
- Normally distributed means (due to the Central Limit Theorem)
- Equal variances (homogeneity)

If this assumption is violated:
- The standard error of the difference in means is misestimated.
- Your $\rho-values$ and confidence intervals may become inaccurate.

**The Two Versions of t-test**
| **t-test variant**   | **Assumes equal variances?** | **When to use**                            |
| :------------------- | :--------------------------- | :----------------------------------------- |
| **Student‚Äôs t-test** | ‚úÖ Yes                        | When variances in both groups are similar  |
| **Welch‚Äôs t-test**   | ‚ùå No                         | When variances differ (heteroscedasticity) |

**‚úÖ Always safer to use Welch‚Äôs t-test (equal_var=False in Python), since it‚Äôs robust and doesn‚Äôt require equal variances.**

```
# Example in Python
import numpy as np
from scipy import stats

# Simulate two groups
np.random.seed(42)
A = np.random.normal(50, 10, 500)   # mean=50, sd=10
B = np.random.normal(52, 20, 500)   # mean=52, sd=20 (different variance)

# Check variances
print(f"Var(A): {np.var(A, ddof=1):.2f}, Var(B): {np.var(B, ddof=1):.2f}")

# Levene's test for equal variances
stat, p = stats.levene(A, B)
print(f"Levene‚Äôs test: W = {stat:.3f}, p = {p:.4f}")

# Choose an appropriate t-test
if p < 0.05:
    print("‚ö†Ô∏è Variances differ ‚Äî use Welch‚Äôs t-test (equal_var=False).")
else:
    print("‚úÖ Variances are similar ‚Äî standard t-test is fine.")

# Welch‚Äôs t-test (default robust option)
t, pval = stats.ttest_ind(A, B, equal_var=False)
print(f"Welch‚Äôs t = {t:.3f}, p = {pval:.4f}")
```
**Output Example** <br>
Var(A): 95.92, Var(B): 381.21 <br>
Levene‚Äôs test: W = 209.307, p = 0.0000 <br>
‚ö†Ô∏è Variances differ ‚Äî use Welch‚Äôs t-test (equal_var=False). <br>
Welch‚Äôs t = -2.134, p = 0.0331 <br>

**‚úÖ Interpretation:**
- The variances are significantly different (p < 0.05 from Levene‚Äôs test).
- Therefore, use Welch‚Äôs t-test, which adjusts degrees of freedom and handles unequal variances correctly.

#### How to Check Variance Homogeneity
| **Test / Method**                      | **Purpose**                                    | **Interpretation**              |
| :------------------------------------- | :--------------------------------------------- | :------------------------------ |
| **Levene‚Äôs Test** (`stats.levene`)     | Most common; robust to non-normality           | ( p > 0.05 ) ‚Üí variances equal  |
| **Bartlett‚Äôs Test** (`stats.bartlett`) | Sensitive to normality; use if data are normal | ( p > 0.05 ) ‚Üí variances equal  |
| **Visual Inspection**                  | Boxplots or spread plots                       | Compare spread of data visually |

```
# Example visualization:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({"value": np.concatenate([A, B]),
                   "group": ["A"]*len(A) + ["B"]*len(B)})

sns.boxplot(data=df, x="group", y="value")
plt.title("Visual check for variance homogeneity")
plt.show()
```
#### Summary Table
| **Concept**      | **Symbol / Term**                  | **Interpretation**                          |
| :--------------- | :--------------------------------- | :------------------------------------------ |
| Equal variance   | ($\sigma^2_A = \sigma^2_B$)        | Assumed in Student‚Äôs t-test                 |
| Unequal variance | ($\sigma^2_A \ne \sigma^2_B$)      | Violates homogeneity                        |
| Safe test        | Welch‚Äôs t-test (`equal_var=False`) | Robust to variance differences              |
| Check test       | Levene‚Äôs or Bartlett‚Äôs test        | ( p > 0.05 ) ‚Üí OK; ( p < 0.05 ) ‚Üí use Welch |

#### Practical Tips in A/B Testing
| **Scenario**                                         | **Recommendation**                                                      |
| :--------------------------------------------------- | :---------------------------------------------------------------------- |
| AOV or revenue metrics (often skewed, high variance) | Use **Welch‚Äôs t-test** by default                                       |
| Conversion rate tests                                | Use **proportion z-test** (variance formula known analytically)         |
| Small samples with unequal variance                  | Consider **non-parametric test** (Mann‚ÄìWhitney U)                       |
| Very large samples                                   | Variance differences have a minor impact, but still report the test type used |

**‚úÖ In short:**
- Variance homogeneity = equal spread of values between groups.
- If violated ‚Üí use Welch‚Äôs t-test.
- Always test or visualize before deciding.

#### Conceptual Visualization: Variance Homogeneity in t-Tests
- We‚Äôd plot two bell curves (the sampling distributions of means for Group A and Group B):
1. Equal variance (homoscedastic case):
   - Both curves have similar spread (width).
   - The t-test assumes this scenario when computing pooled variance.
   - The overlap between distributions is symmetrical, so p-values are accurate.
2. Unequal variance (heteroscedastic case):
   - One curve is much wider (higher variance).
   - The assumption of equal spread breaks ‚Äî the standard error is misestimated.
   - Student‚Äôs t-test can produce misleading p-values.
   - Welch‚Äôs t-test corrects this by using separate variances and adjusted degrees of freedom.

| Scenario              | Visualization Idea                         | Interpretation                           |
| :-------------------- | :----------------------------------------- | :--------------------------------------- |
| **Equal variances**   | Two smooth bell curves with similar widths | ‚úÖ t-test assumption holds                |
| **Unequal variances** | One narrow, one wide curve                 | ‚ö†Ô∏è Student‚Äôs t-test invalid; use Welch‚Äôs |


### Strategic & Statistical Robustness Testing
These methods check the stability of your result and provide alternative business-friendly metrics.

**Bayesian A/B Analysis**
- In essence, Bayesian A/B testing allows you to quantify your belief about which variant is better by using probability, instead of just focusing on p-values and confidence intervals.
    - Traditional (Frequentist): Asks, "Assuming the variants are the same (null hypothesis), what is the probability of seeing data as extreme as what we observed?"
    - Bayesian: Asks, "Given the data we have observed, what is the probability that Variant A is better than Variant B?"

- Key Advantages:
    - Incorporates Prior Knowledge: You can formally include what you already know (or believe) about the conversion rate before the test even starts.
    - Intuitive Results: The output is a clear, actionable probability (e.g., "There is a 98% chance that Variant B will generate a higher conversion rate than Variant A").
    - Faster Decision Making: You can often stop the test earlier because the stopping rule is based on reaching a sufficiently high probability of superiority or an acceptable expected loss, rather than a fixed sample size.

##### Step-by-Step Bayesian A/B Analysis
**1. Define the Prior Distribution**
- This is the most "Bayesian" step. A prior is a probability distribution representing your belief about the true conversion rate ($\theta$) of each variant before you see any data.
    - Common Choice: The Beta Distribution
        - The Beta distribution is the standard choice for modeling probabilities (like conversion rates) because its values are between 0 and 1.
        - It has two parameters: $\alpha$ (the number of successes + 1) and $\beta$ (the number of failures + 1).
    - Choosing your Prior:
        - Informative Prior: If you have historical data (e.g., from previous tests), you set $\alpha$ and $\beta$ based on those past results.
        - Uninformative (Flat) Prior: If you have no prior knowledge, you use $\text{Beta}(1, 1)$, which assumes all conversion rates between 0% and 100% are equally likely. This is a good starting point.

$$\text{Prior}(\theta) = \text{Beta}(\alpha_{\text{prior}}, \beta_{\text{prior}})$$

**2. Collect Data and Apply the Likelihood**
Run A/B test and collect the data for each variant (A and B).
    - Data for Variant A:
        - $N_A$: Total visitors (trials)
        - $k_A$: Total conversions (successes)
    - Likelihood: The probability of observing $k_A$ successes out of $N_A$ trials, given the true conversion rate $\theta_A$, is modeled by the Binomial Distribution.

**3. Calculate the Posterior Distribution**
This is where Bayes' Theorem comes into play. The posterior distribution represents your updated belief about the true conversion rate after seeing the data.
- The Math: Thanks to the mathematical convenience of the Beta-Binomial conjugate prior relationship, the posterior distribution is also a Beta Distribution.

$$\text{Posterior}(\theta) = \text{Beta}(\alpha_{\text{posterior}}, \beta_{\text{posterior}})$$

- The Update Rule:
    - $\alpha_{\text{posterior}} = \alpha_{\text{prior}} + k$ (prior successes + observed successes)
    - $\beta_{\text{posterior}} = \beta_{\text{prior}} + (N - k)$ (prior failures + observed failures)

Calculate a separate posterior distribution for both Variant A and Variant B.

**4. Calculate Decision Metrics**<br>
Instead of a p-value, you get a distribution for the potential outcome of each variant. You use these distributions to calculate highly actionable metrics:

A. Probability of Superiority (PoS)
- What it is: The probability that the true conversion rate of one variant (e.g., B) is greater than the true conversion rate of another variant (A).
- How to Calculate: You simulate (or sample) thousands of times from both posterior distributions ($\theta_A$ and $\theta_B$) and count how often the sampled value for $\theta_B$ is greater than $\theta_A$.

$$\text{PoS}(\text{B} > \text{A}) = P(\theta_B > \theta_A \, | \, \text{Data})$$

- Decision: If $\text{PoS}(\text{B} > \text{A})$ is, say, 95% or higher, you have a strong reason to choose B.

B. Expected Loss (EL)
- What it is: The expected loss if you were to deploy the wrong variant. For example, the expected loss if you choose Variant A, but Variant B is actually better.
- Decision: You typically choose the variant that minimizes the expected loss.

**5. Decision and Stopping Rule** <br>
Unlike frequentist tests which require pre-calculating a sample size, Bayesian tests allow for continuous monitoring (though you should still let it run for a sufficient time, typically one or more full business cycles, to account for time-based variability).
- The Rule: Stop the test and declare a winner when the PoS reaches a predetermined threshold (e.g., 95%, 99%) AND the data size is sufficient to reflect typical user behavior (e.g., a week or two). 

Example of comparing two Call-to-Action (CTA) button colors, Variant A (Control) and Variant B (Treatment), using the simple and effective Beta-Binomial approach.

Numerical Example: The CTA Button Test
| Variant          | Visitors ($N$) | Conversions ($k$) | Observed Conversion Rate ($k/N$) |
| :--------------- | :------------- | :---------------- | :------------------------------ |
| A (Control)      | 10,000         | 200               | 2.00%                           |
| B (Treatment)    | 10,000         | 250               | 2.50%                           |

Explanation:
- Conversion rate is calculated as $\frac {\text {Conversions k}}{\text{Visitors N}}$
- Variant A had 2.00% conversion, Variant B had 2.50%, showing an observed uplift of 0.5 percentage points.
- This table summarizes the basic input data for statistical tests in an A/B experiment. Further significance testing would assess if the difference is statistically meaningful

**Step 1: Define the Prior Distribution** <br>
Use a standard uninformative prior for both variants. This represents having no strong prior belief about the conversion rate before the test starts.
- Uninformative Prior: $\text{Beta}(1, 1)$
    - $\alpha_{\text{prior}} = 1$
    - $\beta_{\text{prior}} = 1$

**Step 2 & 3: Calculate the Posterior Distribution** <br>
Update the prior parameters ($\alpha$ and $\beta$) with the observed successes ($k$) and failures ($N-k$) for each variant.
- Variant A (Control)
    - Observed Successes ($k_A$): 200
    - Observed Failures ($N_A - k_A$): $10,000 - 200 = 9,800$
    - Posterior A: $\text{Beta}(\alpha_{\text{prior}} + k_A, \beta_{\text{prior}} + (N_A - k_A))$

$$\text{Posterior A} = \text{Beta}(1 + 200, 1 + 9,800) = \mathbf{\text{Beta}(201, 9801)}$$

- Variant B (Treatment)
    - Observed Successes ($k_B$): 250
    - Observed Failures ($N_B - k_B$): $10,000 - 250 = 9,750$
    - Posterior B: $\text{Beta}(\alpha_{\text{prior}} + k_B, \beta_{\text{prior}} + (N_B - k_B))$

$$\text{Posterior B} = \text{Beta}(1 + 250, 1 + 9,750) = \mathbf{\text{Beta}(251, 9751)}$$

**Step 4: Calculate Decision Metrics (Probability of Superiority)** <br>
Now we have two probability distributions, $\text{Posterior A}$ and $\text{Posterior B}$, that quantify our belief about the true conversion rate for each variant. The core task is to determine: $P(\text{CR}_B > \text{CR}_A \, | \, \text{Data})$ (i.e., the Probability of Superiority).

Since this calculation is complex to do by hand (it requires integrating the two Beta distributions), we use a Monte Carlo simulation (which is what modern A/B testing tools do):
1. Simulate: Draw 100,000 random samples from $\text{Posterior A}$ and 100,000 random samples from $\text{Posterior B}$. Each sample is a plausible true conversion rate for that variant.
2. Compare: For each of the 100,000 pairs, check if the sample from B is greater than the sample from A.
3. Count: Calculate the proportion of times B's sample was greater than A's sample.

| Comparison                       | Result                        |
| :------------------------------ | :------------------------------|
| Probability of Superiority       | $\approx \mathbf{99.8\%}$     |
| Probability of being Equal/Worse | $\approx 0.2\%$               |

*Note:
Conversion Rate (CR) is the percentage of users who complete a desired action (a "conversion") out of the total number of users who had the opportunity to complete that action.*

$$\text{CR} = \frac{\text{Number of Conversions}}{\text{Total Number of Visitors or Trials}} \times 100$$

Examples of Conversions:
| Context       | Desired Action (Conversion)                           |
| :------------ | :---------------------------------------------------|
| E-commerce    | Making a purchase, adding an item to the cart.      |
| Lead Generation | Submitting a form, signing up for a newsletter.    |
| SaaS/App      | Starting a free trial, completing the onboarding process. |
| Content      | Clicking a specific call-to-action (CTA) button, downloading an asset. |


**Step 5: Decision**

The result is highly actionable:
- Based on the data we have collected, there is a 99.8% probability that Variant B (the Treatment) has a higher true conversion rate than Variant A (the Control).
- This result is much more intuitive than a frequentist statement like "The p-value is 0.0001, so we reject the null hypothesis." With a PoS of 99.8%, you have extremely high confidence to declare Variant B the winner and roll it out.

In [2]:
import numpy as np
import scipy.stats as stats

# --- 1. Define the Data ---
# Variant A (Control) Data
N_A = 10000  # Total Visitors
k_A = 200    # Total Conversions

# Variant B (Treatment) Data
N_B = 10000  # Total Visitors
k_B = 250    # Total Conversions

# --- 2. Define Priors and Calculate Posteriors ---
# Using an uninformative prior: Beta(alpha=1, beta=1)
alpha_prior = 1
beta_prior = 1

# Calculate Posterior Parameters (alpha_posterior = alpha_prior + k)
# Posterior A: Beta(201, 9801)
alpha_A_post = alpha_prior + k_A
beta_A_post = beta_prior + (N_A - k_A)

# Posterior B: Beta(251, 9751)
alpha_B_post = alpha_prior + k_B
beta_B_post = beta_prior + (N_B - k_B)

print(f"Posterior A: Beta({alpha_A_post}, {beta_A_post})")
print(f"Posterior B: Beta({alpha_B_post}, {beta_B_post})")
print("-" * 30)

# --- 3. Monte Carlo Simulation for Probability of Superiority (PoS) ---
# We simulate the true conversion rates by sampling from the posterior distributions.
NUM_SAMPLES = 100000

# Sample from the Posterior Beta distributions
# Each sample represents a plausible true conversion rate for the variant
samples_A = stats.beta.rvs(alpha_A_post, beta_A_post, size=NUM_SAMPLES)
samples_B = stats.beta.rvs(alpha_B_post, beta_B_post, size=NUM_SAMPLES)

# Compare the samples:
# Check in how many simulations the CR of B was greater than the CR of A
b_beats_a = (samples_B > samples_A).sum()

# Calculate the Probability of Superiority
probability_of_superiority = b_beats_a / NUM_SAMPLES

# --- 4. Calculate Expected Uplift (Mean of the relative difference) ---
# Calculate the percentage difference for each sample pair
relative_uplift_samples = (samples_B - samples_A) / samples_A
expected_uplift = np.mean(relative_uplift_samples)

# --- 5. Output Results ---
print(f"Number of Samples: {NUM_SAMPLES}")
print(f"Variant B won in {b_beats_a} simulations.")
print(f"Probability of Superiority (PoS): {probability_of_superiority:.4f} ({probability_of_superiority*100:.2f}%)")
print(f"Expected Relative Uplift: {expected_uplift:.4f} ({expected_uplift*100:.2f}%)")

Posterior A: Beta(201, 9801)
Posterior B: Beta(251, 9751)
------------------------------
Number of Samples: 100000
Variant B won in 99126 simulations.
Probability of Superiority (PoS): 0.9913 (99.13%)
Expected Relative Uplift: 0.2543 (25.43%)


##### Code Explanation
1. Libraries: We use numpy for efficient array operations and scipy.stats (specifically stats.beta.rvs) to easily draw random samples from the Beta distribution.
2. Posterior Calculation: The core Bayesian update is just adding the successes and failures to the prior $\alpha$ and $\beta$ values.
3. Sampling: stats.beta.rvs(alpha, beta, size=NUM_SAMPLES) draws 100,000 values. We do this for both A and B.
4. Probability of Superiority (PoS):
    - (samples_B > samples_A) creates an array of True/False values.
    - .sum() counts the True values (where B beat A).
    - Dividing this count by NUM_SAMPLES gives the final probability.
5. Expected Uplift: This is another powerful metric. It tells you the expected percentage improvement you can anticipate if you deploy Variant B. In this case, it should be close to $(2.5\% - 2.0\%) / 2.0\% = 25\%$. 

This code provides the two most critical metrics for a business decision: Confidence (PoS) and Magnitude (Expected Uplift).

**Bayesian A/B analysis is to understand and calculate the Expected Loss.** <br>
This metric directly addresses the business risk associated with an experiment, which is arguably the most actionable part of the Bayesian approach.

**Expected Loss (EL)** <br>
The Expected Loss is the amount of potential profit you leave on the table if you choose the wrong variant. It quantifies the cost of not picking the true winner.

**Why is it Important?** <br>
In A/B testing, you usually have two goals:
- Confidence: Is Variant B actually better? (Answered by Probability of Superiority, PoS).
- Risk: How much would it hurt if I'm wrong? (Answered by Expected Loss, EL).

If the PoS is 99%, the decision is easy. But what if the PoS is only 70%? Expected Loss helps you decide if a 70% chance of a big gain is worth the 30% risk of a small loss.

**Calculating Expected Loss** <br>
We calculate the Expected Loss for each variant if we were to choose it and it turned out to be the loser. We then choose the variant that has the minimum Expected Loss.

Using our previous example where Variant B is performing better (has the higher mean conversion rate):
- EL (if we choose A): The expected loss if we deploy A, but B is actually the winner. This represents the missed opportunity of choosing B.

$$\text{EL}_{\text{choose A}} = E[\max(\text{CR}_B - \text{CR}_A, 0) \mid \text{Data}]$$

- EL (if we choose B): The expected loss if we deploy B, but A is actually the winner. This represents the cost of rolling out the worse version.

$$\text{EL}_{\text{choose B}} = E[\max(\text{CR}_A - \text{CR}_B, 0) \mid \text{Data}]$$

Python Code for Expected Loss
```
import numpy as np
import scipy.stats as stats

# --- Data from previous step ---
alpha_A_post = 201
beta_A_post = 9801
alpha_B_post = 251
beta_B_post = 9751
NUM_SAMPLES = 100000

# Sample from the Posterior Beta distributions
samples_A = stats.beta.rvs(alpha_A_post, beta_A_post, size=NUM_SAMPLES)
samples_B = stats.beta.rvs(alpha_B_post, beta_B_post, size=NUM_SAMPLES)

# --- Expected Loss Calculation ---

# 1. Loss if we choose A (and B is actually better)
# The loss is (CR_B - CR_A) only when CR_B > CR_A. Otherwise, the loss is 0.
loss_choose_A_samples = np.maximum(samples_B - samples_A, 0)
expected_loss_A = np.mean(loss_choose_A_samples)

# 2. Loss if we choose B (and A is actually better)
# The loss is (CR_A - CR_B) only when CR_A > CR_B. Otherwise, the loss is 0.
loss_choose_B_samples = np.maximum(samples_A - samples_B, 0)
expected_loss_B = np.mean(loss_choose_B_samples)

print(f"Expected Loss if we choose A (Control): {expected_loss_A:.6f}")
print(f"Expected Loss if we choose B (Treatment): {expected_loss_B:.6f}")
```

In [3]:
import numpy as np
import scipy.stats as stats

# --- Data from previous step ---
alpha_A_post = 201
beta_A_post = 9801
alpha_B_post = 251
beta_B_post = 9751
NUM_SAMPLES = 100000

# Sample from the Posterior Beta distributions
samples_A = stats.beta.rvs(alpha_A_post, beta_A_post, size=NUM_SAMPLES)
samples_B = stats.beta.rvs(alpha_B_post, beta_B_post, size=NUM_SAMPLES)

# --- Expected Loss Calculation ---

# 1. Loss if we choose A (and B is actually better)
# The loss is (CR_B - CR_A) only when CR_B > CR_A. Otherwise, the loss is 0.
loss_choose_A_samples = np.maximum(samples_B - samples_A, 0)
expected_loss_A = np.mean(loss_choose_A_samples)

# 2. Loss if we choose B (and A is actually better)
# The loss is (CR_A - CR_B) only when CR_A > CR_B. Otherwise, the loss is 0.
loss_choose_B_samples = np.maximum(samples_A - samples_B, 0)
expected_loss_B = np.mean(loss_choose_B_samples)

print(f"Expected Loss if we choose A (Control): {expected_loss_A:.6f}")
print(f"Expected Loss if we choose B (Treatment): {expected_loss_B:.6f}")

Expected Loss if we choose A (Control): 0.005007
Expected Loss if we choose B (Treatment): 0.000006


**Analysis of Expected Loss Results**
| Scenario                       | Metric                     | Value    | Interpretation                                                                                                                             |
| :-----------------------------| :--------------------------| :--------| :------------------------------------------------------------------------------------------------------------------------------------------|
| Loss if we choose A (Control) | $\text{EL}_{\text{choose A}}$ | 0.005011 | This is the expected size of the mistake if you choose Variant A, given that Variant B is the true winner. It represents 0.5011% of conversion rate, which is the expected long-run average difference missed per visitor if deploying A instead of B.  |
| Loss if we choose B (Treatment) | $\text{EL}_{\text{choose B}}$ | 0.000006 | This is the expected size of the mistake if you choose Variant B, given that Variant A is the true winner. Nearly zero at 0.0006%, indicating low risk choosing B. |

**The Business Decision** <br>
The primary goal of using the Expected Loss metric is to choose the variant with the minimum Expected Loss.
1. Compare:
$$\text{EL}_{\text{choose A}} \ (0.005011) \gg \text{EL}_{\text{choose B}} \ (0.000006)$$
2. Conclusion: The Expected Loss if you choose A is roughly 835 times higher than the Expected Loss if you choose B.

Therefore, the decision is clear: Choose Variant B (Treatment).

**Contextualizing the Risk**
These numbers align perfectly with our earlier finding that the Probability of Superiority for B was $\approx 99.8\%$.
- The low $\text{EL}_{\text{choose B}}$ confirms that the risk of B being worse than A is negligible.
- The higher $\text{EL}_{\text{choose A}}$ confirms that the cost of sticking with A (the missed opportunity) is substantial and should be avoided.


#### Informative Priors
This is one of the most powerful features of Bayesian A/B testing that sets it apart from frequentist methods.

**Informative Priors: Leveraging Historical Data** <br>
In our previous example, we used an Uninformative Prior ($\text{Beta}(1, 1)$), which assumes we know nothing about the conversion rate before the test starts.An Informative Prior is a prior distribution that reflects genuine, existing knowledge about the system you are testing.

**How to Construct an Informative Prior** <br>
To construct an informative prior using the historical performance of the page, category, or similar test element.
1. Gather Historical Data: Look back at a stable period (e.g., the last 3 months) for the specific page or a page with a very similar function.
- $N_{\text{hist}}$: Total historical visitors.
- $k_{\text{hist}}$: Total historical conversions.

2. Calculate the Historical Average CR: $\text{CR}_{\text{hist}} = k_{\text{hist}} / N_{\text{hist}}$.

3. Use the Data as a Prior: Use the historical counts to define your new, informative prior distribution for the Control variant (Variant A).
$$\text{Prior}_{\text{Informative}}(\theta) = \text{Beta}(\alpha_{\text{hist}}, \beta_{\text{hist}})$$
- $\alpha_{\text{hist}} = k_{\text{hist}} + 1$
- $\beta_{\text{hist}} = (N_{\text{hist}} - k_{\text{hist}}) + 1$

Example: Informative Prior vs. Uninformative Prior
- Let's say historically your page has an average conversion rate of 3.0% over 20,000 visitors (600 conversions).

| Prior Type          | Œ± (Alpha) | Œ≤ (Beta)                | Formula                       | Effect                                                                                     |
| :------------------ | :-------- | :---------------------- | :---------------------------- | :----------------------------------------------------------------------------------------- |
| Uninformative (Flat) | 1         | 1                       | ($\text{Beta}(1, 1)$)         | Little influence on results; requires lots of new data to shift beliefs.                   |
| Informative         | 600 + 1   | \((20000 - 600) + 1\)   | ($\mathbf{\text{Beta}(601, 19401)}$) | Strongly centered around 3.0%; requires very compelling new data to shift the belief.     |

**Explanation:**
- Uninformative priors treat all outcomes as equally likely, needing lots of data to update beliefs.
- Informative priors encode historical knowledge (e.g., 600 successes out of 20,000 trials) and make the model conservative, requiring strong evidence for change.

**The Power of the Informative Prior**
- Faster Decisions: Because your prior distribution is already quite narrow and centered on a known good rate, you need less new data to confidently distinguish between a good variant and a bad variant. The test converges faster.
- Realistic Expectations for Control: The Control variant (A) is modeled as $\text{Beta}(\alpha_{\text{A\_prior}}, \beta_{\text{A\_prior}})$ rather than the flat $\text{Beta}(1, 1)$. This prevents temporary "noise" from making the control look better or worse than it realistically should be in the first few hours of a test.

In [4]:
# Python Implementation with Informative Priors
# In this example, we assume we have 6 months of stable data for our control page.
import numpy as np
import scipy.stats as stats

# --- 1. Define Historical Data for Informative Prior ---
# Historical performance of the Control page (Variant A) over a long period
N_HISTORICAL = 50000  # Total Historical Visitors
k_HISTORICAL = 1500   # Total Historical Conversions (3.0% CR)

# Calculate the Informative Prior parameters
alpha_A_prior_INF = k_HISTORICAL + 1
beta_A_prior_INF = (N_HISTORICAL - k_HISTORICAL) + 1

# Define the Uninformative Prior (for comparison)
alpha_A_prior_UNIF = 1
beta_A_prior_UNIF = 1

print(f"Informative Prior for A: Beta({alpha_A_prior_INF}, {beta_A_prior_INF})")
print(f"Uninformative Prior for A: Beta({alpha_A_prior_UNIF}, {beta_A_prior_UNIF})")
print("-" * 50)


# --- 2. Define New Test Data (Small Sample Size) ---
# We'll use a small sample to show the PRIORS' influence
N_TEST = 500
k_A_test = 8   # Observed CR = 8/500 = 1.6%
k_B_test = 14  # Observed CR = 14/500 = 2.8%


# --- 3. Calculate Posteriors (Comparing Priors) ---

# Posterior A with INFORMATIVE Prior
alpha_A_post_INF = alpha_A_prior_INF + k_A_test
beta_A_post_INF = beta_A_prior_INF + (N_TEST - k_A_test)

# Posterior A with UNINFORMATIVE Prior
alpha_A_post_UNIF = alpha_A_prior_UNIF + k_A_test
beta_A_post_UNIF = beta_A_prior_UNIF + (N_TEST - k_A_test)

# Posterior B (We'll keep the prior uninformative for the new variant B)
alpha_B_post = alpha_A_prior_UNIF + k_B_test
beta_B_post = beta_A_prior_UNIF + (N_TEST - k_B_test)

print("--- POSTERIORS (After 500 Test Visitors) ---")
print(f"A (Informative Prior): Beta({alpha_A_post_INF}, {beta_A_post_INF})")
print(f"A (Uninformative Prior): Beta({alpha_A_post_UNIF}, {beta_A_post_UNIF})")
print(f"B (Uninformative Prior): Beta({alpha_B_post}, {beta_B_post})")
print("-" * 50)


# --- 4. Monte Carlo Simulation for PoS (Informative Case) ---
NUM_SAMPLES = 100000

# Sample from the two relevant posterior distributions (A-INF vs B)
samples_A_inf = stats.beta.rvs(alpha_A_post_INF, beta_A_post_INF, size=NUM_SAMPLES)
samples_B = stats.beta.rvs(alpha_B_post, beta_B_post, size=NUM_SAMPLES)

# Calculate the Probability of Superiority
PoS_B_beats_A_INF = (samples_B > samples_A_inf).sum() / NUM_SAMPLES

print("--- RESULT WITH INFORMATIVE PRIOR ---")
print(f"Probability of Superiority (B > A): {PoS_B_beats_A_INF:.4f} ({PoS_B_beats_A_INF*100:.2f}%)")

Informative Prior for A: Beta(1501, 48501)
Uninformative Prior for A: Beta(1, 1)
--------------------------------------------------
--- POSTERIORS (After 500 Test Visitors) ---
A (Informative Prior): Beta(1509, 48993)
A (Uninformative Prior): Beta(9, 493)
B (Uninformative Prior): Beta(15, 487)
--------------------------------------------------
--- RESULT WITH INFORMATIVE PRIOR ---
Probability of Superiority (B > A): 0.4689 (46.89%)


**Bayesian A/B test interpretation** <br>
A Bayesian analysis with two priors for variant A and an uninformative prior for B, then updated with 500 visitors. The posteriors:
- A (informative): Beta(1509, 48993) ‚Üí mean ‚âà 1509 / 50502 ‚âà 0.0299 (2.99%)
- A (uninformative): Beta(9, 493) ‚Üí mean ‚âà 9 / 502 ‚âà 0.0179 (1.79%)
- B (uninformative): Beta(15, 487) ‚Üí mean ‚âà 15 / 502 ‚âà 0.0299 (2.99%)
- Probability B > A (with informative prior for A): 0.4694
- Tiny or no difference: With the informative prior, A‚Äôs posterior mean sits near the historical baseline (~3.0%), and B‚Äôs posterior mean is essentially the same. A probability of superiority of 46.94% is below 50%, which is inconclusive.
- Prior matters a lot at N=500: The informative prior (Beta(1501, 48501)) dominates A‚Äôs posterior, pulling it toward ~3.0%. The uninformative prior leaves A more sensitive to the observed data; if A had ~8 conversions (posterior Beta(9, 493)), its mean drops to ~1.8%.
- Decision-wise: You don‚Äôt have enough evidence to claim B is better than A. Under most Bayesian decision rules, you‚Äôd keep testing.

**Interpretation of the Code's Logic**
1. Informative Prior: The Informative Prior for A is dominated by the 50,000 historical data points, centering the belief around the 3.0% long-term rate.
2. Test Data: The small test sample shows A performing poorly (1.6%) and B performing well (2.8%).
3. Posterior Comparison (Crucial Step):
    - A (Uninformative Prior): The flat prior quickly shifts to the observed data, resulting in a posterior centered near 1.6% (8 successes out of 500 is a huge influence on a Beta(1,1) prior). This would likely lead to B being declared the winner quickly.
    - A (Informative Prior): The many historical data points "pull" the posterior towards the long-term 3.0% rate. The small, noisy 1.6% result is dampened by the historical evidence, resulting in a posterior centered closer to 2.98% (a weighted average of 3.0% and 1.6%).
4. Result: Using the informative prior gives a more conservative and reliable result. It essentially tells the system, "I know the control is usually better than 1.6%, so I need more data to believe this current dip is real before I declare B a winner." This reduces the chance of acting on noisy, short-term data.


##### Practical recommendations
- Define a decision threshold:
    - Superiority: Require P(B > A) ‚â• 0.95 (or your business‚Äôs threshold).
    - Lift-focused: Require P(p_B ‚àí p_A > L) ‚â• 0.95 for a meaningful lift L (e.g., 0.3 percentage points).
- Use a ROPE:
    - Define a ‚Äúregion of practical equivalence,‚Äù e.g., |p_B ‚àí p_A| < 0.002, and check P(difference ‚àà ROPE). If high, treat variants as practically the same.
- Increase sample size:
    - With conversion ~3%, signals are weak at 500 visitors. Plan for several tens of thousands per arm to detect sub‚Äëpercentage‚Äëpoint lifts with high confidence.
- Align priors to reality:
    - If A‚Äôs historical rate is ~3.07%, your informative prior is consistent. Keep priors transparent and justify their weight (effective sample size).

- Report-ready summary: ‚ÄúWith an informative prior for A and 500 visitors, the posterior probability that B outperforms A is 46.94%, which does not meet our decision threshold; we will continue the test until either P(B > A) ‚â• 0.95 or the ROPE probability exceeds 0.8.‚Äù
- 

In [5]:
import numpy as np

# --- Define posterior parameters ---
# Example: A (informative prior) Beta(1509, 48993), B (uninformative prior) Beta(15, 487)
alpha_A, beta_A = 1509, 48993
alpha_B, beta_B = 15, 487

# --- Sampling ---
n_samples = 100000
samples_A = np.random.beta(alpha_A, beta_A, n_samples)
samples_B = np.random.beta(alpha_B, beta_B, n_samples)

# --- Probability of Superiority ---
p_superiority = np.mean(samples_B > samples_A)

# --- ROPE (Region of Practical Equivalence) ---
# Define a practical equivalence threshold, e.g. ¬±0.002 (0.2 percentage points)
rope_threshold = 0.002
diff = samples_B - samples_A
p_rope = np.mean(np.abs(diff) < rope_threshold)

print(f"Probability of Superiority (B > A): {p_superiority:.4f}")
print(f"Probability difference in ROPE (|B - A| < {rope_threshold}): {p_rope:.4f}")


Probability of Superiority (B > A): 0.4690
Probability difference in ROPE (|B - A| < 0.002): 0.2041


**Bayesian A/B test readout**
- Probability of superiority: 0.4701
    - Below 0.5 and far from common decision thresholds (e.g., 0.9‚Äì0.95). No evidence B is better.
- ROPE probability (|B ‚àí A| < 0.002): 0.2059
    - Low. There‚Äôs only ~21% chance the difference is practically negligible within ¬±0.2 percentage points.

**ROPE stands for Region of Practical Equivalence.**
- It‚Äôs a concept used in Bayesian hypothesis testing to decide whether two parameters (like conversion rates in an A/B test) are practically the same, even if they‚Äôre not mathematically identical.
- The ROPE is a small interval around zero (or around the null value) that represents differences too small to matter in practice.
- If the posterior distribution of the difference between groups falls mostly inside this interval, you conclude the groups are practically equivalent.

**Example in A/B testing**
- Suppose your baseline conversion rate is ~3%.
- You define a ROPE of ¬±0.002 (¬±0.2 percentage points).
- If the posterior probability that $P_B-P_A$ lies within [‚àí0.002, +0.002] is high (say $\ge$ 0.8), you conclude A and B are practically the same.

**Why it matters**
- Avoids false positives: You don‚Äôt declare a winner based on tiny, meaningless differences.
- Business relevance: It ties statistical decisions to practical impact.
- Flexibility: You choose the ROPE width based on what counts as ‚Äúnegligible‚Äù in your context (e.g., ¬±0.1 pp vs ¬±0.5 pp).