1. **Experimental Setup and Design**<br>
The goal is to determine if the difference in AOV between the two groups is statistically significant, meaning it is unlikely to have occurred by random chance.

A. Define the Variables
- Metric (Dependent Variable): Average Order Value (AOV), calculated as Total Revenue / Total Number of Orders for each group.
- Treatments (Independent Variable):
    - Group A: Receives the new marketing Strategy A.
    - Group B: Receives the new marketing Strategy B.
    - (Optional but Recommended: Control Group C): Receives the existing/original marketing strategy to serve as a baseline.
- Null Hypothesis ($H_0$): There is no statistically significant difference in AOV between Strategy A and Strategy B
($\text{AOV}_A = \text{AOV}_B$).
- Alternative Hypothesis ($H_a$): There is a statistically significant difference in AOV between Strategy A and Strategy B ($\text{AOV}_A \neq \text{AOV}_B$).

B. Sample Size Determination <br>

Before running the test, you must calculate the required sample size ($N$) to detect a meaningful change.
1. Define Minimum Detectable Effect (MDE): What is the smallest percentage lift in AOV you want to be able to reliably measure (e.g., 5%)
2. Set Significance Level ($\alpha$): Typically set at 0.05 (5%), meaning you are willing to accept a 5% chance of a False Positive (concluding a difference exists when it doesn't).
3. Set Statistical Power ($1-\beta$): Typically set at 0.80 (80%), meaning you want an 80% chance of detecting the MDE if it actually exists.

The required duration of the experiment is determined by the time needed to collect the required number of orders ($N$) for each group.

2. **Data Collection and Statistical Test**
<br>**A. Run the Experiment**
- Random Assignment: Randomly assign users to see either Strategy A or Strategy B. This is crucial to ensure the groups are comparable.
- Duration: Run the test until the required sample size ($N$) is reached. Do not peek at the results early, as this can inflate the Type I error rate.

**B. Select the Statistical Test (The T-Test)** <br>

Since AOV is a continuous variable, the most appropriate statistical test is the Two-Sample Student's t-Test (specifically, for independent samples). <br>

Note on AOV Distributions: AOV data is often heavily skewed (a few large orders inflate the average). If the distributions are highly non-normal, you may need to use a non-parametric test like the Mann-Whitney U test or collect a very large sample to rely on the Central Limit Theorem. <br>

**C. Calculate Key Metrics** <br>

| **Metric** | **Group A (Control)** | **Group B (Variant)** | **Formula / Description** |
|:------------|:----------------------:|:----------------------:|:---------------------------|
| **Total Orders (n)** | $n_A$ | $n_B$ | Number of orders (sample size per group) |
| **Total Revenue (R)** | $R_A$ | $R_B$ | Sum of order values per group |
| **Average Order Value (AOV, ($\bar{x}$))** | $$\bar{x}_A = \frac{R_A}{n_A}$$ | $$\bar{x}_B = \frac{R_B}{n_B}$$ | Mean order value per group |
| **Standard Deviation (s)** | $s_A$ | $s_B$ | Variability of order values within each group |


3. **Analysis and Interpretation**

**A. Calculate the T-Statistic (Example Formula)**

$$t = \frac{\bar{x}_A - \bar{x}_B}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}}$$

**B. Determine the P-Value**<br>
Using the calculated $t$ statistic and the degrees of freedom (based on $n_A$ and $n_B$), statistical software will output a **$\rho$-value**.

**C. Decision Rule**<br>
Compare the $\rho$-value to your significance level ($\alpha = 0.05$):
- **If $\rho$-value $\leq 0.05$: Reject $H_0$**. The difference in AOV between Strategy A and Strategy B is statistically significant. The strategy with the higher AOV is the winner.
- **If $\rho$-value $> 0.05$: Fail to Reject $H_0$**. There is not enough evidence to conclude that one strategy is definitively better than the other based on AOV. You would conclude that the strategies perform similarly in terms of AOV (though you might still pick one based on cost or other secondary metrics).

In [3]:
import numpy as np
import pandas as pd
from scipy import stats

# 1. Simulate the Data
# In a real-world scenario, you would load this data from your database (e.g., SQL/CSV)
# AOV data is often skewed, so we simulate a skewed distribution (Gamma or Exponential)

# --- Simulation Parameters ---
# The true mean AOV for A is slightly higher than B
ORDERS_A = 1500
ORDERS_B = 1550
TRUE_MEAN_A = 105.00  # Strategy A: Average $105
TRUE_MEAN_B = 100.00  # Strategy B: Average $100
STD_DEV = 75.00      # High standard deviation (AOV is volatile)

# Generate data: We use a Log-Normal distribution to simulate positive, skewed order values.
# The 'loc' parameter shifts the distribution to ensure values are positive and meaningful.
np.random.seed(42) # for reproducibility

# Group A (Higher AOV)
data_a = np.random.lognormal(mean=np.log(TRUE_MEAN_A), sigma=np.log(1.5), size=ORDERS_A)
# Group B (Lower AOV)
data_b = np.random.lognormal(mean=np.log(TRUE_MEAN_B), sigma=np.log(1.5), size=ORDERS_B)

# --- 2. Calculate Key Metrics ---
aov_a = np.mean(data_a)
std_a = np.std(data_a, ddof=1) # ddof=1 for sample standard deviation
n_a = len(data_a)

aov_b = np.mean(data_b)
std_b = np.std(data_b, ddof=1)
n_b = len(data_b)

print("--- Data Summary ---")
print(f"Group A: Orders={n_a}, Mean AOV=${aov_a:.2f}, Std Dev=${std_a:.2f}")
print(f"Group B: Orders={n_b}, Mean AOV=${aov_b:.2f}, Std Dev=${std_b:.2f}")
print(f"Observed Lift (A vs B): {((aov_a - aov_b) / aov_b) * 100:.2f}%")
print("-" * 20)

# --- 3. Perform the Statistical Test (Two-Sample T-Test) ---

# We use the t-test from scipy.stats.
# equal_var=False performs Welch's t-test, which is generally safer 
# when the population variances (standard deviations) are unequal, 
# common in A/B testing.
t_stat, p_value = stats.ttest_ind(data_a, data_b, equal_var=False)

# --- 4. Interpretation ---
alpha = 0.05 # Significance level

print("--- T-Test Results ---")
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
print("-" * 20)

print("--- Decision ---")
if p_value <= alpha:
    winner = "A" if aov_a > aov_b else "B"
    print(f"Result: Reject Null Hypothesis (H_0).")
    print(f"Conclusion: The difference is statistically significant.")
    print(f"Recommendation: Implement Strategy {winner} as it generates a higher AOV.")
else:
    print(f"Result: Fail to Reject Null Hypothesis (H_0).")
    print("Conclusion: The difference in AOV is not statistically significant at the 5% level.")
    print("Recommendation: Consider the strategies to be equally effective in terms of AOV.")

--- Data Summary ---
Group A: Orders=1500, Mean AOV=$116.17, Std Dev=$49.22
Group B: Orders=1550, Mean AOV=$108.66, Std Dev=$45.82
Observed Lift (A vs B): 6.92%
--------------------
--- T-Test Results ---
T-Statistic: 4.3624
P-Value: 0.0000
--------------------
--- Decision ---
Result: Reject Null Hypothesis (H_0).
Conclusion: The difference is statistically significant.
Recommendation: Implement Strategy A as it generates a higher AOV.


| **Element** | **Description** |
|:-------------|:----------------|
| **`stats.ttest_ind(data_A, data_B, equal_var=False)`** | Core function performing the two-sample *t*-test. It computes the *t*-statistic and two-tailed *p*-value to test for a difference in group means. |
| **`equal_var=False`** | Runs **Welch’s t-test**, which does **not assume equal variances** between groups. This makes it more robust for real-world A/B test data where variability often differs between A and B. |
| **P-Value** | The probability of observing a difference at least as extreme as the one measured, **assuming the Null Hypothesis ($H_0$) is true** — that is, there’s no actual difference between A and B. |
| **Decision Rule (($\rho \le 0.05$))** | If the *p*-value ≤ 0.05 (the significance level, \(\alpha\)), you **reject the Null Hypothesis** and conclude the difference in Average Order Value (AOV) is statistically significant. |
