# Hypothesis Testing - part 2

As stated in the 1st part, this notebook will deal with the definition and exemplification of A/B testing, crucial in day-to-day business operations, as well as Hypothesis Testing sensitivity. We will explore concepts such as Minimum Detectable Effect, CUPED, CUPAC and other metric analysis techninques useful in A/B testing and data-driven decision making.

## Index:

1. [A/B Testing](#1-ab-testing)
2. [Minimum Detectable Effect](#2-minimum-detectable-effect-mde)
3. [Improving Test sensibility](#3-improving-testing-sensitivity)
   1. [CUPED & CUPAC](#31-cuped-controlled-experiment-using-pre-experiment-data-and-cupac-controlled-experiment-using-pre-assignment-covariates)
   2. [Delta method and Ratio Metrics](#32-metric-analytics-delta-method-and-ration-metrics)
4. [Extra Resources](#4-extra-resources)


**Libraries used:**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

## 1. A/B Testing

A/B testing is a form of two-sample (in this case) Z-Test, since it is normally used used to compare two versions of a product, website, ad, or feature to determine which one performs better (and therefore sample sizes are ussually large). It is widely used in marketing, UX design, and product optimization. 

We essentially compare Group A (control group), which is the baseline, and Group B, which is the modified version. Usually there is a division of users between the 2 groups randomly to avoid bias and then proceed with the collection of data and the conduction of this test. Metrics collected to use in this test include, for example:

- Conversion Rate (CR): Percentage of users completing a desired action.
- Click-Through Rate (CTR): Percentage of users who clicked on a link.
- Bounce Rate: Percentage of users who leave without engaging.
- Revenue per User: How much revenue is generated per visitor.

The hypothesis are essentially the same as in every two-sample test: either there isn't or is a significant difference between the 2 versions. Since it is a Z-test (because of the sample size and we are testing for the proportions (nature of the metrics presented)), Central Limit Theorem is applicable to the score formula, which results in:

$$
Z = \frac{p_B - p_A}{\sqrt{SE_A^2 + SE_B^2}}
$$

where $p_A$ and $p_B$ are the observed metric, usually rates, since it is in percentage (**proportions**); $SE_A$ and $SE_B$ are are the observed conversion rates, and the denominator represents the standard error of the difference in proportions.

Here is an example case:

A company wants to test two call-to-action (CTA) buttons:

- A (Control): "Sign Up Now"
- B (Treatment): "Get Started"

the test runs for a period of time, with the following results:

| Group | Visitors | Conversions | Conversion Rate |
|-------|----------|------------|----------------|
| A     | 10,000  | 500        | 5%             |
| B     | 10,000  | 600        | 6%             |

This results in the following **Hypothesis Test**, with a significance level of 5%:

- Null Hypothesis (H₀): No difference between A and B.
- Alternative Hypothesis (H₁): A significant difference exists.

In [2]:
# Data from the scenario
n_A = 10000  # Number of visitors in group A
n_B = 10000  # Number of visitors in group B
conv_A = 500  # Conversions in group A
conv_B = 600  # Conversions in group B

# Compute conversion rates
p_A = conv_A / n_A
p_B = conv_B / n_B

# Compute standard error for each group
se_A = np.sqrt((p_A * (1 - p_A)) / n_A)
se_B = np.sqrt((p_B * (1 - p_B)) / n_B)

# Compute standard error of the difference
se_diff = np.sqrt(se_A**2 + se_B**2)

# Compute Z-score
z_score = (p_B - p_A) / se_diff

# Compute p-value (two-tailed test)
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

# Print results
print(f"Conversion Rate A: {p_A:.4f}")
print(f"Conversion Rate B: {p_B:.4f}")
print(f"Z-score: {z_score:.2f}")
print(f"p-value: {p_value:.4f}")

# Decision
alpha = 0.05
if p_value < alpha:
    print("Conclusion: Reject H₀. There is a significant difference between the two variations.")
else:
    print("Conclusion: Fail to reject H₀. No significant difference between the variations.")

Conversion Rate A: 0.0500
Conversion Rate B: 0.0600
Z-score: 3.10
p-value: 0.0019
Conclusion: Reject H₀. There is a significant difference between the two variations.


## 2. Minimum Detectable Effect (MDE)

The Minimum Detectable Effect (MDE) is the smallest true effect size that a hypothesis test can detect with a given statistical power and significance level. Determining which MDE to use is important since it affects the needed sample size. A smaller MDE, meant for detecting small changes, needs a larger sample size, which results in higher traffic needs, increased time, and costs. On the other hand, a higher MDE has less sample size requirements but can potentially miss important smaller improvements. The general formula is the following:

$$
MDE = (Z_{1-\alpha/2} + Z_{power}) \times SE \times \sqrt{\frac{2}{n}}
$$

where:
- $Z_{1-\alpha/2}$ is the critical value for the chosen confidence level.
- $Z_{power}$ is is the z-score for the desired power.
- $SE$ is the Standard Error
- $n$ is the sample size per group

Let's see a scenario where a company is running a test for a new landing page:

In [3]:
n = 5000  # Sample size per group
alpha = 0.05  # Significance level
power = 0.8  # Statistical power
z_alpha = stats.norm.ppf(1 - alpha/2)
z_power = stats.norm.ppf(power)
se = np.sqrt((0.5 * (1 - 0.5)) / n)
mde = (z_alpha + z_power) * se * np.sqrt(2)
print(f"Minimum Detectable Effect for the new landing page test: {mde:.4f}")

Minimum Detectable Effect for the new landing page test: 0.0280


On these conditions, the A/B testing can only flag significant changes if the new page results differs from the existing one at least 2.8%.

## 3. Improving testing sensitivity.

As we can see, the selection of a good MDE is dependant on human decision, and it can lead to possibly significant changes to not be detected, if they do not surpass the MDE threshold. To prevent this, we can improve the sensibility of the test, using tatics such as:

- Increasing sample size.
- Reducing measurement noise.
- Using stratified sampling techniques.
- Controlling for confounding variables (e.g., using CUPED).

Here is a quick example, where we increase the sample size, based on the previous MDE calculation example:

In [7]:
original_sample_size = 5000
new_sample_size = 10000  # Doubling the sample size
new_mde = (z_alpha + z_power) * np.sqrt((0.5 * (1 - 0.5)) / new_sample_size) * np.sqrt(2)
print(f"MDE after doubling sample size: {new_mde:.4f}. Now the threshold is smaller ({new_mde*100:.4f}%), and so the test will be more sensitive to flag significant changes.")

MDE after doubling sample size: 0.0198. Now the threshold is smaller (1.9810%), and so the test will be more sensitive to flag significant changes.


### 3.1 CUPED (Controlled-experiment Using Pre-Experiment Data) and CUPAC (Controlled-experiment Using Pre-Assignment Covariates)

CUPED is a variance reduction technique that leverages pre-experiment data to improve test efficiency:

$$
Y_{adj} = Y - \theta (X - \bar{X})
$$

where $\theta$ is calculated as the covariance between X and Y divided by the variance of X. Here is a quick example-application:

In [8]:
X = np.random.normal(100, 10, 5000)  # Pre-experiment data
Y = X + np.random.normal(5, 5, 5000)  # Post-experiment metric

theta = np.cov(X, Y)[0, 1] / np.var(X)
Y_adj = Y - theta * (X - np.mean(X))

print(f"Variance before CUPED: {np.var(Y):.2f}")
print(f"Variance after CUPED: {np.var(Y_adj):.2f}")

Variance before CUPED: 125.84
Variance after CUPED: 24.64


CUPAC extends CUPED by incorporating additional covariates that were known before assignment. These techninques help reduce variance further and increase test sensitivity.

In [9]:
covariate = np.random.normal(50, 10, 5000)
Y_adj_cupac = Y_adj - theta * (covariate - np.mean(covariate))
print(f"Variance after CUPAC: {np.var(Y_adj_cupac):.2f}")

Variance after CUPAC: 118.21


### 3.2 Metric Analytics: Delta Method and Ration Metrics

The Delta Method is used to approximate the variance of a function of a random variable, making it useful for constructing confidence intervals for non-linear metrics.

In [10]:
original_metric = np.random.normal(1.5, 0.3, 5000)
transformed_metric = np.log(original_metric)
estimated_variance = np.var(transformed_metric) / len(original_metric)
print(f"Estimated variance using Delta Method: {estimated_variance:.6f}")

Estimated variance using Delta Method: 0.000009


Ratio metrics (e.g., revenue per user) are commonly used in A/B testing. However, they require special handling to correctly interpret variance and statistical significance.

In [11]:
# Hypothetical Scenario: Comparing revenue per user in an experiment
group_A_revenue = np.random.normal(100, 20, 5000)
group_B_revenue = np.random.normal(110, 20, 5000)
ratio_A = group_A_revenue / np.random.randint(1, 5, 5000)
ratio_B = group_B_revenue / np.random.randint(1, 5, 5000)

# T-test for ratio metrics
t_stat, p_value = stats.ttest_ind(ratio_A, ratio_B)
print(f"T-test result for ratio metrics: T-statistic = {t_stat:.4f}, p-value = {p_value:.4f}")

T-test result for ratio metrics: T-statistic = -8.3164, p-value = 0.0000


## 4. Extra Resources

For further reading, to go more in depth about the topics shown:
- [Minimum Detectable effect](https://splitmetrics.com/resources/minimum-detectable-effect-mde/)
- [Improving Testing Sensitivity](https://kdd.org/kdd2016/papers/files/adp0945-xieA.pdf)
- [CUPED](https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf)
- [CUPAC](https://careersatdoordash.com/blog/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/)
- [Delta Method](https://arxiv.org/pdf/1803.06336)