# AB TEST

**EX_1**

In [1]:
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Parameters
baseline_rate = 0.20  # Baseline open rate
new_rate = 0.23  # New open rate (after changing the subject line)
effect_size = proportion_effectsize(baseline_rate, new_rate)  # Calculate effect size
alpha = 0.05  # Significance level
power = 0.80  # Desired power (80%)

# Calculate required sample size
power_analysis = NormalIndPower()
sample_size = power_analysis.solve_power(effect_size, power=power, alpha=alpha, ratio=1)
print("Required sample size per group:", sample_size)


Required sample size per group: 2940.557827324


This means you would need approximately 3673 participants in each group (A and B) to ensure that your A/B test is properly powered

**EX_2**

In [2]:
from statsmodels.stats.power import NormalIndPower

# Initialize power analysis object
power_analysis = NormalIndPower()

# Parameters
alpha = 0.05  # Significance level
power = 0.80  # Power level

# Effect sizes
effect_sizes = [0.2, 0.4, 0.5]

# Calculate sample sizes for each effect size
for effect_size in effect_sizes:
    sample_size = power_analysis.solve_power(effect_size, power=power, alpha=alpha, ratio=1)
    print(f"Required sample size for effect size {effect_size}: {sample_size:.2f}")


Required sample size for effect size 0.2: 392.44
Required sample size for effect size 0.4: 98.11
Required sample size for effect size 0.5: 62.79


A small effect size means that the difference between the two groups (A and B) is relatively small. As a result, we need a larger sample size to detect this small difference with high confidence

**EX_3**

In [3]:
from statsmodels.stats.power import NormalIndPower

# Initialize power analysis object
power_analysis = NormalIndPower()

# Parameters
effect_size = 0.2  # Small effect size
alpha = 0.05  # Significance level

# Power levels
power_levels = [0.7, 0.8, 0.9]

# Calculate sample sizes for each power level
for power in power_levels:
    sample_size = power_analysis.solve_power(effect_size, power=power, alpha=alpha, ratio=1)
    print(f"Required sample size for power level {power}: {sample_size:.2f}")


Required sample size for power level 0.7: 308.60
Required sample size for power level 0.8: 392.44
Required sample size for power level 0.9: 525.37


**EX_4**

Bonferroni Correction: Divide the significance level by the number of tests (weeks).

Pocock Boundary: Adjusts the significance level based on the number of analyses, creating a more flexible boundary.

O’Brien-Fleming Boundary: Starts with more conservative significance levels early in the test and gradually becomes more lenient.

Bonferroni Correction Method

In [5]:
# Pseudocode for sequential testing
import scipy.stats as stats

# Initialize variables
weeks = 6
alpha = 0.05
adjusted_alpha = alpha / weeks  # Bonferroni correction

# Monitor weekly
for week in range(1, weeks + 1):
    # Assume 'calculate_p_value' is a function that returns the p-value for the current week's data
    # You need to define the function 'calculate_p_value'
    def calculate_p_value(week):
        # Replace this with your actual p-value calculation logic
        # This example just returns a random p-value
        return stats.uniform.rvs()

    p_value = calculate_p_value(week)

    print(f"Week {week}, p-value: {p_value}")

    # Check if the p-value is below the adjusted threshold
    if p_value < adjusted_alpha:
        print(f"Significant result at week {week}. Stop the test.")
        break
    else:
        print("Continue testing.")

Week 1, p-value: 0.8376318262999883
Continue testing.
Week 2, p-value: 0.36077161203876695
Continue testing.
Week 3, p-value: 0.07795096872984342
Continue testing.
Week 4, p-value: 0.5634853453373877
Continue testing.
Week 5, p-value: 0.5618426434212065
Continue testing.
Week 6, p-value: 0.42466524373754966
Continue testing.


In [6]:
import numpy as np
from statsmodels.stats.proportion import proportions_ztest

# Simulate weekly data (you would replace these with actual numbers)
# Data format: (conversions in A, total in A, conversions in B, total in B)
weekly_data = {
    1: [120, 1000, 130, 1000],
    2: [150, 1000, 160, 1000],
    3: [180, 1000, 210, 1000],
    4: [210, 1000, 240, 1000],
    5: [230, 1000, 260, 1000],
    6: [250, 1000, 280, 1000]
}

def calculate_p_value(week):
    data = weekly_data[week]
    conversions_A, total_A, conversions_B, total_B = data

    # Perform Z-test for proportions
    counts = np.array([conversions_A, conversions_B])  # successes
    nobs = np.array([total_A, total_B])  # total participants
    stat, p_value = proportions_ztest(counts, nobs)

    return p_value

# Sequential Testing Code (as before)
weeks = 6
alpha = 0.05
adjusted_alpha = alpha / weeks  # Bonferroni correction

for week in range(1, weeks + 1):
    p_value = calculate_p_value(week)
    print(f"Week {week}, p-value: {p_value:.4f}")

    if p_value < adjusted_alpha:
        print(f"Significant result at week {week}. Stop the test.")
        break
    else:
        print("Continue testing.")


Week 1, p-value: 0.4990
Continue testing.
Week 2, p-value: 0.5367
Continue testing.
Week 3, p-value: 0.0904
Continue testing.
Week 4, p-value: 0.1082
Continue testing.
Week 5, p-value: 0.1188
Continue testing.
Week 6, p-value: 0.1285
Continue testing.


O’Brien-Fleming Boundary

In [8]:
import numpy as np
from statsmodels.stats.proportion import proportions_ztest

# Simulated weekly data: (conversions in A, total in A, conversions in B, total in B)
weekly_data = {
    1: [120, 1000, 130, 1000],
    2: [150, 1000, 160, 1000],
    3: [180, 1000, 210, 1000],
    4: [210, 1000, 240, 1000],
    5: [230, 1000, 260, 1000],
    6: [250, 1000, 280, 1000]
}

def calculate_p_value(week):
    data = weekly_data[week]
    conversions_A, total_A, conversions_B, total_B = data

    # Perform Z-test for proportions
    counts = np.array([conversions_A, conversions_B])  # successes
    nobs = np.array([total_A, total_B])  # total participants
    stat, p_value = proportions_ztest(counts, nobs)

    return p_value

# --- O'Brien-Fleming Boundary Method ---
print("---- O'Brien-Fleming Boundary Method ----")

# O'Brien-Fleming boundaries (these are approximate critical values for 6 looks)
obrien_fleming_boundaries = [0.0005, 0.0032, 0.0085, 0.0169, 0.0283, 0.0437]

for week in range(1, 7):
    p_value = calculate_p_value(week)
    boundary = obrien_fleming_boundaries[week - 1]  # Select the boundary for the current week

    print(f"Week {week}, p-value: {p_value:.4f}, boundary: {boundary:.4f}")

    if p_value < boundary:
        print(f"Significant result at week {week}. Stop the test.")
        break
    else:
        print("Continue testing.")


---- O'Brien-Fleming Boundary Method ----
Week 1, p-value: 0.4990, boundary: 0.0005
Continue testing.
Week 2, p-value: 0.5367, boundary: 0.0032
Continue testing.
Week 3, p-value: 0.0904, boundary: 0.0085
Continue testing.
Week 4, p-value: 0.1082, boundary: 0.0169
Continue testing.
Week 5, p-value: 0.1188, boundary: 0.0283
Continue testing.
Week 6, p-value: 0.1285, boundary: 0.0437
Continue testing.


Pocock Boundary Method

In [9]:
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import norm

# Simulated weekly data: (conversions in A, total in A, conversions in B, total in B)
weekly_data = {
    1: [120, 1000, 130, 1000],
    2: [150, 1000, 160, 1000],
    3: [180, 1000, 210, 1000],
    4: [210, 1000, 240, 1000],
    5: [230, 1000, 260, 1000],
    6: [250, 1000, 280, 1000]
}

def calculate_p_value(week):
    data = weekly_data[week]
    conversions_A, total_A, conversions_B, total_B = data

    # Perform Z-test for proportions
    counts = np.array([conversions_A, conversions_B])  # successes
    nobs = np.array([total_A, total_B])  # total participants
    stat, p_value = proportions_ztest(counts, nobs)

    return p_value

# --- Pocock Boundary Method ---
print("---- Pocock Boundary Method ----")

# Pocock's constant boundary: typically around α divided by a smaller factor
# We can use a z-value corresponding to the Pocock boundary. Approximate z-value for Pocock is around 2.41.
# Convert z-value to p-value
alpha = 0.05
z_pocock = norm.ppf(1 - alpha / 2)


---- Pocock Boundary Method ----


**EX_5**

Initially, you believe that the new feature has a 50% chance of improving user engagement, which represents a non-informative prior (meaning you're starting with no strong assumptions). We can model this belief using a Beta distribution, a common choice for Bayesian A/B testing when dealing with proportions or probabilities.

The Beta distribution is parameterized by two values: α (alpha) and β (beta).
If you believe that the new feature has a 50% chance of being better, you can set both α = 1 and β = 1. This is a uniform prior, reflecting a neutral starting belief (50/50).
Thus, the prior belief can be modeled as:

Prior
∼
Beta
(
𝛼
=
1
,
𝛽
=
1
)
Prior∼Beta(α=1,β=1)
This reflects your initial belief that the new feature is equally likely to improve or not improve engagement.

After collecting data, Bayesian analysis allows you to update your prior belief based on the observed data. This updated belief is called the posterior distribution.

Let’s say, after collecting data, your analysis shows a 65% probability that the new feature is better.
The posterior distribution is calculated by updating the prior with the observed data. In A/B testing, this typically involves adding the observed successes and failures to the parameters of the Beta distribution.

Assuming:

You observe x successes (e.g., users who engaged with the new feature),
Out of n total observations (e.g., total users exposed to the new feature),
The posterior distribution is updated as:

Posterior
∼
Beta
(
𝛼
+
successes
,
𝛽
+
failures
)
Posterior∼Beta(α+successes,β+failures)

Example:
Let’s assume after testing, you observed 65% probability that the new feature is better based on x successes out of n users. This means your updated posterior distribution has shifted, and the mean of the distribution (probability of improvement) is now 65%.
Decision Based on Posterior:
If the posterior probability of the new feature being better is 65%, you are moderately confident that the feature improves user engagement. Bayesian decision-making allows you to incorporate this uncertainty into your decision process.

**Summary:**

Prior Belief: Initially, you assume a 50% chance the new feature improves engagement (Beta(1, 1)).
Posterior Distribution: After collecting data, you update this belief, with the posterior showing a 65% probability that the new feature is better.
Decision: At 65%, you may proceed cautiously, but if the posterior was only 55%, it would be wise to collect more data before making a final decision.

**EX_6**

Traffic Adjustment Strategy:

Layout C (best performer): Increase its traffic allocation, for example, to 50%.
Layouts A and B: Reduce their traffic allocation, for example, to 25% each.
This approach is similar to multi-armed bandit algorithms, where traffic is gradually shifted toward better-performing variations, but some traffic is still reserved for exploration to avoid prematurely dismissing potential winners.

After the first week, the traffic allocation might look like this:

Layout A: 25%
Layout B: 25%
Layout C: 50%

Address premature allocation, exploration vs. exploitation trade-off, delayed feedback, statistical significance issues, and user experience bias using appropriate strategies like epsilon-greedy or Bayesian methods.

Monitor Performance: At the end of each subsequent week, reassess the engagement metrics (e.g., conversion rate, click-through rate) for each layout.

Update Traffic Allocation: Adjust traffic based on performance updates. For example:

If Layout C continues to outperform, you could increase its traffic allocation to 60-70%.
If Layout A shows improvement, you might allocate more traffic back to Layout A (e.g., increase it to 30%).
Continue reducing traffic to Layout B if its performance lags behind (e.g., down to 10-20%).
Convergence: Over time, as one layout consistently outperforms the others, you might allocate the majority of the traffic (e.g., 80-90%) to the best layout, while leaving a small percentage for exploration to confirm that the decision is robust.