In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import statsmodels.stats.api as sms
from scipy.stats import (ttest_1samp, shapiro, levene, ttest_ind, mannwhitneyu,
                         pearsonr, spearmanr, kendalltau, f_oneway, kruskal)
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.multicomp import MultiComparison
import zipfile, requests, io
from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize
from statsmodels.stats.power import zt_ind_solve_power

# Exercise 1: Calculating Required Sample Size

In [2]:
# Inputs
p1 = 0.2
p2 = 0.23
alpha = 0.05
power = 0.8

# Calculate Effect Size (Cohen's h for proportions)
effect_size = proportion_effectsize(p1, p2)

# Calculate sample size per group
analysis = NormalIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)

print(f"Binary data (CTR): Required sample size per group = {int(sample_size)} users")
print(f"Effect size (Cohen's h) = {effect_size:.3f}")

Binary data (CTR): Required sample size per group = 2940 users
Effect size (Cohen's h) = -0.073


# Exercise 2: Understanding the Relationship Between Effect Size and Sample Size
Calculate the required sample size for the following effect sizes: 0.2, 0.4, and 0.5, keeping the significance level and power the same.

In [7]:
# Effect Size = 0.2
p1 = 0.2
p2 = 0.22

# Effect Size
effect_size = proportion_effectsize(p1, p2)

# Calculate Required Sample Size
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)

print(f"Binary data (CTR): Required sample size per group = {int(sample_size)} users")
print(f"Effect size (Cohen's h) = {effect_size:.3f}")

Binary data (CTR): Required sample size per group = 6507 users
Effect size (Cohen's h) = -0.049


In [8]:
# Effect Size = 0.4
p1 = 0.2
p2 = 0.24

# Effect Size
effect_size = proportion_effectsize(p1, p2)

# Calculate Required Sample Size
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)

print(f"Binary data (CTR): Required sample size per group = {int(sample_size)} users")
print(f"Effect size (Cohen's h) = {effect_size:.3f}")

Binary data (CTR): Required sample size per group = 1680 users
Effect size (Cohen's h) = -0.097


In [9]:
# Effect Size = 0.5
p1 = 0.2
p2 = 0.25

# Effect Size
effect_size = proportion_effectsize(p1, p2)

# Calculate Required Sample Size
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)

print(f"Binary data (CTR): Required sample size per group = {int(sample_size)} users")
print(f"Effect size (Cohen's h) = {effect_size:.3f}")

Binary data (CTR): Required sample size per group = 1091 users
Effect size (Cohen's h) = -0.120


**- How does the sample size change as the effect size increases? Explain why this happens.**

**ANSWER:** As the expected effect size increases, the required sample size decreases. This is because it is much easier to observe large differences between two datasets, thus requiring a smaller sample size (less observations). The opposite holds true when you want to explore/detect minute differences; you need more observations to conclude with confidence that it isn't random noise. 

# Exercise 3: Exploring the Impact of Statistical Power
Imagine you are conducting an A/B test where you expect a small effect size of 0.2. You initially plan for a power of 0.8 but wonder how increasing or decreasing the desired power level impacts the required sample size. Calculate the required sample size for power levels of 0.7, 0.8, and 0.9, keeping the effect size at 0.2 and significance level at 0.05.

In [11]:
# Power Level: 0.7
p1 = 0.2
p2 = 0.22
alpha = 0.05
power = 0.7

effect_size = proportion_effectsize(p1, p2)
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)

print(f"Required Sample Size per Group: {int(sample_size)} observations")
print(f"Effect size (Cohen's h) = {effect_size:.3f}")

Required Sample Size per Group: 5117 observations
Effect size (Cohen's h) = -0.049


In [12]:
# Power Level: 0.8
p1 = 0.2
p2 = 0.22
alpha = 0.05
power = 0.8

effect_size = proportion_effectsize(p1, p2)
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)

print(f"Required Sample Size per Group: {int(sample_size)} observations")
print(f"Effect size (Cohen's h) = {effect_size:.3f}")

Required Sample Size per Group: 6507 observations
Effect size (Cohen's h) = -0.049


In [13]:
# # Power Level: 0.7
p1 = 0.2
p2 = 0.22
alpha = 0.05
power = 0.9

effect_size = proportion_effectsize(p1, p2)
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=1)

print(f"Required Sample Size per Group: {int(sample_size)} observations")
print(f"Effect size (Cohen's h) = {effect_size:.3f}")

Required Sample Size per Group: 8711 observations
Effect size (Cohen's h) = -0.049


**How does the required sample size change with different levels of statistical power? Why is this understanding important when designing A/B tests?**

**ANSWER**: As you increase the level of statistical power, the required sample size also increases. This is because Statistical power represents the probability that your test will detect an effect when that effect actually exists.
In simpler terms, if there really IS a difference between your groups, what's the chance your test will successfully find it?

Thus, as you increase the chance that a test will DETECT a difference as statistically significant, you will also need to increase the number of observations. More observations = more accurate results.


# Exercise 4: Implementing Sequential Testing
You are running an A/B test on two versions of a product page to increase the purchase rate. You plan to monitor the results weekly and stop the test early if one version shows a significant improvement.
- Define your stopping criteria.
- Decide how you would implement sequential testing in this scenario.
- At the end of week three, Version B has a p-value of 0.02. What would you do next?

**ANSWER**: 
- The stopping criteria will be when the p-value is below 0.05
- We would implement the sequential testing in the following way:
    - At the end of each week, we would analyze the results of the A/B test. If at the end of the week, our analysis results in a p-value above 0.05, we will run the test for an additional week and analyze again. If in the following week we notice another decrease in the p-value, but it is still above our stopping criteria, we will repeat the process once more. As soon as we notice that the p-value is below 0.05, we can stop our testing and proceed with the appropriate business decision.
- We would end our A/B testing as the a p-value of 0.05 is well below our threshold or stopping criteria.

# Exercise 5: Applying Bayesian A/B Testing

You’re testing a new feature in your app, and you want to use a Bayesian approach. Initially, you believe the new feature has a 50% chance of improving user engagement. After collecting data, your analysis suggests a 65% probability that the new feature is better.

- Describe how you would set up your prior belief.
- After collecting data, how does the updated belief (posterior distribution) influence your decision?
- What would you do if the posterior probability was only 55%?

**ANSWER**:
- Our prior belief is that the app version with the new feature (Version B) has a 50% chance of being better than the the original version (Version A). In other words, we start with a neutral belief because there is a 50% one version could be better or worse than the other.
- Our posterior distribution or belief is the probability that the new version is better than the original after analyzing the results of the test. In this case being our posterior is 65%. With our updated belief, we are now only 35% unsure that the new version is better, which isn't wholeheartedly convincing, but should be enough to attempt a small rollout or experimental phase. The key question is how confident we want to be before continuing with the new version.
- If the posterior was only 55%, we would want to run additional tests to see if we can get a higher posterior to conclude that the new version is better.

# Exercise 6: Implementing Adaptive Experimentation

You’re running a test with three different website layouts to increase user engagement. Initially, each layout gets 33% of the traffic. After the first week, Layout C shows higher engagement.

- Explain how you would adjust the traffic allocation after the first week.
- Describe how you would continue to adapt the experiment in the following weeks.
- What challenges might you face with adaptive experimentation, and how would you address them?

**ANSWER**:
- After the first week, I would adjust the traffic to each layout in the following ways:
    - 60% to C
    - 20% to A
    - 20% to B
- If I notice that C continues to show higher engagement, I would divert more traffic there from layouts A and B, and potentially even phase out traffic to the lower performer between layouts A and B. From there I would eventually divert all traffic to the best performing layout.
- The early analyses are open to noise, and we could potentially divert more traffic to a layout that isn't actually better than the others. There is also a higher chance for inconsistent results week over week. These issues can be addressed by increasing the sample size, running the test for more time, or making smaller adjustments week over week until a clearer pattern or result is observable. 