# 🌟 Exercise 1: Calculating Required Sample Size
You are planning an A/B test to evaluate the impact of a new email subject line on the open rate. Based on past data, you expect a small effect size of 0.3 (an increase from 20% to 23% in the open rate). You aim for an 80% chance (power = 0.8) of detecting this effect if it exists, with a 5% significance level (α = 0.05).  

Calculate the required sample size per group using Python’s statsmodels library.  
What sample size is needed for each group to ensure your test is properly powered?  

We define sample size for the z-test because it is appropriate for large samples with known variances or proportion-based outcomes, typical in A/B testing scenarios involving binary outcomes (like open rates or click-through rates). The z-test assumes normality, which holds for large samples, making it suitable for common A/B testing situations.

In [3]:
import statsmodels.stats.api as sms

# Define the parameters for the sample size calculation
effect_size = 0.3  # Small effect size based on increase in open rate
alpha = 0.05  # Significance level
power = 0.8  # Power of the test


# Calculate the sample size needed for each group
sample_size_per_group = sms.zt_ind_solve_power(effect_size=effect_size, alpha=alpha, power=power, alternative='two-sided')

sample_size_per_group

174.41912242947043

In [5]:
from statsmodels.stats.power import TTestIndPower

# Define the parameters
effect_size = 0.3
alpha = 0.05
power = 0.8

# Calculate the sample size
analysis = TTestIndPower()
sample_size = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power)
print(f'Required sample size per group: {sample_size:.2f}')

Required sample size per group: 175.38


# 🌟 Exercise 2: Understanding the Relationship Between Effect Size and Sample Size
Using the same A/B test setup as in Exercise 1, you want to explore how changing the expected effect size impacts the required sample size.

Calculate the required sample size for the following effect sizes: 0.2, 0.4, and 0.5, keeping the significance level and power the same.
How does the sample size change as the effect size increases? Explain why this happens.

In [4]:
# Define the different effect sizes to evaluate
effect_sizes = [0.2, 0.4, 0.5]

# Calculate the required sample sizes for each effect size
sample_sizes = [sms.zt_ind_solve_power(effect_size=es, alpha=alpha, power=power, alternative='two-sided') for es in effect_sizes]

import pandas as pd

# Create a DataFrame to display the results
sample_size_df = pd.DataFrame({
    'Effect Size': effect_sizes,
    'Sample Size': sample_sizes
})

sample_size_df


Unnamed: 0,Effect Size,Sample Size
0,0.2,392.443023
1,0.4,98.110757
2,0.5,62.790884


# 🌟 Exercise 3: Exploring the Impact of Statistical Power
Imagine you are conducting an A/B test where you expect a small effect size of 0.2. You initially plan for a power of 0.8 but wonder how increasing or decreasing the desired power level impacts the required sample size.

Calculate the required sample size for power levels of 0.7, 0.8, and 0.9, keeping the effect size at 0.2 and significance level at 0.05.
Question: How does the required sample size change with different levels of statistical power? Why is this understanding important when designing A/B tests?

In [6]:


# Define the parameters for the sample size calculation
effect_size = 0.2  # Small effect size
alpha = 0.05  # Significance level

# Define the different power levels to evaluate
power_levels = [0.7, 0.8, 0.9]

# Calculate the required sample sizes for each power level
sample_sizes_power = [sms.zt_ind_solve_power(effect_size=effect_size, alpha=alpha, power=power, alternative='two-sided') for power in power_levels]

# Create a DataFrame to display the results
sample_size_power_df = pd.DataFrame({
    'Power Level': power_levels,
    'Sample Size': sample_sizes_power
})

sample_size_power_df


Unnamed: 0,Power Level,Sample Size
0,0.7,308.600198
1,0.8,392.443023
2,0.9,525.370971


# 🌟 Exercise 4: Implementing Sequential Testing
You are running an A/B test on two versions of a product page to increase the purchase rate. You plan to monitor the results weekly and stop the test early if one version shows a significant improvement.  

Define your stopping criteria.  
Decide how you would implement sequential testing in this scenario.  
At the end of week three, Version B has a p-value of 0.02. What would you do next?  

1. **Stopping Criteria**: Set a pre-defined adjusted significance level for each interim check (e.g., alpha_adjusted = 0.015) using methods like Pocock boundaries.

2. **Sequential Testing Implementation**: Conduct weekly analyses, adjust the p-value threshold for each check, and stop the test if the p-value falls below the threshold.

3. **Next Step at Week Three (p = 0.02)**: Since the p-value (0.02) is higher than the adjusted threshold (0.015), continue the test and check again next week.


# 🌟 Exercise 5: Applying Bayesian A/B Testing
You’re testing a new feature in your app, and you want to use a Bayesian approach. Initially, you believe the new feature has a 50% chance of improving user engagement. After collecting data, your analysis suggests a 65% probability that the new feature is better.

Describe how you would set up your prior belief.  
After collecting data, how does the updated belief (posterior distribution) influence your decision?  
What would you do if the posterior probability was only 55%?  

1. **Prior Belief Setup**: Start with a neutral prior belief, assuming a 50% chance the new feature improves user engagement.

2. **Updated Belief (Posterior)**: After data collection, the posterior (65% probability) increases confidence that the feature is better, guiding you to potentially roll out the feature.

3. **If Posterior is 55%**: With only a 55% probability, the evidence is weak, so you might continue gathering more data before making a decision.


# 🌟 Exercise 6: Implementing Adaptive Experimentation
You’re running a test with three different website layouts to increase user engagement. Initially, each layout gets 33% of the traffic. After the first week, Layout C shows higher engagement.

Explain how you would adjust the traffic allocation after the first week.  
Describe how you would continue to adapt the experiment in the following weeks.  
What challenges might you face with adaptive experimentation, and how would you address them?  

1. **Traffic Adjustment**: After the first week, allocate more traffic to Layout C while reducing traffic to the other layouts.

2. **Ongoing Adaptation**: Continue adjusting traffic weekly, sending more traffic to layouts with higher engagement and less to lower-performing ones.

3. **Challenges**: You might face issues like **early overfitting** or **insufficient data for underperforming layouts**; address them by maintaining a **minimum traffic allocation** to all layouts for accurate comparison.
