# Introduction

In this lesson, we will use simulation to understand some of the considerations for setting up an A/B test: sample size, power, and the false positive rate. But before we think about designing an A/B test, let us first remind ourselves how to conduct the test itself, **after** planning and collecting data.

Suppose that a media company currently has a weekly newsletter email and wants to see if using the recipient's first name in the email subject will cause more people to open the email (ie. "Bob! Checkout this week’s updates" vs "Checkout this week's updates"). They randomly assign a group of 100 recipients to receive one of the two email subjects and record whether or not each recipient opened the email. The first few rows of their data might look something like this:

|Email|Opened|
|:----|:-----|
|name|yes|
|name|no|
|control|yes|
|control|yes|
|name|no|

In order to run a hypothesis test to decide whether there is a significant difference in the open rate for these emails, we would run a Chi-Square test. To accomplish this, we would first create a contingency table for the `Email` and `Opened` variables in the above table:

```
x = pd.crosstab(data.Email, data.Opened)
print(x)
```

Output:

|Opened|no|yes|
|:-----|:-|:--|
|Email|||
|control|23|27|
|name|16|34|

We would then use this table to run a Chi-Square test and get a p-value:

```
chi2, pval, dof, expected = chi2_contingency(x)
print(pval) #Output: 0.2186
```

Based on the p-value, we would make a decision about which email to use; a small p-value would provide evidence that the open rates are significantly different for the two groups, while a large p-value would suggest no significant difference.

***
### Exercise

1. Run the code in the cell below to see the first five rows of data.

In [1]:
import pandas as pd
from scipy.stats import chi2_contingency

data = pd.read_csv("ab_data.csv")
data.head()

Unnamed: 0,Web_Version,Purchased
0,A,no
1,A,no
2,A,yes
3,A,yes
4,A,yes


2. Suppose that you are running an A/B test comparing two versions of a checkout page (version A or version B) to see whether there is a significantly different purchase rate for one version compared to the other. Data from this experiment has been loaded for you in the dataframe named `data`. Use this data to create a contingency table and save the result as `ab_contingency`, then print out the result.

In [2]:
ab_contingency = pd.crosstab(data.Web_Version, data.Purchased)
ab_contingency

Purchased,no,yes
Web_Version,Unnamed: 1_level_1,Unnamed: 2_level_1
A,24,26
B,15,35


3. Use `ab_contingency` to run a Chi-Square test using `chi2_contingency()` and save the p-value as a variable named `pval`. Print out `pval`.

In [3]:
_, pval, _, _ = chi2_contingency(ab_contingency)
pval

0.10096676200907678

***

## Simulating Data for a Chi-Square test

In the last exercise, we used some data from an A/B test to run a Chi-Square test. In the next few exercises, we will build up a simulation to understand the considerations that go into choosing a sample size for that test.

Again consider the A/B test example from the previous exercise, comparing email subjects with and without the recipient's first name. Suppose we know that visitors have a 50% chance of opening the control email and a 65% chance of opening the name email (30% lift!).

Here we use **lift** to refer to the inherent difference in the distributions of our two groups of data. In the *A/B Testing: Sample Size Calculators* lesson, we learned that **minimum detectable effect** is the smallest size of the difference between the two groups that we want our test to be able to detect. If we set up our experiment with a minimum detectable effect of at least 20%, our statistical test should detect a difference with a "lift" or "effect" of 20% or greater. In this lesson we are going to simulate data that has a lift of 30% to demonstrate how the inherent lift impacts the power of our statistical test.

We can use the aforementioned probabilities to simulate a dataset of 100 email recipients as follows:

```
sample_control = np.random.choice(['yes', 'no'], size=50, p=[.5, .5])
sample_name = np.random.choice(['yes', 'no'], size=50, p=[.65, .35])
```

This gives us two simulated samples, of 50 recipients each, who hypothetically saw the name or control email subject. Each one looks something like `['yes' 'no' 'no' 'no' 'yes' 'yes' ...]`, where `'yes'` corresponds to an opened email.

Next, we can assemble these arrays into a data frame that looks a lot like the one we saw in exercise 1:

```
group = ['control'] * 50 + ['name'] * 50
outcome = list(sample_control) + list(sample_name)
sim_data = {"Email": group, "Opened": outcome}
sim_data = pd.DataFrame(sim_data)
print(sim_data.head())
```

Output:

|Email|Opened|
|:----|:-----|
|control|no|
|control|yes|
|control|yes|
|control|no|
|control|no|

Because of how we created this data frame, all of the "control" observations will be listed first, followed by all of the "name" observations.

***
### Exercise

1. In the cell below, you see the code from the narrative, which can be used to simulate a dataset for a Chi-Square test. You will notice that we have replaced all hard-coded numbers with the following variables: `sample_size`, `control_rate`, and `name_rate` (which is calculated using `control_rate` and lift).

    Press "Run". Inspect the output. Does it look as expected?

In [4]:
import numpy as np
import pandas as pd

sample_size = 4
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

sample_control = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])

group = ['control'] * int(sample_size / 2) + ['name'] * int(sample_size / 2)
outcome = list(sample_control) + list(sample_name)
sim_data = {"Button": group, "Opened": outcome}
sim_data = pd.DataFrame(sim_data)
sim_data

Unnamed: 0,Button,Opened
0,control,no
1,control,no
2,name,no
3,name,yes


2. Press "Run" a few more times and notice how the data changes each time even though you have not changed the code. This happens because we have provided probabilities for the outcomes; (opened or not), rather than specific values.

***

## Determining Significance

Now that we have practiced simulating data for an A/B test, let us actually run a Chi-Square test for each simulated dataset and consider the decision we would make based on the outcome.

If we were really running this test, we would want to use the data to make a decision about whether to use the control (old) or name (new) email subject. To make that decision, we can use a significance threshold. For example, if we’re using a significance threshold of 0.05, we will "reject the null hypothesis" for any p-value less than 0.05. In this context, rejecting the null would mean that we conclude that there **is** a significant difference between the open rates for the two email subjects and therefore we **should** switch to the email subject that uses the recipient’s first name.

We can use the following Python statement to record whether a particular p-value is significant or not, based on a threshold of 0.05:

```
result = ('significant' if pval < 0.05 else 'not significant')
print(result)
```

***
### Exercise

1. The code from the previous exercises is provided for you in the cell below. This code generates a simulated dataset named `sim_data` and then runs a Chi-Square test for that data, saving the p-value as pval.

    An additional variable named `significance_threshold` has been defined for you, which is equal to the significance threshold for the test. After the p-value calculation, add a line of code that uses `significance_threshold` to determine whether the p-value is `'significant'` or `'not significant'`. Save the result as result and print it out.

In [5]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# pre-set values
significance_threshold = 0.05
sample_size = 100
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# simulate a dataset
sample_control = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])

group = ['control'] * int(sample_size / 2) + ['name'] * int(sample_size / 2)
outcome = list(sample_control) + list(sample_name)
sim_data = {"Email": group, "Opened": outcome}
sim_data = pd.DataFrame(sim_data)

# run a chi-square test
ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
chi2, pval, dof, expected = chi2_contingency(ab_contingency, correction=False)
print(f"p-Value: {pval:0.4f}")

# determine significance here:
result = ('significant' if pval < significance_threshold else 'not significant')

print(f"Result: {result}")


p-Value: 0.1025
Result: not significant


## Estimating Power

In the last exercise, we learned how to simulate a dataset for a Chi-Square test, run the test, and then output a result: 'significant' or 'not significant'. In this exercise, we’ll repeat that process many times so that we can inspect the relative frequency of each outcome.

To do this, we will start by creating an empty list to store the results of our repeated experiments. Next, we will move all of our simulation code (to create a sample dataset, run a Chi-Square test, and determine a result) inside of a for-loop. In each iteration of the loop, we will append the outcome to our results list so that we can inspect it later.

The outline of the code looks something like this:

```
Set the sample size and subscription probabilities
Create an empty list named `results`


Repeat 100 times in a for-loop:
   Simulate a dataset
   Run a Chi-Square test
   Use the p-value to determine significance
   Append the result ('significant' or 'not significant') to `results`
```

Finally, we can inspect `results` by calculating the proportion of simulated tests where the result was `'significant'`:

```
results =  np.array(results)
print(np.sum(results == 'significant') / 100)
```

***
### Exercise

1. In the cell below, we have copied over the code from the previous exercise and moved the simulation inside a for-loop as described in the narrative. We have also initialized an empty list named `results`.

    Below the determination of `result`, but still inside the for-loop, add a line of code to append `result` onto `results`.

In [6]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.05
sample_size = 100
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control'] * int(sample_size / 2) + ['name'] * int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  chi2, pval, dof, expected = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results here:
print("Proportion of significant results:")

Proportion of significant results:


2. Outside of the for-loop, add a line of code to print the proportion of `results` that are `'significant'`. Press "Run" a few times (note: you will see slightly different numbers each time because this is a random process). Approximately what proportion of the results were significant (would have led us to switch to the new, name email subject)?

In [7]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.05
sample_size = 100
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control'] * int(sample_size / 2) + ['name'] * int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  _, pval, _, _ = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
propportion_of_significant_results = results.count('significant') / len(results)
print(f"Proportion of significant results: {propportion_of_significant_results:0.0%}")

results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results: 26%
0.26


## False Positives and True Positives

In the previous exercise, we simulated 1,000 datasets and ran a Chi-Square test for each one, recording whether the results were 'significant' or 'not significant'. This allowed us to estimate the proportion of simulated datasets that led to a 'significant' result.

In general, we hope that the test reflects reality. We therefore want the result to be 'significant' if there really **is** a significant difference in the probability of an open for the two email subjects (lift > 0). In that case, the proportion of significant results is the true positive rate, also called the *power* of the test. Most sample size calculators aim for a power of 80%.

On the other hand, if there is no difference in the probability of an email being opened for the two email subjects (lift = 0), a 'significant' result would be a false-positive (also called a type I error). This would lead us to invest time and resources into adding first names into email subjects when there is no real pay-off in the long run.

***
### Exercise

1. The simulation code from the previous exercises is loaded for you in the cell below. We have included the code to print out the proportion of tests where a significant result was recorded. Currently, the simulation is set up so that there is a difference in the probability of a subscription for the two buttons.

    Press "Run" a few times and inspect the proportion of significant tests (printed to the output terminal) each time. If we ran a test with the provided sample size (100), baseline conversion rate (50%) and lift (30%), approximately what percent of the time would we correctly observe a significant result? Note that this is the "power" of the test.

In [8]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.05
sample_size = 100
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control']*int(sample_size / 2) + ['name']*int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  _, pval, _, _ = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
print("Proportion of significant results:")
results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results:
0.29


2. Now, change the value of `lift` so that the proportion of significant tests is equal to the false positive rate and press "Run" once more.

    Note that the proportion of significant tests should be approximately equal to the significance threshold if you have done this correctly.

*Hint: If the proportion of significant tests is equal to the **false positive rate**, that means the significant results are wrong and there is no difference between the groups. This means `lift` should be 0.*

*Remember that lift is the inherent difference between the groups. Here, a lift of 0% will mean that we are sampling from populations that have an equal probability of a “success”.*

In [9]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.05
sample_size = 100
lift = 0
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control']*int(sample_size / 2) + ['name']*int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  _, pval, _, _ = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
print("Proportion of significant results:")
results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results:
0.05


***

## Trade Offs

At this point, let us return to the point of view of a product manager who is actually planning this A/B test. Suppose that the product manager wants to be able to accurately detect a lift of 30% (or higher), but also wants to avoid false positives (they do not want to change the email subjects unless there is actually a difference between them). To plan their test, the product manager needs to consider the following:

* Increasing the sample size increases the power of the test (the probability of detecting a difference if there **is** one); however, larger sample sizes require more time and resources.
* Increasing the significance threshold also increases the power of the test; however, it simultaneously increases the false positive rate (the probability of detecting a difference when there **is not** none).

Finally, if the project manager chooses a larger minimum detectable effect/lift, then they will be able to decrease the sample size without decreasing power. However, if they set up their test to detect a minimum lift of 30% (for example), they may not be able to detect smaller differences that are still meaningful.

***
### Exercise

1. The simulation code from the previous exercises is provided for you in the cell below. Currently, the simulation is set up to use an open rate of 50% for the control email, and a lift of 30% for the name email subject. Set the sample size of 100 and press "Run" and make note of the proportion of significant results (which is the power of the test).

In [10]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.05
sample_size = 100
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control']*int(sample_size / 2) + ['name']*int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  _, pval, _, _ = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
print("Proportion of significant results:")
results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results:
0.27


2. Now increase the sample size to `500` and press "Run" again. Note that the power of the test also increases.

In [11]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.05
sample_size = 500
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control']*int(sample_size / 2) + ['name']*int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  _, pval, _, _ = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
print("Proportion of significant results:")
results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results:
0.98


3. Next, increase the significance threshold to `0.10`. Note that the power of the test increases even more.

In [12]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.1
sample_size = 500
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control']*int(sample_size / 2) + ['name']*int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  _, pval, _, _ = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
print("Proportion of significant results:")
results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results:
0.94


4. Finally, increase the lift to 40%. Note that again, the power of the test increases.

In [13]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.1
sample_size = 500
lift = 0.4
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control']*int(sample_size / 2) + ['name']*int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  _, pval, _, _ = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
print("Proportion of significant results:")
results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results:
1.0


## Review

Congratulations! You have now learned how to use simulation to investigate the trade-offs for an A/B test sample-size calculation. As a recap, this lesson covered the following:

* The significance threshold for a test is equal to the false positive rate
* The power of a test is the probability of correctly detecting a significant result
* Increasing sample size increases the power of a test
* Increasing the significance threshold increases power, but also increases the false positive rate
* Larger sample sizes are needed to detect smaller effect sizes

Two notes about the terminology in the sample size calculator:

* **Baseline conversion rate** is equivalent to our `control_rate` in the code.
* **Minimum detectable effect (MDE)** is the smallest effect size (or `lift`) that we want our test to be able to detect. If the MDE is larger than our true `lift`, power will* decrease because our sample size might not be large enough to detect the difference between the two groups.

***
### Exercise

In [14]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.05
sample_size = 100
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control'] * int(sample_size / 2) + ['name'] * int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  chi2, pval, dof, expected = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
print("Proportion of significant results:")
results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results:
0.16


1. As a final exercise, we have provided a <a href="a_b_test_calculator.html">sample size calculator</a> for an A/B test, along with the simulation code from the previous exercises. The calculator estimates the sample size needed to achieve 80% power. Plug in the following values to the sample size calculator:

    * Baseline rate: 50%
    * Minimum detectable effect: 30%
    * Significance threshold: 5%

    Then, set the sample size for the simulation code equal to the sample size indicated by the calculator. Press "Run" and inspect the proportion of tests that were significant. The proportion should be close to 0.80!

In [15]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.05
sample_size = 330
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control'] * int(sample_size / 2) + ['name'] * int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  chi2, pval, dof, expected = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
print("Proportion of significant results:")
results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results:
0.79


2. Let us now examine how MDE impacts the power of our test. Change the MDE in the calculator to 40% so that you have:

    * Baseline rate: 50%
    * Minimum detectable effect: 40%
    * Significance threshold: 5%

    Update the `sample_size` in our simulator to match the new sample size given by the calculator. Press "Run" and inspect the proportion of tests that were significant. Now that our MDE is *larger* than our actual effect, what do you see happens to our power?

*Hint: When the Minimum Detectable Effect is larger than our actual effect, power decreases as our test does not have a large enough sample size to detect the small effect (AKA lift).*

In [16]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# preset values
significance_threshold = 0.05
sample_size = 180
lift = 0.3
control_rate = 0.5
name_rate = (1 + lift) * control_rate

# initialize an empty list of results
results = []

# start the loop
for i in range(100):
  # simulate data:
  sample_control = np.random.choice(['yes', 'no'],  size=int(sample_size / 2), p=[control_rate, 1 - control_rate])
  sample_name = np.random.choice(['yes', 'no'], size=int(sample_size / 2), p=[name_rate, 1 - name_rate])
  group = ['control'] * int(sample_size / 2) + ['name'] * int(sample_size / 2)
  outcome = list(sample_control) + list(sample_name)
  sim_data = {"Email": group, "Opened": outcome}
  sim_data = pd.DataFrame(sim_data)

  # run the test
  ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
  chi2, pval, dof, expected = chi2_contingency(ab_contingency)
  result = ('significant' if pval < significance_threshold else 'not significant')

  # append the result to our results list:
  results.append(result)

# calculate proportion of significant results:
print("Proportion of significant results:")
results =  np.array(results)
print(np.sum(results == 'significant') / 100)

Proportion of significant results:
0.47
