# A/B Testing

## Reading

[Chapter 12: 12.1 - 12.3](https://inferentialthinking.com/chapters/12/Comparing_Two_Samples.html)<br><br>

Up until now we've done statistical analysis with one data sample to arrive at a conclusion about the data. But often data scientists also need to make decisions based on comparing two data samples with each other, such as when comparing two groups of patients who were given either a new treatment or a placebo.

In this notebook we learn now to find the similarities and differences between two samples through A/B testing.

---

## Comparing Two Samples

In data analytics, A/B testing is a statistical method to compare two versions of an experiment, process, or behavior. The steps in A/B testing include:
- Collecting and measuring data
- Formulating hypotheses
- Statistical analysis
- Interpreting the result

To see how A/B testing works, we use the same dataset from the textbook to work with data of newborns and their mothers. In this example we will see whether a mother being a smoker affects the baby's birth weight.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/baby.csv"
births = pd.read_csv(url)
print("First 5 rows:")
births.head()

---

### Collecting and Measuring Data

We use the `Birth Weight` and `Maternal Smoker` columns to create a DataFrame with 2 columns. Then we find the number of maternal smokers and non-smokers.

In [None]:
# create a new DataFrame with the 2 columns
birthweight_and_smoker = births[["Birth Weight", "Maternal Smoker"]].copy()
print("First 5 rows of birthweight_and_smoker:")
display(birthweight_and_smoker.head())

# find the number of smokers and nonsmokers
print("\nNumber of maternal smokers and nonsmokers:")
birthweight_and_smoker["Maternal Smoker"].value_counts()

For our A/B testing, we're comparing the birth weight of babies born to smokers and nonsmokers. Our A and B groups are the smokers and nonsmokers.

To visually compare the baby birth weights, we  plot the baby birth weight distribution for smokers and nonsmokers.

Since the number of smokers (459) and nonsmokers (719) are not the same, we use the parameter `density=True` with the `plt.hist` function so that we can compare them fairly. Recall in the Module 3 Plots notebook, using `density=True` normalizes the two distributions, which means each bar of the histogram is a proportion of the entire dataset instead of being a simple count of data. This way the two histograms are the _probability distributions_ and can be compared fairly.

In [None]:
plt.figure(figsize=(5, 3))
smokers = birthweight_and_smoker[birthweight_and_smoker["Maternal Smoker"] == True]
nonsmokers = birthweight_and_smoker[birthweight_and_smoker["Maternal Smoker"] == False]
plt.hist(smokers["Birth Weight"], density=True, alpha=0.4, label="Smoker")
plt.hist(nonsmokers["Birth Weight"], density=True, alpha=0.4, label="Nonsmoker")
plt.legend()
plt.title("Birth Weight Distribution")
plt.xlabel("Birth Weight")
plt.grid()
plt.plot()

It looks like babies of nonsmokers tend to have higher birth weights than babies of smokers.

But since the dataset is a sample of all the births, how do we know that the difference in birth weights is due to smoking and not due to the mothers that happened to be chosen for the sample? It is possible that during the random selection of all the mothers, the sample happened to end up with more of the smokers who have lower birth weight babies than the general population of smokers.

To answer the question of whether smoking makes a difference in birth weights, we will continue with the next steps of A/B testing.

### Formulating Hypotheses

Recalling from Module 6 class notes that the _null_ hyppothesis states that there's _no difference_ between the two samples, we have our two hypotheses:

> The null hypothesis: Babies generally have the same birth weights, regardless of whether the mother is a smoker or nonsmoker. Any difference in birth weights is only due to chance in the random sampling.

> The alternative hypothesis: Babies of smokers generally have lower birth weights than babies of nonsmokers.

---

### Statistical Analysis

- <u>The Test Statistic</u>

Since we're comparing the birth weights between the 2 groups, the test statistic is the difference in birth weights between the 2 groups.

We write a function to find this statistic, given 2 input: the DataFrame name and the column name that differentiates the 2 groups, which is `Maternal Smoker` in this case.

The function returns the average difference in birth weights.

In [None]:
def find_statistic(dataframe, column) :
  smokers = dataframe[dataframe[column] == True]
  nonsmokers = dataframe[dataframe[column] == False]
  return np.mean(smokers["Birth Weight"]) - np.mean(nonsmokers["Birth Weight"])

# run the function to find the mean birth weight difference,
# using the observed data from the dataset
print(find_statistic(birthweight_and_smoker, "Maternal Smoker"))

It looks like babies of smokers have a lower birth weights than babies of nonsmokers.

- <u>The Random Permutation</u>

If we only use the given dataset, we would end up with the conclusion that babies of smokers generally have lower birth weights. But we need more testing to be more certain with our conclusion. Given that we only have one dataset to work with, we need to randomize the smokers and nonsmokers in the dataset by using a _random permutation_.

Since the null hypothesis states that there is no difference in the baby birth weights, whether the mother is a smoker or nonsmoker, we will create a new maternal smoker column that consists of randomly shuffled True and False values from the `Maternal Smoker` column.

Through random shuffling, the number of smokers and nonsmokers will still be the same, but each smoker or nonsmoker is at a new random location in the column. This means that each observed birth weight can be randomly paired with a smoker or nonsmoker in the new column. The shuffling is called the _random permutation_ of the `Maternal Smoker` column.

To randomly shuffle the `Maternal Smoker` column, we use the same `sample()` method to select data values from the column, but this time we use `replace=False` for no replacement and we use `frac=1` so 100% of the data are selected.

In [None]:
# random shuffle of Maternal Smoker
shuffled = birthweight_and_smoker["Maternal Smoker"].sample(frac=1, replace=False)
print("First 5 rows of shuffled column:")
shuffled.head()

The index values show us that the rows of `Maternal Smoker` had been shuffled randomly.

We combine the shuffled data with the original `birthweight_and_smoker` DataFrame.

In [None]:
# add the shuffled column to birthweight_and_smoker,
# and reset the index so it matches the index of the DataFrame
birthweight_and_smoker["Shuffled"] = shuffled.reset_index(drop=True)
print("\nFirst 10 rows of DataFrame with shuffled column:")
display(birthweight_and_smoker.head(10))

Now we see whether there's a difference in birth weight between the original group and the shuffled groups.

In [None]:
print("Original data:", find_statistic(birthweight_and_smoker, "Maternal Smoker"))
print("Shuffled data:", find_statistic(birthweight_and_smoker, "Shuffled"))

We see that the difference in birth weights between the 2 groups are much closer for the shuffled data.

We need to repeat the shuffle and the calculation of the difference many times in a simulation.

- <u>The Simulation</u>

We loop 5000 times to shuffle the `Maternal Smoker` column, run the `find_statistic` function, save the statistic, and display their distribution.

In [None]:
# combine the shuffling and the difference calculation in a function

def one_sample(dataframe) :
  # create a shuffled Maternal Smoker column
  shuffled = dataframe["Maternal Smoker"].sample(frac=1, replace=False)
  dataframe["Shuffled"] = shuffled.reset_index(drop=True)
  # calculate the birth weight difference
  return find_statistic(dataframe, "Shuffled")

# test the function by running it with a sample
print(one_sample(birthweight_and_smoker))

Running the Code cell above several times produce slightly different results due to the chance of the random shuffle, but we note that all the results are smaller than the observed difference in birth weights.

Next we let the computer repeat the difference calculation of one sample, save the differences, and plot the distribution.

In [None]:
L = []
for i in range(5000):
    L.append(one_sample(birthweight_and_smoker))
statistic_array = np.array(L)
print(len(statistic_array))

In [None]:
plt.figure(figsize=(4, 3))
plt.hist(statistic_array, density=True, edgecolor='black')
plt.title("Birth Weight Difference Distribution")
plt.xlabel("Birth Weight Difference")
plt.grid()
plt.plot()

We can see that the distribution is centered around 0, which means the birth weights are the same for both the smoker and nonsmoker groups. This is the data of the null hypothesis, which says that the difference is around 0 for smokers and nonsmokers.

We also note that the observed difference in birth weights, which is -9.27, doesn't even appear on the x-axis range of -4 to +4 in the plot.

### Interpreting the Results

Because the observed birth weight difference of -9.27 does not agree with the predicted birth weight difference that centers around 0, we can conclude that the data rejects the null hypothesis.

When we calculate the p-value, which shows the proportion of simulated data that are at or past the observed data:

In [None]:
np.count_nonzero(statistic_array <= -9.27) / 5000

We see that the p-value agrees with the fact that -9.27 is not within the range of the simulated difference. As discussed in the previous notebook, the very small p-value is highly statistically significant and therefore also rejects the null hypothesis. We can conclude that for this study that looks at smoking as a factor, babies born to nonsmoker mothers generally have higher birth weights than those of smokers.

---

## Causality

One common application of A/B testing is with _randomized control experiments_, where there are data from 2 groups: a group that receives a treatment and a control group that receives no treatment. Since the subjects in the two groups are assigned randomly, if the outcomes of the two groups are different than what we would predict as purely due to chance, then we will have evidence of _causation_, which is that the treatment causes the different outcomes.

We use the same textbook dataset of patients with chronic back pain and their treatment. The name of the treatment is Botulinum Toxin A or BTA. In the trial, 31 patients were randomly assigned so that 15 people received the BTA treatment and 16 people received saline solution only. Eight weeks after the treatment, the patients reported whether they felt relief from back pain. In the dataset, a result of 1 means pain relief, and 0 means no pain relief.

In [None]:
url = "https://raw.githubusercontent.com/DeAnzaDataScience/CIS11/refs/heads/main/datasets_notes/bta.csv"
bta = pd.read_csv(url)
bta.Result = bta.Result.astype(int)  # convert input from float to int
print("First 5 rows:")
bta.head(5)

We check the number of patients in both groups, and the number of pain relief in both groups.

In [None]:
bta.groupby("Group").Result.value_counts()

For each group we also find the proportion of people with pain relief, which happens to also be the average of the number of people with pain relief, because the `Results` are in terms of 0 and 1.

For example, for the control group:
- the proportion of people with pain relief is 2/16 = 0.125
- the average number of people with pain relief is ((0 * 14)+(1 * 2))/16 = 0.125

In [None]:
bta.groupby("Group").Result.mean()

It looks like the treatment made a difference because 60% of patients in the Treatment group had pain relief, while only 12.5% of patients in the Control group had pain relief.

But just as in the previous example of birth weights and maternal smokers, for this case we can't tell whether the difference in the results was due to the treatment or due to chance during the random assignment of patients into the groups.

We will go through the same steps of A/B testing as before to see if the treatment causes the difference in outcome.

- <u>The Hypotheses</u>

> The null hypothesis: the treatment has no effect on pain relief. The distribution of the outcomes for all 31 patients is the same whether they are in the control group or treatment group.

> The alternative hypothesis: the distribution of the outcomes for all 31 patients is different, depending on whether they are in the control group or treatment group.

- <u>The Test Statistic</u>

Since the hypotheses are based on whether the outcomes of the two groups are different or the same, we set the test statistic to be the difference between the proportions of pain relief in the two groups. If the proportions are the same, then it means the outcomes are the same.

In the observed case above, the difference in proportions is: |0.6 - 0.125| = 0.475.<br>
We now write a function to calculate the statistic.

In [None]:
def get_statistic(column):
  # find the proportions of pain relief in each group
  proportions = bta.groupby(column).Result.mean().values
  # find the difference between the proportions
  return np.abs(proportions[0] - proportions[1])

# test with the observed data
print(get_statistic("Group"))

- <u>The Random Permutation</u>

Now we randomly shuffle the `Group` column and store the shuffled groups as a new column of the bta DataFrame.

In [None]:
def a_sample() :
  # shuffling code
  shuffled = bta["Group"].sample(frac=1, replace=False)
  bta["Shuffled"] = shuffled.reset_index(drop=True)

  # difference calculation code
  return get_statistic("Shuffled")

# test the function by running it with the current sample
print(a_sample())

If we run the `a_sample()` function a few times, we see that the output is slightly different due to the random shuffle, but they're all much smaller than the output of the observed data of 0.475.

- <u>The Simulation</u>

We now write a loop to run `a_sample()` multiple times and plot the distribution of the outcome.

In [None]:
L = []
for i in range(10000):
    L.append(a_sample())
statistic_array = np.array(L)
print(len(statistic_array))

In [None]:
plt.figure(figsize=(4, 3))
plt.hist(statistic_array, density=True, bins = np.arange(0, 0.7, 0.1), edgecolor='black')
plt.title("Difference Distribution")
plt.xlabel("Difference")
plt.grid()
plt.plot()

- <u>The Conclusion</u>

Given that the observed difference is 0.475, which is at the tail end of the distribution, this means the data rejects the null hypothesis.

The p-value is calculated below:

In [None]:
print(np.count_nonzero(statistic_array >= 0.475) / 10000)

The very small p-value is highly statistically significant and also rejects the null hypothesis. We can conclude that the treatment makes a difference on pain relief.

- <u>Causality</u>

Because the patients were randomly placed into the groups, our A/B test result above is evidence that there is _causation_: the treatment causes a difference in pain relief.

If the patients were _not_ randomly placed into the groups, and instead chose to be in a particular group, then our A/B test result above can only show that there is an _association_ between the treatment and pain relief, it cannot show _causation_. This is because when the patients choose to be in a group, it means that there could be other causes for pain relief. For example, a patient who believes that the treatment will work will most likely choose to be in the group to receive the treatment, and this person is more likely to feel pain relief based on a personal belief in the treatment. Factors such as someone's belief in the treatment are called _confounding factors_, they make it difficult to prove causation.

---

In this notebook we apply our sampling and simulation skills for A/B testing, which is used to compare between 2 groups of data. We also see the importance of random selection of data and how to determine causation.