# Frequentist A/B Testing

**Notebook by Carlos Santillán**

**Based on Vatsal's article in TowardsDataScience**

*The purpose of this notebook is to serve me as a reference guide when this material is needed*

**April 2022**


A/B testing is commonly used across all industries to make decision in different aspects of the business. From writing emails, to choosing landing pages, implementing specific feature designs, A/B testing can be used to make the best decision based on statistical analysis.

## What is A/B Testing?


Inferential statistics is often used to infer something about the population based on the observations on a sample of that population. A/B testing is the application of inferential statistics for researching user experience. It consists of randomized experiments with two variants, A and B [1]. 

Generally, by testing each variant against user response, one can find statistical evidence proving that one variant is a better choice than the other or you can conclude that there is no statistical significance from choosing 1 over the other. 

Common applications a company would have to conduct A/B tests would be make decisions to improve conversion rates of their users, improve marketability of their products, increases their daily active users, etc.

## Frequentist Approach

This is the more traditional approach when it comes to statistical inference, it is commonly introduced in basic stats courses taken in university. A simple outline of this approach can be of the following manner:

1. Identify the null and alternative hypothesis
  - **Null hypothesis** : there is no significant difference between specified populations, any observed difference being due to sampling or experimental error.
  - **Alternative hypothesis** : a hypothesis which contradicts the null hypothesis

2. Calculate a sample size to achieve statistical significance (usually 95%)


3. Calculate the test statistics and map it to a p value


4. Accept or reject the null hypothesis based on if the p value is smaller / larger than the p critical value

### Null & Alternative Hypothesis

Identifying a hypothesis to test is generally through domain knowledge of your given problem. 

The **null hypothesis is generally a statement regarding the population that is believed to be true**. 

The **alternative hypothesis is a statement which contradicts the null hypothesis**. 

A simple example can be outlined in the following scenario; you want to increase the conversion rate of users visiting your website based on adding a distinct feature. The null hypothesis would be that adding this distinct feature to the website will have no impact on the conversion rate. The alternative hypothesis would be that adding this new feature will impact conversion rate.

### Sample Mean Estimate

The sample mean estimate from a group of observations is essentially an estimate of the population mean. It can be represented by the following formula :

$$\mu = \frac{1}{N} \sum^{N}_{i=1} x_{i}$$

Where:

- $N$ represents the total number of samples

- $x_{i}$ represents the number of occurences of an event

In an ideal situation we would want the difference between the sample mean estimates of the variations (A and B) to be high. The larger the difference between the two would indicate a larger gap between the test statistics, implying that there would be a clear winner between the variations.

### Confidence Intervals

Confidence intervals are a range of values so defined that there is a specified probability that the value of a parameter lies within it. It can be outlined by the following formula :

$$\left [ \hat{\mu}-t_{\alpha, N-1} \frac{\hat{\sigma}}{\sqrt{N}},\; \hat{\mu}+t_{\alpha, N-1} \frac{\hat{\sigma}}{\sqrt{N}} \right ]$$

Where:

- $\mu$ represents the sample mean estimate

- $t$ is the confidence level value

- $\sigma$ is the sample standard deviation

- $N$ is the sample size

### Test Statistics

The test statistic is a point value on a normal distribution, which shows how far off (in no. of standard deviations) from the mean the test-statistic is. There are various formulations of the test statistic based on the sample size and other factors. A few variations of the formula can be seen in the image below.

| **Test for** | $H_{0}$  | **Test Statistic**  | **Use when**  |
|---|---|---|---|
| Pop. mean $\mu$  | $\mu = mu_{0}$  | $z =\frac{\bar{x}-\mu_{0}}{\frac{\sigma}{\sqrt{n}}}$  | Normal dist. or $n > 30$, $\sigma$ known  |
| Pop. mean $\mu$  | $\mu = mu_{0}$  | $t =\frac{\bar{x}-\mu_{0}}{\frac{s}{\sqrt{n}}}$  | $n < 30$ and/or $\sigma$ unknown  |
| Pop. prop. $p$  | $p = p_{0}$  | $z=\frac{\hat{p}-p_{0}}{\sqrt{\frac{p_{0}(1-p_{0})}{n}}}$  | $n\hat{p} \geq 10,\; n(1-\hat{p}) \geq 10$|

Based on the value yielded from the test statistic, one can map the test-statistic to a p-value and either accept of reject the hypothesis based on if the p value is above or below the p critical value.

### P-Value

In statistics, a *p-value* is the probability that the null hypothesis (the idea that a theory being tested is false) gives for a specific experimental result to happen. p-value is also called probability value. 

- If the p-value is low, the null hypothesis is unlikely, and the experiment has statistical significance as evidence for a different theory. 

In many fields, an experiment must have a p-value of less than 0.05 for the experiment to be considered evidence of the alternative hypothesis. In short, **a low p-value means a higher chance of the null hypothesis being false.** As explained above, once a p-value is identified, interpreting the results is fairly simple.

#### Coffee Cup Explanation

A new study says that if you drink more than four cups a day, it might -- it might -- increase the risk of an early death. >> At this coffee shop in Atlanta today, excitement is brewing over new research showing that morning cup of Joe may help you live longer. Just one cup a day lowered the chance by 12%. Bump that to three cups a day and the risk drops by 18%. >> So if you're drinking coffee right now, you're probably wondering which study to believe.

Well, today I'm going to show you how to judge the true strength of the evidence so that you can make sense of conflicting data. The key is to recognize when the evidence is exaggerated. Well, to start with, a scientific study is always more persuasive when the results are replicated or confirmed in a second independent study. A bullseye on the first shot might be lucky, but two in a row is probably not a random accident. Unfortunately, the two coffee studies totally contradict one another. 

In fact, there are so many examples of contradictory studies that one prominent scientist suggested that most published research could be false. For example, the FDA took a look at 22 cases where the results of phase 3 contradicted the results of phase 3. Some scientists even think we might have a replication crisis. 

Well, the main character in this controversy is the gatekeeper of P-values called statistical significance. A p-value is a value between 0 and 1 that tells you how surprising the data is if there's no effect other than random variation. Now, smaller p-values represent stronger evidence of a real effect. Let's look at an example from one of the early cholesterol drugs. Pravastatin reduces the risk of coronary events by 31% in patients with high cholesterol who've never had a heart attack. The p-value for Pravastatin is less than .001. This means that we would expect to see a 31% reduction in coronary events purely by chance, less than one in a thousand times. 

So this data's too surprising to be random luck. And it's evidence of a real drug effect. Now, the key point is that smaller p-values represent stronger evidence of a real drug effect. Now, this begs the question, when are p-values small enough to represent statistically significant evidence? Well, the traditional threshold for statistical significance is .05 or one in 20. Now, p-values less than .05 represent drug effects that are large enough to be surprising and -- and worth a little bit more research, because we would expect that purely by chance less than 20 times. It's enough proof for publication. 

Now, in the p-values above .05, that represents drug effects that are more likely to be random noise, and usually they wind up unpublished in the researcher's trash can. Now, some scientists believe the .05 threshold allows too many false discoveries to slip into the published literature. This sparked a debate on whether to ban p-values in statistical significance altogether. But p-values are not to blame. The replication crisis is a result of unrealistic expectations that grow out of exaggerated evidence. Here's the simplest guide I can give you for setting realistic expectations based on the evidence. 

P-values less than .05 or one in 20 are evidence of something surprising and they invite a second study. P-values less than one in a hundred represent strong evidence. And p-values less than one in a thousand represent proof beyond a reasonable doubt, and they probably don't need independent confirmation. This is why some drugs are approved just based on a single study. Now, let's look a little deeper at what to expect when the strength of evidence is right around .05 or one in 20 where replication is not as certain. Suppose we test a thousand drugs in independent clinical trials. 

The 10 blocks of a hundred little pills. Assume 10% are effective, the hundred in the white block, and assume 90% are ineffective, the 900 in the red blocks. Now, our gatekeeper correctly identifies most of the effective drugs for publication, but he sends a few effective drugs into the trash can by mistake. He filters out almost all of the ineffective drugs into the trash can, as he should, and only 5% of the ineffective drugs slip by him and are mistakenly published as meaningful discoveries. Now, bear in mind many of these false discoveries get cleaned up over time as we conduct more studies. 

So statistical significance is not a guarantee of replication, but it helps us to identify most of the good drugs without becoming overwhelmed by too many false discoveries. Unfortunately, there are two silent killers of replication that exaggerate the evidence, and they trick the gatekeeper into allowing false discoveries to slip through. And these are multiplicity and selection bias. 

Let's start with multiplicity. We ask almost unlimited questions of the data from a single clinical trial. The number of end points, doses, subgroups and other factors quickly multiply. It's like bunny rabbits, into 20, 30, 40, 50, even hundreds of p-values. For example, in this list of 40 p-values, we naturally focus on the smallest ones like the one in red. It looks really impressive, but the true strength of evidence depends on the number of questions that we asked. When you take into account the 40 questions, the strength of evidence is no longer proof beyond doubt. 

It's simply something surprising and worth another look. Now, without the context of the number of questions, p-values can exaggerate the evidence and really look stronger than they really are. Now, selection bias is the other silent killer of replication, and it really hurts the gatekeeper. When we ask multiple questions, we have a tendency to switch our focus after we see the data to the best outcomes. It's like shooting first and then painting a bullseye around the hole like that was our target all along. 

Now, let's look at an example from a drug for high blood pressure. PRAISE-1 and PRAISE-2 were clinical trials evaluating amlodopine in heart patients. It had a promising effect in mortality in one subgroup of patients. You might expect replication based on the p-value in red which is less than one in a thousand. But the PRAISE-2 model did not replicate the mortality findings observed in the PRAISE-1 trial. Why not? What happened? Investigators switched end points and then focused on the best of eight subgroups in a clinical trial with no overall effect. So that p-value in red exaggerated the evidence and probably didn't support conducting a second trial.

## Python Example

The mortgage department of a large bank is interested in the nature of loans of first-time borrowers. 

This information will be used to tailor their marketing strategy. They believe that 50% of first-time borrowers take out smaller loans than other borrowers. 

They perform a hypothesis test to determine if the percentage is the same or different from 50%. They sample 100 first-time borrowers and find 53 of these loans are smaller that the other borrowers. For the hypothesis test, they choose a 5% level of significance.

**Null Hypothesis**: $p = 0.05$

**Alternate Hypotehsis**: $p \neq 0.05$

This will be run as a two tailed test:

In [1]:

import numpy as np
from scipy.stats import norm

## calculate the confidence interval and p critical value
significance_level = 0.05
conf_interval = 1 - (significance_level / 2)

p_crticial = norm.ppf(conf_interval)
print(p_crticial)

1.959963984540054


Given that our significance level is 5% and that it is a two tailed test, our confidence interval will be 1–0.05/2 = 0.975. From running the code above you will yield a p-critical value of 1.96

In [2]:
## test statistics
n = 100   
p1 = 53 / 100
p0 = 50 / 100
p = 0.5
q = 1 - p

def test_statistic(n, p1, p0, p, q):
    '''
    Calculates the test statistic of a normal disribution 
    '''
    z = (p1 - p0) / np.sqrt((p * q) / n)
    return z

test_statistic(n , p1, p0, p, q)

0.6000000000000005

Based on the code above we notice that the test statistic is 0.6. This is barely off of the mean of the standard normal distribution of zero. There is virtually no difference from the sample proportion and the hypothesized proportion in terms of standard deviations.

The test statistic is within the critical values, hence we fail to reject the null hypothesis. This implies that at a 95% level of significance we cannot reject the null hypothesis that 50% of first-time borrowers have the same size loans as other borrowers

## Concluding Remarks

In summation, the frequentist approach to A/B testing is used to make a decision based on statistical significance of the outcome favouring one of the two variants A or B. This is done through identifying the null and alternative hypothesis associated to the test, identifying the sample size and calculating the test statistic at a certain confidence interval. Once the test statistic is obtained we can determine the P value and conclude if we accept or reject the null hypothesis.