#### ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) 

## Lesson 2.05 | A/B Testing

Our goal is often to learn about a population (usually how the population behaves or why it behaves the way it does.)

**Inferential statistics** focuses on generalizing results from a sample to a larger **population of interest**.
- Because we can't study populations directly, we have to take subsets of these populations called **samples**.
- We calculate some value on a sample, called a **statistic**.
- We make inferences about our **parameters** so that we learn about our **population**.
 
When attempting to learn about these parameters, there are two main types of questions we ask:
- What is a range of likely values for my parameter?
- Is this a set of values for my parameter?

### Confidence Interval
**Summary**: **A confidence interval describes a set of likely values for the parameter based on a statistic.** Confidence intervals will be centered at our "best guess" and then include a margin of error. 
- The technical term for this "best guess" is a **point estimate**.
- The technical term for the margin of error is called our **standard error**, or the standard deviation of a statistic, multiplied by a **multiplier**.


**Structure**: $$[\text{point estimate}] \pm [\text{multiplier}]\times[\text{standard error}]$$

**Example**: A 95% confidence interval for the average hours of sleep DSI students get each night is (5.5,7.5). We interpret this by saying that we are 95% confident that the average amount of sleep DSI students get each night is between 5.5 hours and 7.5 hours.

### Hypothesis Test
**Summary**: We are going to come up with two hypotheses, called the null and alternative hypotheses. We'll assume the null hypothesis is true, then gather evidence and measure how likely the null hypothesis is to be true. 
- If there's a lot of evidence to suggest that the null hypothesis is false, we'll say that the null hypothesis is false, meaning the alternative hypothesis is true.
- If there's not a lot of evidence to suggest that the null hypothesis is false, we'll say that we can't conclude the null hypothesis is false.

**Structure**: 
1. Construct a null hypothesis that you seek to contradict and its complement, the alternative hypothesis.
2. Specify a level of significance.
3. Calculate your point estimate.
4. Calculate your test statistic - this quantifies how far our observed results are from our expected results.
5. Find your $p$-value and make a conclusion.

**Example**: I believe DSI students get, on average, at least 8 hours of sleep each night. You believe my claim to be false and conduct a hypothesis test. Your null hypothesis is $H_0: \mu \geq 8$ and your alternative hypothesis is that $H_A: \mu < 8$. You establish your significance level at $\alpha = 0.05$, then gather data, calculate your point estimate and your test statistic, and find your $p$-value to be 0.02. Because $p < \alpha$, you reject your null hypothesis and conclude that the alternative hypothesis ($\mu < 8$) is true.

### Intro to A/B Testing

**A/B Testing** is a term for a randomized experiment with two variants: A and B. We often use "experiment" and "A/B testing" interchangably, but A/B testing is a term of art that is used mainly to describe cases where we only test two groups.

<details><summary> Why would we conduct a randomized experiment or an A/B test?
</summary>
```
- Experiments allow us to establish causation, whereas no other methods yet exist for establishing a causal relationship.
```
</details>

Experiments are the preferred way of making conclusions, as observational studies or surveys may not give us the level of certainty we might attain with experiments.

<details><summary> What are situations where we would not conduct an experiment or an A/B test?
</summary>
```
- Scenarios in which it is unethical to conduct an experiment.
- Cases where experiments are too expensive or time-consuming.
```
</details>

When designing an experiment, there are four main questions you want to answer:
1. What factor(s) will be changed?
2. Who will be part of the test group?
3. How long will the test run?
4. Why is this test truly necessary?

When designing an experiment, there are a few best practices to keep in mind.
1. Define your dependent variable (the variable you want to measure, your Y) before the experiment.
2. Only one variable should be changed, and it should be your independent variable (the variable you want to change).
    - There are types of experiments, like factorial designs, that can account for having multiple changed variables.
3. Randomly assign individuals to groups.
4. [Block](https://en.wikipedia.org/wiki/Blocking_(statistics)) your observations so that each group should have the same distribution of important variables.
5. Make sure there is a control group!

Suppose that you've recently been hired as a data scientist to work for [Vera Bradley](https://en.wikipedia.org/wiki/Vera_Bradley), a luggage and handbag company. The team wants you to tell them how their store should be laid out. There is the current layout of the store (layout 1) and two other proposed layouts (layouts 2 and 3).
- In groups of 2-4 people, design an experiment that helps the team answer their question! In particular, answer the four main questions above and, for each best practice, describe how you would attempt to incorporate it.
- Each market will share answers!

---


### Analyzing Experimental Results

Depending on how we set up our experiment, there will be a specific way to analyze the data.
- One-Sample $t$-test
- Two-Sample $t$-test
- Matched Pairs $t$-test
- ANOVA

In [1]:
import scipy.stats as stats
import numpy as np

#### One-Sample $t$-test
Earlier, we did a **one-sample $t$-test**. This is called a one-sample t-test because we compared one sample to a specific value.

$H_0: \mu \geq 8$

$H_A: \mu < 8$

In [2]:
data = [6.5,7,7.5,3,4.5,5,5.5,5.5,6,6,7,6.5,6.5,5,6]

stats.ttest_1samp(data,8)

Ttest_1sampResult(statistic=-7.3329889726309601, pvalue=3.715806740608815e-06)

#### Two-Sample $t$-test
When doing experiments, we're more likely to want to compare results from two groups instead of comparing one group to a number. This is where a **two-sample $t$-test** comes into play. 

Suppose that we had randomly split the DSI students into groups A and B, given coffee to group A and water to group B, then compared their sleeping habits.

<details><summary>
In this case, what do you think the hypotheses would be?
</summary>
```
H_0: mu_A = mu_B
H_A: mu_A != mu_B
```
</details>

Let's check out the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind) for a two-sample $t$-test.

In [3]:
data = [6.5,7,7.5,3,4.5,5,5.5,5.5,6,6,7,6.5,6.5,5,6]
group_A = data[0:8]
group_B = data[8:]

In [4]:
stats.ttest_ind(group_A, group_B)

Ttest_indResult(statistic=-0.9784163096277243, pvalue=0.34572599932044457)

<details><summary>
Suppose we established our significance level to be $\alpha = 0.05$. How would I interpret the results of this hypothesis test?
</summary>
```
Because p is greater than alpha, we fail to reject the null hypothesis and cannot conclude that the null hypothesis is false.
```
</details>

#### Matched Pairs $t$-test
From the documentation for the two-sample $t$-test, it was very clear that the two samples needed to be **independent** of one another. That is, changes in one group should not affect the other group.

However, this isn't always the case. Suppose we've developed a drug that helps to reduce the systolic blood pressure of individuals. 
- We recruit 10 individuals to participate in our study.
- We measure all 10 patients' systolic blood pressure.
- We administer the drug to all 10 patients over the course of eight weeks.
- We measure all 10 patients' systolic blood pressure again.

In this case, we want to compare the **pre-**drug values against the **post-**drug values to see if our drug had the intended effect. Here, our pre-drug values and post-drug values are certainly not independent of one another - we're taking our measurements on the same people!

This is where the **matched pairs $t$-test** comes into play. The matched pairs t-test is a way for us to take two dependent samples and compare their means. (Spoiler alert: it just takes sample 2, subtracts sample 1, and conducts a one-sample $t$-test on the difference.)

<details><summary>
In this case, what do you think the hypotheses would be?
</summary>
```
H_0: mu_pre = mu_post
H_A: mu_pre != mu_post
```
</details>

Let's check out the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html) for a matched pairs $t$-test.

In [5]:
systolic_pre = [130,135,139,153,160,142,136,128,144,155]
systolic_post = [125,134,140,151,160,140,138,124,141,148]

In [6]:
stats.ttest_rel(systolic_pre, systolic_post)

Ttest_relResult(statistic=2.400108850942297, pvalue=0.039890766746082586)

<details><summary>
Suppose we established our significance level to be $\alpha = 0.05$. How would I interpret the results of this hypothesis test?
</summary>
```
Because p is less than alpha, we reject the null hypothesis and conclude that the alternative hypothesis is true.
```
</details>

#### ANOVA
Suppose you're a data scientist for a factory and you have three assembly lines running. You want to identify if the lines are all performing at roughly the same level, or if at least one line is slower than the others.

In this case, we can't run a one-sample or a two-sample test... so we need to use ANOVA, or "Analysis of Variance," which will allow us to compare more than two **independent** samples.

Despite its name, ANOVA is a type of hypothesis test that will test whether or not all samples have the same mean.

<details><summary>
In this case, what do you think the hypotheses would be?
</summary>
```
H_0: mu_A = mu_B = mu_C = ...
H_A: at least one mu_i != mu_j.
```
</details>

Let's check out the [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) for ANOVA.

In [7]:
line_a = [6,5,4,8,7]
line_b = [6,4,3,7,4]
line_c = [1,3,2,4,4]

In [8]:
stats.f_oneway(line_a,line_b,line_c)

F_onewayResult(statistic=5.6811594202898537, pvalue=0.018365036502725765)

<details><summary>
Suppose we established our significance level to be $\alpha = 0.05$. How would I interpret the results of this hypothesis test?
</summary>
```
Because p is less than alpha, we reject the null hypothesis and conclude that the alternative hypothesis is true.
```
</details>