# Hypothesis Testing



## **What is statistical hypothesis testing?**

1. Start with a Scientific Question (yes/no)
2. Take the skeptical stance (Null hypothesis)
3. State the complement (Alternative)
4. Create a model of the situation Assuming the Null Hypothesis is True!
5. Decide how surprised you would need to be in order to change your mind

When we perform experiments, we typically do not have access to all the members of a population, and need to take samples of measurements to make inferences about the population.

A statistical hypothesis test is a method for testing a hypothesis about a parameter in a population using data measured in a sample.

We test a hypothesis by determining the chance of obtaining a sample statistic if the null hypothesis regarding the population parameter is true.

> The goal of hypothesis testing is to make a decision about the value of a population parameter based on sample data.

**Some examples in the real world?** 

- Chemistry - do inputs from two different barley fields produce different yields?
- Astrophysics - do star systems with near-orbiting gas giants have hotter stars?
- Medicine - Does a particular drug perform better in one population over another? 
- Sports - Is Lebron the GOAT? 



## Definitions 

**Significance Level $\alpha$**

> The significance level $\alpha$ is the threshold at which you're okay with rejecting the null hypothesis. It is the probability of rejecting the null hypothesis when it is true.

> The most commonly used $\alpha$ in science is $\alpha = 0.05$. When you set $\alpha = 0.05$, you're saying "I'm okay with rejecting the null hypothesis if there is less than a 5% chance that the results I am seeing are actually due to randomness".

**p-values**

> The p-value is the probability of observing a test statistic at least as large as the one observed, by random chance, assuming that the null hypothesis is true.

> If $p \lt \alpha$, we reject the null hypothesis.

> If $p \geq \alpha$, we fail to reject the null hypothesis.

**We do not accept the alternative hypothesis, we only reject or fail to reject the null hypothesis in favor of the alternative.**

**What if the experiment we perform fails to reject the null hypothesis?**

> We do not throw out failed experiments!
We say "this methodology, with this data, does not produce significant results"
Maybe we need more data!

**Type 1 Errors (False Positives) and Type 2 Errors (False Negatives)**

> Most tests for the presence of some factor are imperfect. And in fact most tests are imperfect in two ways: They will sometimes fail to predict the presence of that factor when it is after all present, and they will sometimes predict the presence of that factor when in fact it is not. Clearly, the lower these error rates are, the better, but it is not uncommon for these rates to be between 1% and 5%, and sometimes they are even higher than that. (Of course, if they're higher than 50%, then we're better off just flipping a coin to run our test!)

> Predicting the presence of some factor (i.e. counter to the null hypothesis) when in fact it is not there (i.e. the null hypothesis is true) is called a "false positive". Failing to predict the presence of some factor (i.e. in accord with the null hypothesis) when in fact it is there (i.e. the null hypothesis is false) is called a "false negative".

**One-sample z-test**

> For large enough sample sizes $n$(30) with known population standard deviation $\sigma$, the test statistic of the sample mean $\bar x$ is given by the z-statistic,$$Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}$$where $\mu$ is the population mean.

> Our hypothesis test tries to answer the question of how likely we are to observe a z-statistic as extreme as our sample's given the null hypothesis that the sample and the population have the same mean, given a significance threshold of $\alpha$. This is a one-sample z-test.

**One-sample t-test**

> For small sample sizes or samples with unknown population standard deviation, the test statistic of the sample mean is given by the t-statistic,$$ t = \frac{\bar{x} - \mu}{s/\sqrt{n}} $$Here, $s$ is the sample standard deviation, which is used to estimate the population standard deviation, and $\mu$ is the population mean.

> Our hypothesis test tries to answer the question of how likely we are to observe a t-statistic as extreme as our sample's given the null hypothesis that the sample and population have the same mean, given a significance threshold of $\alpha$. This is a one-sample t-test.

**Two-sample t-tests** 

Sometimes, we are interested in determining whether two population means are equal. In this case, we use two-sample t-tests.

There are two types of two-sample t-tests: paired and independent (unpaired) tests.

What's the difference?

> **Paired tests**: How is a sample affected by a certain treatment? The individuals in the sample remain the same and you compare how they change after treatment.

> **Independent tests:** When we compare two different, unrelated samples to each other, we use an independent (or unpaired) two-sample t-test.

The test statistic for an unpaired two-sample t-test is slightly different than the test statistic for the one-sample t-test.

Assuming equal variances, the test statistic for a two-sample t-test is given by:

> $$ t = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{s^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right)}}$$
where $s^2$ is the pooled sample variance,

> $$ s^2 = \frac{\sum_{i=1}^{n_1} \left(x_i - \bar{x_1}\right)^2 + \sum_{j=1}^{n_2} \left(x_j - \bar{x_2}\right)^2 }{n_1 + n_2 - 2} $$
Here, $n_1$ is the sample size of sample 1 and $n_2$ is the sample size of sample 2.

An independent two-sample t-test for samples of size $n_1$ and $n_2$ has $(n_1 + n_2 - 2)$ degrees of freedom.

## Example 1 

Suppose that African elephants have weights distributed normally around a mean of 9000 lbs with a standard deviation of 900 lbs. Pachyderm Adventures has recently measured the weights of 35 Gabonese elephants and has calculated their average weight at 8637 lbs.

Is the average weight of Gabonese elephants different than the average weight of African elephants? Use significance level $\alpha = 0.05$.

**What are the null and alternative hypotheses? What is the significance level of the test?**

**What should be our test statistic? Are we running an upper, lower, or two-tailed test? Why?**

**What's the value of the critical test statistic that we should use for our test?**

## How do we set up a hypothesis test? 

**Regardless of the type of statistical hypothesis test you're performing, there are five main steps to executing them:**

1. Set up a null and alternative hypothesis
2. Choose a significance level $\alpha$ (or use the one assigned).
3. Determine the critical test statistic value or p-value. (Find the rejection region for the null hypothesis.)
4. Calculate the value of the test statistic.
5. Compare the test statistic value to the critical test statistic value to reject the null hypothesis or not.
![](https://github.com/learn-co-students/dsc-hypothesis_testing-seattle-102819/raw/633b48d10c99c4d75ba7d2ceadcf67b5f49c9c8d/images/hypothesis_test.png)


**Decision Rule:**

The decision rule tells us when we can reject the null hypothesis.

**It depends on 3 factors:**

1. The alternative hypothesis
    - Is this an upper-tailed, lower-tailed, or two-tailed test?
2. The test statistic
3. The level of significance $\alpha$.

**Upper-tailed test (right-tailed test):**

    - The null hypothesis is rejected if the test statistic is greater than the critical value.

**Lower-tailed test (left-tailed test):**

    - The null hypothesis is rejected if the test statistic is smaller than the critical value.

**Two-tailed test:**

    - The null hypothesis is rejected if the test statistic is either larger than an upper critical value or smaller than a lower critical value.
    
[Awesome reference for everything Hypothesis testing related](https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/)

In [1]:
import numpy as np 
from scipy import stats

**What is the null/alternate hypothesis?** 

Null: The average weight of Gabonese elephants is the same as the average weight of African elephants 

Alternate: The average weight of Gabonese elephants is different than the average weight of African elephants 

alpha = 0.05 

Two-tailed test. Because we're looking for whether the weights are considered lower or higher. We are comparing a sample mean to a population mean; our sample size must be > 30; we know the population SD. 

In [2]:
#critical z-statistic 
alpha = 0.05 

# point percent function is the inverse of the cumulative density
# function which can be understood as the quantile
stats.norm.ppf(alpha/2), stats.norm.ppf(1-alpha/2)


(-1.9599639845400545, 1.959963984540054)

$$\text{z-statistic} = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}, $$

where $\bar x$ is the sample mean, $\mu$ is the population mean, $\sigma$ is the population standard deviation, and $n$ is the sample size.

In [3]:
n = 35 
sigma = 900 

x_bar = 8637
mu = 9000

se = sigma/np.sqrt(n)
z = (x_bar - mu)/se 
print(z)

-2.386152179183512


**Reject null or not?** 
z = -2.38, which is < -1.96, thus we can reject the null hypothesis. 

## Example 2 

The average number of scoops of ice cream sold by employees at Scoops Ahoy is 15 per hour with a standard deviation of 3.

Steve worked for 6 hours and averaged 18 scoops sold per hour.

In [None]:
#define parameters 


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline 

x = np.linspace(-3,3,1000)
y = stats.norm.pdf(x, 0, 1)
z_ = x[x>z]
plt.plot(x, y)
plt.fill_between(z_, 0, stats.norm.pdf(z_, 0, 1))
plt.vlines(z,0,0.4)
plt.show()

In [None]:
#calculate p-value and interpret


**If we picked a random employee and observed their performance for 6 hours, there is only a 0.7% chance that they would sell more scoops on average than Steve.**

## Example 3 
A coffee shop relocates from Manhattan to Brooklyn and wants to make sure that all lattes are consistent before and after their move. They buy a new machine and hire a new barista. In Manhattan, lattes are made with 4 oz of espresso. A random sample of 25 lattes made in their new store in Brooklyn shows a mean of 4.6 oz and standard deviation of 0.22 oz. Are their lattes different now that they've relocated to Brooklyn? Use a significance level of $\alpha = 0.01$.

State null and alternative hypothesis:
1. Null: Lattes are the same 
2. Alternative: Lattes are different

What kind of test should we use? Run a 2-tailed, 1-sample t-test 

In [None]:
#define parameters 


In [None]:
#calculate critical statistic values 


**Can we reject the null hypothesis?**


## Example 4 

The average annual earnings for an eSports athlete is 45,000 dollars with a standard deviation of 25,000 dollars. A recent study of 15 eSports atheletes found their average annual earnings to be 55,000 dollars. A rising eSports star is trying to decide whether or not to drop out of college and pursue as eSports career. His decision will be swayed over whether or not he could expect to make more than 45,000 dollars a year in eSports. Design a hypothesis test to inform his decision with a 95% confidence level.

**What is the null/alternate hypothesis?** 

Null: Average salary of eSports athlete = 45,000

Alternate: Average salary of eSports athlete > 45,000

In [None]:
#define parameters 


In [None]:
#calculate p-value 


## Example 4.5

You measure the delivery times of ten different restaurants in two different neighborhoods. You want to know if restaurants in the different neighborhoods have the same delivery times. It’s okay to assume both samples have equal variances. Set your significance threshold to 0.05.

delivery_times_A = [28.4, 23.3, 30.4, 28.1, 29.4, 30.6, 27.8, 30.9, 27.0, 32.8]

delivery_times_B = [26.4, 26.3, 27.4, 30.4, 25.1, 28.4, 23.3, 24.7, 31.8, 24.3]

Null: The delivery times are the same 
Alternate: Delivery times are not the same 

Type of test: Two-sided, two sample t-test 

In [None]:
delivery_times_A = [28.4, 23.3, 30.4, 28.1, 29.4, 30.6, 27.8, 30.9, 27.0, 32.8]

delivery_times_B = [26.4, 26.3, 27.4, 30.4, 25.1, 28.4, 23.3, 24.7, 31.8, 24.3]

## Example - Two sample T-Test 

League of Legends and DOTA 2 are two similar multiplayer online battle arena games. There is a big debate over which game is better and which one requires more skill. To solve this question, two studies were ran using an IQ test. The first study looked at 55 League of Legends players and found their average IQ to be 116 with a standard deviation of 10. The second study looked at 45 DOTA 2 players and found their average IQ to be 112 with a standard deviation of 12. Assuming a confidence level of 90%, can these studies conclude that one playerbase is more intelligent than the other?

### Summary 

- A statistical hypothesis test is a method for testing a hypothesis about a parameter in a population using data measured in a sample.

- Hypothesis tests consist of a null hypothesis and an alternative hypothesis.

- We test a hypothesis by determining the chance of obtaining a sample statistic if the null hypothesis regarding the population parameter is true.

- One-sample z-tests and one-sample t-tests are hypothesis tests for the population mean $\mu$.

- We use a one-sample z-test for the population mean when the population standard deviation is known and the sample size is sufficiently large. We use a one-sample t-test for the population mean when the population standard deviation is unknown or when the sample size is small.

- Two-sample t-tests are hypothesis tests for differences in two population means.