# Hypothesis Testing

Hypothesis testing is the process of comparing one hypothesis to another and using statistics to help evaluate the hypothesis. It is part of the branch of statistics known as Inferential Statistics. 

In this lesson we will introduce some broad concepts related to hypothesis testing, and in future lessons we will dive into a few specific hypothesis tests. 


![Types of Stats](stats-types.jpg)

The terms covered in this lesson are summarized in the table below:

| Term                                            | Formula / Symbol | Description                                                 |
| ----                                            | ---------------- | -----------                                                 |
| **Null Hypothesis**                             | $H_0$            | The "default" hypothesis; usually no change, no effect, etc |
| **Alternative Hypothesis**                      | $H_1$ or $H_a$   | The "other" hypothesis; states there is some relationship                                                            |
| **Significance Level**, **False Positive Rate** | $\alpha$         | P(FP) = P(Type I Error)                                     |
| **Statistical Power**                           | $1 - \beta$      | P(Reject $H_0$ when $H_0$ is false)                         |
| **False Negative Rate**                         | $\beta$          | P(FN) = P(Type II Error)                                    |
| **p-value** | $p$ | P(We observed this result due to chance \| $H_0$ is true) |



## Sample vs. Population

According to [Scribbr.com](https://www.scribbr.com/methodology/population-vs-sample/#:~:text=A%20population%20is%20the%20entire,t%20always%20refer%20to%20people.), "A **population** is the entire group that you want to draw conclusions about. A **sample** is the specific group that you will collect data from. The size of the sample is always less than the total size of the population. In research, a population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries, species, organisms, etc."

Let's say we have a dataset that contains information about 600 statistics students: their hair and eye color and their sex. 

One thing we know is we have a sample of *some* population. 

What we don't know is what population it is a representative sample of. 


Some questions we may ask:

1. Is this a representative sample of *all* statistics students?

2. ...of *all* students?

3. ...of statistics students in a certain region of the world? 

4. ...of college statistics students in the US?

5. ...of people in the U.S.?

6. ...(and on and on)


Let's take, for example, the question of "Is this a representative sample of people in the U.S.?"

We can use statistical testing to determine the probability that this sample is a sample of the general U.S. population. My hypothesis is that it is not of the general U.S. population because within this group of students, I found that 36% of them have blue eyes. However, according to [heffingtons](https://heffingtons.com/interesting-facts-about-eye-color/), the estimated population proportion of blue eyes in the U.S. is 27%. 

- Brown Eyes: 45%

- Blue Eyes: 27%

- Hazel Eyes: 18% (Note: Hazel eyes consist of shades of brown and green.)

- Green Eyes: 9%

- Other: 1%


But is this difference significant? I.e. If I took a sample of 600 people across the U.S., what is the probability that 36% of that sample had blue eyes? I'm going to say that if the probability of that happening is < 5%, then I will conclude that this is a sample of a different population. If it is > 5%, I will conclude that the 9% difference was just by chance and this still appears to be a sample of the U.S. population as a whole. 


Another option is to look at hair and eye color together and look for relationships. Possible questions could be: 

- Are brown eyes more likely to be associated with brown or black hair?

- Are green eyes more likely to be associated with blond or red hair? 

Here we would be testing relationships, to see if two categorical variables are dependent on each other. In that case we could use something like a Chi-Square test for independence. 

## Null and Alternate Hypothesis

When performing formal statistical hypothesis testing, the question being asked needs to be phrased as a **null hypothesis** ($H_0$) and an **alternative hypothesis** ($H_a$) . The null hypothesis is the "status quo" and usually reflects no change or no difference, while the alternative hypothesis says that there is a difference or change.

Some examples:

> - $H_0$: There is no difference between right-handed people and left-handed individual's heights.
> - $H_a$: There is a difference between right-handed people and left-handed individual's heights.

> - $H_0$: The amount of sleep a student gets the night before an exam makes no difference on the student's exam score.
> - $H_a$: Less sleep the night before an exam leads to a lower exam score.

The results of a hypothesis test will lead us to either **reject the null hypothesis** or **fail to reject the null hypothesis**. Strictly speaking, this does not tell us that the alternative hypothesis is true.

The alternative hypothesis can either be that there is a difference or that the difference is either greater or less than. This tells us whether we are setting up a **two-tailed** (for any difference) or **one-tailed** (for a specific difference) test.

General hypothesis tests process:

0. Choose the right type of test for your data / question (elaborated on in future lessons)
1. Form hypotheses and set a desired confidence level
1. Calculate the appropriate test statistics and p-value
1. Conclude based on the above statistics

## Hypothesis Test Results 

Once we determine the correct hypothesis test, we choose a **confidence interval**, a range of values that contains the true value a certain percent of the time. By choosing a confidence interval, we set our **significance level**, $\alpha$ (alpha) as well. $\alpha$ is defined as 1 - our confidence level. Typical values for our confidence interval are 95%, 99% and 99.9%.

### p-value

One of the values we will obtain from a hypothesis test is a **p-value**. The calculation differs depending on the specific type of test we are running, but is interpreted the same way.

The p-value is the chance that we obtained the results we did (or would obtain more extreme results) if the null hypothesis is true.

For example, imagine we were testing the hypothesis that Codeup students that drink coffee have higher grades. Our hypotheses would be:

- $H_0$: There is no difference in grade for coffee and non-coffee drinkers.
- $H_a$: Coffee drinkers have higher grades than non-coffee drinkers.

Let's imagine we end up with a p-value of .05. This means that if it's true that there is no difference in grades, and we ran the experiment 20 times, we would expect 1 out of the 20 experiments to tell us that there is a difference in grades, purely due to chance.

Based on our previously set confidence interval and p-value, we decide whether to reject the null hypothesis or not. If our p-value is less than $\alpha$, we reject the null hypothesis, otherwise, we fail to reject the null hypothesis.

Note that p-values don't tell us anything about effect size.

## Hypothesis Testing Errors

There are two types of errors we will encounter with hypothesis testing:

- A **type I** error is when we reject the null hypothesis, but, in reality, the null hypothesis is true.
- A **type II** error is when we fail to reject the null hypothesis when it is actually false.

The table below shows us the possible outcomes of a hypothesis test.

| &nbsp;       | $H_0$ is true                 | $H_0$ is false                 |
| ------------ | -------------                 | --------------                 |
| Accept $H_0$ | True Negative                 | False Negative (Type II Error) |
| Reject $H_0$ | False Positive (Type I Error) | True Positive                  |


## In Practice


This question and others are what we answer using hypothesis testing. 

Thinking about the telco-churn dataset, here are some questions we might initially have of the data:

1. Do customers churn because their bills are too high? 

2. Is their internet too slow? Maybe there's something wrong with certain internet options? 

3. Do customers get charged more the longer they are there? 

Questions do not usually start off in an organized way. So let's restructure these questions into something that will help us know how to test the questions. 

1. Do those who churn spend more than those who do not churn? 

2. Are certain internet types more or less likely to churn?

3. Is there a linear relationship between tenure and average monthly charges? 

Let's explicitly call out the associated variables and their datatype for these questions. 

1. Do those who churn (has_churned, boolean) spend more each month (avg_monthly_spend, numeric) than those who do not churn? 

2. Are customers with Fiber (has_fiber, boolean) more likely to churn (has_churned, boolean) than those without?

3. Are sr. citizens (is_senior_citizen, boolean) more likely to churn (has_churned, boolean)?

4. Are customers without auto payment (has_autopayment, boolean) more likely to churn (has_churned, boolean)? 

5. Do customers who churn (has_churned, boolean) have lower tenure (tenure_months, numeric)? 

6. Is there a linear relationship between tenure (tenure_months, numeric) and total charges (ttl_charges, numeric)?

7. Is there a linear relationship between tenure (tenure_months, numeric) and average monthly charges (avg_monthly_charges, numeric)? 


The types of tests we run depends on the question and data types:

1. Do those who churn (has_churned, boolean) spend more each month (avg_monthly_spend, numeric) than those who do not churn? (boolean x numeric: comparison of means (t-test) across the 2 groups)

2. Are customers with Fiber (has_fiber, boolean) more likely to churn (has_churned, boolean) than those without? (boolean x boolean: comparison of proportions/relationships)

3. Are sr. citizens (is_senior_citizen, boolean) more likely to churn (has_churned, boolean)? (boolean x boolean: comparison of proportions/relationships)

4. Are customers without auto payment (has_autopayment, boolean) more likely to churn (has_churned, boolean)?  (boolean x boolean: comparison of proportions/relationships)

5. Do customers who churn (has_churned, boolean) have lower tenure (tenure_months, numeric)? (boolean x numeric: comparison of means (t-test) across the 2 groups)

6. Is there a linear relationship between tenure (tenure_months, numeric) and total charges (ttl_charges, numeric)? (numeric x numeric: linear correlation between two continuous values, does one affect the other. (pearson's correlation))

7. Is there a linear relationship between tenure (tenure_months, numeric) and average monthly charges (avg_monthly_charges, numeric)?  (numeric x numeric: linear correlation between two continuous values, does one affect the other. (pearson's correlation))

## Exercises

Do your work for this exercise in a jupyter notebook named `hypothesis_testing.ipynb`.

For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

- Has the network latency gone up since we switched internet service providers?
- Is the website redesign any good?
- Is our television ad driving more sales?