<center><h1>Statistics for Machine Learning</h1></center>
&nbsp;
<center><h2>Hypothesis Testing and p-Values</h2></center>

### Overview


### Pre-requisites

This notebook builds on *previous notebook*

### Contents

Section 1 - What is Hypothesis Testing?

Section 2 - Setting up a Hypothesis Test

Section 3 - Coin Flips Example (Discrete)

Section 4 - Height Example (Continuous)

Section 5 - Glossary

Section 6 - References

### Loading the Data

The SOCR dataset from UCLA contains height and weight information for 25,000 18 year olds, who form the population for this notebook.

In [2]:
import math
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('datasets/SOCR-HeightWeight.csv', sep=',')
heights = list(df['Height(Inches)'])
weights = list(df['Weight(Pounds)'])

heights.sort()
weights.sort()

print(f'Number of rows: {len(df)}')

Number of rows: 25000


<center><h2>Section 1 - What is Hypothesis Testing?</h2></center>

### 1.1 - Motivation for Hypothesis Testing

In *Sampling a Distribution & Bessel's Correction*, we showed that we can take a small number of samples from a population to estimate the population parameters. In *Population Parameters for Normal Distributions*, we saw how once the population parameters were known, we could answer questions such as 'what is the probability that a person in the population is taller than 70 inches?'.

This notebook will look at tackling slightly different questions. Here, a measurement will be given and the task is to figure out if the measurement is **significantly** different from a population. To explain this concept, we will work through two examples: an example with a discrete random variable, and an example with a continuous random variable.

The first example will consider a coin that is flipped 10 times and lands on heads everytime - since the outcomes are either heads or tails, this random variable is discrete. Given all the combinations of heads and tails the coin could have landed on, is 10 heads in a row different enough (significantly different) from what a normal coin would land on? If it is, then the coin is unlikely to be a regular coin, and is more likely to be weighted or biased in some way. If 10 consecutive heads is not that uncommon, then the coin is probably just a regular coin.

The second example will consider the height of a person who is 78 inches tall - since height can be any decimal value, this random variable is continuous. Given the heights of people in the population, is 78 inches significantly different? If so, then the person is probably from a different population of people, but if not, the person is more likely to be from the same population (and just happens to be quite tall).

The test we perform to find out the answer to these questions is called a **Hypothesis Test**, and is one of the most important statistical tests. It can be used to determine if a new drug is better at curing illness, if safety measures have reduced road traffic collisions, and much more.

### 1.2 - Introduction to $p$-Values

The last section was deliberately vague, and made statements such as "different enough" and "significantly different". But how do we quantify if a measurement is "different enough"? The answer is by using **$p$-values**. The term $p$-value is short for probability value, and is a number between 0 and 1. This value is the probability that a measurement we have made was due to random chance, and not because anything is actually different about the item we are measuring.

This part is very important, and so here is is again for emphasis:
> A $p$-value is the probability that a measurement we have made was due to random chance, and not because anything is actually different about the item we are measuring.

If the $p$-value is very low (usually < 0.05), it is very unlikely that a measurement was due to random chance, and so we can have high confidence that the result was caused by external factors (such as the coin being biased, or the person being from a different population). If the $p$-value is very high, the measurement is very typical of the population data, and so we have no evidence to suggest there is anything different about the measurement. 

The method for calculating $p$-values is shown later in the notebook.

<center><h2>Section 2 - Setting up a Hypothesis Test</h2></center>

### 2.1 - Approach to a Hypothesis Test

To answer the question 'is a measurement significantly different?', it is actually easier to assume a measurement is not different, and then disprove the assumption. We ask the question this way around because $p$-values give the probability that a measurement was due to random chance, and NOT because the item (e.g. coin, height etc) was different. Hence, we must assume the observation was due to random chance, then either disprove or do not disprove the assumption.

### 2.2 - Hypothesis Test Overview

Below is an overview of the steps for setting up a hypothesis test. The next two sections will walk through examples for discrete and continuous random variables respectively. Note the overall method is the same for both, but the way you work through some of the steps is slightly different - see those sections for the details.

&nbsp;

**Step 1)** Write down the Null Hypothesis, $H_0$

The first step in performing a hypothesis test is to write down what we are going to test. In the coin flip example, this would be: 'The coin is not biased'. Because we always test if a result is NOT significantly different, we call this statement the **Null Hypothesis**. This emphasises the fact that the test is always checking that the is no difference. The null hypothesis is denoted $H_0$.

&nbsp;

**Step 2)** Write down the Alternative Hypothesis, $H_1$

The **Alternative Hypothesis** is the case where there IS a difference between the two items. For example, the alternative hypothesis for the coin flip would be: 'The coin is biased'. The alternative hypothesis is denoted $H_1$.

&nbsp;

**Step 3)** Find the $p$-value

Next, find the $p$-value - this will tell us the probability the measurement was due to random chance. If the probability is very low we can determine that the measurement is likely explained by external factors (such as the coin being weighted). We can set the limit of what we consider 'very low' to different values depending on the problem. The mathematical name for this threshold is the **level of significance** and is represented by $\alpha$. The value of $\alpha$ is generally 0.05, and so $p$-values lower than this are rare enough for the measurement to be considered significant. However in more critical experiments, such as medical trials, you might want $\alpha$ value much lower so that you can be more confident that the results are not due to random chance.

&nbsp;

**Step 4)** Make a Conclusion

If the $p$-value is less than the level of significance, then the measurement is significantly different and the null hypothesis was wrong. So we **reject the null hypothesis**. This is a key phrase that is commonly heard in statistics. If the $p$-value is not less than the level of significance we cannot say whether the result is significant or not, and so we **fail to reject the null hypothesis**. For example, maybe the coin WAS biased, but we haven't flipped it enough times yet to find that out.

### 2.3 - Why use $\alpha$ < 0.05?

The value of $\alpha$ can be chosen depending on the experiment. Recall that a $p$-value is the probability that a measurement we have made was due to random chance. This means that for $\alpha$ = 0.05, we are willing to accept that 5% of the time the observation we made was actually due to random chance. 

In the examples of flipping a coin or measuring heights, this is probably a safe assumption. But in the case of medical trials where people could live or die depending on the effectiveness of a drug, it is critical we are very confident in the results of an experiment. We can increase our confidence by decreasing the value of $\alpha$ so that it is much more unlikely that the results were due to random chance. In cases such as these, alpha could be very small, for example $\alpha$ = 0.00001.

### 2.4 - Calculating $p$-Values

The technical definition of a $p$-value has been held back so far to focus on the meaning and interpretation, more so than the mathematical formula. The value itself is simple to calculate, and is the sum of three parts:

&nbsp;

1) the probability of the specific outcome occurring

2) the probabilities of any equally likely outcomes occurring

3) the probabilities of any rarer outcomes occurring

&nbsp;

The next and final step is to divide by the total number of outcomes - this gives the $p$-value.

&nbsp;

At first glance, the second and third terms might seem a bit confusing but these can be easily explained. What we are interested in with the $p$-value, is the probability that the measurement made was due to random chance. To find this, we are not so much interested in if the coin flipped all heads or all tails - it doesn't matter which. We are solely interested in whether the coin is exhibiting significantly different behaviour than expected. So we can add the probability of 10 consecutive tails to the probability of 10 consecutive heads. This explains the second term. 

To explain the third term it is helpful to use an example. Imagine there is a bag of 1000 marbles where 500 are blue, 2 are yellow and the remaining marbles are all unique colours (red, green etc). It doesn't matter that a yellow marble has a 2/1000 chance of being drawn, because 498 of the marbles have a 1/1000 chance of being drawn. The fact the yellow marble is 'rare' is not a special property here. Half the time a 'rare' marble will be drawn. The $p$-value accounts for this by adding the third term. In this case, the $p$-value for drawing a yellow marble is 0.5 (500/100), which is much larger than 0.05. This indicates that the measurement is not due to random chance and is in fact likely a typical result for that population.

<center><h2>Section 3 - Coin Flips Example (Discrete)</h2></center>

<center><h2>Section 4 - Height Example (Continuous)</h2></center>

<center><h2>Section 4 - Conclusion</h2></center>

<center><h2>Section 5 - Glossary</h2></center>

**Alternate Hypothesis ($H_1$)**
> Definition

**Hypothesis Testing**
> Definition

**level of significance**
> Definition

**Null Hypothesis ($H_0$)**
> Definition

**Significantly Different**
> Definition

**$p$-Value**
> Definition

<center><h2>Section 6 - References</h2></center>

[1] Decription - [Website domain](link)