## Hypothesis Testing

A Hypothesis is a claim(assumption) about a population parameter:
- population mean
- population proportion

Point estimates and confidence intervals are basic inferential statistics that act as the foundation for another inference technique: statistical hypothesis testing. Statistical hypothesis testing is a framework for determining whether observed data deviates from what is expected. 

## Hypothesis Testing Foundations

Statistical hypothesis tests are based on a statement called the null hypothesis that states the assumption to be twested. It is always about a population parameter not a sample statistic. We begin with the assumption that the null hypothesis is true. The null hypothesis refers to the status quo what we believe to be true.

The purpose of a hypothesis test is to determine whether the null hypothesis is likely to be true given sample data.

If there is little evidence against the null hypothesis given the data, we can not reject the null hypothesis. If the null hypothesis is unlikely given the data, you might reject the null hypothesis in favor of the alternative hypothesis. 

The alternative Hypothesis H1 is the opposite of the null hypothesis. It challenges the status quo and is generally the hypothesis the researcher is trying to support.

Once we have the null and alternative hypothesis in hand, you choose the significance level. $\alpha $.

The significance level is a probability threshold that determines when you reject the null hypothesis. This defines the unlikely values of the sample statistic if the null hypothesis is true.

**P-value**
If the probability of getting a result as extreme as the one you observe due to chance is lower than the significance level, you reject the null hypothesis in favor of the alternative.

The probability of seeing a result as extreme or more extreme as the one we observed, given the null hypothesis is true is known as the p-value.

The T-test is a statistical test used to determine whether a numeric data sample parameter differs significantly from the population or wether two samples differ from one another.

![alternative text](https://miro.medium.com/v2/resize:fit:1400/1*ADioOHRLwtjmF_7huRLYSg.jpeg)

# One Sample T-tests

A one sample t-test checks whether a sample mean differs from the population mean. We use the t-statistic when the population standard deviation is unknown else we use a z critical value.

The KAGGLE notebook I am following creates dummy age data for the population of voters in the entire country and a sample of voters in Minnesota, and tests wether the average age of voters in Minnesota differs from the population.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import math

In [3]:
np.random.seed(6)

population_ages1 = stats.poisson.rvs(loc = 18, mu =35, size = 150000) #n.b loc is the left hand side boundary 18 because voters can only be 18+
population_ages2 = stats.poisson.rvs(loc=18, mu = 10, size = 100000)
population_ages = np.concatenate((population_ages1, population_ages2))

minnesota_ages1 = stats.poisson.rvs(loc = 18, mu = 30, size = 30)
minnesota_ages2 = stats.poisson.rvs(loc = 18 , mu = 10, size = 20)
minnesota_ages = np.concatenate((minnesota_ages1, minnesota_ages2))

print(f"Population ages mean: {population_ages.mean()}")
print(f"Minnesota ages mean: {minnesota_ages.mean()}")



Population ages mean: 43.000112
Minnesota ages mean: 39.26


Notice that we used a slifhtly different combination of distributions to generate the sample data for Minnesota, so we know that the two means are different. Let's conduct a t-test at a 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution fo the population. I.e the population mean is the same as the sample mean.

To conduct a one sample t-test, we can use the stats.ttest_1samp()

In [4]:
stats.ttest_1samp(a = minnesota_ages, popmean= population_ages.mean())

TtestResult(statistic=np.float64(-2.5742714883655027), pvalue=np.float64(0.013118685425061678), df=np.int64(49))

The test result shows that the test statistic (t value) is -2.5742. This test statistic tells us how much the sample mean deviates from the null hypothesis.

If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degress of freedom, i.e our t-critical value/ within the rejection region, we reject the null hypothesis. WE check this with the inverse of the cdf ppf the percent point function.

In [7]:
print(stats.t.ppf(q = 0.025, df = (50-1)))
print(stats.t.ppf(q = 0.975, df = 49))

-2.0095752371292397
2.0095752371292397


We can see that the test statistic for our sample is within the rejection region/ exceeds the t critical value for 95% significance level, indicating the null hypothesis that the sample is from the same distribution as the population is false and we reject our null. Stating that we have sufficient evidence at the 5% significance level that the true population mean is not equal to the sample mean.