### Hypothesis testing

In the general setting, we start with some hypothesis about the population that we want to test. We denote this as the null hypothesis. We then construct an opposing hypothesis that we denote as the alternative hypothesis. The hypothesis test will then tell us which hypothesis we can accept and which we must reject.

The whole process of hypothesis testing can be summed up via the steps below

* Formulate the null hypothesis (denoted by H0) and the alternative hypothesis (denoted by H1)
* Determine the sample size for your sample
* Choose a significance level (this is denoted by α and a common value is 0.05)
* Collect your sample
* Decide whether to accept or reject the null hypothesis

Now, this test statistic is usually examined together with something called the p-value which basically tells us the probability that the observed statistic occurred this way by chance. A very small p-value should give us confidence that what we are observing is not happening by chance.

There are many tests available and in this unit, we will look at two of them: t-tests and Chi-square tests. We will continue working with our example from the previous unit with the variable population consisting of Poisson random variables

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

poisson1 = stats.poisson.rvs(mu=55, size=200000)
poisson2 = stats.poisson.rvs(mu=10, size=100000)
population = np.concatenate((poisson1, poisson2))

#### One-sample t-tests
This test is meant to determine for us whether two numerical datasets differ significantly from one another. To demonstrate this with an example we will set up an artificial sample, called test sample that we create in such a way that it does indeed differ from the original population.

In [4]:
test_sample = np.concatenate(
    (stats.poisson.rvs(mu=50, size=200), stats.poisson.rvs(mu=10, size=100))
)

The main difference here is that the mean of the first part is 50 instead of 55 as in the original population (also the size of the test_sample is much smaller than the size of the `population). Let’s now declare the null hypothesis:

H0= the dataset tes_sample has the same mean as the dataset population

We will choose a significance level of 0.05 which we mentioned was common for this type of test. We can now compute our two parameters from this test: the t_statistic and the p_value using the special function in the stats module for one-sample t-tests appropriately called stats.ttest_1samp(). Let’s give it a go

In [5]:
t_statistic, p_value = stats.ttest_1samp(test_sample, popmean=population.mean())
t_statistic


-2.6163373876026763

In [6]:
p_value

0.00933935058143441

Now, what does this mean? The t_statistic is a standardized metric that tells us how much the sample mean deviates from the null hypothesis. The p_value means that 0.09 percent of the time our data just randomly happened to appear this way. This is quite a small percentage in this case. But our decision on whether to reject or accept the hypothesis is based solely on the p_value in this case. The rule is the following

* if the p-value is less than the significance value we reject the null hypothesis
* if the p-value is greater than the significance value we failed to reject the null hypothesis

In our case a p-value of 0.0049 is certainly less then our significance value of 0.05 therefore we can reject our hypothesis. Of course, since we created this example we knew from the start that our test_sample had indeed a difference mean, but the hypothesis test confirmed this for us.

#### Type I and II errors
* A type I error occurs if we reject the null hypothesis when it is actually true (this is also called a false positive).
* A type II error occurs if we fail to reject the null hypothesis when it is actually false (this is known as a false negative).


### Chi-square tests

The t-test we looked at before is great for comparing quantitative properties such as the mean. However when we are dealing with qualitative, or categorical data then we use a different test known as the chi-square test. Let’s set up an example of a categorical population and a test sample.

In [7]:
data = pd.DataFrame(
    ["red"] * 50000 + ["blue"] * 30000 + ["green"] * 10000 + ["white"] * 10000
)
sample = pd.DataFrame(["red"] * 600 + ["blue"] * 300 + ["green"] * 70 + ["white"] * 60)

We created a test sample in this case which has a different distribution of colors than our population. We will now test the null hypothesis that the sample has the same distribution as the population.

In [8]:
data_count = pd.crosstab(index=data[0], columns="count")
data_count

col_0,count
0,Unnamed: 1_level_1
blue,30000
green,10000
red,50000
white,10000


In [9]:
sample_count = pd.crosstab(index=sample[0], columns="count")
sample_count

col_0,count
0,Unnamed: 1_level_1
blue,300
green,70
red,600
white,60


Now we must calculate the chi-square statistic which is given by the following formula

Sum((observed-expected)^2/expected)

where the sum is over all the categories, observed is the count for the given category in our sample, and expected is the expected count based on the distribution of our original population. So in our case what are the expected counts for our sample?
    
Well they are obtained by taking the counts of the population multiplying by the length of the sample and then dividing by the length of the population. This is what we mean

In [10]:
expected_count = data_count * len(sample) / len(data)
expected_count

col_0,count
0,Unnamed: 1_level_1
blue,309.0
green,103.0
red,515.0
white,103.0


Note that the expected counts are the counts that the sample would have if it were to have the exact same distribution of colors as the population.

We can now compute our chi-square

In [12]:
chi_square = (((sample_count - expected_count) ** 2) / expected_count).sum()
chi_square

col_0
count    42.815534
dtype: float64

So we have a chi-square statistic of 42.82. But how can we interpret this? Well, first we choose a significance level, which once again we go with 0.05, and then compute the critical value corresponding to this confidence.

In [13]:
stats.chi2.ppf(q=0.95, df=3)

7.814727903251179

The degrees of freedom in this case, is given by the number of categories minus one.

We can now compute the p-value corresponding to our chi-square



In [15]:
p_value = 1 - stats.chi2.cdf(x=chi_square, df=3)
p_value

array([2.69324241e-09])

Then the conclusion is made as before: if our p-value is lower than our significance value so we reject the hypothesis. This is indeed the correct answer since we picked the sample in such a way that it would have a different distribution of the categorical variable from our population.