## <center> Hypothesis testing </center>

Point estimates and confidence intervals are basic inference tools that act as the foundation for another inference technique: statistical hypothesis testing. Statistical hypothesis testing is a framework for determining whether observed data deviates from what is expected. Python's scipy.stats library contains an array of functions that make it easy to carry out hypothesis tests.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

### Tests on continous data

#### One-Sample t-test

A one-sample t-test checks whether a sample mean differs from the population mean. 

To check the mean value of normally distributed data against a reference value, we typically use the one sample t-test, which is based on the t-distribution. If we knew the mean and the standard deviation of a normally distributed population, we could calculate the corresponding standard error, and use values from the normal distribution to determine how likely it is to find a certain value. However, in practice we have to estimate the mean and standard deviation from the sample; and the t-distribution, which characterizes the distribution of sample means for normally distributed data, deviates slightly from the normal distribution.

Ex: The daily energy intake from 11 healthy women is [5260, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515, 8230, 8770] kJ. Test the claim that the mean value is 7725, with a significance level of 0.05.

In [2]:
data = np.array([5260, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515, 8230, 8770])

t, p_val = stats.ttest_1samp(data, 7725, alternative='two-sided')
t, p_val

(-2.8207540608310198, 0.018137235176105812)

$H_0$ is rejected, due to $p$ value being less than $\alpha$, so there is a difference between the recommeneded mean valua and the optained mean value

If the data are not normally distributed, the one-sample t-test should not be used. Instead, we must use a nonparametric test on the mean value. We can do this by performing a Wilcoxon signed rank sum test (rank gives the sum of the ranks of the negative values). Note that in contrast to the one-sample t-test, this test checks for a difference from null:


      (rank, pVal) = stats.wilcoxon(data-checkValue)

In [3]:
data = np.array([5260, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515, 8230, 8770])

t, p_val = stats.wilcoxon(data - 7725)
t, p_val

(8.0, 0.0244140625)

#### Two-sample t-test

A two-sample t-test investigates whether the means of two independent data samples differ from one another. In a two-sample test, the null hypothesis is that the means of both groups are the same. Unlike the one sample-test where we test against a known population parameter, the two sample test only involves sample means.

##### Testing from stats
T-test for means of two independent samples from descriptive statistics.

In [4]:
# Suppose we have the summary data for two samples, as follows 
# (with the Sample Variance being the corrected sample variance):

#                 Sample   Sample
#           Size   Mean   Variance
# Sample 1    45    204.4     13825.3
# Sample 2    55    130.0     8632.0

t, p_val = stats.ttest_ind_from_stats(mean1=204.4, std1=np.sqrt(13825.3), nobs1=45,
                                mean2=130.0, std2=np.sqrt(8632.0), nobs2=55,
                                alternative='two-sided')

t, p_val

(3.5349416415661556, 0.0006244158265099108)

##### Paired t-test
Two values recorded from the same subject at different times are compared to each other.

Ex: Weekly mean men-hour loss due to accidents in ten plants. Test the claim that the new regulations had a significant impact, with a significance level of 0.05.

In [5]:
before = np.array([45, 73, 46, 124, 33, 57, 83, 34, 26, 17])
after = np.array([36, 60, 44, 119, 35, 51, 77, 29, 24, 11])

t, p_val = stats.ttest_rel(a=before, b=after, alternative='two-sided')
t, p_val

(4.03328398196115, 0.002958322868433042)

$H_0$ is rejected, due to $p$ value being less than $\alpha$, so there is an improvement in the plant security, with the new regulations.

##### t-Test between Independent Groups

An unpaired t-test, or t-test for two independent groups, compares two groups. An example would be the comparison of the effect of two medications given to two different groups of patients.
The basic idea is the same as for the one-sample t-test. But instead of the variance of the mean, we now need the variance of the difference between the means of the two groups. Since the variance of a sum (or difference) of independent random variables equals the sum of the variances.

Ex: In a clinic, 15 lazy patients weigh [76, 101, 66, 72, 88, 82, 79, 73, 76, 85, 75, 64, 76, 81, 86.] kg, and 15 sporty patients weigh [ 64, 65, 56, 62, 59, 76, 66, 82, 91, 57, 92, 80, 82, 67, 54] kg.
Are the lazy patients significantly heavier?

In [6]:
lazy = np.array([76, 101, 66, 72, 88, 82, 79, 73, 76, 85, 75, 64, 76, 81, 86])
sporty = np.array([64, 65, 56, 62, 59, 76, 66, 82, 91, 57, 92, 80, 82, 67, 54])

t, p_val = stats.ttest_ind(a=lazy, b=sporty,
                           equal_var=False, # If True (default), perform a standard independent 2 sample test 
                                           # that assumes equal population variances.
                                           # If False, perform Welch’s t-test, which does not assume equal population 
                                           # variance.
                           alternative='two-sided')
t, p_val

(2.0968730776547093, 0.046052661509003605)

$H_0$ is rejected, due to $p$ value being less than $\alpha$, so there is significant difference between the weights of the two groups.

If the measurement values from two groups are not normally distributed we have
to resort to a nonparametric test. The most common nonparametric test for the
comparison of two independent groups is the Mann–Whitney(–Wilcoxon) test.
Watch out, because this test is sometimes also referred to as Wilcoxon rank-sum
test. This is different from the Wilcoxon signed rank sum test! The test-statistic for
this test is commonly indicated with u:


        u_statistic, pVal = stats.mannwhitneyu(group1, group2)

### Tests on categorical data

#### Chi-square test 
This is the most common type. It is a hypothesis test, which checks if the entries in the individual cells in a frequency
table all come from the same distribution. In other words, it checks the null hypothesis $H_0$ that the results are independent of the row or column in which they appear. The alternative hypothesis Ha does not specify the type of association, so close attention to the data is required to interpret the information provided by the test.

##### One-Way Chi-Square Test --- Goodness of fit
The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution.

$$ \chi^2 = \sum^k_{i=1} \frac{(observed_i - expected_i)^2}{expected_i}$$

Ex: Prove that a distribution approximates a Possion distribution with parameter $\lambda = 4.6$ with a signficance level of 0.01

In [7]:
observed = np.array([3, 15, 47, 76, 68, 74, 46, 39, 15, 9, 5, 2, 0, 1])
expected = np.array([4, 18.4, 42.8, 65.2, 74.8, 69.2, 52.8, 34.8, 20, 10, 4.8, 2, 0.8, 0.4])

chisq, p = stats.chisquare(f_obs=observed,
                           f_exp=expected,
                           ddof=1)
chisq, p

(8.471484713365697, 0.7472860318162149)

In this case $H_0$ is failed to reject, so we don't have enough information to declare that the Poission distribution is an actual good fit.

##### Chi-Square Contingency Test

The chi-squared test of independence or chi-Square contingency test tests whether two categorical variables are independent. The chi-square contingency test is based on a test statistic that measures the divergence of the observed data from the values that would be expected under the null hypothesis of no association.

Ex: Determine if there is a relationship between the employee's performance in the company's training program and his subsequent success on the job. Use a significance level of 0.01

In [8]:
data = np.array([
    [23, 60, 29],
    [28, 79, 60],
    [9, 49, 63]
])

chi2, p, dof, ex = stats.chi2_contingency(data, 
                                          correction=True) # Small sample, Yates correction must be done
chi2, p, dof, ex

(20.178903582087926,
 0.00046038041384262443,
 4,
 array([[16.8 , 52.64, 42.56],
        [25.05, 78.49, 63.46],
        [18.15, 56.87, 45.98]]))

##### Cochran’s Q Test
Cochran’s Q test is a hypothesis test where the response variable can take only two
possible outcomes (coded as 0 and 1). It is a nonparametric statistical test to verify
if k treatments have identical effects. 

The null hypothesis for the Cochran’s Q test is that there are no differences
between the variables. If the calculated probability p is below the selected significance
level, the null hypothesis is rejected, and it can be concluded that the
proportions in at least 2 of the variables are significantly different from each other.

Ex: Twelve subjects are asked to perform three tasks. The outcome of each task is
success or failure. The results are coded 0 for failure and 1 for success. In the
example, subject 1 was successful in task 2, but failed tasks 1 and 3

In [9]:
from statsmodels.sandbox.stats.runs import cochrans_q

In [10]:
tasks = np.array([[0,1,1,0,1,0,0,1,0,0,0,0],
                      [1,1,1,0,0,1,0,1,1,1,1,1],
                      [0,0,1,0,0,1,0,0,0,0,0,0]])
    

df = pd.DataFrame(tasks.T, columns = ['Task1', 'Task2', 'Task3'])
Q, pVal = cochrans_q(df)
Q, pVal 

(8.666666666666666, 0.013123728736940971)