# Hypothesis Testing - Part 3. Chi Squared



**Assumptions**  


The following Resources have been used:  


*This notebook is from a series on Hypothesis Testing* 
1. *Hypothesis Testing - Comparing Proportions (z test)*
2. *Hypothesis Testing - Comparing Means (t test)*  
3. ***Hypothesis Testing - Chi Sq***  
4. *Hypothesis Testing - ANOVA*  

#### Libraries

In [35]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn import datasets

## 1. Chi Squared - Goodness of fit  
#### How well does our observations fit against the expected values

The chi-square goodness of fit test is appropriate when the following conditions are met: 
1. The sampling method is simple random sampling. 
2. The variable under study is categorical. The expected value ( Note expected NOT Observed) of the number of sample observations in each level of the variable is at least 5.

#### Example

Joe wondered whether the days of the week of health clinic appointments at his clinic had an even distribution from Monday through Friday, so he took a random sample of 500health clinic appointments and recorded their days of the week. Here are his results:

|Day|Appointments|
|---|---|
M |115  
T |100  
W |115  
T |100  
F |70  

In [36]:
# set alpha = 0.05
alpha = 0.05

observed = np.array([115, 100, 115, 100, 70])
#observed = np.array([115, 117, 115, 115, 110])
expected_ratio =  np.array([0.2, 0.2, 0.2, 0.2, 0.2])
expected = observed.sum()*expected_ratio

chisq, p = stats.chisquare(observed, expected)

print("Chisq stat: {:.4f} pvalue: {:.4f}".format(chisq,p))
if p <= alpha:
    print('Reject the H0: Data does not appear to come from the specified distribution')
else:
    print('Fail to reject H0: Data comes from specified distributions)')

Chisq stat: 13.5000 pvalue: 0.0091
Reject the H0: Data does not appear to come from the specified distribution


## 2. Chi Squared - Homogenity

If we are testing multiple samples then we are comparing homeogenity. If  it is the same sample but with different attributes e.g. age then we are testing for independance

**Homogenity:**   A high chi square results in a lower p value if the p value is below the significance level then we can conclude that the distributions are different
**Independance:** A high chi square results in a lower p value if the p value is below the significance level then we can conclude that there is dependance

#### Example

Age|Alerts|No alerts|Total|
----|---|----|----|
0-29|48|64|112|
    |56|56||
30-59|33|27|60|
    |30|30||
60+|19|9|28|
    |14|14||
Total|100|100|200|

They want to use these results to carry out a chi squared test for homogeneity. Assume that all conditions for inference were met.

In [37]:
observed = [[48,33,19], [64,27,9]]
observed
stat, p, dof, expected = stats.chi2_contingency(observed=observed)
print("chisq is {:.4f}".format(stat)) 
# interpret p-value
alpha = 0.05
print("p value is  {:.4f}".format(p)) 
if p <= alpha:
    print('Distributions are different (reject H0)')
else:
    print('Distributions are similar (H0 holds true)')

chisq is 6.4571
p value is  0.0396
Distributions are different (reject H0)


In [38]:
observed = [[6,9,15,6], [78,55,133,58]]
observed
stat, p, dof, expected = stats.chi2_contingency(observed=observed)
print("chisq is {:.4f}".format(stat)) 
# interpret p-value
alpha = 0.05
print("p value is  {:.4f}".format(p)) 
if p <= alpha:
    print('Distributions are different (reject H0)')
else:
    print('Distributions are similar (H0 holds true)')

chisq is 1.9663
p value is  0.5794
Distributions are similar (H0 holds true)


## 3. Chi Squared - Independance

$H_0$ - Independant  
$H_1$ - Not independant

In [39]:
fave_colour = pd.DataFrame(
    [
        [25,46,15],
        [15,44,15],
        [10,10,20]
    ],
    index=['Child','Young_Adult','Older_Adult'],
    columns=['Red','Blue','Green'])
fave_colour

Unnamed: 0,Red,Blue,Green
Child,25,46,15
Young_Adult,15,44,15
Older_Adult,10,10,20


In [40]:
stat, p, dof, expected = stats.chi2_contingency(observed=fave_colour)
print("chisq is {:.4f}".format(stat)) 
# interpret p-value
alpha = 0.05
print("p value is  {:.4f}".format(p)) 
if p <= alpha:
    print('Distributions are not independant (reject H0)')
else:
    print('Distributions are Independant (H0 holds true)')

chisq is 20.3928
p value is  0.0004
Distributions are not independant (reject H0)


In [41]:
fave_snack = pd.DataFrame(
    [
        [20,40,50],
        [10,21,27],
        [4,8,11]
    ],
    index=['Child','Young_Adult','Older_Adult'],
    columns=['choc','crisps','fruit'])
fave_snack

Unnamed: 0,choc,crisps,fruit
Child,20,40,50
Young_Adult,10,21,27
Older_Adult,4,8,11


In [42]:
stat, p, dof, expected = stats.chi2_contingency(observed=fave_snack)
print("chisq is {:.4f}".format(stat)) 
# interpret p-value
alpha = 0.05
print("p value is  {:.4f}".format(p)) 
if p <= alpha:
    print('Distributions are not independant (reject H0)')
else:
    print('Distributions are Independant (H0 holds true)')

chisq is 0.0620
p value is  0.9995
Distributions are Independant (H0 holds true)
