<h1>Chi Square Test </h1>
<br><b> 1. Goodness of fit
<br> 2. Test for independence 

<b> Test statistics </b>
$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$
O = Observed frequency
<br>E = Estimated frequency

In [1]:
# loading packages
import numpy as np
from scipy import stats
import pandas as pd

<b> Problem : </b> Is distribution of spoken language in Paris and France is similar
<br><br><b>H0 : </b> data fits the distribution
<br><b> Ha : </b> data doesn't fit the distribution ( generally right tailed test)
<br><br><b> df </b>= number of categories minus 1

In [2]:
# creating data
france = pd.DataFrame( ['French'] * 100000 + ['Englis'] * 60000 + ['spanish'] * 50000 \
                      + ['italian'] * 15000 + ['others'] * 35000)

paris = pd.DataFrame(['French'] * 600 + ['Englis'] * 300 + ['spanish'] * 250 \
                      + ['italian'] * 75 + ['others'] * 150)

france_table = pd.crosstab (index = france[0], columns='count')
paris_table = pd.crosstab (index = paris[0], columns='count')
paris_table

col_0,count
0,Unnamed: 1_level_1
Englis,300
French,600
italian,75
others,150
spanish,250


<h3> Using formula

In [3]:
observed = paris_table # Observed values

france_ratio  = france_table / len(france)
expected  = france_ratio * len(paris) # Expected values

chi_square_stats = (((observed - expected)**2) / expected).sum() # chi square stats 
df = paris_table.shape[0] - 1 # degree of freedom

print('Degree of Freddom : ', df)
print(chi_square_stats)

Degree of Freddom :  4
col_0
count    18.194805
dtype: float64


In [4]:
alpha = 0.05
p_value = 1 - stats.chi2.cdf(x = chi_square_stats,
                             df = df)

print('----------- Result based on P Value ---------')
print('P_value :', p_value)
print('alpha :', alpha)
print ('Conclusion : Fail to reject H0' if p_value > alpha else 'Conclusion : Reject H0')

critical_value = stats.chi2.ppf(q = .95,  # taking critical value for 95% of confidence
                                df = df)

# test statistics out of quantile of data So we can reject null hypothesis
print('')
print('----------- Result based on statistics Value ---------')
print('Statistics :', chi_square_stats[0])
print('Critical value :',critical_value )
print ('Conclusion : Fail to reject H0' if chi_square_stats[0] < critical_value else 'Conclusion : Reject H0')

----------- Result based on P Value ---------
P_value : [0.00113047]
alpha : 0.05
Conclusion : Reject H0

----------- Result based on statistics Value ---------
Statistics : 18.194805194805173
Critical value : 9.487729036781154
Conclusion : Reject H0


<h3> Using stats function

In [5]:
stats.chisquare(f_obs = observed,
                f_exp = expected)

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

<h2> Test of independence

<b>Problem :</b> Is there any relation between race and political party for which they voted

<br><b>H0 :</b> No relation between race and voted political party
<br><b>Ha :</b> There is relation between race and voted polical party
<br><br><b>df =</b> (number of rows - 1) * (number pf cols -1)

<h3> Using formula

In [6]:
# Creating data
np.random.seed(10)
voter_race = np.random.choice(a = ['asian','black','hispanic','other','white'],
                              p =[0.05,0.15,0.25,0.05,0.5],
                              size = 1000)
pol_party = np.random.choice(a = ['democrat','independent','republican'],
                         p = [0.4,0.2,0.4],
                         size = 1000)

In [7]:
voters = pd.DataFrame({'race':voter_race, 'party':pol_party})
voters_tab = pd.crosstab(voters.race,voters.party,margins = True)
voters_tab.index = ['asian','black','hispanic','other','white','col_total']
voters_tab.columns = ['democrat','independent','republican','row_total']
voters_tab

Unnamed: 0,democrat,independent,republican,row_total
asian,21,7,32,60
black,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
col_total,397,186,417,1000


In [8]:
observed = voters_tab.iloc[0:5,0:3]
expected = np.outer(voters_tab['row_total'][0:5], voters_tab.loc['col_total'][0:3]) / 1000
# (51 * 383) / 1000 
print(expected)

[[ 23.82   11.16   25.02 ]
 [ 61.138  28.644  64.218]
 [ 99.647  46.686 104.667]
 [ 15.086   7.068  15.846]
 [197.309  92.442 207.249]]


<b>Ho = </b> race and party variables are not dependent
<br><b> Ha = </b> Both variables are dependent

In [9]:
chi_square_stats = (((observed - expected)**2)/expected).sum().sum()

df = (observed.shape[0] -1) * (observed.shape[1] - 1)
alpha = 0.05

p_value = 1 - stats.chi2.cdf(x = chi_square_stats, df = df)

print('----------- Result based on P Value ---------')
print('P_value :', p_value)
print('alpha :', alpha)
print ('Conclusion : Fail to reject H0' if p_value > alpha else 'Conclusion : Reject H0')

critical_value = stats.chi2.ppf(q = .95,  # taking critical value for 95% of confidence
                                df = df)

# test statistics out of quantile of data So we can reject null hypothesis
print('')
print('----------- Result based on statistics Value ---------')
print('Statistics :', chi_square_stats)
print('Critical value :',critical_value )
print ('Conclusion : Fail to reject H0' if chi_square_stats < critical_value else 'Conclusion : Reject H0')

----------- Result based on P Value ---------
P_value : 0.518479392948842
alpha : 0.05
Conclusion : Fail to reject H0

----------- Result based on statistics Value ---------
Statistics : 7.169321280162059
Critical value : 15.50731305586545
Conclusion : Fail to reject H0


<h3> Using stats module function

In [10]:
result = stats.chi2_contingency(observed = observed)
chi_square_stats = result [0] 
p_value = result [1] 

print('----------- Result based on P Value ---------')
print('P_value :', p_value)
print('alpha :', alpha)
print ('Conclusion : Fail to reject H0' if p_value > alpha else 'Conclusion : Reject H0')

----------- Result based on P Value ---------
P_value : 0.518479392948842
alpha : 0.05
Conclusion : Fail to reject H0
