## Chi-Square Test
 - The Chi-Square distribuition is denoted as $\large \chi^2 $ - distribuition. It is non parametric test. 
 - A Random Variable $\large X \sim N(\mu,\sigma ^2)$ then standard normal variate is $\large Z = \frac{X - \mu}{\sigma} \sim N(0,1)$
 - The square of standard normal variate $\large Z^2 = \bigl[ \frac{X-\mu}{\sigma} \bigr] ^2 $ is called Chi-Square variate with 1 degree of freedom.
 - Chi-Square for n degree of freedom $\large \chi^2 = \sum_{i=1}^{N} \bigl[ \frac{X - \mu_i}{\sigma_i} \bigr] ^2 $
 - Chi-Square test for goodness of fit for ideal condition is that $\large \sum_{i = 1}^{N} O_i = \sum_{i = 1}^{N}E_i$.
 - Chi-Square statistics is defined as $\large \chi ^2 = \sum_{i = 1}^{N} \bigl[ \frac {(O_i - E_i)^2}{E_i} \bigr]$ 
 
 - This test is applied when we have two categorical variables from single population. It is used to determine whether there is significant association between the two variables

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

In [2]:
tips_df = sns.load_dataset('tips')
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
tips_table = pd.crosstab(tips_df.sex, tips_df.smoker)
tips_table

smoker,Yes,No
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,60,97
Female,33,54


In [4]:
tips_table.values

array([[60, 97],
       [33, 54]], dtype=int64)

In [5]:
Observed_Values = tips_table.values
print('Observed Values:\n', Observed_Values)

Observed Values:
 [[60 97]
 [33 54]]


In [6]:
val = stats.chi2_contingency(tips_table)
print('Chi-Square test statistic:', val[0])
print('P-Value of test statistic:', val[1])
print('Degree of Freedom:', val[2])
print('Expected Values:\n', val[3])

Chi-Square test statistic: 0.008763290531773594
P-Value of test statistic: 0.925417020494423
Degree of Freedom: 1
Expected Values:
 [[59.84016393 97.15983607]
 [33.15983607 53.84016393]]


In [7]:
Expected_Values = val[3]
[print(O,E) for O,E in zip(Observed_Values, Expected_Values)]

[60 97] [59.84016393 97.15983607]
[33 54] [33.15983607 53.84016393]


[None, None]

In [8]:
no_of_rows = tips_table.shape[0]
no_of_columns = tips_table.shape[1]
ddof = (no_of_rows-1) * (no_of_columns - 1)
alpha = 0.05
print('Degree of Freeedom:', ddof)

Degree of Freeedom: 1


In [9]:
from scipy.stats import chi2
chi_square = sum([(O-E)**2./E for O, E in zip(Observed_Values, Expected_Values)])
chi_square_statistics = sum(chi_square)
print('Chi-Square Statistics:',chi_square_statistics)

Chi-Square Statistics: 0.001934818536627623


# Chi-Square Distribuition Table

In [10]:
from IPython.display import IFrame
IFrame('../SDataset/chi-square-table.pdf', width=600, height=300)

In [11]:
critical_value = chi2.ppf(q = 1- alpha, df = ddof)
print('Critical Value:',critical_value)

Critical Value: 3.841458820694124


In [12]:
#pvalue
p_value = 1-chi2.cdf(x = chi_square_statistics, df = ddof)
print('p_value:', p_value)
print('Significance level:', alpha)
print('Degree of Freedom:', ddof)

p_value: 0.964915107315732
Significance level: 0.05
Degree of Freedom: 1


In [13]:
Null_hpothesis = 'There is no significant difference between gender and smoker'
if chi_square_statistics <= critical_value:
    print('Accept H0, There is no relationship betweeen 2 categorical variables')
else:
    print('Accept H1, There is a relationship between 2 categorical variables')
    

if p_value <= alpha:
    print('Accept H1, There is a relationship between 2 categorical variables')
else:
    print('Accept H0, There is no relationship between 2 categorical variables')

Accept H0, There is no relationship betweeen 2 categorical variables
Accept H0, There is no relationship between 2 categorical variables
