### Chi-Square Test

The test is applied when you have two categorical variables/features from a single population. It is used to determine whether there is a significant association between the two variables/features

Null hypothesis H0: 

    No significant assocition between 2 categorical variables
    
Alternate hypothesis H1: 

    Significant association between 2 categorical variables

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

In [2]:
df = sns.load_dataset("tips")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
df.shape

(244, 7)

Consider the two categorical variables sex & smoker

In [4]:
cat_df = pd.crosstab(df["sex"],df["smoker"])

In [5]:
cat_df

smoker,Yes,No
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,60,97
Female,33,54


In [6]:
cat_df.values

array([[60, 97],
       [33, 54]], dtype=int64)

We can call these variables as Observed variables

In [7]:
observed_values = cat_df.values

In [8]:
observed_values

array([[60, 97],
       [33, 54]], dtype=int64)

Let's apply chi-square test

In [10]:
stats.chi2_contingency(cat_df)

Chi2ContingencyResult(statistic=0.0, pvalue=1.0, dof=1, expected_freq=array([[59.84016393, 97.15983607],
       [33.15983607, 53.84016393]]))

We need the expected_freq

In [11]:
expected_values = stats.chi2_contingency(cat_df)[-1]

In [12]:
expected_values

array([[59.84016393, 97.15983607],
       [33.15983607, 53.84016393]])

In [13]:
cat_df.shape

(2, 2)

In [14]:
rows, cols = cat_df.shape[0], cat_df.shape[1]

In [15]:
rows

2

In [16]:
cols

2

##### Degree of Freedom

In [17]:
ddof = (rows-1)*(cols-1)
ddof

1

In [18]:
alpha = 0.05

##### Apply chi_square formula

!["chi2 Formula"](chi2Formula.png)

In [19]:
chi_square = sum([(o-e)**2./e for o,e in zip(observed_values, expected_values)])

In [20]:
chi_square

array([0.00119737, 0.00073745])

In [21]:
chi_square_stats = chi_square[0] + chi_square[1]
chi_square_stats

0.001934818536627623

##### Critical value

In [24]:
critical_val = stats.chi2.ppf(q=1-alpha, df=ddof)

In [25]:
critical_val

3.841458820694124

##### significance value

In [29]:
alpha

0.05

##### p_value

In [26]:
p_value = 1 - stats.chi2.cdf(x=chi_square_stats, df=ddof)

In [27]:
p_value

0.964915107315732

In [30]:
if p_value<0.05:
    print("Null hypothesis Rejected")
else:
    print("Null hypothesis Accepted")

Null hypothesis Accepted


That means, no significance relation between 2 categorical variables

Alternatively, you could use chi_square_stats & critical_value to decide the null hypothesis

In [32]:
if chi_square_stats >= critical_val:
    print("Null hypothesis Rejected")
else:
    print("Null hypothesis Accepted")

Null hypothesis Accepted


That means, no significant association / relation between the two categorical variables sex & smoker