## Chi-Square Test
- The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.
- E.g. Find out if the colour of flyers affect people either TAKING or NOT TAKING them
- o = observed
- e = expected

- $\huge X^2 = \Sigma \frac {(o-e)^2}{e} $

- In general, the number of Degrees of Freedom for $\huge X^2$ is 
- $ \large (number of rows - 1) * (number of columns - 1)$

In [22]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
import seaborn as sns
from IPython.display import Image
from scipy.stats import chi2

In [6]:
dataset = sns.load_dataset('tips')
dataset.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [7]:
dataset_table = pd.crosstab(dataset['sex'], dataset['smoker'])
dataset_table

smoker,Yes,No
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,60,97
Female,33,54


In [12]:
# Observed values
observed_values = dataset_table.values
print("Observed values: \n", observed_values)

Observed values: 
 [[60 97]
 [33 54]]


In [14]:
# Returns chi2(test statistic), p value, degree of freedom, expected 
val = stats.chi2_contingency(dataset_table)
val

(0.008763290531773594,
 0.925417020494423,
 1,
 array([[59.84016393, 97.15983607],
        [33.15983607, 53.84016393]]))

In [15]:
expected_values = val[3]
expected_values

array([[59.84016393, 97.15983607],
       [33.15983607, 53.84016393]])

In [21]:
no_of_rows = len(dataset_table.iloc[0:2, 0])
no_of_columns = len(dataset_table.iloc[0, 0:2])
degree_of_freedom = (no_of_rows-1) * (no_of_columns-1)
print("Degree of Freedom: ", degree_of_freedom)
alpha = 0.05

Degree of Freedom:  1


$\huge X^2 = \Sigma \frac {(o-e)^2}{e} $

In [29]:
chi_square = sum([(o-e) ** 2/e for o, e in zip(observed_values, expected_values)])
chi_square_statistic = chi_square[0] + chi_square[1]
print("Chi square statistic: ", chi_square_statistic)

Chi square statistic:  0.001934818536627623


In [30]:
critical_value = chi2.ppf(q=1-alpha, df=degree_of_freedom)
print("Critical value: ", critical_value)

Critical value:  3.841458820694124


In [31]:
# P value
p_value = 1-chi2.cdf(x=chi_square_statistic, df=degree_of_freedom)
print("P value: ", p_value)
print("Significance level: ", alpha)
print("Degree of Freedom: ", degree_of_freedom)

P value:  0.964915107315732
Significance level:  0.05
Degree of Freedom:  1


In [33]:
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Do not reject H0,There is NO relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Do not reject H0,There is NO relationship between 2 categorical variables")

Do not reject H0,There is NO relationship between 2 categorical variables
Do not reject H0,There is NO relationship between 2 categorical variables
