https://analyticsindiamag.com/a-beginners-guide-to-chi-square-test-in-python-from-scratch/

#### Step 1: Importing libraries to create an array and data frame:

In [1]:
import numpy as np
import pandas as pd

#### Step 2: Creating an array and converting that array to the data frame:

In [4]:
np.random.seed(10)
# Sample data randomly at fixed probabilities
type_bottle = np.random.choice(a= ["paper","cans","glass","others","plastic"],
                              p = [0.05, 0.15 ,0.25, 0.05, 0.5],
                              size=1000)

# Sample data randomly at fixed probabilities
month = np.random.choice(a= ["January","February","March"],
                              p = [0.4, 0.2, 0.4],
                              size=1000)

bottles = pd.DataFrame({"types":type_bottle, 
                       "months":month})
bottles.head() 

Unnamed: 0,types,months
0,plastic,January
1,paper,March
2,plastic,February
3,plastic,March
4,others,January


In [5]:
bottles_tab = pd.crosstab(bottles.types, bottles.months, margins = True)
bottles_tab.columns = ["January","February","March","row_totals"]
bottles_tab.index = ["paper","cans","glass","others","plastic","col_totals"]
observed = bottles_tab.iloc[0:5,0:3]   # Get table without totals for later use
bottles_tab

Unnamed: 0,January,February,March,row_totals
paper,25,65,64,154
cans,50,107,94,251
glass,8,15,15,38
others,7,21,32,60
plastic,96,189,212,497
col_totals,186,397,417,1000


In [9]:
observed

Unnamed: 0,January,February,March
paper,25,65,64
cans,50,107,94
glass,8,15,15
others,7,21,32
plastic,96,189,212


### Implementing the Chi-square test

How to calculate expected values?

 - we need two things: observed values and expected values.
 
Multiply the row total to the column total and divide by the total number of observations for a cell to get the expected count.

In [7]:
expected =  np.outer(bottles_tab["row_totals"][0:5],
                     bottles_tab.loc["col_totals"][0:3]) / 1000
expected = pd.DataFrame(expected)
 
expected.columns = ["Janurary","Feburary","March"]
expected.index = ["paper","cans","glass","others","plastic"]
 
expected

Unnamed: 0,Janurary,Feburary,March
paper,28.644,61.138,64.218
cans,46.686,99.647,104.667
glass,7.068,15.086,15.846
others,11.16,23.82,25.02
plastic,92.442,197.309,207.249


Here, we will write the formula in python to calculate the chi-square static value.

In [8]:
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print(chi_squared_stat)

3.1891910015593856


Next, we will calculate the p-value and critical value which will help to accept or reject the null hypothesis.

In [13]:
from scipy.stats import chi2
critical_value= chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 8)   # df= degree of freedom
print("Critical value:",critical_value)
p_value = 1 - chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=8)
print("P value:",p_value)

Critical value: 15.50731305586545
P value: 0.9219296414720469


Note: The degrees of freedom for a test of independence equals the product of the number of categories in each variable minus 1. In this case we have a 5×3 table so df = 4×2 = 8.

### It can be done by a quick method by just a single line of code, which is given below, without doing all of the above steps: 

In [19]:
import scipy
scipy.stats.chi2_contingency(observed= observed)


(7.169321280162059,
 0.518479392948842,
 8,
 array([[ 28.644,  61.138,  64.218],
        [ 46.686,  99.647, 104.667],
        [  7.068,  15.086,  15.846],
        [ 11.16 ,  23.82 ,  25.02 ],
        [ 92.442, 197.309, 207.249]]))

Finally, we get a p-value of 0.9219 which is greater than 0.5. Therefore, we will accept the null hypothesis that says there is no relationship between the features. The test result does not detect a significant relationship between the variables.

### Yate’s Correction

In the above explanation of the Pearson’s chi-square formula that was a fault which was corrected by Frank Yates, and it’s known as Yate’s correction or Yate’s Chi-Square. 

In [20]:
scipy.stats.chi2_contingency(df, correction=True)  #"correction=True" to apply Yates' correction 

NameError: name 'df' is not defined