##Chi-Square Test:
###The chi-square test is used mainly for two purposes:
###1.Independence Of variables
###2.Checking goodness of fit
\\
###1.Independence Of Variables:
###Let us understand the concept with the help of an example(from Statistics for business and economics by David Anderson,Sweeney):
Alber’s manufactures and distributes three types of beer: light, regular, and dark. In an analysis of the
market segments for the three beers, the firm’s market research group raised the question
of whether preferences for the three beers differ among male and female beer drinkers. If
beer preference is independent of the gender of the beer drinker, one advertising campaign
will be initiated for all of Alber’s beers. However, if beer preference depends on the gender 
of the beer drinker, the firm will tailor its promotions to different target markets.
A test of independence addresses the question of whether the beer preference (light, 
regular, or dark) is independent of the gender of the beer drinker (male, female). The hypotheses for this test of independence are:\
H0:
Beer preference is not independent of the gender of the beer drink\
Ha:
Beer preference is not independent of the gender of the beer drink




In [1]:
import pandas as pd
df=pd.read_excel("BeerPreference.xlsx")
df

Unnamed: 0,Beer Drinker,Preference,Gender
0,1,Regular,Male
1,2,Light,Female
2,3,Regular,Male
3,4,Regular,Male
4,5,Regular,Female
...,...,...,...
195,196,Light,Male
196,197,Regular,Male
197,198,Light,Male
198,199,Light,Male


In [2]:
#Prepare the contingency table
table=pd.pivot_table(df[['Gender','Preference']],index='Gender',columns='Preference',aggfunc=len)
table

Preference,Dark,Light,Regular
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,8,39,21
Male,25,51,56


In [3]:
from scipy.stats import chi2_contingency
chi2,p,dof,tab=chi2_contingency(table)
print(f"The p value based on the chi square value of {chi2:0.2f} is:{p:0.2f}")

The p value based on the chi square value of 6.45 is:0.04


###Since the p value for the hypotheses is less than 0.05 we reject the null hypothesis. Thus beer preference and the gender are not independent variables are not independent.

###2.Checking goodness of fit:
###It is basically finding the distribution to which the given data belongs to.
###Lets consider an example to find whether the given data belongs to Poisson Distribution or not.
###The data shows the no. of arrivals in 100 1 minute time intervals at a parking lot and their respective frequencies.
###H0:Data follows Poisson Distribution
###Ha:Data does not follow Poisson Distribution

In [10]:
df1=pd.read_csv("Chi_sq.csv")
df1

Unnamed: 0,Arrivals,Frequency
0,0,0
1,1,1
2,2,4
3,3,10
4,4,14
5,5,20
6,6,12
7,7,12
8,8,9
9,9,8


In [52]:
#The above table shows observed frequency. Now, find the expected frequency
from scipy.stats import poisson
mu=((df1['Arrivals']*df1['Frequency']).sum())/df1['Frequency'].sum()
expfreq=[]
for i in range(13):
  x=poisson.pmf(i,mu)
  expfreq.append(x*df1['Frequency'].sum())
expfreq=[round(x,2) for x in expfreq]
df2=pd.DataFrame({'Arrivals':df1['Arrivals'],'Observed Frequency':df1['Frequency'],'Expected Frequency':expfreq})
df2

Unnamed: 0,Arrivals,Observed Frequency,Expected Frequency
0,0,0,0.25
1,1,1,1.49
2,2,4,4.46
3,3,10,8.92
4,4,14,13.39
5,5,20,16.06
6,6,12,16.06
7,7,12,13.77
8,8,9,10.33
9,9,8,6.88


In [64]:
#The expected frequency should be atleast 5 for each entry. Thus we will merge few entries.
print("Entries with Expected Frequency<5")
print(df2[df2['Expected Frequency']<5],"\n")
x1=pd.Series(df2.iloc[:3,]['Observed Frequency'].sum())
y1=pd.Series(df2.iloc[:3,]['Expected Frequency'].sum())
x2=pd.Series(df2.iloc[10:13,]['Observed Frequency'].sum())
y2=pd.Series(df2.iloc[10:13,]['Expected Frequency'].sum())
arr1=pd.Series('<=2')
arr2=pd.Series('>=10')
df3=pd.DataFrame({'Arrivals':arr1,'Observed Frequency':x1,'Expected Frequency':y1})
df3=df3.append(df2.iloc[3:10])
df4=pd.DataFrame({'Arrivals':arr2,'Observed Frequency':x2,'Expected Frequency':y2})
df3=df3.append(df4)
df3.reset_index
df3.index=[i for i in range(9)]
print("After merging entries")
print(df3)

Entries with Expected Frequency<5
    Arrivals  Observed Frequency  Expected Frequency
0          0                   0                0.25
1          1                   1                1.49
2          2                   4                4.46
10        10                   6                4.13
11        11                   3                2.25
12        12                   1                1.13 

After merging entries
  Arrivals  Observed Frequency  Expected Frequency
0      <=2                   5                6.20
1        3                  10                8.92
2        4                  14               13.39
3        5                  20               16.06
4        6                  12               16.06
5        7                  12               13.77
6        8                   9               10.33
7        9                   8                6.88
8     >=10                  10                7.51


In [65]:
#Now we will perform the chi square test
import scipy.stats as st
st.chisquare(df3['Observed Frequency'],df3['Expected Frequency'])

Power_divergenceResult(statistic=3.7904463885065667, pvalue=0.8755178295443689)

###The result shows that the p value for the hypotheses is 0.875 which is greater than 0.05. Thus we must accept null hypothesis i.e. the data follows poisson distribution.