## Chi-square test for Goodness of fit

##### Used to determine how considerably the observed value of an event differs from the expected value. It helps us check whether a variable comes from a certain distribution or if a sample represents a population.
The observed probability distribution is compared with the expected probability distribution

Null hypothesis:  A variable has a predetermined distribution.

Alternative hypotheses: A variable deviates from the expected distribution.

![chisq.png](attachment:chisq.png)

In [1]:
# importing packages
import scipy.stats as stats
import numpy as np

We want to evaluate whether the study hours of a student in one week is statistically same as expected hours or not

In [2]:
observed_data = [8, 6, 10, 7, 8, 11, 9]
expected_data = [9, 8, 11, 8, 10, 7, 6]

In [3]:
# Chi-Square Goodness of Fit Test
chi_square_test_statistic, p_value = stats.chisquare(observed_data, expected_data)

In [4]:
print('chi_square_test_statistic is : ' + str(chi_square_test_statistic))

print('p_value : ' + str(p_value))

chi_square_test_statistic is : 5.0127344877344875
p_value : 0.542180861413329


We see that the p-value is greater than 0.05, so we cannot reject the null hypothesis, we should conclude that the difference of expectation and observed is not statistically significant

Now we need to see another method by finding the critical value of chi-square

In [5]:
# find Chi-Square critical value
print(stats.chi2.ppf(0.95, df=6)) #df=degrees of freedom=n-1, CI=95%

12.591587243743977


 if chi_square_ value > critical value, the null hypothesis is rejected. if chi_square_ value <= critical value, the null hypothesis is accepted. in the below example chi_square value is 5.0127344877344875 and the critical value is 12.591587243743977. As chi_square_ value <=, critical_value null hypothesis is accepted and the alternative hypothesis is rejected.

In [6]:
#Without using package (for conceptual understanding)
chi_square_test_statistic1 = 0
for i in range(len(observed_data)):
    chi_square_test_statistic1 = chi_square_test_statistic1 + \
        (np.square(observed_data[i]-expected_data[i]))/expected_data[i]
  
  
print('chi-square value determined by formula : ' +str(chi_square_test_statistic1))
  
# find Chi-Square critical value
print('critical value: ' +str(stats.chi2.ppf(0.95, df=6)))

chi-square value determined by formula : 5.0127344877344875
critical value: 12.591587243743977


## Chi-square Test for Independence

Used to determine dependence or independence between two categorical variables.

The Contingency Table:
A Contingency table (also called crosstab) is used in statistics to summarise the relationship between several categorical variables. 
Here, we take a table that shows the number of men and women buying different types of pets

In [7]:
####	  dog  cat  bird  total
#men	  207  282	241	  730
#women    234  242	232	  708
#total    441  524	473	  1438

Null hypothesis (H0): There is no relation between the variables.

Alternate hypothesis (H1) : There is a significant relation between the two.

In [8]:
from scipy.stats import chi2_contingency

In [9]:
# Defining the table
data = [[207, 282, 241], [234, 242, 232]]
stat, p, dof, expected = chi2_contingency(data)

In [10]:
# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Gender and Pet-choice are statistically Dependent (reject H0)')
else:
    print('Gender and Pet-choice are statistically Independent (H0 holds true)')

p value is 0.1031971404730939
Gender and Pet-choice are statistically Independent (H0 holds true)


There is another way to do it by calculating Expected values, making expected value table and then chi-square table but that's not needed since we have the packages.