# Chi-Square test

The Chi-Square test is a statistical method used to examine the association between categorical variables. There are different types of Chi-Square tests, including the Chi-Square goodness-of-fit test and the Chi-Square test of independence

This test makes four assumptions:

1: Both variables are categorical.

2: All observations are independent.
 
3: Cells in the contingency table are mutually exclusive.

4: Expected value of cells should be 5 or greater in at least 80% of cells.

# Chi-Square test of independence

The Chi-Square test of independence is a statistical test used to determine whether there is a significant association between two categorical variables. Here's a guide to understanding and performing the Chi-Square test of independence:

Null Hypothesis (H0): Assumes that there is no association between the two categorical variables.

Alternative Hypothesis (H1): Assumes that there is an association between the two categorical variables.

Calculate Expected Frequencies: 

Eij =(RowTotali X ColumnTotalj)/(GrandTotal)

Compute the Chi-Square Statistic: Use the formula:
        
        Chi-Square χ2 = sum[(Oij-Eij)^2/Eij]

Determine Degrees of Freedom (df): Calculate the degrees of freedom using the formula:

df=(NumberofRows−1)×(NumberofColumns−1)

Conclusion:

If χ2 statistic is greater than the critical value or if the p-value is less than the significance level (commonly 0.05), reject the null hypothesis.

If χ2 statistic is less than the critical value or if the p-value is greater than the significance level, do not reject the null hypothesis.

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency

In [10]:
data = {'Action': [30, 20], 'Comedy': [10, 30], 'Drama': [20, 40]}
contingency_table = pd.DataFrame(data, index=['Male', 'Female'])

In [11]:
contingency_table

Unnamed: 0,Action,Comedy,Drama
Male,30,10,20
Female,20,30,40


In [12]:
# Perform Chi-Square Test of Independence
chi2, p, dof, expected = chi2_contingency(contingency_table)


In [13]:
print("\nChi-Square Test Statistic:", chi2)
print("p-value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies Table:")
print(expected)


Chi-Square Test Statistic: 13.194444444444445
p-value: 0.001364152090848616
Degrees of Freedom: 2
Expected Frequencies Table:
[[20. 16. 24.]
 [30. 24. 36.]]


In [14]:
# Interpret the p-value
alpha = 0.05
if p < alpha:
    print("\nSince the p-value is less than 0.05, we reject the null hypothesis.")
    print("There is a significant association between Gender and Preference.")
else:
    print("\nSince the p-value is greater than 0.05, we fail to reject the null hypothesis.")
    print("There is no significant association between Gender and Preference.")



Since the p-value is less than 0.05, we reject the null hypothesis.
There is a significant association between Gender and Preference.


# Chi-Square Test of Goodness of Fit


The Chi-Square test of goodness of fit is a statistical test used to determine whether a sample data set fits a population with a specific distribution. It's commonly used to see if an observed frequency distribution matches an expected frequency distribution.

Null Hypothesis (H0): The observed frequencies fit the expected distribution.

Alternative Hypothesis (H1): The observed frequencies do not fit the expected distribution.

Compute the Chi-Square Statistic: Use the formula:

    Chi-Square χ2 = sum[(Oi-Ei)^2/Ei]

Determine the Degrees of Freedom:

df=number of categories−1

Conclusion:

If χ2 calculated > χ2 critical, or if the p-value < α, reject the null hypothesis.Otherwise, fail to reject the null hypothesis.

# Method I

In [22]:
import pandas as pd
import numpy as np

In [50]:
expected_value=np.array([50,50,50,50,50])
observed=np.array([50,60,40,47,53])

In [51]:
from scipy.stats import chisquare

In [52]:
chi2_stat, p_val = chisquare(f_obs=observed, f_exp=expected_value)

In [53]:
print(f"Chi-Square Statistic: {chi2_stat}")
print(f"P-Value: {p_val}")

Chi-Square Statistic: 4.359999999999999
P-Value: 0.3594720674366307


In [54]:
if p_val < 0.05:
    print("Reject the null hypothesis: The observed data does not fit the expected distribution.")
else:
    print("Fail to reject the null hypothesis: The observed data fits the expected distribution.")

Fail to reject the null hypothesis: The observed data fits the expected distribution.


# Method II

In [60]:
data = {
    'Category': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
    'Observed': [50,60,40,47,53]
}

In [63]:
df=pd.DataFrame(data)
df

Unnamed: 0,Category,Observed
0,Monday,50
1,Tuesday,60
2,Wednesday,40
3,Thursday,47
4,Friday,53


Calculate total observed and expected frequencies

In [64]:
total_observed = df['Observed'].sum()
total_observed

250

In [68]:
num_categories = df.shape[0]
num_categories

5

In [69]:
expected_frequency = total_observed / num_categories
expected_frequency

50.0

In [70]:
#Add expected frequencies to the DataFrame
df['Expected']=expected_frequency

In [71]:
df

Unnamed: 0,Category,Observed,Expected
0,Monday,50,50.0
1,Tuesday,60,50.0
2,Wednesday,40,50.0
3,Thursday,47,50.0
4,Friday,53,50.0


In [72]:
from scipy.stats import chisquare

In [73]:
chi2_stat,p_val=chisquare(f_obs=df['Observed'],f_exp=df['Expected'])

In [77]:
print(f"Chi-Square Statistic: {round((chi2_stat),4)}")
print(f"P-Value: {round((p_val),4)}")

Chi-Square Statistic: 4.36
P-Value: 0.3595


In [78]:
if p_val < 0.05:
    print("Reject the null hypothesis: The observed data does not fit the expected distribution.")
else:
    print("Fail to reject the null hypothesis: The observed data fits the expected distribution.")

Fail to reject the null hypothesis: The observed data fits the expected distribution.
