<h1 style="color: red;">CHI-SQUARE TEST</h1>

The Chi-Square test is a statistical method used to determine whether there's a significant association between two categorical variables or to test the goodness of fit of observed data to a theoretical distribution. There are two common types of Chi-Square tests:

1. **Chi-Square Test for Independence**: Used to determine if there is an association between two categorical variables in a contingency table. For example, you might want to know if gender is associated with political preference.

2. **Chi-Square Goodness of Fit Test**: This test evaluates how well an observed distribution of data matches a theoretical distribution. For example, you could test if the roll of a die is fair by comparing observed frequencies to expected frequencies.

### Formula

For both tests, the basic formula for the Chi-Square statistic is:

\[
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
\]

Where:
- \(O_i\) = observed frequency
- \(E_i\) = expected frequency

### Steps for the Chi-Square Test for Independence:
1. **Create a Contingency Table**: Lay out the observed data for the two variables.
2. **Calculate Expected Frequencies**: These are based on the assumption that the variables are independent.
3. **Calculate Chi-Square Statistic**: Using the formula above.
4. **Determine Degrees of Freedom**: For a contingency table, degrees of freedom are calculated as:
   
   \[
   df = (rows - 1) \times (columns - 1)
   \]
5. **Compare with Critical Value**: Look up the critical value in a Chi-Square distribution table (based on your degrees of freedom and significance level, usually 0.05) or use the p-value.

If the calculated Chi-Square statistic exceeds the critical value or if the p-value is smaller than your significance level, you reject the null hypothesis, indicating that there is a statistically significant relationship between the variables.

### Assumptions:
- The data should be in the form of counts or frequencies, not percentages or ratios.
- Each observation is independent of the others.
- Expected frequency in each category should be at least 5 for the test to be valid.

In [2]:
import pandas as pd
import numpy as np 
import scipy.stats as stats

In [4]:
kenya=(['kikuyu']*100000 + ['luo']*50000 +['kalengin']*60000 +['kambas']*15000 +['other']*35000)
Nairobi=(['kikuyu']*600 + ['luo']*250 +['kalengin']*300 +['kambas']*75 +['other']*150)

kenyan_table=pd.crosstab(index=pd.Series(kenya),columns='count')
Nairobi_table=pd.crosstab(index=pd.Series(Nairobi),columns='count')

print('kenya')
print(kenyan_table)
print(' ')
print('Nairobi')
print(Nairobi_table)

kenya
col_0      count
row_0           
kalengin   60000
kambas     15000
kikuyu    100000
luo        50000
other      35000
 
Nairobi
col_0     count
row_0          
kalengin    300
kambas       75
kikuyu      600
luo         250
other       150


In [7]:
observed=Nairobi_table
kenya_ratios=kenyan_table/len(kenya)

print(kenya_ratios)
expected=kenya_ratios*len(Nairobi)
print(expected)
chi_squared_stats=(((observed-expected)**2)/expected).sum()
print(chi_squared_stats)

col_0        count
row_0             
kalengin  0.230769
kambas    0.057692
kikuyu    0.384615
luo       0.192308
other     0.134615
col_0          count
row_0               
kalengin  317.307692
kambas     79.326923
kikuyu    528.846154
luo       264.423077
other     185.096154
col_0
count    18.194805
dtype: float64


In [10]:
crit=stats.chi2.ppf(q=0.95,df=4)
print('critical value')
print(crit)
p_value=1-stats.chi2.cdf(x=chi_squared_stats,df=4)
print('p_value')
print(p_value)

critical value
9.487729036781154
p_value
[0.00113047]


<h3 style="color: red;">Conclusion</h3>

since our chi squared statistic exceeds our critical value, we reject the null hypothesis that the two distributions are the same.
we can carry out the chi squared goodness of fit test automatically by using the stats.chisquare()

In [12]:
stats.chisquare(f_obs=observed, f_exp=expected)

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

<h2 style="color: red;">CHi SQUARED TEST OF INDEPENDENCE</h2>

The Chi-Squared Test of Independence is used when you have two categorical variables and want to see if there's a relationship between them. For example, you might want to know if gender is related to voting preference. The test works by setting up a contingency table that shows the observed counts for each combination of the categories.

The null hypothesis (H₀) assumes that the two variables are independent—meaning one variable doesn't affect the other. The alternative hypothesis (H₁) suggests that there is a relationship between them.

The result is a chi-squared statistic, which you compare to a critical value from the chi-squared distribution table (based on the degrees of freedom and significance level). If the calculated value is greater than the critical value, you reject the null hypothesis, concluding that there is a significant association between the variables. Otherwise, you fail to reject the null hypothesis, suggesting the variables are likely independent.
                     

In [17]:
np.random.seed(10)
voter_race=np.random.choice (a=['asian','blacks','hispanic','other','white'],
            p=[0.05,0.15,0.25,0.05,0.5],
            size=1000)

voter_party=np.random.choice(a=['democrats','independents','republicans'],
            p=[0.4,0.2,0.4],
            size=1000)
voters=pd.DataFrame({'race':voter_race,
                     'party':voter_party})

voter_tab=pd.crosstab(voters.race,voters.party,margins=True)
voter_tab.columns=['democrats','independents','republicans','row_total']
voter_tab.index=['asian','blacks','hispanic','other','white','col_total']
observed=voter_tab.iloc[0:5,0:3]
voter_tab


Unnamed: 0,democrats,independents,republicans,row_total
asian,21,7,32,60
blacks,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
col_total,397,186,417,1000


In [18]:
expected=np.outer(voter_tab['row_total'][0:5],
                  voter_tab.loc['col_total'][0:3])/1000
expected=pd.DataFrame(expected)
expected.columns=['democrats','independents','republicans']
expected.index=['asian','blacks','hispanic','other','white']
expected

Unnamed: 0,democrats,independents,republicans
asian,23.82,11.16,25.02
blacks,61.138,28.644,64.218
hispanic,99.647,46.686,104.667
other,15.086,7.068,15.846
white,197.309,92.442,207.249


In [20]:
chi_square_stat=(((observed-expected)**2)/expected).sum().sum()
chi_square_stat

np.float64(7.169321280162059)

In [23]:
stats.chi2_contingency(observed=observed)

Chi2ContingencyResult(statistic=np.float64(7.169321280162059), pvalue=np.float64(0.518479392948842), dof=8, expected_freq=array([[ 23.82 ,  11.16 ,  25.02 ],
       [ 61.138,  28.644,  64.218],
       [ 99.647,  46.686, 104.667],
       [ 15.086,   7.068,  15.846],
       [197.309,  92.442, 207.249]]))

In [21]:
crit=stats.chi2.ppf(q=0.95,df=8)
print('critical value')
print(crit)
p_value=1-stats.chi2.cdf(x=chi_squared_stats,df=8)
print('p_value')
print(p_value)

critical value
15.507313055865453
p_value
[0.01981249]


 <h3 style="color: red;">Conclusion</h3>

we fail to reject the null hypothesis that the two distributions are independent
