## Chi-Squared Test

A common statistical test for categorical variables, the Chi-Squared test.

## Chi-Squared Goodness-Of-Fit Test

In our study of t-tests, we introduce the one-way t-test to check whether a sample mean differs from the expected/population mean. The Chi-Squared goodness of fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution.

For example, you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at school match the entire U.S population or whether the computer browser preferences of your peers match those of internet uses as a whole.

When working with categorical data, the values themselves aren't of much use for statistical testing because categories like 'male', 'female' and 'other' have no mathematical meaning. 

Statisitcal tests dealing with categorical variables, focus on variable counts or proportions instead of the actual value of the variables themselves, as these provide a way to analyze the distribution of categories. 

The kaggle notebook i am working though generates some fake demographic data for US and Minnesota and walk through the chi-square goodness of fit test to check whether they are different.

In [18]:
import numpy as np
import pandas as pd 
import scipy.stats as stats

In [19]:
national = pd.DataFrame(["white"]*100000 + ["hispanic"] *60000 + \
                        ["black"]*50000 + ["asian"]*15000 + ["other"]*35000)

minnesota = pd.DataFrame(["white"]*600 + ["hispanic"]*300 + \
                         ["black"]*250 +["asian"]*75 + ["other"]*150)

national_table = pd.crosstab(index=national[0], columns="count")
minnesota_table = pd.crosstab(index=minnesota[0], columns="count")

print("National")
print(national_table)
print("="*17)
print( "Minnesota") 
print(minnesota_table)

National
col_0      count
0               
asian      15000
black      50000
hispanic   60000
other      35000
white     100000
Minnesota
col_0     count
0              
asian        75
black       250
hispanic    300
other       150
white       600


The Chi-Square test for goodness-of-fit measures how well observed data matches expected data.

Chi-squared tests are based on the so-called Chi-Squared statistic. You calculate the chi-squared statistic with the following formula:

$$\sum \left( \frac{(\text{observed} - \text{expected})^2}{\text{expected}} \right)$$


In the formula, observed is the actual observed count for each category and expected is the expected count based on the distribution of the population for the corresponding category. Lets calculate the chi-squared statistic for our data to illustrate:

In [20]:
observed = minnesota_table

national_ratios = national_table/len(national) # get population ratios

expected = national_ratios * len(minnesota) # get expected counts

chi_squared_stat = (((observed - expected)**2)/expected).sum()

print(chi_squared_stat)

col_0
count    18.194805
dtype: float64


n.b the chi-squared test assumes none of the expected counts are less than 5.

Similar to the t-test where we compared the t-test statistic to a critical value based on the t-distribution whether the result is significant , in the chi-square test we compare the chi-square test statistic to a critical value based on the chi-square distribution.

In [21]:
crit = stats.chi2.ppf(q = 0.95, df = 4) # find the critical value for 95% confidence df = number of variable categories - 1

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x = chi_squared_stat, df = 4) # finds the p-value

print("P value")
print(p_value)

Critical value
9.487729036781154
P value
[0.00113047]


Since our Chi-Squared test statistic exceeds the critical value, we'd reject the null hypothesis that the two distributions are the same. There is sufficient evidence at the 95% confidence interval that the distribution of our sample is not the same as the distribution of our population.

We can carry out a chi-squared goodness of fit test using the F-test

In [22]:
stats.chisquare(f_obs = observed, f_exp = expected)

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

## Chi-Squared Test of Independence 

Independence is a key concept in probability that describes a situation where knowing the value of one variable tells you nothing about the value of another.

For instance, the month you were born probably doesn't tell you anything about which web browser you use, so we'd expect birth month and browser preference to be independent. 

On the other hand, your month of weight may be related to your body fat percentage, so these values are not independent.

The Chi-Squared test of independence tests whether two categorical variables are independent.  The test of independence is commonly used to determine whether variables like education, political views vary based on demographic factors like gender, race and religion. 

In [23]:
np.random.seed(10)

voter_race = np.random.choice(a =["asian", "black", "hispanic", "other", "white"],
                              p = [0.05, 0.15, 0.25, 0.05, 0.5],
                              size = 1000)

voter_party = np.random.choice(a = ["democrat", "independent", "republican"],
                               p = [0.4,0.2,0.4],
                               size = 1000)

voters = pd.DataFrame({"race":voter_race,
                       "party": voter_party})

voter_tab = pd.crosstab(voters.race, voters.party, margins = True)

voter_tab.columns = ["democrat", "independent", "republican", "row_totals"]

voter_tab.index = ["asian", "black", "hispanic", "other", "white", "col_totals"]

observed = voter_tab.iloc[0:5, 0:3]
voter_tab



Unnamed: 0,democrat,independent,republican,row_totals
asian,21,7,32,60
black,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
col_totals,397,186,417,1000


Note that we did not use the race data to inform the generation of the party data so the variables are independent.

For a test of independence, we use the same chi-squared formula that we used for the goodness-of-fit test. The main difference is we have to calculate the expected counts of each cell in a 2-dimensional table instead of a 1-dimensional table. To get the expected count for a cell, multiply the row total for that cell by the column total for that cell and then divide by the total number of observation. We can quickly get the expected counts for all cells in the table by taking the row totals and column totals of the table, performing an outer product on them with np.outer.

In [24]:
expected =  np.outer(voter_tab["row_totals"][0:5],
                     voter_tab.loc["col_totals"][0:3]) / 1000

expected = pd.DataFrame(expected)

expected.columns = ["democrat","independent","republican"]
expected.index = ["asian","black","hispanic","other","white"]

expected

Unnamed: 0,democrat,independent,republican
asian,23.82,11.16,25.02
black,61.138,28.644,64.218
hispanic,99.647,46.686,104.667
other,15.086,7.068,15.846
white,197.309,92.442,207.249


In [25]:
# now we can follow the same steps as before to calculate the chi quare statistic and p-value

chi_squared_stat = (((observed-expected)**2)/expected).sum().sum() # called twice to get the column sums and the sum of the column sums sum of entire table

print(f"Chi-Squared Statisitic: {chi_squared_stat}")

crit = stats.chi2.ppf(q = 0.95, 
                      df = 8)

print(f"Critical Value : {crit}")

p_value = 1 - stats.chi2.cdf(x = chi_squared_stat, df = 8) # df = product of number of categories in each variable - 1 i.e 5*3 = 4*2

print(f"P value: {p_value}")
               

Chi-Squared Statisitic: 7.169321280162059
Critical Value : 15.507313055865453
P value: 0.518479392948842


The high p-value means the null hypothesis that the variables are independent has insufficient evidence to reject. The categorical variables are independent. 