## Chi-square distribution

The chi-squared goodness-of-fit test is an analog of the one way t-test for categorical variables; it tests whether the distribution of sample categorical data matches with an expected distribution. 

For instance, you could use a chi-squared goodness-of-fit test to check whehter the computer browser preferences of your friends match those of internet users as a whole or the spending habit of customers at your comapny match that of entire customers in the Ireland.

Let's generate some fake customer spending data for all retail stores and Primark and walk through the chi-square goodness of fit test to check whether they are different

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [2]:
ireland_retail = pd.DataFrame(["Jan"] * 100000 + ["Feb"] * 60000 +\
                              ["Sep"] * 50000+ ["Nov"] * 30000 +\
                                ["Other"] * 10000)

primark = pd.DataFrame(["Jan"] * 800 + ["Feb"] * 500 + \
                       ["Sep"] * 250 + ["Nov"] * 75 +\
                        ["Other"] * 150)

ireland_retail_table = pd.crosstab(index = ireland_retail[0], columns= "count")
primark_retail_table = pd.crosstab(index = primark[0], columns = "count")

print("Ireland retail")
print(ireland_retail_table)
print(" ")
print("Primark")
print(primark_retail_table)

Ireland retail
col_0   count
0            
Feb     60000
Jan    100000
Nov     30000
Other   10000
Sep     50000
 
Primark
col_0  count
0           
Feb      500
Jan      800
Nov       75
Other    150
Sep      250


In [3]:
observed_table = primark_retail_table
ireland_retail_ratios = ireland_retail_table / len(ireland_retail)
print(ireland_retail_ratios)

expected_table = ireland_retail_ratios * len(primark)
chi_squared_stat = (((observed_table - expected_table) ** 2) / expected_table).sum()
print(chi_squared_stat)

col_0  count
0           
Feb     0.24
Jan     0.40
Nov     0.12
Other   0.04
Sep     0.20
col_0
count    232.629108
dtype: float64


Note : The chi-squared test assumes none of the expected counts are less than 5.

In t-test where we compare t-test statistic to a critical value based on the t-distribution to determine whether the result is significant. Similarly, in chi-squared distribution, we compare the chi-square test statistic to a critical value based on chi-squared distribution. 

In [4]:
crit = stats.chi2.ppf(q = 0.95, # Confidence interval = 95%
                      df = 4)  #Df = number of variable categoreis - 1

print("Critical value")
print(crit)

p_value = 1 -stats.chi2.cdf(x = chi_squared_stat,
                            df = 4)

print("P value")
print(p_value)

Critical value
9.487729036781154
P value
[0.]


Since, our chi-squared statistic exceeds the critical value, and p-value is extreme low (hence, it is not captured), we would reject the null hypothesis that the two distributions are same. 

In [5]:
stats.chisquare(f_obs = observed_table, # Array of observed counts
                f_exp = expected_table) # Array of expected counts

Power_divergenceResult(statistic=array([232.62910798]), pvalue=array([3.58577414e-49]))

## Chi-square test of independence

The chi-squared test of independence tests whether two categorical variables are independent. This test is commonly used to determine whehter two variables like education, political views and other preferences vary based on demogrpahic factors like gender, race and religion. Let us create some fake voter polling data and peform a test of independence

In [6]:
np.random.seed(10)

# Sample data random.y at fixed probabilities
spending_month = np.random.choice(a = ["Jan", "Feb", "Sep", "Nov", "Other"],
                              p = [0.04, 0.15, 0.35, 0.06, 0.4],
                              size = 1000)

# Sample data randomly at fixed probabilies
customer_segment = np.random.choice(a = ["Low_class", "Middle_class", "Upper_class"],
                               p = [0.6, 0.3, 0.1],
                               size = 1000)

customers = pd.DataFrame({"month": spending_month,
                       "segment": customer_segment})

voter_tab = pd.crosstab(customers.month, customers.segment, margins = True)
voter_tab.columns = ["Low_class", "Middle_class", "Upper_class", "row_totals"]
voter_tab.index = ["Jan", "Feb", "Sep", "Nov", "Other", "col_totals"]
observed2d = voter_tab.iloc[0:5, 0:3] # getting the data for later use
voter_tab

Unnamed: 0,Low_class,Middle_class,Upper_class,row_totals
Jan,85,54,12,151
Feb,25,24,4,53
Sep,35,23,1,59
Nov,228,115,53,396
Other,210,96,35,341
col_totals,583,312,105,1000


Since, we have not used race data to connect with party data, so the variables are independent.

To find the independence, we have to use the same chi-squared formula that we used for goodness-of-fit test. The only difference is we have to calculate the expected counts in a 2-dimensional table instead of 1-dimensional table.

To get the expected counts for a cell, multiply row total for that cell by the column total for that cell and divide by the nymber of observations. 

We are using np.outer functio to get the expected counts for all cells in the table by the total number of observations. 

In [7]:
# The below code is calculating the outer values 
# (adds up the outer rows and columns)

expected2d = np.outer(voter_tab["row_totals"][0 : 5],
                    voter_tab.loc["col_totals"][0 : 3]) / 1000

expected2d = pd.DataFrame(expected2d)
expected2d.columns = ["Low_class", "Middle_class", "Upper_class"]
expected2d.index = ["Jan", "Feb", "Sep", "Nov", "Other"]

expected2d

Unnamed: 0,Low_class,Middle_class,Upper_class
Jan,88.033,47.112,15.855
Feb,30.899,16.536,5.565
Sep,34.397,18.408,6.195
Nov,230.868,123.552,41.58
Other,198.803,106.392,35.805


Now, let us apply the chi-squared statistic and calcualte the p-value

In [8]:
chi_squared_stat2 = (((observed2d-expected2d)**2)/expected2d).sum().sum()

print(chi_squared_stat2)

# We used sum twice to add the  columns and rows together, from 2d table

17.924643480097544


In [9]:
crit = stats.chi2.ppf(q = 0.95,
                      df = 8)

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x = chi_squared_stat2, df = 8)

print("P value")
print(p_value)

Critical value
15.50731305586545
P value
0.021798595389257547


 Here, our critical value is much higher than the chi-test statistic and also the p-value is so high. Hence, we conclude that we cannot reject the null hypothesis.

### Note : 
The degrees of freedom for a test of independence equals the product of the number of categories - 1. In this case, we have 5 * 3 table, hence we give 4 * 2 = 8

We cab aksi yse stats.chi_contingency() function to conduct a test of independence automatically given a frequency of observed counts.

In [10]:
stats.chi2_contingency(observed = observed2d)

Chi2ContingencyResult(statistic=17.924643480097544, pvalue=0.021798595389257592, dof=8, expected_freq=array([[ 88.033,  47.112,  15.855],
       [ 30.899,  16.536,   5.565],
       [ 34.397,  18.408,   6.195],
       [230.868, 123.552,  41.58 ],
       [198.803, 106.392,  35.805]]))

Chi-squared tests provide a way to investigate the differences in the distributions of categorical variables with the same categories and dependence between categorical variables. 