In [1]:
# The Pearson’s chi-squared statistical hypothesis is an example of a test for independence between categorical variables.
# This can be a problem of feature selection. (And thus help in dimensionality reduction)
# In the case of classification problems where input variables are also categorical, we can use statistical tests 
# to determine whether the output variable is dependent or independent of the input variables. 
# If independent, then the input variable is a candidate for a feature that may be irrelevant to the problem 
# and removed from the dataset.

# Main concepts:
# Pairs of categorical variables can be summarized using a contingency table.
# The chi-squared test can compare an observed contingency table to an expected table and determine if the categorical variables are independent.
# How to calculate and interpret the chi-squared test for categorical variables in Python.

# Example:
#            fiction	 non-fiction 	 Total
# Male         250        50          300
# Female       200       1000        1200
# Total        450       1050        1500

# Does an interest on fiction or non-fiction depend on gender, or are they independent?

In [2]:
from scipy.stats import chi2_contingency
from scipy.stats import chi2

In [3]:
# contingency table
table = [
    [250, 50],
    [200, 1000]
]
print(table)

[[250, 50], [200, 1000]]


In [4]:
# degrees of freedom: 
# (rows - 1) * (cols - 1)

# Expected frequencies:
# e_ij = (count(A=a_i) x count(B=b_j)) / n
# example: (male, fiction) = (300x450 / 1500) = 90

# Input: an array representing the contingency table for the two categorical variables. 
# Returns: the calculated statistic and p-value for interpretation, 
#          calculated degrees of freedom, and the table of expected frequencies.
stat, p, dof, expected = chi2_contingency(table)
print('dof=%d' % dof)
print('Expected frequencies:')
print(expected)

dof=1
Expected frequencies:
[[ 90. 210.]
 [360. 840.]]


In [5]:
# interpret test-statistic
prob = 0.999
critical = chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))
if abs(stat) >= critical:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

probability=0.999, critical=10.828, stat=504.767
Dependent (reject H0)


In [6]:
# Note: 
# For 1 degree of freedom, the x^2 value needed to reject the hypothesis at the 0.001 significance level is 10.828 
# (taken from the table of upper percentage points of the x^2 distribution, typically available from any textbook on statistics). 
# Since our computed value is above this, we can reject the hypothesis that gender and preferred reading are independent 
# and conclude that the two attributes are (strongly) correlated for the given group of people.

# interpret p-value
# In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:
# If p-value <= alpha: significant result, reject null hypothesis (H0), dependent.
# If p-value > alpha: not significant result, fail to reject null hypothesis (H0), independent

alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

significance=0.001, p=0.000
Dependent (reject H0)
