### the Chi-Squared Test for Machine Learning

A common problem in applied machine learning is determining whether input features are relevant to the outcome to be predicted.

This is the problem of feature selection.

### The chi-squared test can compare an observed contingency table to an expected table and determine if the categorical variables are independent.

https://machinelearningmastery.com/chi-squared-test-for-machine-learning/

### Contingency Table
A categorical variable is a variable that may take on one of a set of labels.

## Does an interest in math or science depend on gender, or are they independent?

### Sex,	Interest
Male,	Art
Female,	Math
Male, 	Science
Male,	Math
...

        Science,	Math,	Art
Male         20,      30,    15
Female       20,      15,    30

위의 성별 / 관심 사례가 주어지면 범주 (남성 및 여성)에 대한 관측치 수가 같을 수도 있고 같지 않을 수도 있습니다. 
그럼에도 불구하고 우리는 각 관심 그룹에서 예상되는 관측 빈도를 계산하고 성별에 따른 관심의 분할이 비슷한 빈도 또는 다른 빈도를 초래하는지 확인할 수 있습니다.

If Statistic >= Critical Value: significant result, reject null hypothesis (H0), dependent.
If Statistic < Critical Value: not significant result, fail to reject null hypothesis (H0), independent.

The degrees of freedom for the chi-squared distribution is calculated based on the size of the contingency table as:
### degrees of freedom: (rows - 1) * (cols - 1)

In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:

If p-value <= alpha: significant result, reject null hypothesis (H0), dependent.
If p-value > alpha: not significant result, fail to reject null hypothesis (H0), independent.

In [1]:
# chi-squared test with similar proportions
from scipy.stats import chi2_contingency
from scipy.stats import chi2

In [2]:
# contingency table
table = [	[10, 20, 30],
			[6,  9,  17]]
print(table)

[[10, 20, 30], [6, 9, 17]]


In [4]:
stat, p, dof, expected = chi2_contingency(table)
print('dof=%d' % dof)
print(expected)

dof=2
[[10.43478261 18.91304348 30.65217391]
 [ 5.56521739 10.08695652 16.34782609]]


In [6]:
# interpret test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))

probability=0.950, critical=5.991, stat=0.272


In [7]:
if abs(stat) >= critical:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')

Independent (fail to reject H0)


In [8]:
# interpret p-value
alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')

significance=0.050, p=0.873
Independent (fail to reject H0)


### 결과 해설

Running the example first prints the contingency table. 
The test is calculated and the degrees of freedom (dof) is reported as 2, which makes sense given:

degrees of freedom: (rows - 1) * (cols - 1)
degrees of freedom: (2 - 1) * (3 - 1)
degrees of freedom: 1 * 2
degrees of freedom: 2

[[10, 20, 30], [6, 9, 17]]

dof=2

[[10.43478261 18.91304348 30.65217391]
 [ 5.56521739 10.08695652 16.34782609]]

probability=0.950, critical=5.991, stat=0.272
Independent (fail to reject H0)

significance=0.050, p=0.873
Independent (fail to reject H0)