SelectKBest uses Chi-squared test for feature selection for categorical features & target columns. 

We calculate Chi-square between each feature & the target & select the desired 
number of features with best Chi-square scores or the lowest p-values. 

The Chi-squared (χ2) test is used in statistics to test the independence of two events. More specifically in feature selection we use it to test whether the occurrence of a specific feature & the target are independent or not.

For each feature & target combination, a corresponding high χ2 chi-square score or a low p-value indicates that the target column is dependent on the feature column.

url: https://www.youtube.com/watch?v=fMIwIKLGke0

In [3]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.feature_selection import SelectKBest, chi2

In [2]:
df = pd.read_excel('accident.xlsx')

In [4]:
df.head()

Unnamed: 0,is_adult,is_male,accident
0,1,0,1
1,1,1,1
2,1,1,0
3,1,1,0
4,1,0,0


In [5]:
adult_accident_crosstab = pd.crosstab(df['is_adult'], df['accident'], 
                                      margins=True)
adult_accident_crosstab

accident,0,1,All
is_adult,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,781,146,927
All,781,146,927


In [6]:
gender_accident_crosstab = pd.crosstab(df['is_male'], df['accident'], 
                                       margins=True)
gender_accident_crosstab

accident,0,1,All
is_male,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,154,47,201
1,627,99,726
All,781,146,927


In [7]:
def check_categorical_dependency(crosstab_table, confidence_interval):
    stat, p, dof, expected = stats.chi2_contingency(crosstab_table)
    print ("Chi-Square Statistic value = {}".format(stat))
    print ("P - Value = {}".format(p))
    alpha = 1.0 - confidence_interval
    if p <= alpha:
        print('Dependent (reject H0)')
    else:
	      print('Independent (fail to reject H0)')
    return expected

In [8]:
exp_table_1 = check_categorical_dependency(adult_accident_crosstab, 0.95)

Chi-Square Statistic value = 0.0
P - Value = 1.0
Independent (fail to reject H0)


In [9]:
pd.DataFrame(exp_table_1)

Unnamed: 0,0,1,2
0,781.0,146.0,927.0
1,781.0,146.0,927.0


In [10]:
exp_table_2 = check_categorical_dependency(gender_accident_crosstab, 0.95)

Chi-Square Statistic value = 11.270043347013548
P - Value = 0.023691007358727482
Dependent (reject H0)


In [11]:
pd.DataFrame(exp_table_2)

Unnamed: 0,0,1,2
0,169.343042,31.656958,201.0
1,611.656958,114.343042,726.0
2,781.0,146.0,927.0


# Feature Selection using Chi-Square

In [12]:
X = df[["is_adult",	"is_male"]]

In [13]:
y = df[["accident"]]

In [14]:
X_new = SelectKBest(chi2, k=1).fit_transform(X, y)

In [15]:
X_new.shape

(927, 1)

In [16]:
pd.crosstab(np.squeeze(X_new), np.squeeze(y))

accident,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,154,47
1,627,99
