# Feature selection for categorical data

- **Chi squared:** Chi-squared test is a hypothesis test that is used to determine whether there is a relationship between two categorical features. The Chi-squared score can be used to select the features with the highest values for the test chi-squared statistic
- **Mutual information:** Mutual information measures the dependency between two variables.It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
from scipy import stats
import matplotlib.pylab as plt
import seaborn as sns
from sklearn import feature_selection
from sklearn.datasets import load_diabetes

In [2]:
df = sns.load_dataset('titanic')

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Convert categorical data into numerical values 

In [3]:
# Remove the dependent feature (survived) from dataframe
# Apply feature selection only to independent features
X = df.drop(columns=['survived'])
y = df[['survived']]

In [4]:
# select categorical features
categories = ['sex','sibsp','alone','parch','embarked']
X = X[categories]
X.head()

Unnamed: 0,sex,sibsp,alone,parch,embarked
0,male,1,False,0,S
1,female,1,False,0,C
2,female,0,True,0,S
3,female,1,False,0,S
4,male,0,True,0,S


In [5]:
dic = {k:i for i, k in enumerate(X['embarked'].unique())}
X['embarked'] = X['embarked'].map(dic)

In [6]:
X['alone'] = np.where(X['alone']==True, 1, 0)

In [7]:
X['sex'] = np.where(X['sex']=='male',1,0)

In [8]:
X.head()

Unnamed: 0,sex,sibsp,alone,parch,embarked
0,1,1,0,0,0
1,0,1,0,0,1
2,0,0,1,0,0
3,0,1,0,0,0
4,1,0,1,0,0


### Perform chi-squared test

In [22]:
chi2 = feature_selection.chi2(X,y)

In [26]:
chi2_vals = chi2[0]
pvalues = chi2[1]

In [65]:
chi2_test = pd.DataFrame({'chi2':chi2_vals, 'p_values':pvalues})
# add features are index to the dataframe
chi2_test.index = X.columns
chi2_test.sort_values(by='chi2', ascending=False)

Unnamed: 0,chi2,p_values
sex,92.702447,6.077838e-22
alone,14.640793,0.0001300685
embarked,14.124257,0.0001711228
parch,10.097499,0.001484707
sibsp,2.581865,0.1080942


We need to note the the higher the `chi2` score the higher the feature importance. On the other hand, the lower the `p_value` the higher the feature importance. We can see that the feature `sex` is by far the more important

### Mutual information

In [10]:
mutual_info = feature_selection.mutual_info_classif(X,y)

  y = column_or_1d(y, warn=True)


In [14]:
df_mutual_info = pd.DataFrame(mutual_info, columns=['values'])
df_mutual_info.index = X.columns
df_mutual_info = df_mutual_info.sort_values(by='values', ascending=False)

In [15]:
df_mutual_info

Unnamed: 0,values
sex,0.147114
alone,0.021629
parch,0.008139
sibsp,0.001526
embarked,0.0
