### Fisher Score - Chisquare Test for feature selection

Compute chi-squared stats between each non-negative feature and class.

- The score should be used to evaluate categorical variables in a classification task. 

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification. The Chi Square stastic is commonly used for testing relationships between categorical variables.

It compares the observed distribution of the different classes of target Y among the different categories of the feature, against the expected distribution of the target classes, regardless of the feature categories

In [27]:
##import pandas as pd
##df = pd.read_csv('C:/Users/axagrawal/Desktop/Kaggle/Titanic/train.csv')
##df.head()
##df.shape
##df.columns
##df.info()

In [1]:
## Another way to import titanic dataset
## seaborn is a inbuilt library wich has so many dataset
import seaborn as sns
df = sns.load_dataset('titanic')

In [2]:
df.head()
##df.shape

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


In [4]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [5]:
### I m going to take category features like ["Sex","embarked","alone","pclass","Survived(Output category feature)"]

df = df[["sex","embarked","alone","pclass","survived"]]
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,male,S,False,3,0
1,female,C,False,1,1
2,female,S,True,3,1
3,female,S,False,1,1
4,male,S,True,3,0


In [6]:
## Before we apply Chi Square. We need to apply encoding to label categorical variable using one hot category or any other technique
## Let's perform label encoding on sex column
import numpy as np
df['sex']=np.where(df["sex"]=="male",1,0)
### Let's perform label encoding on embarked
ordinal_label = {k:i for i, k in enumerate(df["embarked"].unique(),0)}
df['embarked'] = df["embarked"].map(ordinal_label)

In [7]:
 enumerate(df["embarked"].unique(),0)

<enumerate at 0x1fbe1d26500>

In [8]:
ordinal_label

{'S': 0, 'C': 1, 'Q': 2, nan: 3}

In [9]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,False,3,0
1,0,1,False,1,1
2,0,0,True,3,1
3,0,0,False,1,1
4,1,0,True,3,0


In [10]:
### Let's perform label encoding on embarked
df['alone']=np.where(df["alone"]==True,1,0)


In [11]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,0,3,0
1,0,1,0,1,1
2,0,0,1,3,1
3,0,0,0,1,1
4,1,0,1,3,0


In [12]:
### Train Test Split is usually done to avoid Overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,y_test=train_test_split(df[['sex','embarked','alone','pclass']],df[["survived"]],test_size=0.3,random_state=0)

In [15]:
X_train.shape, Y_train.shape

((623, 4), (623, 1))

In [17]:
X_train.head()


Unnamed: 0,sex,embarked,alone,pclass
857,1,0,1,1
52,0,1,0,1
386,1,0,0,3
124,1,0,0,1
578,0,1,0,3


In [18]:
X_train.isnull().sum()

sex         0
embarked    0
alone       0
pclass      0
dtype: int64

In [20]:
### We need to perform Chi Square Test
### Chi Square returns two values F-Score and P-Value
from sklearn.feature_selection import chi2
f_p_values=chi2(X_train,Y_train)

In [21]:
f_p_values

(array([63.55447864, 11.83961845,  9.03328564, 21.61080949]),
 array([1.55992554e-15, 5.79837058e-04, 2.65107556e-03, 3.33964360e-06]))

#### First row or array parameters in above output is F-Score. Higher the value of F-Score, more important is that feature.
#### Second array parameter or row in above ouput is P value. Lesser the P value more important will be the feature is.

In [27]:
## Convert above array into dataframe
import pandas as pd
pd.Series(f_p_values[0])

0    63.554479
1    11.839618
2     9.033286
3    21.610809
dtype: float64

In [28]:
pd.Series(f_p_values[1])


0    1.559926e-15
1    5.798371e-04
2    2.651076e-03
3    3.339644e-06
dtype: float64

In [30]:
p_values=pd.Series(f_p_values[1])
p_values.index=X_train.columns

In [31]:
p_values

sex         1.559926e-15
embarked    5.798371e-04
alone       2.651076e-03
pclass      3.339644e-06
dtype: float64

In [35]:
p_values.index

Index(['sex', 'embarked', 'alone', 'pclass'], dtype='object')

In [36]:
## sort_index helps to sort index like alone--embarked---pclass-sex (Alphabatically)
p_values.sort_index()

alone       2.651076e-03
embarked    5.798371e-04
pclass      3.339644e-06
sex         1.559926e-15
dtype: float64

In [38]:
## Sorting index name in descending alphabatically
p_values.sort_index(ascending = False)

sex         1.559926e-15
pclass      3.339644e-06
embarked    5.798371e-04
alone       2.651076e-03
dtype: float64

#### Observation
Sex column is the most important column when compared to the output feature.

