**Connect With Me in Linkedin** :- https://www.linkedin.com/in/dheerajkumar1997/

## Fischer Score: Chi Square

In this notebook, we'll compute the chi-squared stats between each non negative feature and the target class. This score should be used to evaluate categorical variables in a classification task.

In [1]:
# Import Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest, SelectPercentile
%matplotlib inline

In [2]:
# Load Dataset
df = pd.read_csv('titanic_dataset.csv')
df.shape

(891, 12)

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Encode categorical variables into numbers
df['Sex'] = np.where(df.Sex == 'male', 1, 0)
label = {k:i for i, k in enumerate(df['Embarked'].unique(),0)}
df['Embarked'] = df['Embarked'].map(label)

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,0


In [6]:
X = df[['Pclass', 'Sex', 'Embarked']]
X.head()

Unnamed: 0,Pclass,Sex,Embarked
0,3,1,0
1,1,0,1
2,3,0,0
3,1,0,0
4,3,1,0


In [7]:
y = df['Survived']
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [8]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((623, 3), (623,), (268, 3), (268,))

In [9]:
# Calcualte the Fisher Score (chi2) between each feature and target
fisher_score = chi2(X_train.fillna(0), y_train)
fisher_score

(array([20.13531511, 68.02020611, 10.86691227]),
 array([7.21520718e-06, 1.61828662e-16, 9.78976136e-04]))

Two arrays are returned: F scores and P value. More the P value, more difference between the distributions.

In [10]:
p_values = pd.Series(fisher_score[1])
p_values.index = X_train.columns
p_values.sort_values(ascending=False)

Embarked    9.789761e-04
Pclass      7.215207e-06
Sex         1.618287e-16
dtype: float64

Smaller the p_value, more significant the feature is to predict the target value i.e. Survived in the titanic dataset. Hence, Sex is the most important feature here.

**Connect With Me in Linkedin** :- https://www.linkedin.com/in/dheerajkumar1997/