# Machine learning Feature selection (filter method)


### Common Techniques in Filter Methods

###### The filter method is a straightforward and efficient approach to feature selection in machine learning. By relying on statistical measures to assess the relevance of features, it provides a quick way to reduce the dimensionality of the dataset and improve model performance. However, it’s often beneficial to combine it with other methods (like wrapper or embedded methods) for more comprehensive feature selection.

#### 1. Varience Threshold
#### 2. Correlation Coefficient 
#### 3. SelectKbest
#### 4. Chi-Square Test
#### 5. Information Gain
#### 6. ANOVA (Analysis of Variance)
#### 7. Mutual Information
#### 8. Univariate Selection



#   1.   Varience Threshold


###### * Removes features with low variance, assuming that features with very low variance are less likely to be useful.

In [3]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

In [4]:
df = pd.DataFrame({"A":[1,2,4,1,2,4], 
                    "B":[4,5,6,7,8,9], 
                    "C":[0,0,0,0,0,0],
                    "D":[1,1,1,1,1,1]}) 

In [6]:
df.head()

Unnamed: 0,A,B,C,D
0,1,4,0,1
1,2,5,0,1
2,4,6,0,1
3,1,7,0,1
4,2,8,0,1


In [7]:
x=df.drop(['D'],axis=1)

In [8]:
y=df['D']

In [9]:
x.corrwith(y)

A   NaN
B   NaN
C   NaN
dtype: float64

In [11]:
df.corr()['D']

A   NaN
B   NaN
C   NaN
D   NaN
Name: D, dtype: float64

In [12]:
var=VarianceThreshold(threshold=0)

In [13]:
var.fit(df)

In [14]:
var.get_support()

array([ True,  True, False, False])

In [16]:
df.columns[var.get_support()]

Index(['A', 'B'], dtype='object')

In [17]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [19]:
constant_columns=[column for column in df.columns if column not in df.columns[var.get_support()]]

In [20]:
constant_columns

['C', 'D']

In [21]:
for i in constant_columns:
    print(i)

C
D


In [22]:
df.drop(constant_columns,axis=1)

Unnamed: 0,A,B
0,1,4
1,2,5
2,4,6
3,1,7
4,2,8
5,4,9


In [23]:
data=pd.read_csv('nitheesh.csv')

In [24]:
data.head()

Unnamed: 0,name,height,weight,color,place
0,1,1,1,1,3
1,1,1,1,2,2
2,1,1,2,1,3
3,1,1,2,2,1
4,1,2,1,1,3


In [25]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split

In [26]:
x=data.drop(labels=['place'],axis=1)
y=data.place

In [27]:
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.25,random_state=55)

In [28]:
var=VarianceThreshold(threshold=0)

In [29]:
var.fit(x_train)

In [30]:
var.get_support()

array([ True,  True,  True,  True])

In [31]:
sum(var.get_support())

4

In [32]:
constant=[col for col in x_train.columns if col not in x_train.columns[var.get_support()]]

In [33]:
constant
# its null there is no constant variable in this dataset

[]

In [34]:
x_train.drop(constant,axis=1)

Unnamed: 0,name,height,weight,color
1,1,1,1,2
20,3,2,1,1
5,1,2,1,2
8,2,1,1,1
7,1,2,2,2
13,2,2,1,2


# 2. Correlation Coefficient

##### 1. Pearson's Correlation: Measures the linear relationship between two variables. Features highly correlated with the target variable and not highly correlated with each other are preferred.
##### 2. Spearman's Rank Correlation: Measures the monotonic relationship between two variables, suitable for non-linear relationships.

In [35]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [36]:
# Load the Titanic dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
titanic = pd.read_csv(url)

In [37]:
print(titanic.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [38]:
titanic.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [39]:
titanic['Age'].fillna(titanic['Age'].median(), inplace=True)
titanic['Embarked'].fillna(titanic['Embarked'].mode()[0], inplace=True)

In [40]:
label_enc = LabelEncoder()
titanic['Sex'] = label_enc.fit_transform(titanic['Sex'])
titanic['Embarked'] = label_enc.fit_transform(titanic['Embarked'])

In [41]:
correlation_matrix = titanic.corr()

In [42]:
print(correlation_matrix['Survived'].sort_values(ascending=False))

Survived    1.000000
Fare        0.257307
Parch       0.081629
SibSp      -0.035322
Age        -0.064910
Embarked   -0.167675
Pclass     -0.338481
Sex        -0.543351
Name: Survived, dtype: float64


In [43]:
selected_features = correlation_matrix['Survived'][abs(correlation_matrix['Survived']) > 0.1].index.tolist()

In [44]:
selected_features.remove('Survived')  # Remove the target variable itself
print("Selected features based on correlation:", selected_features)

Selected features based on correlation: ['Pclass', 'Sex', 'Fare', 'Embarked']


In [45]:
X_selected = titanic[selected_features]
y = titanic['Survived']

In [46]:
print("Shape of the dataset with selected features:", X_selected.shape)

Shape of the dataset with selected features: (891, 4)


In [48]:
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2


#  Find the best feature using the correlation

In [20]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest,f_classif

In [21]:
data={'age':[1,2,3,4,5],
     'income':[100,200,300,600,800],
     'loan_approved':[0,0,0,1,1]}

In [22]:
data=pd.DataFrame(data)

In [23]:
data

Unnamed: 0,age,income,loan_approved
0,1,100,0
1,2,200,0
2,3,300,0
3,4,600,1
4,5,800,1


In [25]:
correlation=data.corr()
correlation

Unnamed: 0,age,income,loan_approved
age,1.0,0.976187,0.866025
income,0.976187,1.0,0.939336
loan_approved,0.866025,0.939336,1.0


In [26]:
sorted_correlation = correlation.unstack().sort_values(ascending=False)
sorted_correlation

age            age              1.000000
income         income           1.000000
loan_approved  loan_approved    1.000000
age            income           0.976187
income         age              0.976187
               loan_approved    0.939336
loan_approved  income           0.939336
age            loan_approved    0.866025
loan_approved  age              0.866025
dtype: float64

#  3.  SelectKBest

# Using the iris data set and select the best 2  feature using  SelectKBest

In [19]:
import pandas as pd
from sklearn.datasets import load_iris

In [3]:
data=load_iris()

In [2]:
from sklearn.feature_selection import SelectKBest,f_classif

In [4]:
data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [5]:
dir(data)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

In [6]:
x,y=data.data,data.target

In [7]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [8]:
data.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [9]:
df=pd.DataFrame(data.data,columns=data.feature_names)

In [10]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [11]:
df.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

In [12]:
df['target']=data.target

In [13]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [14]:
k_best=SelectKBest(score_func=f_classif,k=2)

In [15]:
k_best.fit_transform(x,y)

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.7, 0.4],
       [1.4, 0.3],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.5, 0.1],
       [1.5, 0.2],
       [1.6, 0.2],
       [1.4, 0.1],
       [1.1, 0.1],
       [1.2, 0.2],
       [1.5, 0.4],
       [1.3, 0.4],
       [1.4, 0.3],
       [1.7, 0.3],
       [1.5, 0.3],
       [1.7, 0.2],
       [1.5, 0.4],
       [1. , 0.2],
       [1.7, 0.5],
       [1.9, 0.2],
       [1.6, 0.2],
       [1.6, 0.4],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.6, 0.2],
       [1.6, 0.2],
       [1.5, 0.4],
       [1.5, 0.1],
       [1.4, 0.2],
       [1.5, 0.2],
       [1.2, 0.2],
       [1.3, 0.2],
       [1.4, 0.1],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.3, 0.3],
       [1.3, 0.3],
       [1.3, 0.2],
       [1.6, 0.6],
       [1.9, 0.4],
       [1.4, 0.3],
       [1.6, 0.2],
       [1.4, 0.2],
       [1.5, 0.2],
       [1.4, 0.2],
       [4.7, 1.4],
       [4.5, 1.5],
       [4.9,

In [16]:
select_indices=k_best.get_support(indices=True)

In [17]:
select_features=df.columns[select_indices]

In [18]:
select_features # these two are our best feature in the iris datset

Index(['petal length (cm)', 'petal width (cm)'], dtype='object')

# 4. Chi-Square Test

####  Evaluates the independence of categorical features with respect to the target variable. A higher chi-square score indicates a stronger relationship with the target.


In [27]:
import seaborn as sns
df=sns.load_dataset('titanic')

In [28]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [29]:
##['sex','embarked','alone','pclass','Survived']
df=df[['sex','embarked','alone','pclass','survived']]
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,male,S,False,3,0
1,female,C,False,1,1
2,female,S,True,3,1
3,female,S,False,1,1
4,male,S,True,3,0


In [30]:
import numpy as np
df['sex']=np.where(df['sex']=='male',1,0)

In [31]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,S,False,3,0
1,0,C,False,1,1
2,0,S,True,3,1
3,0,S,False,1,1
4,1,S,True,3,0


In [32]:
### Let's perform label encoding on sex column
import numpy as np
### let's perform label encoding on embarked
ordinal_label = {k: i for i, k in enumerate(df['embarked'].unique(), 0)}
df['embarked'] = df['embarked'].map(ordinal_label)

In [33]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,False,3,0
1,0,1,False,1,1
2,0,0,True,3,1
3,0,0,False,1,1
4,1,0,True,3,0


In [34]:
df['alone']=np.where(df['alone']=='True',1,0)

In [35]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,0,3,0
1,0,1,0,1,1
2,0,0,0,3,1
3,0,0,0,1,1
4,1,0,0,3,0


In [36]:
### train Test split is usually done to avaoid overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df[['sex','embarked','alone','pclass']],
                                              df['survived'],test_size=0.3,random_state=100)

In [37]:
X_train.isnull().sum()

sex         0
embarked    0
alone       0
pclass      0
dtype: int64

In [38]:
## Perform chi2 test
### chi2 returns 2 values
### Fscore and the pvalue
from sklearn.feature_selection import chi2
f_p_values=chi2(X_train,y_train)

In [39]:
import pandas as pd
p_values=pd.Series(f_p_values[1])
p_values.index=X_train.columns
p_values

sex         5.306038e-16
embarked    5.999221e-03
alone                NaN
pclass      2.755149e-06
dtype: float64

In [40]:
p_values.sort_index(ascending=False)

sex         5.306038e-16
pclass      2.755149e-06
embarked    5.999221e-03
alone                NaN
dtype: float64

# 5. information gain use and select the best five feature 

In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

In [43]:
df=pd.read_csv('titanic.csv')

In [44]:
df.head()

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


In [45]:
df.isnull().sum()

PassengerId      0
Name             0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Survived         0
dtype: int64

In [46]:
df=df[['Sex','Embarked','SibSp','Pclass','Survived']]

In [47]:
df.head()

Unnamed: 0,Sex,Embarked,SibSp,Pclass,Survived
0,male,S,1,3,0
1,female,C,1,1,1
2,female,S,0,3,1
3,female,S,1,1,1
4,male,S,0,3,0


In [48]:
df['Sex']=pd.get_dummies(df['Sex'],drop_first=True)
df.head()

Unnamed: 0,Sex,Embarked,SibSp,Pclass,Survived
0,1,S,1,3,0
1,0,C,1,1,1
2,0,S,0,3,1
3,0,S,1,1,1
4,1,S,0,3,0


In [49]:
df.corr()['Survived']

Sex        -0.543351
SibSp      -0.035322
Pclass     -0.338481
Survived    1.000000
Name: Survived, dtype: float64

In [50]:
df.isnull().sum()
df['Embarked']=df['Embarked'].fillna(df.Embarked.mode()[0])

In [51]:
df['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

In [53]:
df['Embarked']=pd.get_dummies(df['Embarked'],drop_first=True)

In [54]:
df.head()

Unnamed: 0,Sex,Embarked,SibSp,Pclass,Survived
0,1,0,1,3,0
1,0,0,1,1,1
2,0,0,0,3,1
3,0,0,1,1,1
4,1,0,0,3,0


In [55]:
co=df.corr()['Survived']

In [56]:
co.sort_values(ascending=True)

Sex        -0.543351
Pclass     -0.338481
SibSp      -0.035322
Embarked    0.003650
Survived    1.000000
Name: Survived, dtype: float64

In [57]:
x=df[['Sex','Embarked','SibSp','Pclass']]
y=df['Survived']

In [58]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=55)

In [59]:
k_best=SelectKBest(k=3,score_func=chi2)

In [60]:
x_new=k_best.fit_transform(x,y)

In [61]:
selected_feature_indices=k_best.get_support(indices=True)

In [62]:
selected_feature_indices

array([0, 2, 3], dtype=int64)

In [63]:
selected_feature=x.columns[selected_feature_indices]

In [64]:
selected_feature

Index(['Sex', 'SibSp', 'Pclass'], dtype='object')



# 6. ANOVA (Analysis of Variance):

##### Used for continuous features. It measures the difference between the means of different groups of a categorical target variable. Mutual Information:

In [65]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif

In [66]:
data = load_iris()
X, y = data.data, data.target

In [67]:
k = 2  # Number of top features to select
anova_selector = SelectKBest(score_func=f_classif, k=k)
X_new = anova_selector.fit_transform(X, y)

In [68]:
feature_scores = anova_selector.scores_
selected_features_indices = anova_selector.get_support(indices=True)

In [69]:
print("Selected features shape:", X_new.shape)
print("ANOVA F-test scores for all features:", feature_scores)
print("Indices of selected features:", selected_features_indices)

Selected features shape: (150, 2)
ANOVA F-test scores for all features: [ 119.26450218   49.16004009 1180.16118225  960.0071468 ]
Indices of selected features: [2 3]


# 7. Mutual Information

#### Measures the amount of information obtained about one variable through another. It works well for both categorical and continuous features.

#### Mutual Information (MI) is a measure of the mutual dependence between two variables. In the context of feature selection for machine learning, it helps in identifying the relevance of input features with respect to the target variable. Here, we will use the Titanic dataset to implement mutual information for feature selection.

In [70]:
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [79]:
titanic_data=pd.read_csv('titanic.csv')

In [80]:
titanic_data

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.2500,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1000,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,"Montvila, Rev. Juozas",2,male,27.0,0,0,211536,13.0000,,S,0
887,888,"Graham, Miss. Margaret Edith",1,female,19.0,0,0,112053,30.0000,B42,S,1
888,889,"Johnston, Miss. Catherine Helen ""Carrie""",3,female,,1,2,W./C. 6607,23.4500,,S,0
889,890,"Behr, Mr. Karl Howell",1,male,26.0,0,0,111369,30.0000,C148,C,1


In [81]:
titanic_data = titanic_data.drop(['Name', 'Ticket', 'Cabin'], axis=1)

In [82]:
titanic_data['Age'].fillna(titanic_data['Age'].mean(), inplace=True)
titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0], inplace=True)

In [83]:
titanic_data = pd.get_dummies(titanic_data, columns=['Sex', 'Embarked'], drop_first=True)

In [84]:
X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']

In [85]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [86]:
mi_scores = mutual_info_classif(X_scaled, y, discrete_features='auto')

In [87]:
mi_scores_df = pd.DataFrame({'Feature': X.columns, 'MI Score': mi_scores})
mi_scores_df = mi_scores_df.sort_values(by='MI Score', ascending=False)

In [88]:
print(mi_scores_df)

       Feature  MI Score
6     Sex_male  0.181429
5         Fare  0.147400
1       Pclass  0.044179
3        SibSp  0.038003
4        Parch  0.019471
0  PassengerId  0.013811
2          Age  0.005620
7   Embarked_Q  0.000000
8   Embarked_S  0.000000


# 8. Univariate Selection

#### Techniques like SelectKBest can be used to select the top k features based on univariate statistical tests.

#### Univariate selection is a feature selection technique where each feature is evaluated individually using statistical tests to determine its relationship with the target variable. One common implementation in Python is using the SelectKBest method from scikit-learn, which allows you to select the top k features based on univariate statistical tests.

In [93]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif, chi2
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [94]:
titanic_data=pd.read_csv('titanic.csv')

In [95]:
columns_to_drop = ['Name', 'Ticket', 'Cabin']
titanic_data = titanic_data.drop(columns=[col for col in columns_to_drop if col in titanic_data.columns])

In [96]:
titanic_data['Age'].fillna(titanic_data['Age'].mean(), inplace=True)
titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0], inplace=True)

In [97]:
titanic_data = pd.get_dummies(titanic_data, columns=['Sex', 'Embarked'], drop_first=True)

In [98]:
X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']

In [99]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [100]:
k = 5  # Number of top features to select
selector = SelectKBest(score_func=f_classif, k=k)
X_new = selector.fit_transform(X_scaled, y)

In [101]:
selected_features = X.columns[selector.get_support()]
feature_scores = selector.scores_[selector.get_support()]


In [102]:
selected_features_df = pd.DataFrame({'Feature': selected_features, 'Score': feature_scores})
selected_features_df = selected_features_df.sort_values(by='Score', ascending=False)

In [103]:
print(selected_features_df)

      Feature       Score
3    Sex_male  372.405724
0      Pclass  115.031272
2        Fare   63.030764
4  Embarked_S   20.374460
1       Parch    5.963464
