Statistical Test: For each feature, a statistical test is performed to evaluate its relationship with the target variable. The type of test depends on the type of the target variable:

For classification tasks (where the target is categorical), tests like Chi-Square (for categorical features) or ANOVA F-test (for continuous features) are commonly used.
For regression tasks (where the target is continuous), the Pearson correlation coefficient or mutual information can be used.
Rank Features: After the statistical tests, features are ranked based on their scores. Higher scores indicate a stronger relationship with the target variable.

Select Features: A threshold is set (either a fixed number of top features or a threshold score), and only the top-ranked features are selected for the model training.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

import os

In [2]:
os.chdir("D:\\ML practice datasets")

In [3]:
df = pd.read_csv("diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


SelectKBest: This function is used to select the top k features based on their importance scores.
Chi2: This is the statistical test used to evaluate the relationship between each feature and the target variable. It is appropriate for non-negative features and a categorical target.

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

bestfeatures = SelectKBest(score_func=chi2, k=4)
fit = bestfeatures.fit(X, y)

scores = pd.DataFrame(fit.scores_)
features = pd.DataFrame(X.columns)

feature_scores = pd.concat([features, scores], axis=1)
feature_scores.columns = ['Feature','Score']  # naming the dataframe columns
print(feature_scores.nlargest(4, 'Score'))  # print 4 best features



   Feature        Score
4  Insulin  2175.565273
1  Glucose  1411.887041
7      Age   181.303689
5      BMI   127.669343
