The goal of this project is to study feature selection methods for effective machine learning model training.

Let's generate data, build a logistic regression model, and evaluate the average accuracy.

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
from sklearn.datasets import make_classification

In [None]:
x_data_generated, y_data_generated = make_classification(scale=1)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cross_val_score(LogisticRegression(), x_data_generated, y_data_generated, scoring='accuracy').mean()

0.76

I will use statistical methods for feature selection: select features based on the correlation matrix, remove low-variance features using VarianceThreshold, then build the logistic regression model again and evaluate the average accuracy.

In [None]:
data = pd.DataFrame(x_data_generated,y_data_generated)

In [None]:
data.shape

(100, 20)

In [None]:
corr = data.corr()

In [None]:
corr.style.background_gradient(cmap='RdYlGn')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,1.0,0.060049,0.038222,0.065344,0.162921,-0.184014,-0.012553,-0.12727,0.105779,0.002115,0.080021,0.052013,-0.026934,-0.038838,0.096511,0.051789,-0.027978,-0.060304,0.075137,0.073765
1,0.060049,1.0,-0.132698,-0.09377,0.168758,-0.076571,0.04905,0.097316,0.027742,0.127665,-0.020312,-0.123894,-0.184228,0.107929,0.014269,0.054378,-0.103446,0.099394,0.025584,0.118083
2,0.038222,-0.132698,1.0,-0.045122,0.112697,-0.071741,0.078896,-0.115383,0.070748,-0.139404,-0.015354,0.077814,-0.082135,-0.124304,-0.135378,0.102673,0.063235,0.051747,0.076876,0.08555
3,0.065344,-0.09377,-0.045122,1.0,0.14709,-0.006924,-0.127268,-0.039623,-0.100963,0.200769,0.033491,-0.090065,-0.198557,-0.025956,-0.029308,-0.157963,-0.048419,-0.274255,0.140142,-0.076963
4,0.162921,0.168758,0.112697,0.14709,1.0,0.037544,-0.079786,-0.291851,-0.051106,0.112206,0.029076,0.049959,-0.256341,0.011919,0.082976,-0.091933,-0.005678,-0.073995,-0.077312,0.046621
5,-0.184014,-0.076571,-0.071741,-0.006924,0.037544,1.0,-0.071562,-0.205127,0.013573,0.036082,0.06546,-0.11716,0.029221,-0.020173,0.179775,-0.047875,0.058298,-0.163573,0.02276,-0.029259
6,-0.012553,0.04905,0.078896,-0.127268,-0.079786,-0.071562,1.0,-0.16761,0.059185,-0.066809,-0.749775,-0.052965,0.003301,0.060168,0.133328,0.81385,-0.101642,0.098735,-0.013932,0.017641
7,-0.12727,0.097316,-0.115383,-0.039623,-0.291851,-0.205127,-0.16761,1.0,0.058655,0.088768,0.171125,-0.117349,0.023935,0.071926,-0.019313,-0.096492,0.040382,-0.019897,0.133483,-0.080199
8,0.105779,0.027742,0.070748,-0.100963,-0.051106,0.013573,0.059185,0.058655,1.0,0.025637,0.616158,-0.084419,0.038299,-0.009578,0.114451,0.628224,-0.021788,-0.106855,0.045951,0.013553
9,0.002115,0.127665,-0.139404,0.200769,0.112206,0.036082,-0.066809,0.088768,0.025637,1.0,0.069706,-0.049856,-0.106012,-0.041093,0.060134,-0.037148,0.021766,0.097349,0.116861,0.03193


We see that the highest correlation is between features 6 and 15, 8 and 10, 15 and 8, and 9 and 3. I think we can remove features 15 and 8.

In [None]:
data1 = data.loc[:, (data.columns != 15) & (data.columns != 8)]

In [None]:
data1.shape

(100, 18)

In [None]:
x_cor = data1

In [None]:
cross_val_score(LogisticRegression(), x_cor, y_data_generated, scoring='accuracy').mean()

0.77

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
selector = VarianceThreshold(threshold=1)

In [None]:
data2 = selector.fit_transform(data1)

In [None]:
data2.shape

(100, 9)

In [None]:
data2 = pd.DataFrame(data2)

In [None]:
x = data2

In [None]:
y = y_data_generated

In [None]:
cross_val_score(LogisticRegression(), x, y, scoring='accuracy').mean()

0.7500000000000001

I will perform feature selection based on analysis of variance: select the top 5 features using the scoring function for classification f_classif (SelectKBest(f_classif, k=5)), then build the logistic regression model again and evaluate the average accuracy.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

In [None]:
cross_val_score(LogisticRegression(), SelectKBest(f_classif, k=5).fit_transform(x_data_generated, y_data_generated), y_data_generated, scoring="accuracy").mean()

0.79

I will implement feature selection using logistic regression. The selected features will then be used as input to the logistic regression model itself (SelectFromModel). I will use L1 regularization. Next, I will perform feature selection using the RandomForest model and its built-in feature_importance attribute, then build the logistic regression model again and evaluate the average accuracy.

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [None]:
model =  LogisticRegression(penalty='l1', solver='liblinear')

In [None]:
model.fit(x_data_generated, y_data_generated)

In [None]:
selection_mlr = SelectFromModel(model)

In [None]:
x_selected_mlr = selection_mlr.transform(x_data_generated)

In [None]:
x_selected_mlr.shape

(100, 13)

In [None]:
cross_val_score(LogisticRegression(), x_selected_mlr, y_data_generated, scoring='accuracy').mean()

0.7700000000000001

In [None]:
model_rf = RandomForestClassifier()

In [None]:
model_rf.fit(x_data_generated, y_data_generated)

In [None]:
feature_importances = model_rf.feature_importances_

In [None]:
threshold = np.mean(feature_importances)

In [None]:
feature_selection_model_rf = SelectFromModel(model_rf, threshold=threshold)

In [None]:
x_selected_rf = feature_selection_model_rf.transform(x_data_generated)

In [None]:
x_selected_rf.shape

(100, 4)

In [None]:
cross_val_score(LogisticRegression(), x_selected_rf, y_data_generated, scoring='accuracy').mean()

0.8400000000000001

Ш will perform feature selection using SequentialFeatureSelector, then build the logistic regression model again and evaluate the average accuracy.

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector

In [None]:
sfs_forward = SequentialFeatureSelector(
    RandomForestClassifier(), n_features_to_select=10, direction="forward"
)
sfs_forward.fit(x_data_generated, y_data_generated)

In [None]:
x_tr = sfs_forward.transform(x_data_generated)

In [None]:
x_tr.shape

(100, 10)

In [None]:
cross_val_score(LogisticRegression(), x_tr, y_data_generated, scoring='accuracy').mean()

0.78

To summarize the results, we will create a table with the following format:

| Feature Selection Method | Number of Features | Average Model Accuracy |

In [None]:
itog = pd.DataFrame(columns=["Способ выбора признаков", "Количество признаков", "Средняя точность модели"])

In [None]:
itog.loc[len(itog)]=['Изначально', 20, 0.76]
itog.loc[len(itog)]=['На основе матрицы корреляций', 18, 0.77]
itog.loc[len(itog)]=['VarianceThreshold', 9, 0.75]
itog.loc[len(itog)]=['SelectKBest', 5, 0.79]
itog.loc[len(itog)]=['Логистическая регрессия SelectFromModel', 13, 0.77]
itog.loc[len(itog)]=['RandomForestClassifier SelectFromModel', 4, 0.84]
itog.loc[len(itog)]=['SequentialFeatureSelector', 10, 0.78]

In [None]:
itog

Unnamed: 0,Способ выбора признаков,Количество признаков,Средняя точность модели
0,Изначально,20,0.76
1,На основе матрицы корреляций,18,0.77
2,VarianceThreshold,9,0.75
3,SelectKBest,5,0.79
4,Логистическая регрессия SelectFromModel,13,0.77
5,RandomForestClassifier SelectFromModel,4,0.84
6,SequentialFeatureSelector,10,0.78
