<a href="https://colab.research.google.com/github/CYBORGBC69/CYBORGBC69/blob/main/Copy_of_Examples_of_Feature_Selection_in_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
fedesoriano_company_bankruptcy_prediction_path = kagglehub.dataset_download('fedesoriano/company-bankruptcy-prediction')

print('Data source import complete.')


I am presenting some feature selection methods, that can be used in classification and regression problems when the datasets have too many columns. Feature selection is important because it can:
* Enhance model performance
* Reduce overfitting
* Improve the model interpretability
* Make the model training faster


![Screenshot 2025-02-25 alle 11.32.08.jpg](attachment:9f06b297-b366-4032-9439-5b3b56c9f9cc.jpg)

---
### SUMMARY
1. [Feature Selection with Feature Importance](#1)
2. [Feature Selection with Mutual Information Score](#2)
3. [Feature Selection with Correlation Heatmap](#3)
4. [Feature Selection with Variance Threshold](#4)
5. [Sequential Feature Selection](#5)
6. [Classification with a Random Forest Classifier](#6)
---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score

from warnings import simplefilter
simplefilter("ignore")

# 1. Feature Selection with Feature Importance
<a id="1"></a>

## 1.1 Brief EDA

Here, I am using a dataset with a very large number of columns. It describes the financial situation of companies and their bankruptcy status: 0 stands for no bankruptcy, 1 for bankruptcy.

In [None]:
data1 = pd.read_csv('/kaggle/input/company-bankruptcy-prediction/data.csv')

data1.head()

In [None]:
print(f'The dataset has {data1.shape[0]} rows and {data1.shape[1]} columns.')

In [None]:
print(f'The dataset has {data1.isna().sum().sum()} null values.')

print()

print(f'The dataset has {data1.duplicated().sum()} duplicate rows.')

In [None]:
cat_cols = [col for col in data1.columns if data1[col].dtypes == 'O']

print(f'There are {len(cat_cols)} categorical columns in the dataset.')

In [None]:
binary_cols = []

for col in data1.columns:
    if data1[col].nunique() == 2:
        binary_cols.append(col)

print(f'There are {len(binary_cols)} binary columns.')
print(f'They are: {binary_cols}.')

'Bankrupt?' is the target variable.

More details on the dataset, including a complete exploratory data analysis, can be found in **Ref. 1**.

Before carrying out classification, it is necessary to reduce the number of columns by dropping those that are unnecessary and/or redundant. To do that, I will make use of feature importance calculated with a random forest classifier.

## 1.2 Feature Selection with Feature Importance

First, I am defining *X* and *y* ...

In [None]:
X = data1.drop('Bankrupt?', axis=1)
y = data1['Bankrupt?']

... then, I am computing the feature importances by means of a random forest classifier.

In [None]:
# Random Forest Model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)

# To sort the index in descending order, I multiply 'rf.feature_importances_' by -1
sorted_idx = (-rf.feature_importances_).argsort()

list_of_tuples = list(zip(X.columns[sorted_idx],
                          rf.feature_importances_[sorted_idx]))

feat_importance = pd.DataFrame(list_of_tuples,
                  columns=['feature', 'feature importance'])

##################

fig = plt.figure(figsize=(12,8))

fig = sns.barplot(data=feat_importance[feat_importance['feature importance'] > 0.015], x='feature', y='feature importance')
plt.title('Feature Importance > 0.015',fontsize=25)
plt.xticks(fontsize=8,rotation=60)

plt.tight_layout()

Now, I can get a list of the features with importance greater than a given threshold, like 0.01 or 0.015 ...

In [None]:
col_001 = feat_importance[feat_importance['feature importance'] > 0.01]['feature'].to_list()

col_02 = feat_importance[feat_importance['feature importance'] > 0.02]['feature'].to_list()

print('Features with importance > 0.01: ')
print(col_001)
print()
print('Features with importance > 0.02: ')
print(col_02)

... or I can directly get the 'small' *X*.

In [None]:
X[col_02].head()

THe same feature selection method can be applied to regression by substituting the RandomForestClassifier with a RandomForestRegressor.

## 1.3 Feature Selection with Permutation-Based Importance

The same can be done by using permutation-based importance.

In [None]:
perm_importance = permutation_importance(rf, X, y)

sorted_idx = (-perm_importance.importances_mean).argsort()

list_of_tuples  = list(zip(X.columns[sorted_idx],
                           perm_importance.importances_mean[sorted_idx]))

perm_importance = pd.DataFrame(list_of_tuples,
                  columns=['feature','permutation importance'])

print(perm_importance.head())

In [None]:
plt.figure(figsize=(12,8))

sns.barplot(perm_importance[perm_importance['permutation importance'] > 0.0005], x='feature', y='permutation importance')

plt.title('Permutation-Based Importances > 0.0005', fontsize=25)
plt.xlabel('feature', fontsize=15)
plt.xticks(fontsize=8, rotation=45)
plt.ylabel('permutation importance', fontsize=15)

plt.tight_layout()

In [None]:
col_0005 = perm_importance[perm_importance['permutation importance'] > 0.0005]['feature'].to_list()

print('Features with permutation importance > 0.0005: ')
col_0005

# 2. Feature Selection with Mutual Information Score
<a id="2"></a>

An  alternative to feature importance is to construct a ranking with a feature utility metric, a function measuring associations between a feature and the target. Then one can choose a smaller set of the most useful features.

In [None]:
discrete_features = X.dtypes == int

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_classif(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features).reset_index()

mi_scores.head()

Below, I am plotting the mutual information scores > 0.025.

In [None]:
ax = plt.figure(figsize=(12,9))

ax = sns.barplot(data=mi_scores[mi_scores['MI Scores'] > 0.025], x='index', y='MI Scores')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, size=7)
ax.set_title('Mutual Information Scores > 0.025', size=30)

plt.tight_layout()

The names of the scores above a given threshold can be used to create a list of the most relevant features.

In [None]:
mi_scores_0025 = mi_scores[mi_scores['MI Scores'] > 0.025]['index'].to_list()

mi_scores_0025

To improve the list of features, one should investigate the existence of possible interactions among the features and also play with the threshold (here set at 0.0025). Investigating the correlations between the features is very important and I will discuss it in the following sections.

This method can also be applied to regression problems, by substituting the 'mutual_info_classif' function with 'mutual_info_regression'.

# 3. Feature Selection with Correlation Heatmap
<a id="3"></a>

I am printing the correlation heatmaps of *X* for each series of relevant (or selected) features that I have obtained with the previous methods.

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(X[col_02].corr(method='pearson'),annot=True,fmt='.2f',annot_kws={"fontsize":8},cmap='Blues')
plt.title('Correlation heatmap',fontsize=30)

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(X[col_0005].corr(method='pearson'),annot=True,fmt='.2f',annot_kws={"fontsize":8},cmap='Reds')
plt.title('Correlation heatmap',fontsize=30)

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(X[mi_scores_0025].corr(method='pearson'),annot=True,fmt='.2f',annot_kws={"fontsize":6},cmap='Greens')
plt.title('Correlation heatmap',fontsize=30)

plt.tight_layout()
plt.show()

Now, I will drop the columns with *r* > 0.9. However, given that each value of *r* results from the correlation of two different columns, I will only keep one of the columns from each pair, that with a higher value of feature importance, mutual information score ...

In the case of the first heatmap, the blue one, where the number of highly correlated features is low, I can drop the extra features 'by hand'. Only two of them exhibit a very high correlation (*r* > 0.9): they are ' Net profit before tax/Paid-in capital' and ' Persistent EPS in the Last Four Seasons'. I am keeping only one of them, the one that has the highest value of feature importance.

In [None]:
feat_importance[feat_importance['feature importance'] > 0.02]

In [None]:
col_02.remove(' Net profit before tax/Paid-in capital')

col_02

This is the resulting dataframe.

In [None]:
X[col_02]

In the third case, that of the green correlation heatmap, the number of highly correlated features is larger and thus this procedure should be automated. Let's try to do this.

In [None]:
rows, cols = X[mi_scores_0025].shape
flds = list(X[mi_scores_0025].columns)

corr = X[mi_scores_0025].corr().values

cols_to_drop_list = []

for i in range(cols):
    for j in range(i+1, cols):
        if corr[i,j] > 0.9:
            mi_scores_i = float(mi_scores[mi_scores['index'] == flds[i]]['MI Scores'])
            mi_scores_j = float(mi_scores[mi_scores['index'] == flds[j]]['MI Scores'])
            print(flds[i], ' ', flds[j], ' ', corr[i,j])
            if mi_scores_i > mi_scores_j:
                cols_to_drop_list.append(flds[j])
            else:
                cols_to_drop_list.append(flds[i])

cols_to_drop_list

In [None]:
mi_scores_0025_cut = [x for x in mi_scores_0025 if x not in cols_to_drop_list]

len(mi_scores_0025), len(mi_scores_0025_cut)

I have cut almost half of the columns in the 'mi_scores_0025' list.

# 4. Feature Selection with Variance Threshold
<a id="4"></a>

The variance threshold method removes all those features whose variance does not meet some threshold. By default, it removes all zero-variance features, i.e., features with the same value in all samples under the assumption that the features with a higher variance may contain more useful information.

For quasi-constant features, that have the same value for a very large subset, using a threshold of 0.01 would mean dropping the column where 99% of the values are similar.

In [None]:
var_thr = VarianceThreshold(threshold = 0.1)
var_thr.fit(X)

var_thr.get_support()

The values in the output mean:
* True: High Variance
* False: Low Variance

I am dropping the columns that are 90% or more similar.

In [None]:
concol = [column for column in X.columns if column not in X.columns[var_thr.get_support()]]

for features in concol:
    print(features)

In [None]:
cols_after_var_thr = [x for x in X.columns if x not in concol]

len(cols_after_var_thr), len(X.columns)

The final number of columns after applying variance threshold is around 1/4 of the initial number.

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(X[cols_after_var_thr].corr(method='pearson'),annot=True,fmt='.2f',annot_kws={"fontsize":6},cmap='Blues')
plt.title('Correlation heatmap',fontsize=30)

plt.tight_layout()
plt.show()

# 5. Sequential Feature Selection
<a id="5"></a>

Sequential Feature Selector is a greedy procedure where, at each iteration, one chooses the best new feature to add to the selected features based a cross-validation score. One starts with 0 features and chooses the best single feature with the highest score. The procedure is repeated until one reaches the desired number of selected features.

In [None]:
sfs_forward = SequentialFeatureSelector(rf, n_features_to_select=7, direction="forward").fit(X, y)

selected_features_forw = [column for column in X.columns if column in X.columns[sfs_forward.get_support()]]

for features in selected_features_forw:
    print(features)

In [None]:
X[selected_features_forw]

I have commented the code below because it is quite slow. To make it run faster, one should substitute the random forest classifier with another one.

In [None]:
'''sfs_backward = SequentialFeatureSelector(rf, n_features_to_select=7, direction="backward").fit(X, y)

selected_features_back = [column for column in X.columns if column in X.columns[sfs_backward.get_support()]]

for features in selected_features_back:
    print(features)'''

# 6. Classification with a Random Forest Classifier
<a id="6"></a>

## 6.1 Classification with the 'Feature Importance' Dataset

In [None]:
def get_test_scores(model_name:str,preds,y_test_data):
    '''
    Generate a table of test scores.

    In:
        model_name (string): Your choice: how the model will be named in the output table
        preds: numpy array of test predictions
        y_test_data: numpy array of y_test data

    Out:
        table: a pandas df of precision, recall, f1, and accuracy scores for your model
    '''
    accuracy  = accuracy_score(y_test_data,preds)
    precision = precision_score(y_test_data,preds,average='macro')
    recall    = recall_score(y_test_data,preds,average='macro')
    f1        = f1_score(y_test_data,preds,average='macro')

    table = pd.DataFrame({'model': [model_name],'precision': [precision],'recall': [recall],
                          'F1': [f1],'accuracy': [accuracy]})

    return table

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X[col_02], y, test_size=0.3, random_state=42)

In [None]:
rf = RandomForestClassifier()

rf.fit(X_train, y_train)

rf_test_preds_FI = rf.predict(X_test)

rf_test_results_FI = get_test_scores('Random Forest (with Feature Importance)', rf_test_preds_FI, y_test)

rf_test_results_FI

Recall score is quite low. To improve it, one should under-/over-sample the train data. See **Ref. 2**.

## 6.2 Classification with the 'Forward Feature Selection' Dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X[selected_features_forw], y, test_size=0.3, random_state=42)

In [None]:
rf.fit(X_train, y_train)

rf_test_preds_FFS = rf.predict(X_test)

rf_test_results_FFS = get_test_scores('Random Forest (with Forward Feature Selection)', rf_test_preds_FFS, y_test)

rf_test_results_FFS

# References

1. Jacopo Ferretti, [*Causes of Stroke: Logistic Regression + Partial Dependence + SHAP*](https://www.kaggle.com/code/jacopoferretti/causes-of-stroke-log-regr-partial-dependence-shap), notebook on Kaggle.
2. Jacopo Ferretti, [*Company Bankruptcy: Classification with Feature Selection*](https://www.kaggle.com/code/jacopoferretti/company-bankruptcy-classif-w-feature-selection), notebook on Kaggle.
3. Ryan Holbrook and Alexis Cook, [*Feature Engineering*](https://www.kaggle.com/learn/feature-engineering), course on Kaggle.
4. Shelvi Garg, [*Dropping Constant Features using VarianceThreshold: Feature Selection -1*](https://medium.com/nerd-for-tech/removing-constant-variables-feature-selection-463e2d6a30d9), article on medium.com.
5. Manoj Kumar, Maria Telenczuk and Nicolas Hug, [*Model-based and sequential feature selection*](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_diabetes.html#sphx-glr-auto-examples-feature-selection-plot-select-from-model-diabetes-py), scikit-learn.org.