In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

# Feature Selection

Feature selection is the process of selecting a subset of relevant and informative features from a larger set of available features for use in machine learning algorithms. The aim is to reduce the dimensionality of the data and improve the accuracy and efficiency of the model.

There are several techniques of feature selection. Let's take a look into a two most popular techniques.

## Forward Feature Selection

Forward feature selection involves starting with an empty set of features and iteratively adding one feature at a time based on their individual performance in predicting the outcome variable. This process continues until a stopping criterion is met, such as reaching a predefined number of features or a specific level of accuracy.

In [3]:
# Creating custom dataset for testing
X, y = make_classification(
    n_samples = 800, # total rows
    n_features= 10, # total columns
    n_informative = 5, # informative features
    n_redundant = 0,
    random_state = 42


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [5]:
X_train

array([[-1.36640003, -0.32488362,  0.65893535, ...,  1.0604942 ,
        -0.02333372, -0.18402613],
       [-0.0462497 , -0.73132461, -0.34477745, ..., -0.0687072 ,
        -0.83922872, -0.28752381],
       [-1.67506644, -0.89554588, -0.5129346 , ..., -0.49896076,
         1.55535909, -0.25275679],
       ...,
       [ 0.21992435,  1.30540243,  1.29242937, ..., -0.71444768,
         1.32562275, -2.1280591 ],
       [ 1.46819807, -1.67579178,  0.01398182, ...,  0.86466231,
         0.81370229, -1.31381612],
       [-1.49779863, -0.16046355,  0.077014  , ..., -0.43002836,
         2.33283789, -1.45072092]])

In [7]:
selected_feature= [] 
for i in range(X_train.shape[1]):
    best_acc = 0
    best_feature = None
    for j in range(X_train.shape[1]):
        if j not in selected_feature:
            features = selected_feature + [j]
            model = LogisticRegression()
            model.fit(X_train[:, features], y_train)
            accuracy = model.score(X_test[:, features], y_test)

        if accuracy > best_acc:
            best_acc = accuracy
            best_feature = j
    selected_feature.append(best_feature)

    print("Selected Feature (forward): ", selected_feature, "| Score: ", accuracy)

Selected Feature (forward):  [9] | Score:  0.66875
Selected Feature (forward):  [9, 0] | Score:  0.6125
Selected Feature (forward):  [9, 0, 5] | Score:  0.7375
Selected Feature (forward):  [9, 0, 5, 8] | Score:  0.7875
Selected Feature (forward):  [9, 0, 5, 8, 0] | Score:  0.775
Selected Feature (forward):  [9, 0, 5, 8, 0, 1] | Score:  0.775
Selected Feature (forward):  [9, 0, 5, 8, 0, 1, 2] | Score:  0.775
Selected Feature (forward):  [9, 0, 5, 8, 0, 1, 2, 3] | Score:  0.76875
Selected Feature (forward):  [9, 0, 5, 8, 0, 1, 2, 3, 4] | Score:  0.76875
Selected Feature (forward):  [9, 0, 5, 8, 0, 1, 2, 3, 4, 6] | Score:  0.76875


In [39]:
sliced_array = X[:, [4, 8, 2, 0]]
sliced_array

array([[ 0.59569725,  3.25494185, -0.01563761,  3.86239664],
       [ 0.20289324, -1.9777969 , -3.80487368, -2.75566887],
       [ 1.6694547 ,  0.81238553,  2.43884712,  1.37952235],
       ...,
       [-0.35891874,  0.72486005,  2.77181538,  1.33017686],
       [ 1.70220879, -0.02961941,  1.58535027, -1.19160029],
       [ 1.49761512,  0.43877996,  2.07032359,  1.46458697]])

<br>

## Backward Feature Selection

Backward feature selection, on the other hand, starts with all available features and iteratively removes one feature at a time based on their individual performance in predicting the outcome variable. This process continues until a stopping criterion is met, such as reaching a predefined number of features or a specific level of accuracy.

In [33]:
X, y = make_classification(
    n_samples=1000, 
    n_features=10,
    n_informative=5,
    n_redundant=0,
    random_state=42
)

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [35]:
selected_feature= list(range(X_train.shape[1]))
for i in range(X_train.shape[1] -1):
    worst_acc = 1
    worst_feature = None
    
    for j in selected_feature:
        
        features = selected_feature.copy()
        features.remove(j)
        
        model = LogisticRegression()
        model.fit(X_train[:, features], y_train)
        accuracy = model.score(X_test[:, features], y_test)

        if accuracy < worst_acc:
            worst_acc = accuracy
            worst_feature = j
            
    selected_feature.remove(worst_feature)

    print("Selected Feature (Backward): ", selected_feature, "| Score: ", accuracy)

Selected Feature (Backward):  [0, 1, 3, 4, 5, 6, 7, 8, 9] | Score:  0.825
Selected Feature (Backward):  [0, 1, 3, 4, 5, 6, 7, 9] | Score:  0.755
Selected Feature (Backward):  [0, 1, 3, 5, 6, 7, 9] | Score:  0.685
Selected Feature (Backward):  [0, 1, 3, 5, 6, 9] | Score:  0.62
Selected Feature (Backward):  [1, 3, 5, 6, 9] | Score:  0.395
Selected Feature (Backward):  [1, 3, 5, 9] | Score:  0.395
Selected Feature (Backward):  [1, 3, 5] | Score:  0.36
Selected Feature (Backward):  [1, 5] | Score:  0.41
Selected Feature (Backward):  [1] | Score:  0.4
