In [1]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

In [3]:
# Creating custom dataset for testing
X, y = make_classification(
    n_samples=800, # total rows
    n_features=10, # total columns
    n_informative=5, # informative features
    n_redundant=0,
    random_state = 42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, stratify=y, random_state = 42
)

## Feature Selection

Feature selection is the process of selecting a subset of relevant and informative features from a larger set of available features for use in machine learning algorithms. The aim is to reduce the dimensionality of the data and improve the accuracy and efficiency of the model.

There are several techniques of feature selection. Let's take a look into a two most popular techniques.

### Forward Feature Selection

Forward feature selection involves **starting with an empty set of features and iteratively adding one feature at a time** based on their individual performance in predicting the outcome variable. This process continues until a stopping criterion is met, such as reaching a predefined number of features or a specific level of accuracy.

In [4]:
X_train.shape

(640, 10)

In [7]:
selected_feature = []

for each in range(X_train.shape[1]):
    best_acc = 0
    best_feature = None
    
    for j in range(X_train.shape[1]):
        
        if j not in selected_feature:
            
            features = selected_feature + [j]
            
            model = LogisticRegression()
            model.fit(X_train[:, features] , y_train)
            accuracy = model.score(X_test[:, features], y_test)

            if accuracy > best_acc:
                best_acc = accuracy
                best_feature = j
                
    selected_feature.append(best_feature)

    print("Selected Feature (forward): ", selected_feature, " Score: ",  accuracy)


Selected Feature (forward):  [9]  Score:  0.66875
Selected Feature (forward):  [9, 0]  Score:  0.6125
Selected Feature (forward):  [9, 0, 5]  Score:  0.7375
Selected Feature (forward):  [9, 0, 5, 8]  Score:  0.7875
Selected Feature (forward):  [9, 0, 5, 8, 1]  Score:  0.775
Selected Feature (forward):  [9, 0, 5, 8, 1, 2]  Score:  0.775
Selected Feature (forward):  [9, 0, 5, 8, 1, 2, 3]  Score:  0.76875
Selected Feature (forward):  [9, 0, 5, 8, 1, 2, 3, 4]  Score:  0.775
Selected Feature (forward):  [9, 0, 5, 8, 1, 2, 3, 4, 6]  Score:  0.775
Selected Feature (forward):  [9, 0, 5, 8, 1, 2, 3, 4, 6, 7]  Score:  0.78125


### Backward Feature selection

Backward feature selection, on the other hand, **starts with all available features and iteratively removes one feature at a time based on their individual performance** in predicting the outcome variable. This process continues until a stopping criterion is met, such as reaching a predefined number of features or a specific level of accuracy.

In [13]:
selected_feature = list(range(X_train.shape[1]))

for each in range(X_train.shape[1] - 1):
    worst_acc = 1
    worst_feature = None
    
    for j in selected_feature:
                    
        features = selected_feature.copy()
        features.remove(j)
        
        model = LogisticRegression()
        model.fit(X_train[:, features] , y_train)
        accuracy = model.score(X_test[:, features], y_test)

        if accuracy < worst_acc:
            worst_acc = accuracy
            worst_feature = j
                
    selected_feature.remove(worst_feature)

    print("Selected Feature (Backward): ", selected_feature, " Score: ",  accuracy)


Selected Feature (Backward):  [0, 1, 2, 3, 4, 5, 6, 7, 8]  Score:  0.675
Selected Feature (Backward):  [0, 1, 2, 3, 4, 6, 7, 8]  Score:  0.65625
Selected Feature (Backward):  [0, 1, 2, 3, 4, 6, 7]  Score:  0.51875
Selected Feature (Backward):  [0, 1, 3, 4, 6, 7]  Score:  0.51875
Selected Feature (Backward):  [0, 1, 4, 6, 7]  Score:  0.51875
Selected Feature (Backward):  [1, 4, 6, 7]  Score:  0.53125
Selected Feature (Backward):  [1, 4, 7]  Score:  0.475
Selected Feature (Backward):  [1, 4]  Score:  0.475
Selected Feature (Backward):  [1]  Score:  0.43125


Both forward and backward feature selection have their own advantages and limitations. Forward feature selection tends to be more computationally efficient and is more likely to identify relevant features that may be missed in backward selection. However, it may also include irrelevant features that may not contribute to the overall accuracy of the model.

In contrast, backward feature selection tends to produce more parsimonious models that may be easier to interpret and have better generalizability. However, it may also remove important features that may have a significant impact on the model's accuracy.

Ultimately, the choice between forward and backward feature selection depends on the specific needs and characteristics of the dataset and the goals of the analysis.