### Step backward feature selection

Step Backward Feature Selection starts by fitting a model using all the features in the data set and determining its performance. 

Then, it trains models on all possible combinations of all features, minus one, and removes the feature that returns the model with the lowest performance.

In the third step, it trains models in all possible combinations of the features remaining from step 2, minus 1 feature, and removes the feature that produced the lowest performing model.

The algorithm stops when a certain criteria determined by the user is met. This criteria could be that the model performance does not decrease beyond a certain threshold, or alternatively, as we show in this notebook, when we reach a certain number of selected features.

The evaluation metric can be the roc_auc for classification or the r squared for regression, for example, and is determined by the user.

Step Backward Feature Selection is called greedy because it evaluates all possible n, and then n-1 and n-2 and so on feature combinations. Therefore, it is very computationally expensive and sometimes, if the feature space is big enough, even unfeasible.

Scikit-learn provides various stopping criteria to stop the search:

* when a certain number of features is reached (like MLXtend) (arbitrary)
* if the performance does not increase beyond a certain threshold (ideal but expensive)
* selects half of the features (arbitrary)

In [1]:
# %load_ext autoreload
# %autoreload 2

In [15]:
import pandas as pd

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import roc_auc_score, r2_score
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SequentialFeatureSelector as SFS
# importing
from helper_fe_v2 import (
            get_full_datapath_nm,
            read_df_from_file,
            check_module_members,
            gen_correlation,
            do_bkwd_fwd_selection,
            yaml_path,
            read_yaml_conf,
            run_randomForestClassifier,
            run_randomForestRegressor
)

In [3]:
config = read_yaml_conf(yaml_path())
print ("yaml_ conf ", config ) 

yaml_ conf  {'write_file': True, 'base_dir': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments', 'full_config_file': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments\\py-projects.yaml', 'cur_kaggle_expt': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments\\feature-engineering\\fe-recipes', 'files': {'test_data_set2': 'fselect_dataset_2.csv', 'housing_data': 'housing_prices_train.csv', 'test_data_set1': 'fselect_dataset_1.csv', 'titanic_data': 'fe-cookbook-titanic.csv', 'comp_eda_file': 'none.csv'}, 'project_parms': {'use_mlxtnd': 'True'}, 'process_eda': {'main_file': 'fselect_dataset_2.csv', 'compre_file': 'none.csv', 'pairwise_analysis': 'on', 'show_html': 'False'}, 'RandomForestConfig': {'n_estimators': 200, 'rand_state': 39, 'max_depth': 4}}


## Classification

In [4]:
data = read_df_from_file ( config['files']['test_data_set2'], set_nrows=False, nrws=0 ) 
data.shape

Full path NM exists  C:\Users\Arindam Banerji\CopyFolder\IOT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_2.csv


(50000, 109)

In [5]:
data.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_100,var_101,var_102,var_103,var_104,var_105,var_106,var_107,var_108,var_109
0,4.53271,3.280834,17.982476,4.404259,2.34991,0.603264,2.784655,0.323146,12.009691,0.139346,...,2.079066,6.748819,2.941445,18.360496,17.726613,7.774031,1.473441,1.973832,0.976806,2.541417
1,5.821374,12.098722,13.309151,4.125599,1.045386,1.832035,1.833494,0.70909,8.652883,0.102757,...,2.479789,7.79529,3.55789,17.383378,15.193423,8.263673,1.878108,0.567939,1.018818,1.416433
2,1.938776,7.952752,0.972671,3.459267,1.935782,0.621463,2.338139,0.344948,9.93785,11.691283,...,1.861487,6.130886,3.401064,15.850471,14.620599,6.849776,1.09821,1.959183,1.575493,1.857893
3,6.02069,9.900544,17.869637,4.366715,1.973693,2.026012,2.853025,0.674847,11.816859,0.011151,...,1.340944,7.240058,2.417235,15.194609,13.553772,7.229971,0.835158,2.234482,0.94617,2.700606
4,3.909506,10.576516,0.934191,3.419572,1.871438,3.340811,1.868282,0.439865,13.58562,1.153366,...,2.738095,6.565509,4.341414,15.893832,11.929787,6.954033,1.853364,0.511027,2.599562,0.811364


In [6]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 108), (15000, 108))

In [7]:
# remove correlated features to reduce the feature space

corr_features = gen_correlation(X_train, 0.8)
print('correlated features: ', len(set(corr_features)) )

correlated features:  36


In [8]:
# removed correlated  features
X_train.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 72), (15000, 72))

In [9]:
# within the SFS we indicate:

# 1) the algorithm we want to create, in this case RandomForests
# (note that I use few trees to speed things up)

# 2) the stopping criteria: want to select 50 features

# 3) wheter to perform step forward or step backward

# 4) the evaluation metric: in this case the roc_auc
# 5) the want cross-validation

# this is going to take a while, do not despair

rf = RandomForestClassifier(n_estimators=10, n_jobs=4, random_state=0)


In [13]:
sfs = do_bkwd_fwd_selection (estimator = rf,
            k_features=65, # the lower the features we want, the longer this will take
            forward=False,
            verbose=2,
            scoring='roc_auc',
            cv=2,
            path_to_yaml=yaml_path())

sfs = sfs.fit(X_train, y_train)

Calling sklearn libs 


In [16]:
pred = run_randomForestClassifier(X_train, X_test, y_train, y_test, yaml_path())

Train set
Random Forests roc-auc: 0.7119921185820277
Test set
Random Forests roc-auc: 0.6957598691250635


In [17]:
selected_feat= sfs.get_feature_names_out()

selected_feat

array(['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'var_6', 'var_7',
       'var_9', 'var_10', 'var_12', 'var_13', 'var_14', 'var_15',
       'var_16', 'var_18', 'var_19', 'var_20', 'var_21', 'var_22',
       'var_23', 'var_25', 'var_27', 'var_30', 'var_31', 'var_34',
       'var_35', 'var_36', 'var_37', 'var_38', 'var_40', 'var_45',
       'var_46', 'var_47', 'var_48', 'var_49', 'var_51', 'var_53',
       'var_55', 'var_56', 'var_58', 'var_60', 'var_62', 'var_63',
       'var_65', 'var_67', 'var_68', 'var_69', 'var_71', 'var_73',
       'var_77', 'var_78', 'var_79', 'var_81', 'var_83', 'var_86',
       'var_89', 'var_90', 'var_91', 'var_92', 'var_93', 'var_96',
       'var_98', 'var_99', 'var_103', 'var_107'], dtype=object)

In [18]:
# evaluate performance of algorithm built
# using selected features

run_randomForestClassifier(X_train[selected_feat],
                  X_test[selected_feat],
                  y_train, y_test, yaml_path())

Train set
Random Forests roc-auc: 0.7109038354859207
Test set
Random Forests roc-auc: 0.6952845877445597


array([[0.35004144, 0.64995856],
       [0.23308177, 0.76691823],
       [0.24920491, 0.75079509],
       ...,
       [0.18578571, 0.81421429],
       [0.35276133, 0.64723867],
       [0.34838792, 0.65161208]])

## Regression

In [19]:
config = read_yaml_conf(yaml_path())
print ("yaml_ conf ", config ) 

yaml_ conf  {'write_file': True, 'base_dir': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments', 'full_config_file': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments\\py-projects.yaml', 'cur_kaggle_expt': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments\\feature-engineering\\fe-recipes', 'files': {'test_data_set2': 'fselect_dataset_2.csv', 'housing_data': 'housing_prices_train.csv', 'test_data_set1': 'fselect_dataset_1.csv', 'titanic_data': 'fe-cookbook-titanic.csv', 'comp_eda_file': 'none.csv'}, 'project_parms': {'use_mlxtnd': 'False'}, 'process_eda': {'main_file': 'fselect_dataset_2.csv', 'compre_file': 'none.csv', 'pairwise_analysis': 'on', 'show_html': 'False'}, 'RandomForestConfig': {'n_estimators': 200, 'rand_state': 39, 'max_depth': 4}}


In [20]:
data = read_df_from_file ( config['files']['housing_data'], set_nrows=False, nrws=0 ) 
data.shape

Full path NM exists  C:\Users\Arindam Banerji\CopyFolder\IOT_thoughts\python-projects\kaggle_experiments\input_data\housing_prices_train.csv


(1460, 81)

In [21]:
# In practice, feature selection should be done after data pre-processing,
# so ideally, all the categorical variables are encoded into numbers,
# and then you can assess how deterministic they are of the target

# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

(1460, 38)

In [22]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 37), (438, 37))

In [23]:
# remove correlated features to reduce the feature space
# remove correlated features to reduce the feature space

corr_features = gen_correlation(X_train, 0.8)
print('correlated features: ', len(set(corr_features)) )


correlated features:  3


In [24]:
# removed correlated features
X_train.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((1022, 34), (438, 34))

In [25]:
X_train.fillna(0, inplace=True)
X_test.fillna(0, inplace=True)

In [26]:
rf = RandomForestRegressor(n_estimators=10, n_jobs=4, random_state=10)

In [27]:
sfs = do_bkwd_fwd_selection (estimator = rf,
          k_features=20, # the lower the features we want, the longer this will take
          forward=False,
          verbose=2,
          scoring='r2',
          cv=2,
          path_to_yaml=yaml_path())

print (X_train.shape, y_train.shape)

sfs = sfs.fit(X_train, y_train)

Calling sklearn libs 
(1022, 34) (1022,)


In [28]:
sfs.get_feature_names_out()

array(['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'Fireplaces', 'GarageCars', 'WoodDeckSF', '3SsnPorch',
       'ScreenPorch', 'PoolArea'], dtype=object)

In [30]:
selected_feat= sfs.get_feature_names_out()

selected_feat

array(['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'Fireplaces', 'GarageCars', 'WoodDeckSF', '3SsnPorch',
       'ScreenPorch', 'PoolArea'], dtype=object)

In [31]:
from helper_fe_v2 import run_randomForestRegressor
# function to train random forests and evaluate the performance
path_to_yaml = os.getenv('CURYAMLPATH') + "\\" + os.getenv('CURYAMLFILE')
pred = run_randomForestRegressor(X_train[selected_feat],
                  X_test[selected_feat],
                  y_train, y_test, path_to_yaml)

Train set
Random Forests roc-auc: 0.8629621682457452
Test set
Random Forests roc-auc: 0.8246015925536159


In [32]:
# evaluate performance of algorithm built
# using selected features
pred = run_randomForestRegressor(X_train, X_test, y_train, y_test, path_to_yaml)

Train set
Random Forests roc-auc: 0.8699152317492538
Test set
Random Forests roc-auc: 0.8190809813112794
