### Feature Selection

The input dataset consists of 24 independent variables.

The data features used to train machine learning models have a huge influence on mode performance. The irrelevant or partially relevant features can negatively impact model performance. Hence Feature selection methods are employed in this project to Reduces Overfitting, Improves Accuracy and Reduces Training Time.

* Input : Smote+TomekLink class balanced datasets - x_train_smtom.csv, y_train_smtom.csv, x_test_c4.csv and y_test_c4.csv
* Outcome : Selected Significant featues -
  fs_set3= ['LIMIT_BAL','PAY_AMT1', 'PAY_1', 'PAY_AMT2', 'AGE', 'SEX_2','PAY_3', 'MARRIAGE_2', 'BILL_AMT1',   'MARRIAGE_1', 'PAY_2', 'SEX_1', 'EDUCATION_2', 'EDUCATION_1']

In [2]:
 # Import libraries
from __future__ import print_function 
import time

# Import libraries
import pandas as pd
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
import random
import sklearn
import scipy
from collections import Counter

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Import feature selection libraries
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import RFE
from sklearn.feature_selection import chi2, SelectKBest, f_classif
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

# Import models from sklearn
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# Import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score,roc_curve, auc
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import cohen_kappa_score
from sklearn import model_selection

In [3]:
x_train=pd.read_csv('x_train_smtom.csv')
x_test=pd.read_csv('x_test_c4.csv')

y_train=pd.read_csv('y_train_smtom.csv')
y_test=pd.read_csv('y_test_c4.csv')

print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

(31664, 29) (9000, 29) (31664, 1) (9000, 1)


### Filter Based methods:

#### 1. ANOVA Method

In [3]:
anova = SelectKBest(f_classif, k=12).fit(x_train, y_train)
print('Significant features from ANOVA method: {}'.format(x_train.columns[anova.get_support()]))

Significant features from ANOVA method: Index(['LIMIT_BAL', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',
       'PAY_AMT1', 'PAY_AMT2', 'SEX_2', 'EDUCATION_1', 'MARRIAGE_2'],
      dtype='object')


#### 2. Information Gain/Mutual Information

In [4]:
# calculate the mutual information between the variables and the target
# this returns the mutual information value of each feature
# the smaller the value the less information the feature has about the target

mutual_info = SelectKBest(mutual_info_classif, k=12).fit(x_train.fillna(0), y_train)
print('Significant features from Mutual Information method: {}'.format(x_train.columns[mutual_info.get_support()]))

Significant features from Mutual Information method: Index(['LIMIT_BAL', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'],
      dtype='object')


### Wrapper based methods:

#### 3. Forward selection

In [6]:
sfs = SFS(RandomForestClassifier(), k_features=(5,12), forward=True, 
           floating=False, verbose=2,scoring='roc_auc',cv=0)
sfs = sfs.fit(x_train, y_train)

print('Significant features from Forward Selection method: {}'.format(x_train.columns[list(sfs.k_feature_idx_)]))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:  3.0min finished

[2020-10-18 14:06:10] Features: 1/12 -- score: 0.9871895092381068[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   10.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:  6.8min finished

[2020-10-18 14:13:00] Features: 2/12 -- score: 0.9969487224657541[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   11.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  5.8min finished

[2020-10-18 14:18:47] Features: 3/12 -- score: 0.998149956570902[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   

Significant features from Forward Selection method: Index(['LIMIT_BAL', 'AGE', 'PAY_1', 'BILL_AMT1', 'BILL_AMT5', 'BILL_AMT6',
       'PAY_AMT1', 'PAY_AMT3', 'PAY_AMT4', 'SEX_2', 'EDUCATION_2',
       'MARRIAGE_1'],
      dtype='object')


[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:  2.0min finished

[2020-10-18 14:41:39] Features: 12/12 -- score: 0.9999916378160681

#### 5. Backward elimination

In [7]:
sbs = SFS(RandomForestClassifier(n_jobs=4), k_features=(5,12), forward=False, 
           floating=False, verbose=2, scoring='roc_auc', cv=0)
sbs = sbs.fit(x_train, y_train)

print('Significant features from Backward Elimination method: {}'.format(x_train.columns[list(sbs.k_feature_idx_)]))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:  1.6min finished

[2020-10-18 14:50:34] Features: 28/5 -- score: 0.9999967843891941[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:  1.6min finished

[2020-10-18 14:52:07] Features: 27/5 -- score: 0.9999967444932785[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  1.7min finished

[2020-10-18 14:53:50] Features: 26/5 -- score: 0.9999971634003934[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  

Significant features from Backward Elimination method: Index(['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2',
       'PAY_AMT4', 'SEX_2', 'EDUCATION_1', 'EDUCATION_2', 'EDUCATION_3',
       'MARRIAGE_2'],
      dtype='object')


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   12.2s finished

[2020-10-18 15:09:41] Features: 5/5 -- score: 0.9995041276793141

#### 6. Stepwise method

In [8]:
sws = SFS(RandomForestClassifier(n_jobs=4), k_features=(5,12), forward=True, 
           floating=True, verbose=2,scoring='roc_auc', cv=0)
sws = sws.fit(np.array(x_train.fillna(0)), y_train)

print('Significant features from Stepwise selection method: {}'.format(x_train.columns[list(sws.k_feature_idx_)]))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  29 out of  29 | elapsed:   35.2s finished

[2020-10-18 15:10:17] Features: 1/12 -- score: 0.987106629462821[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  28 out of  28 | elapsed:  1.2min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.7s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.7s finished

[2020-10-18 15:11:30] Features: 2/12 -- score: 0.9968978930743498[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.9s remaining:    0.0s
[Parallel(n_job

Significant features from Stepwise selection method: Index(['LIMIT_BAL', 'AGE', 'PAY_1', 'PAY_5', 'BILL_AMT1', 'BILL_AMT6',
       'PAY_AMT3', 'PAY_AMT4', 'SEX_1', 'SEX_2', 'EDUCATION_2', 'MARRIAGE_1'],
      dtype='object')


[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:   23.8s finished

[2020-10-18 15:23:21] Features: 12/12 -- score: 0.9999889468365539

#### 7. Recursive Feature Elimination using Random forests

In [9]:
# RFE will remove one least imp feature at each iteration by running random forest till a criteria is met.

rfe = RFE(RandomForestClassifier(), n_features_to_select=12)
rfe.fit(x_train.fillna(0), y_train)
print('Significant features from RFE method: {}'.format(x_train.columns[(rfe.get_support())]))

Significant features from RFE method: Index(['LIMIT_BAL', 'AGE', 'PAY_1', 'PAY_2', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2',
       'PAY_AMT6'],
      dtype='object')


#### Shortlisting features based on their significance accross different feature selection methods

In [48]:
# 1 full dataset + 3 subset combinations

fs_set1=['LIMIT_BAL','PAY_AMT1', 'PAY_1', 'PAY_AMT2', 'AGE', 'SEX_2',
         'PAY_AMT4','BILL_AMT6', 'BILL_AMT1', 'PAY_AMT3', 'PAY_2', 'EDUCATION_2', 'MARRIAGE_2']

fs_set2=['LIMIT_BAL','PAY_AMT1', 'PAY_1', 'PAY_AMT2', 'AGE', 'SEX_1',
         'PAY_5','PAY_AMT3', 'BILL_AMT1', 'PAY_4', 'PAY_2', 'EDUCATION_1','MARRIAGE_1']

fs_set3=['LIMIT_BAL','PAY_AMT1', 'PAY_1', 'PAY_AMT2', 'AGE', 'SEX_2',
         'PAY_3', 'MARRIAGE_2', 'BILL_AMT1', 'MARRIAGE_1', 'PAY_2', 'SEX_1', 'EDUCATION_2', 'EDUCATION_1']

#### Functions to compare performace of ML models

In [49]:
classifier = [
    ensemble.AdaBoostClassifier(), ensemble.BaggingClassifier(), XGBClassifier(),
    ensemble.GradientBoostingClassifier(), ensemble.RandomForestClassifier(), tree.DecisionTreeClassifier(),
    linear_model.LogisticRegressionCV(), naive_bayes.GaussianNB(), neighbors.KNeighborsClassifier(),svm.SVC(probability=True)
    ]

In [50]:
def models_comparison(x_train, y_train, x_test, y_test, folds):
    
    time_start = time.time()
    classifier_columns = []
    classifier_compare = pd.DataFrame(columns = classifier_columns)

    row_index = 0
    for alg in classifier:
    
        pred = alg.fit(x_train, y_train).predict(x_test)
        classifier_name = alg.__class__.__name__
        
        classifier_compare.loc[row_index,'ML Algorithm'] = classifier_name
        classifier_compare.loc[row_index, 'Train Accuracy'] = model_selection.cross_val_score(alg,x_train,y_train,cv=folds,scoring='accuracy').mean()
        classifier_compare.loc[row_index, 'Test Accuracy'] = model_selection.cross_val_score(alg,x_test,y_test,cv=folds,scoring='accuracy').mean()
        classifier_compare.loc[row_index, 'Precision'] = model_selection.cross_val_score(alg,x_test,y_test,cv=folds,scoring='precision').mean()
        classifier_compare.loc[row_index, 'Recall'] = model_selection.cross_val_score(alg,x_test,y_test,cv=folds,scoring='recall').mean()
        classifier_compare.loc[row_index, 'F1 score'] = model_selection.cross_val_score(alg,x_test,y_test,cv=folds,scoring='f1').mean()
        fp, tp, th = roc_curve(y_test, pred)
        classifier_compare.loc[row_index, 'ROC AUC'] = auc(fp, tp)       
        classifier_compare.loc[row_index, 'Kappa'] = cohen_kappa_score(y_test, pred, labels=None, weights=None, sample_weight=None)  
        roc_auc = auc(fp, tp)
        classifier_compare.loc[row_index, 'GINI'] = (2 * roc_auc) - 1
        tn, fp, fn, tp = confusion_matrix(y_test, pred, labels=[0,1]).ravel()
        classifier_compare.loc[row_index, 'Type II error'] = fn
        row_index+=1
    
    classifier_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)  
    print('Time elapsed: {} seconds'.format(time.time()-time_start))
    return classifier_compare

### Selecting best set of features for modelling

In [51]:
#1. SET1 - fs_set1

models_comparison(x_train[fs_set1], y_train, x_test[fs_set1], y_test, 10)

Time elapsed: 6730.045720338821 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
0,AdaBoostClassifier,0.746785,0.825222,0.694577,0.339175,0.455215,0.657324,0.277372,0.314648,888.0
3,GradientBoostingClassifier,0.766366,0.824444,0.680547,0.351546,0.463653,0.676138,0.315427,0.352275,851.0
4,RandomForestClassifier,0.829909,0.819,0.640286,0.353608,0.460677,0.666666,0.324183,0.333333,982.0
2,XGBClassifier,0.807141,0.810556,0.609678,0.341237,0.436929,0.666773,0.319626,0.333546,964.0
1,BaggingClassifier,0.798675,0.804444,0.588075,0.324742,0.412576,0.643331,0.280681,0.286662,1061.0
6,LogisticRegressionCV,0.630022,0.785111,0.4,0.003608,0.007117,0.614175,0.146257,0.228351,527.0
9,SVC,0.620483,0.784444,0.0,0.0,0.0,0.612121,0.151249,0.224243,624.0
8,KNeighborsClassifier,0.747886,0.755556,0.35655,0.164948,0.225305,0.582598,0.127718,0.165197,932.0
5,DecisionTreeClassifier,0.746659,0.722111,0.3699,0.38866,0.376431,0.612395,0.193754,0.22479,995.0
7,GaussianNB,0.592914,0.623333,0.302993,0.57268,0.396037,0.582678,0.094384,0.165356,403.0


In [52]:
#2. SET2 - fs_set2

models_comparison(x_train[fs_set2], y_train, x_test[fs_set2], y_test, 10)

Time elapsed: 7039.1116988658905 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.756512,0.824,0.677886,0.352062,0.46304,0.672649,0.313163,0.345298,878.0
0,AdaBoostClassifier,0.739489,0.822333,0.679217,0.336598,0.449332,0.659531,0.284024,0.319062,894.0
4,RandomForestClassifier,0.817276,0.818444,0.635534,0.359278,0.454207,0.66164,0.316999,0.323279,1007.0
2,XGBClassifier,0.797982,0.811778,0.611099,0.353093,0.44716,0.656411,0.29522,0.312822,977.0
1,BaggingClassifier,0.79198,0.803333,0.561883,0.337113,0.423059,0.635546,0.26893,0.271092,1100.0
6,LogisticRegressionCV,0.621336,0.786333,0.552698,0.013918,0.026898,0.634759,0.178745,0.269518,526.0
9,SVC,0.608072,0.784444,0.0,0.0,0.0,0.609537,0.145561,0.219074,606.0
8,KNeighborsClassifier,0.709578,0.763222,0.382767,0.160309,0.225722,0.579468,0.125525,0.158935,973.0
7,GaussianNB,0.590798,0.746667,0.400299,0.344845,0.36944,0.582204,0.093177,0.164408,390.0
5,DecisionTreeClassifier,0.736395,0.726,0.372956,0.38866,0.381979,0.610591,0.19086,0.221182,1002.0


In [53]:
#3. SET3 - fs_set3

models_comparison(x_train[fs_set3], y_train, x_test[fs_set3], y_test, 10)

Time elapsed: 5632.3634288311005 seconds


Unnamed: 0,ML Algorithm,Train Accuracy,Test Accuracy,Precision,Recall,F1 score,ROC AUC,Kappa,GINI,Type II error
3,GradientBoostingClassifier,0.827703,0.826111,0.694922,0.348969,0.464205,0.674049,0.379679,0.348099,1088.0
0,AdaBoostClassifier,0.816145,0.824111,0.690314,0.335567,0.450882,0.660911,0.355732,0.321822,1142.0
4,RandomForestClassifier,0.847189,0.815,0.634675,0.352062,0.449507,0.658014,0.335865,0.316028,1109.0
2,XGBClassifier,0.834021,0.810667,0.610752,0.341237,0.437355,0.653673,0.334275,0.307346,1147.0
1,BaggingClassifier,0.834147,0.800889,0.573999,0.325773,0.409159,0.63961,0.302885,0.279221,1187.0
7,GaussianNB,0.58426,0.787667,0.589918,0.052062,0.094959,0.581402,0.091618,0.162803,378.0
6,LogisticRegressionCV,0.622758,0.786222,0.403571,0.013402,0.025422,0.618214,0.154633,0.236427,552.0
9,SVC,0.603303,0.784444,0.0,0.0,0.0,0.607124,0.143568,0.214249,628.0
8,KNeighborsClassifier,0.69966,0.761,0.376889,0.164948,0.229108,0.580388,0.127224,0.160777,973.0
5,DecisionTreeClassifier,0.790783,0.726778,0.378659,0.389691,0.377648,0.616198,0.213998,0.232397,1069.0


### Conclusion

In [None]:
# Based on the performace of machine learning models, fs_set3 shows best results, especially for tree based models. 
    # Hence SET3 is finalized for further processing
# The models such as Logistic regression, SVC, KNN and Naive Bayes show slight improvement after feature selection, however 
    # it is not very significant
    
# Selected Significant features are --> 'LIMIT_BAL','PAY_AMT1', 'PAY_1', 'PAY_AMT2', 'AGE', 'SEX_2',
#          'PAY_3', 'MARRIAGE_2', 'BILL_AMT1', 'MARRIAGE_1', 'PAY_2', 'SEX_1', 'EDUCATION_2', 'EDUCATION_1'