#### My personal guide to feature selection

**Feature Selection** or **Variable selection** is the process of slecting a subset of relevant features or variables from the total features of a level in a dataset
to build machine learning algorithms

Feature selection techniques are of 3 types:
1. Filter Methods
2. Wrapper Methods
3. Embedded Methods

**Filter Methods**
They consist of the following techniques:
1. Basic Methods
2. Univariate Methods
3. Information gain
4. Fischer Score
5. Anova f-value for feature selection
6. Correlation matrix with heatmap

1. **Basic Methods**
- Under basic methods we remove constant and quasi constant features

1.1 **Remove constant features**

- Constant features are those which shows the same value for all the observations in the dataset. It provides no information that allows a ml model to predict a target
- To identify constant features we use the VarianceThreshold function from sklearn

In [3]:
# Remove constant features

# Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [13]:
# importing the dataset 

X_train = pd.read_csv("datasets\\santander-customer-satisfaction\\train.csv", nrows = 35000)
X_test = pd.read_csv("datasets\\santander-customer-satisfaction\\test.csv", nrows = 15000)

In [14]:
# Drop the target label from X_train

X_train.drop(labels=["TARGET"], axis = 1, inplace = True)

In [15]:
# Checking the shape of training and test sets

X_train.shape, X_test.shape

((35000, 370), (15000, 370))

- Variance threshold from sklearn removes all features whose variance doesnt meet some threshold. By default, it removes all zero variance features ie, features that have the same value in all samples

In [16]:
# VarianceThreshold to find constant features

from sklearn.feature_selection import VarianceThreshold

cons_features = VarianceThreshold(threshold = 0)

cons_features.fit(X_train) 



VarianceThreshold(threshold=0)

In [19]:
# See how many non constant feature are there

print(sum(cons_features.get_support()))

# or

print(len(X_train.columns[cons_features.get_support()]))

319
319


In [21]:
# Lets print the constant features - 51 features are there with no variance

print(len([x for x in X_train.columns if x not in X_train.columns[cons_features.get_support()]]))
[x for x in X_train.columns if x not in X_train.columns[cons_features.get_support()]]

51


['ind_var2_0',
 'ind_var2',
 'ind_var18_0',
 'ind_var18',
 'ind_var27_0',
 'ind_var28_0',
 'ind_var28',
 'ind_var27',
 'ind_var34_0',
 'ind_var34',
 'ind_var41',
 'ind_var46_0',
 'ind_var46',
 'num_var18_0',
 'num_var18',
 'num_var27_0',
 'num_var28_0',
 'num_var28',
 'num_var27',
 'num_var34_0',
 'num_var34',
 'num_var41',
 'num_var46_0',
 'num_var46',
 'saldo_var18',
 'saldo_var28',
 'saldo_var27',
 'saldo_var34',
 'saldo_var41',
 'saldo_var46',
 'delta_imp_amort_var18_1y3',
 'delta_imp_amort_var34_1y3',
 'imp_amort_var18_hace3',
 'imp_amort_var18_ult1',
 'imp_amort_var34_hace3',
 'imp_amort_var34_ult1',
 'imp_reemb_var13_hace3',
 'imp_reemb_var17_hace3',
 'imp_reemb_var33_hace3',
 'imp_trasp_var17_out_hace3',
 'imp_trasp_var33_out_hace3',
 'num_var2_0_ult1',
 'num_var2_ult1',
 'num_reemb_var13_hace3',
 'num_reemb_var17_hace3',
 'num_reemb_var33_hace3',
 'num_trasp_var17_out_hace3',
 'num_trasp_var33_out_hace3',
 'saldo_var2_ult1',
 'saldo_medio_var13_medio_hace3',
 'saldo_medio_var2

- There are 51 columns which shows one single value for all the observations in the dataset.
- Now we will transform both our training and testing dataset

In [22]:
X_train = cons_features.transform(X_train)
X_test = cons_features.transform(X_test)

In [23]:
# New Shape - Earlier out columns were 370, we removed 51 columns so it gave rise to 319 remianing columns now

X_train.shape, X_test.shape

((35000, 319), (15000, 319))

1.2 **Remove quasi constant features**

- Quasi constant features are those features in which like 99.8% values are similar and only 0.2% values are not same. In such cases also, they do not provide any distinct information in predicting the target as well
- Here also we will make use of the VarianceThreshold function
