# Feature Selection

It is about selection of attribute that have the grestest impact towards the **problem** you are solving. That is, selecting the best features for building models.

<hr>

## Why Feature Selection?
* Higher accuracy
* Simpler models
* Reducing overfitting risk

## Feature Selection Techniques
### Filter methods
- Independent on model
- Based on score of statistics
- Easy to understand
- Good for early features removal
- Low computational requirements
- **E.g:** Chi square, information gain, correlation score, correlation matrix with heatmap.

### Wrapper methods
- Compare different subsets of features and run the model on them
- Basically a search problem
- **E.g:** Best-first search, Random hil-climbing algorithm, forward selection, backward elimination

### Embedded methods
- Find features that contribute most to the accuracy of the model while it is created
- Regularizatio is the most common method - it penalizes higher complexity
- **E.g:** LASSO, Elastic Net, Ridge regression.

## Before Feature Selection
- Clean data
- Divide into training and testing sets
- Feature scaling
- Only do feature selection on training set to avoid overfitting.

<hr>

## Filter Methods

In [1]:
import pandas as pd

In [2]:
data = pd.read_parquet('./data/customer_satisfaction.parquet')
data.head()

Unnamed: 0_level_0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [3]:
data.shape

(76020, 370)

In [4]:
data['TARGET'].value_counts()/len(data)

TARGET
0    0.960431
1    0.039569
Name: count, dtype: float64

In [5]:
data.describe()

Unnamed: 0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
count,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,...,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0
mean,-1523.199277,33.212865,86.208265,72.363067,119.529632,3.55913,6.472698,0.412946,0.567352,3.160715,...,7.935824,1.365146,12.21558,8.784074,31.505324,1.858575,76.026165,56.614351,117235.8,0.039569
std,39033.462364,12.956486,1614.757313,339.315831,546.266294,93.155749,153.737066,30.604864,36.513513,95.268204,...,455.887218,113.959637,783.207399,538.439211,2013.125393,147.786584,4040.337842,2852.579397,182664.6,0.194945
min,-999999.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5163.75,0.0
25%,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67870.61,0.0
50%,2.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,106409.2,0.0
75%,2.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,118756.3,0.0
max,238.0,105.0,210000.0,12888.03,21024.81,8237.82,11073.57,6600.0,6600.0,8237.82,...,50003.88,20385.72,138831.63,91778.73,438329.22,24650.01,681462.9,397884.3,22034740.0,1.0


<hr>

**Check for Constant features**

In [6]:
data.columns[(data == data.iloc[0]).all()]

Index(['ind_var2_0', 'ind_var2', 'ind_var27_0', 'ind_var28_0', 'ind_var28',
       'ind_var27', 'ind_var41', 'ind_var46_0', 'ind_var46', 'num_var27_0',
       'num_var28_0', 'num_var28', 'num_var27', 'num_var41', 'num_var46_0',
       'num_var46', 'saldo_var28', 'saldo_var27', 'saldo_var41', 'saldo_var46',
       'imp_amort_var18_hace3', 'imp_amort_var34_hace3',
       'imp_reemb_var13_hace3', 'imp_reemb_var33_hace3',
       'imp_trasp_var17_out_hace3', 'imp_trasp_var33_out_hace3',
       'num_var2_0_ult1', 'num_var2_ult1', 'num_reemb_var13_hace3',
       'num_reemb_var33_hace3', 'num_trasp_var17_out_hace3',
       'num_trasp_var33_out_hace3', 'saldo_var2_ult1',
       'saldo_medio_var13_medio_hace3'],
      dtype='object')

In [7]:
len(data.columns[(data == data.iloc[0]).all()])

34

**Use Sklearn**
* To remove constant and quasi constant features
* `VarianceThreshold` Feature selector that removes all low-variance features

In [8]:
from sklearn.feature_selection import VarianceThreshold

In [9]:
sel = VarianceThreshold()
sel.fit_transform(data)

array([[2.00000000e+00, 2.30000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 3.92051700e+04, 0.00000000e+00],
       [2.00000000e+00, 3.40000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 4.92780300e+04, 0.00000000e+00],
       [2.00000000e+00, 2.30000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 6.73337700e+04, 0.00000000e+00],
       ...,
       [2.00000000e+00, 2.30000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 7.40281500e+04, 0.00000000e+00],
       [2.00000000e+00, 2.50000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 8.42781600e+04, 0.00000000e+00],
       [2.00000000e+00, 4.60000000e+01, 0.00000000e+00, ...,
        0.00000000e+00, 1.17310979e+05, 0.00000000e+00]])

In [10]:
# sel -> features that are not constant
len(data.columns[sel.get_support()])

336

In [11]:
sel.get_feature_names_out()

array(['var3', 'var15', 'imp_ent_var16_ult1', 'imp_op_var39_comer_ult1',
       'imp_op_var39_comer_ult3', 'imp_op_var40_comer_ult1',
       'imp_op_var40_comer_ult3', 'imp_op_var40_efect_ult1',
       'imp_op_var40_efect_ult3', 'imp_op_var40_ult1',
       'imp_op_var41_comer_ult1', 'imp_op_var41_comer_ult3',
       'imp_op_var41_efect_ult1', 'imp_op_var41_efect_ult3',
       'imp_op_var41_ult1', 'imp_op_var39_efect_ult1',
       'imp_op_var39_efect_ult3', 'imp_op_var39_ult1',
       'imp_sal_var16_ult1', 'ind_var1_0', 'ind_var1', 'ind_var5_0',
       'ind_var5', 'ind_var6_0', 'ind_var6', 'ind_var8_0', 'ind_var8',
       'ind_var12_0', 'ind_var12', 'ind_var13_0', 'ind_var13_corto_0',
       'ind_var13_corto', 'ind_var13_largo_0', 'ind_var13_largo',
       'ind_var13_medio_0', 'ind_var13_medio', 'ind_var13', 'ind_var14_0',
       'ind_var14', 'ind_var17_0', 'ind_var17', 'ind_var18_0',
       'ind_var18', 'ind_var19', 'ind_var20_0', 'ind_var20',
       'ind_var24_0', 'ind_var24', 'ind_va

**Quasi constant features**
- Same value for the great majority of the observations

In [12]:
# Threshold = 1%
selection = VarianceThreshold(threshold=0.01)
selection.fit(data)

VarianceThreshold(threshold=0.01)

In [13]:
len(selection.get_feature_names_out())

273

In [14]:
quasi_constant = [col for col in data.columns if col not in selection.get_feature_names_out()]
len(quasi_constant)

97

<hr>

### Correlation with color
- `corr()` It computes pairwise correlation of columns, excluding NA/null values.
    - For better readability use: `.style.background_gradient(cmap='Blues')`
- Good features are highly correlated with target.
- Ideally feature should be correlated with target, but uncorrelated among themselves.

In [15]:
train = data[selection.get_feature_names_out()]
train.shape

(76020, 273)

In [20]:
# train.corr().style.background_gradient(cmap='Blues')

#### Find correlated features
- The goal is to find and remove correlated features
- Calculate correlation matrix (assign it to `corr_matrix`)
- A feature is correlated to any previous features if the following is true
```Python
feature = 'imp_op_var39_comer_ult1'
(corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)]>0.8).any()
```
- Get all the correlated features by using list comprehenseion

In [21]:
corr_matrix = train.corr()

In [24]:
feature = 'imp_op_var39_comer_ult3'
(corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)]>0.8).any()

True

In [25]:
corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)]>0.8).any()]

In [26]:
len(corr_features)

149

<hr>

## Wrapper Method
### Forward Selection
- `SequentialFeatureSelector`: Sequential Feature Selection for Classification and Regression
- First install it: `!pip install mlxtend`
- For preparation remove all quasi- constant features and correlated features
```Python
X = data.drop(['TARGET'] + quasi_features + corr_features, axis=1)
y = data['TARGET']
```
- Create a small training set
```Python
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.75,random_state=42)
```
- Use the `SVC` model with the `SequentialFeatureSelector`

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_selection import SequentialFeatureSelector as SFS

In [29]:
X = data.drop(['TARGET'] + quasi_constant + corr_features, axis=1)
y = data['TARGET']

In [31]:
len(X.columns)

123

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.9,random_state=42)

In [37]:
sfs = SFS(SVC(),n_features_to_select=2,cv=2,n_jobs=8)

In [38]:
sfs.fit(X_train,y_train)

SequentialFeatureSelector(cv=2, estimator=SVC(), n_features_to_select=2,
                          n_jobs=8)

**Good Score