# Feature Selection

It is about selection of attribute that have the grestest impact towards the **problem** you are solving. That is, selecting the best features for building models.

## Why Feature Selection?
* Higher accuracy
* Simpler models
* Reducing overfitting risk

## Feature Selection Techniques
### Filter methods
- Independent on model
- Based on score of statistics
- Easy to understand
- Good for early features removal
- Low computational requirements
- **E.g:** Chi square, information gain, correlation score, correlation matrix with heatmap.

### Wrapper methods
- Compare different subsets of features and run the model on them
- Basically a search problem
- **E.g:** Best-first search, Random hil-climbing algorithm, forward selection, backward elimination

### Embedded methods
- Find features that contribute most to the accuracy of the model while it is created
- Regularizatio is the most common method - it penalizes higher complexity
- **E.g:** LASSO, Elastic Net, Ridge regression.

## Before Feature Selection
- Clean data
- Divide into training and testing sets
- Feature scaling
- Only do feature selection on training set to avoid overfitting.

<hr>

## Filter Methods

In [1]:
import pandas as pd

In [2]:
data = pd.read_parquet('./data/customer_satisfaction.parquet')
data.head()

Unnamed: 0_level_0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [3]:
data.shape

(76020, 370)

In [4]:
data['TARGET'].value_counts()/len(data)

TARGET
0    0.960431
1    0.039569
Name: count, dtype: float64