# Backward Feature Elimination 

## Remove Low variance features
### Why using it ?
A feature with a low variance can be consider as constant, its impact on the target will be negligible
### What it does ?
The variance measures how far a set of numbers is spread out from their average value.
### How to do it ?

In [5]:
import pandas as pd

# Load the enhanced data with no label and sort column by variance.

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',', index_col=0)

avocados_enhanced_no_label.var().sort_values()

# Which column(s) can be omit according to this test ?


day_of_week                    0.000000e+00
average_price                  1.621484e-01
type_id                        2.500137e-01
year                           8.834843e-01
month                          1.249007e+01
day                            7.702311e+01
week_of_year                   2.397525e+02
region_id                      2.428544e+02
day_of_year                    1.166599e+04
date_id                        8.729794e+07
total_extra_large_bags_sold    3.130385e+08
total_plu_4770_sold            1.154853e+10
total_large_bags_sold          5.951939e+10
total_small_bags_sold          5.567824e+11
total_bags_sold                9.726741e+11
total_plu_4225_sold            1.449906e+12
total_plu_4046_sold            1.600197e+12
total_volume_sold              1.192698e+13
dtype: float64

As we can see, the reported day_of_week will never change over the dataset, its impact on the average price will be null

## Remove feature with high missing value ratio
### Why using it ?
All dataset can't be perfect, it's happend that sometimes values can be missed (sensors its not working, bug, ...)
We can decide that above a specific threshold, a features with too much missing isn't reliable.
### What it does ?
We are looking for a percentage of missing values. if the percentage missing value for a dedicated column is too high, we remove the column from our dataset.
### How to do it ?

In [4]:
import pandas as pd

# Load the enhanced data with no label  and calculate the ratio of missing value per column

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',', index_col=0)

    avocados_enhanced_no_label.isnull().sum()/len(avocados_enhanced_no_label)*100

# Which column(s) can be omit according to this test ?

average_price                  0.0
total_volume_sold              0.0
total_plu_4046_sold            0.0
total_plu_4225_sold            0.0
total_plu_4770_sold            0.0
total_bags_sold                0.0
total_small_bags_sold          0.0
total_large_bags_sold          0.0
total_extra_large_bags_sold    0.0
year                           0.0
region_id                      0.0
type_id_x                      0.0
month_x                        0.0
month_label_x                  0.0
day_x                          0.0
day_of_week_x                  0.0
day_label_x                    0.0
day_of_year_x                  0.0
week_of_year_x                 0.0
date_id                        0.0
month_y                        0.0
month_label_y                  0.0
day_y                          0.0
day_of_week_y                  0.0
day_label_y                    0.0
day_of_year_y                  0.0
week_of_year_y                 0.0
type_id_y                      0.0
month               

fortunately for us, none of ours features contains missing values

## Remove highly correlated features
### Why using it ?
If two features are highy correlated between them, it's mean that their are carrying similar information.
We can decide to remove one of them, in order to reduce the number of feature with redundance information.
### What it does ?

we first create the covariance matrix, then for each row, we are looking for high correlated values with previous column.
If a the current column is highly correlated to a previous one (expect the target), no need to keep it

### How to do it ?

In [1]:
import pandas as pd

# Load the enhanced data with no label and display the correlation matrix

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',', index_col=0)

avocados_enhanced_no_label.corr().abs()

Unnamed: 0,average_price,total_volume_sold,total_plu_4046_sold,total_plu_4225_sold,total_plu_4770_sold,total_bags_sold,total_small_bags_sold,total_large_bags_sold,total_extra_large_bags_sold,year,region_id,type_id,month,day,day_of_week,day_of_year,week_of_year,date_id
average_price,1.0,0.192752,0.208317,0.172928,0.179446,0.177088,0.17473,0.17294,0.117592,0.093197,0.011716,0.615845,0.162409,0.027386,,0.16338,0.146383,0.099925
total_volume_sold,0.192752,1.0,0.977863,0.974181,0.872202,0.963047,0.967238,0.88064,0.747157,0.017193,0.174176,0.232434,0.024689,0.009747,,0.025374,0.024217,0.016353
total_plu_4046_sold,0.208317,0.977863,1.0,0.92611,0.833389,0.920057,0.92528,0.838645,0.699377,0.003353,0.192073,0.225819,0.025803,0.010159,,0.026572,0.026268,0.002388
total_plu_4225_sold,0.172928,0.974181,0.92611,1.0,0.887855,0.905787,0.916031,0.810015,0.688809,0.009559,0.145726,0.232289,0.022108,0.012393,,0.023008,0.019965,0.010464
total_plu_4770_sold,0.179446,0.872202,0.833389,0.887855,1.0,0.792314,0.802733,0.698471,0.679861,0.036531,0.095252,0.210027,0.033424,0.009009,,0.033992,0.032542,0.038023
total_bags_sold,0.177088,0.963047,0.920057,0.905787,0.792314,1.0,0.994335,0.943009,0.804233,0.071552,0.175256,0.217788,0.022724,0.004988,,0.022975,0.023189,0.071117
total_small_bags_sold,0.17473,0.967238,0.92528,0.916031,0.802733,0.994335,1.0,0.902589,0.806845,0.063915,0.164702,0.220535,0.023126,0.00387,,0.023306,0.023766,0.06342
total_large_bags_sold,0.17294,0.88064,0.838645,0.810015,0.698471,0.943009,0.902589,1.0,0.710858,0.087891,0.198768,0.193177,0.020187,0.008352,,0.020654,0.019949,0.087647
total_extra_large_bags_sold,0.117592,0.747157,0.699377,0.688809,0.679861,0.804233,0.806845,0.710858,1.0,0.081033,0.082281,0.175483,0.012969,0.000319,,0.013014,0.015233,0.081029
year,0.093197,0.017193,0.003353,0.009559,0.036531,0.071552,0.063915,0.087891,0.081033,1.0,5.5e-05,3.2e-05,0.17705,0.004475,,0.175635,0.171874,0.999306


In [3]:
import pandas as pd
import numpy as np

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',', index_col=0)

avocado_corr_matrix = avocados_enhanced_no_label.corr().abs()

# Create a corr matrix without redundancy in the correlation (ie upper matrix) 

upper_matrix = avocado_corr_matrix.where(np.triu(np.ones(avocado_corr_matrix.shape), k=1).astype(np.bool))

# Create a list containing the column to drop if the corr between the column is > 0.9
to_drop = [column for column in upper_matrix.columns if any(upper_matrix[column] > 0.90)]

print(upper_matrix)
print(to_drop)

                             average_price  total_volume_sold  \
average_price                          NaN           0.192752   
total_volume_sold                      NaN                NaN   
total_plu_4046_sold                    NaN                NaN   
total_plu_4225_sold                    NaN                NaN   
total_plu_4770_sold                    NaN                NaN   
total_bags_sold                        NaN                NaN   
total_small_bags_sold                  NaN                NaN   
total_large_bags_sold                  NaN                NaN   
total_extra_large_bags_sold            NaN                NaN   
year                                   NaN                NaN   
region_id                              NaN                NaN   
type_id_x                              NaN                NaN   
month_x                                NaN                NaN   
day_x                                  NaN                NaN   
day_of_week_x            

In [2]:
import pandas as pd
import numpy as np

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',', index_col=0)

#order column by correlation over target, for example, we choose 'average_price' as the target
avocado_features_names_ordered_by_corr = list(avocados_enhanced_no_label.corr().abs()['average_price'].sort_values(ascending=False).index)

# Create a new corr matrix without redundancy in the correlation (ie upper matrix) 
 
avocado_corr_matrix = avocados_enhanced_no_label[avocado_features_names_ordered_by_corr].corr().abs()
print(avocado_corr_matrix)
upper_matrix = avocado_corr_matrix.where(np.triu(np.ones(avocado_corr_matrix.shape), k=1).astype(np.bool))

# Create a list containing the column to drop if the corr between the column is > 0.9

to_drop = [column for column in upper_matrix.columns if any(upper_matrix[column] > 0.90)]

print(upper_matrix)
print(to_drop)

                             average_price  type_id_y  type_id_x  \
average_price                     1.000000   0.615845   0.615845   
type_id_y                         0.615845   1.000000   1.000000   
type_id_x                         0.615845   1.000000   1.000000   
total_plu_4046_sold               0.208317   0.225819   0.225819   
total_volume_sold                 0.192752   0.232434   0.232434   
total_plu_4770_sold               0.179446   0.210027   0.210027   
total_bags_sold                   0.177088   0.217788   0.217788   
total_small_bags_sold             0.174730   0.220535   0.220535   
total_large_bags_sold             0.172940   0.193177   0.193177   
total_plu_4225_sold               0.172928   0.232289   0.232289   
day_of_year                       0.163380   0.000085   0.000085   
day_of_year_y                     0.163380   0.000085   0.000085   
day_of_year_x                     0.163380   0.000085   0.000085   
month_x                           0.162409   0.0

## Recursive feature elimination
### Why using it ?
### What it does ?

https://scikit-learn.org/stable/modules/feature_selection.html#rfe

explain rfe algo :
https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/feature_selection/_rfe.py#L38
remove each features based on their coef score provide by the estimator (ie LR or SVC) :
the feature with the lowest coef is removed :

improvement is'nt based on the model score, but on the model coef calculation :
The RFE choice the features depending on model, it's not always relevant


### How to do it ?

In [17]:
# RFE for regression or classification

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import pandas as pd

# Load the enhanced data with no label and create 2 subdatasets :
# one with the target (for example 'average_price')
# one with the observed features
avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',' ,index_col=0)
avocado_features = avocados_enhanced_no_label.drop('average_price', axis=1)
avocado_target = avocados_enhanced_no_label['average_price']

# choose a and load selector (regressor or classifier, depends on the target)
# example : LinearRegression from sklearn.linear_model, but try other
linear_regression_estimator = LinearRegression()
# Load the selector on a defined number of feature to select
rfe_selector = RFE(linear_regression_estimator, n_features_to_select=4)
# fit on train set
fited_rfe_selector = rfe_selector.fit(avocado_features, avocado_target)


# order the features based on their ranking score, and display them
features_kept = pd.DataFrame(list(avocado_features.columns),columns=["features"])
features_kept['ranking'] = fited_rfe_selector.ranking_
features_kept.sort_values('ranking', inplace=True)

print(features_kept)



                       features  ranking
8                          year        1
12                          day        1
10                      type_id        1
11                        month        1
14                  day_of_year        2
7   total_extra_large_bags_sold        3
5         total_small_bags_sold        4
4               total_bags_sold        5
6         total_large_bags_sold        6
15                 week_of_year        7
9                     region_id        8
2           total_plu_4225_sold        9
0             total_volume_sold       10
1           total_plu_4046_sold       11
3           total_plu_4770_sold       12
16                      date_id       13
13                  day_of_week       14


In [8]:
# RFE for classification

from sklearn.feature_selection import RFE
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
import pandas as pd

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',',index_col=0)
avocado_features = avocados_enhanced_no_label.drop('type_id', axis=1)
avocado_target = avocados_enhanced_no_label['type_id']

linear_svc_estimator = LinearSVC()

scaler = StandardScaler()
#calculating the mean and variance
scaler.fit(avocado_features)
#removing the mean and scaling to unit variance
scaled_avocado_features = scaler.transform(avocado_features)

# feature extraction
rfe_selector = RFE(linear_svc_estimator, n_features_to_select=4)

# fit on train set
fited_rfe_selector = rfe_selector.fit(scaled_avocado_features, avocado_target)

features_kept = pd.DataFrame(list(avocado_features.columns),columns=["features"])
features_kept['ranking'] = fited_rfe_selector.ranking_
features_kept.sort_values('ranking', inplace=True)


print(features_kept)

                       features  ranking
8   total_extra_large_bags_sold        1
2           total_plu_4046_sold        1
3           total_plu_4225_sold        1
4           total_plu_4770_sold        1
5               total_bags_sold        2
1             total_volume_sold        3
6         total_small_bags_sold        4
7         total_large_bags_sold        5
0                 average_price        6
14                  day_of_year        7
16                      date_id        8
11                        month        9
10                    region_id       10
9                          year       11
12                          day       12
15                 week_of_year       13
13                  day_of_week       14


# Forward Feature Selection

## High correlation with target
### Why using it ?
### What it does ?
### How to do it ?

In [8]:
import pandas as pd

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',', index_col=0)

# Load the enhanced data with no label and create a correlation matrix ordered by correlation over target
# target could now be 'type_id'
avocados_enhanced_no_label.corr().abs()['average_price'].sort_values(ascending=False)




#['average_price'].sort_values(ascending=False)

#avocados_enhanced_no_label.corr().abs()['type_id'].sort_values(ascending=False)

average_price                  1.000000
type_id                        0.615845
total_plu_4046_sold            0.208317
total_volume_sold              0.192752
total_plu_4770_sold            0.179446
total_bags_sold                0.177088
total_small_bags_sold          0.174730
total_large_bags_sold          0.172940
total_plu_4225_sold            0.172928
day_of_year                    0.163380
month                          0.162409
week_of_year                   0.146383
total_extra_large_bags_sold    0.117592
date_id                        0.099925
year                           0.093197
day                            0.027386
region_id                      0.011716
day_of_week                         NaN
Name: average_price, dtype: float64

## Univariate feature selection
### Why using it ?

When you want to select data base on a statistical / probalities approach

Univariate Feature Selection uses statistical tests to select features. Univariate describes a type of data which consists of observations on only a single characteristic or attribute. Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable. Some examples of statistical tests that can be used to evaluate feature relevance are Pearson Correlation, Maximal information coefficient, Distance correlation, ANOVA and Chi-square. Chi-square is used to find the relationship between categorical variables and Anova is preferred when the variables are continuous.

https://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html#sphx-glr-auto-examples-feature-selection-plot-feature-selection-py

### What it does ?
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest.fit

 Applying univariate feature selection before the SVM increases the SVM weight attributed to the significant features, and will thus improve classification.

### How to do it ?

In [10]:
import pandas as pd
from sklearn.feature_selection import SelectKBest,f_classif

# Load the enhanced data with no label and create 2 subdatasets :
# one with the target (for example 'type_id')
# one with the observed features

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',',index_col=0)
avocado_features = avocados_enhanced_no_label.drop(['average_price','day_of_week','type_id'], axis=1)
avocado_target = avocados_enhanced_no_label['type_id']

avocado_enhanced_columns = list(avocado_features.columns)

# f_classif using anova f-value to score
#ANOVA F-value between label/feature for classification tasks.
#f_regression
#F-value between label/feature for regression tasks.

# Create the selector and fit it on the data
select_k_best_selector = SelectKBest(f_classif, k=4)
select_k_best_selector.fit(avocado_features, avocado_target)

# Gather features scores
avocado_features_scores = pd.DataFrame(avocado_enhanced_columns,columns=["features"])

# display score an p value for each features
avocado_features_scores['scores'] = select_k_best_selector.scores_
avocado_features_scores['p_values'] = select_k_best_selector.pvalues_
avocado_features_scores.sort_values('scores', ascending=False, inplace=True)

print(avocado_features_scores)

# Which are relevant according to this test and why ?


                       features       scores       p_values
0             total_volume_sold  1042.108793  2.185255e-222
2           total_plu_4225_sold  1040.734514  4.188721e-222
1           total_plu_4046_sold   980.486848  1.072492e-209
5         total_small_bags_sold   932.826142  7.489965e-200
4               total_bags_sold   908.583262  7.786117e-195
3           total_plu_4770_sold   842.045893  4.944173e-181
6         total_large_bags_sold   707.326363  6.208641e-153
7   total_extra_large_bags_sold   579.757145  3.911033e-126
9                     region_id     0.001426   9.698724e-01
12                  day_of_year     0.000131   9.908594e-01
10                        month     0.000131   9.908617e-01
13                 week_of_year     0.000114   9.914966e-01
14                      date_id     0.000023   9.961356e-01
8                          year     0.000017   9.966843e-01
11                          day     0.000003   9.986459e-01


## Feature Selection using SelectFromModel

### Why using it ?

in order to kept features that really matter for the model/estimator


### What it does ?

The features will be kept if their coefficient calculated by the estimator are high enough

### How to do it ?

In [15]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np

# Load the enhanced data with no label and create 2 subdatasets :
# one with the target (for example 'type_id')
# one with the observed features

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',',index_col=0)

avocado_features = avocados_enhanced_no_label.drop('average_price', axis=1)

avocado_target = avocados_enhanced_no_label['average_price']

avocado_features_columns = list(avocado_features.columns)


# Choose and load the estimator / model
linear_regression_estimator = LinearRegression()

# Load the selector
select_from_model_selector = SelectFromModel(linear_regression_estimator, max_features=5, threshold=-np.inf)

# Fit the selector on data
select_from_model_selector.fit(avocado_features, avocado_target)

# Gather the features selected
avocado_features_selected = pd.DataFrame(avocado_features_columns,columns=["features"])

# display the coeff, the features and the selected status
avocado_features_selected['coeff'] = select_from_model_selector.estimator_.coef_
avocado_features_selected['is_selected'] = select_from_model_selector.get_support()


avocado_features_selected.sort_values('is_selected', ascending = False,inplace=True)
print(avocado_features_selected)

# Which are relevant according to this test and why ?


                       features         coeff  is_selected
14                  day_of_year -4.484500e-02         True
12                          day  4.588967e-02         True
11                        month  1.387611e+00         True
7   total_extra_large_bags_sold  1.402073e-02         True
10                      type_id  4.889550e-01         True
0             total_volume_sold -1.117643e-04        False
9                     region_id  3.503097e-04        False
15                 week_of_year -6.727884e-04        False
13                  day_of_week  2.220446e-16        False
8                          year -1.388070e-02        False
1           total_plu_4046_sold  1.116732e-04        False
6         total_large_bags_sold  1.401930e-02        False
5         total_small_bags_sold  1.401947e-02        False
4               total_bags_sold -1.390769e-02        False
3           total_plu_4770_sold  1.114250e-04        False
2           total_plu_4225_sold  1.118726e-04        Fal

In [2]:
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

avocados_enhanced_no_label = pd.read_csv('../ressources/avocados_enhanced_no_label.csv', sep=',',index_col=0)
avocado_features = avocados_enhanced_no_label.drop('type_id', axis=1)
avocado_target = avocados_enhanced_no_label['type_id']

avocado_features_columns = list(avocado_features.columns)

linear_svc_estimator = LinearSVC()

scaler = StandardScaler()
#calculating the mean and variance
scaler.fit(avocado_features)
#removing the mean and scaling to unit variance
scaled_avocado_features = scaler.transform(avocado_features)

select_from_model_selector = SelectFromModel(linear_svc_estimator, max_features=5,threshold=-np.inf)
# fit on train set
select_from_model_selector.fit(scaled_avocado_features, avocado_target)

avocado_features_selected = pd.DataFrame(avocado_features_columns,columns=["features"])

avocado_features_selected['coeff'] = select_from_model_selector.estimator_.coef_[0]
avocado_features_selected['is_selected'] = select_from_model_selector.get_support()
avocado_features_selected.sort_values('coeff', inplace=True)
print(avocado_features_selected)

                       features      coeff  is_selected
8   total_extra_large_bags_sold -14.550717         True
2           total_plu_4046_sold  -7.631265         True
3           total_plu_4225_sold  -5.914269         True
4           total_plu_4770_sold  -5.703681         True
1             total_volume_sold  -4.610765         True
11                        month  -0.382896        False
16                      date_id  -0.061921        False
9                          year  -0.047116        False
12                          day  -0.040973        False
15                 week_of_year  -0.023039        False
13                  day_of_week   0.000000        False
10                    region_id   0.048802        False
14                  day_of_year   0.263533        False
0                 average_price   0.567460        False
5               total_bags_sold   1.466634        False
7         total_large_bags_sold   1.629446        False
6         total_small_bags_sold   1.750720      