---
layout: post
title:  "Feature Selection"
date:   2023-06-1 10:14:54 +0700
categories: MachineLearning
---

# Introduction

Feature selection is a technique in machine learning that only choose a subset of all the available features to construct a model. We know that not all features contributes to the prediction equally. For example, to predict a house price, the size of house can be much more important than the address of the house. So feature selection can help to reduce the number of variables, that would reduce computation cost and training time and make the model more parsimonious. The more important thing is that those variables are qualitatively different, so if there is an algorithm or procedure to select only the most contributing attributes, it would be better. This can combat overfitting and help with the ultimate goal of generalization, since the model sees the underlying pattern of the data. That would also possible make the model perform better.

Feature selection is a part of feature engineering (create new variables that make better sense). And it is part of the process of understanding your data qualitatively. So make sure you use logics and domain knowledge to evaluate each feature thoroughly before deciding to drop or to add features. Another thing to note is that some feature selection algorithms consider features individually. They cannot take into account feature interactions. That might be the broader task of feature engineering (to combine and make meaningful variables from the feature set).

Automatically, there are ways to do feature selection: 

- Filter method: This is to score the relevance of each feature to the target variable, making use of statistical measures such as Chi-Squared test, information gain, correlation coefficients.
- Wrapper method: Why not use a model to predict the importance of the features? The search can be an exhaustive algorithm, or heuristics such as forward selection, backward elemination, recursive feature eliminiation, genetic algorithm.
- Embedded method: The model to select features is embeded into the final model construction process, using models such as LASSO, elastic net, decision tree, random forest. This method can be more efficient than the above.

# Filter method

## Chi squared test
This statistical test is to evaluate the likelihood of correlation between variables using their distribution. Intuitively, we start with the simple assumption that all variables appear equally in frequency, i.e. they are independent. If the observed frequency is different from the expected frequency, they are actually dependent. 

Let's say we have 2 categorical variables X and Y. Attribute X has two values X1 and X2, attribute Y has two values Y1 and Y2 likewise. To see whether X and Y are dependent, we aggregate their dependent frequency:

|  | X1 | X2 | Total |
|--|----|----|-------|
|Y1|120 |80  |200    |
|Y2|130 |60  |190    |
|Total|250|140|390   |

We interpret the above table as follows: there are 100 observations that has both attributes X1 and Y1. Then we define the hypothesis H0 to be no association between X and Y, H1 to be there is association between X and Y. 

In step 2, we calculate the expected frequency for each cell: $$ Expected Frequency = \frac{Row total * Column total}{Total} $$. For example, the expected observations in the first cell is $$ (X1,Y1)=\frac{250*200}{390} = 128 $$. So we expect to see 128 observations that have the combination (X1,Y1). Here is the expected frequencies:

|  | X1 | X2 | Total |
|--|----|----|-------|
|Y1|128 |71  |200    |
|Y2|121 |68  |190    |
|Total|250|140|390   |

Then we calculate the difference between the expected and the observed frequencies: $$ \frac{(observed - expected)^2}{expected} $$

|  | X1 | X2 | Total |
|--|----|----|-------|
|Y1|0.5 |1.1  |200    |
|Y2|0.7 |0.9  |190    |
|Total|250|140|390   |

The test statistics chi squared would be the sum of all those differences: 0.5 + 1.1 + 0.7 + 0.9 = 3.2. Now we calculate the degree of freedom of this table: $$ (rows - 1) * (columns - 1) = (2-1) * (2-1)=1 $$

With 1 degree of freedom, choose alpha to be 0.05, check this table:

<img src="https://www.simplilearn.com/ice9/free_resources_article_thumb/Chi_Sq_formula_3.jpg">

We have the critical value to be 3.841 which is bigger than our chi squared statistics for this dataset. We cannot reject the null hypothesis that there is no link between X and Y. Which we can carefully conclude that there is no link between X and Y. 

Usually we use the Chi squared test for category variables. For example, movie genres can have one of these values: action, comedy, thriller, children. However, we can apply the Chi squared test for numerical values too, but we need to bin the data first. Binning a continuous variable is also called bucketting: we divide the range of the variable into smaller intervals making up that range. For example, we divide [0,10] into 3 buckets: [0,3], (3,6], and (6,10]. This gives us something like three categories to work with. The only thing is that when you do this, consider that we lose some information grouping data into buckets. Also, the Chi square test is for frequencies so they are in nature all non negative values. If you have negative values in your dataset, a way to remedy is to turn them into zero. 

### Code example
In this example, we would look into the home credit default risk dataset, which we can use to predict the unability to pay back the loan. In this case, we would only demonstrate the chi squared test to select the most relevant features for the target variable. There are 1166 features in which 16 are categorical variables. We fill in the missing values with most frequent for categorical variables and mean for numerical variables. We also turn negative values into 0. We would then select the top 8 categorical variables then we select the top 800 numerical variables. The binning of numerical variables would be done automatically by sklearn.

In [1]:
import pandas as pd
import numpy as np
np.set_printoptions(suppress=True)


In [21]:
train = pd.read_csv('home-credit-manual-engineered-features/train_previous_raw.csv', nrows = 1000)
test = pd.read_csv('home-credit-manual-engineered-features/test_previous_raw.csv', nrows = 1000)

In [22]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,client_installments_AMT_PAYMENT_min_sum,client_installments_AMT_PAYMENT_sum_mean,client_installments_AMT_PAYMENT_sum_max,client_installments_AMT_PAYMENT_sum_min,client_installments_AMT_PAYMENT_sum_sum,client_installments_counts_mean,client_installments_counts_max,client_installments_counts_min,client_installments_counts_sum,client_counts
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,175783.725,219625.695,219625.695,219625.695,4172888.0,19.0,19.0,19.0,361.0,19.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,1154108.295,453952.2204,1150977.33,80773.38,11348810.0,9.16,12.0,6.0,229.0,25.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,16071.75,21288.465,21288.465,21288.465,63865.39,3.0,3.0,3.0,9.0,3.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,994476.69,232499.719688,691786.89,25091.325,3719996.0,7.875,10.0,1.0,126.0,16.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,483756.39,172669.901591,280199.7,18330.39,11396210.0,13.606061,17.0,10.0,898.0,66.0


In [None]:
y_train = train['TARGET']
train = train.drop(['TARGET'], axis=1)

In [25]:
y_train = y_train.to_frame()
y_train

Unnamed: 0,TARGET
0,1
1,0
2,0
3,0
4,0
...,...
995,0
996,0
997,0
998,0


In [71]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train = train.select_dtypes(include=categoricals)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)


In [73]:
from sklearn.preprocessing import LabelEncoder

X_train = X_train.apply(LabelEncoder().fit_transform)

In [39]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [79]:
chi2_selector = SelectKBest(chi2, k=3) # Select top 5 features
X_train_kbest = chi2_selector.fit_transform(X_train, y_train)

# To get the selected feature names
selected_features = [feature for feature, mask in zip(X_train.columns, chi2_selector.get_support()) if mask]
print(selected_features)


['client_credit_SK_ID_CURR_sum_sum', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_sum', 'client_credit_AMT_TOTAL_RECEIVABLE_sum_sum']


In [77]:
X_train = train.select_dtypes(include=numerics)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)
X_train[X_train < 0] = 0

chi2_selector = SelectKBest(chi2, k=800) # Select top 5 features
X_train_kbest = chi2_selector.fit_transform(X_train, y_train)

# To get the selected feature names
selected_features = [feature for feature, mask in zip(X_train.columns, chi2_selector.get_support()) if mask]
print(selected_features)

['SK_ID_CURR', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'DAYS_EMPLOYED', 'OWN_CAR_AGE', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'COMMONAREA_AVG', 'LANDAREA_AVG', 'NONLIVINGAREA_AVG', 'COMMONAREA_MODE', 'LANDAREA_MODE', 'NONLIVINGAREA_MODE', 'COMMONAREA_MEDI', 'LANDAREA_MEDI', 'NONLIVINGAREA_MEDI', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_18', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_RE

In [78]:
chi2_selector = SelectKBest(chi2, k=10) # Select top 5 features
X_train_kbest = chi2_selector.fit_transform(X_train, y_train)

# To get the selected feature names
selected_features = [feature for feature, mask in zip(X_train.columns, chi2_selector.get_support()) if mask]
print(selected_features)

['client_credit_SK_ID_CURR_sum_sum', 'client_credit_AMT_BALANCE_sum_sum', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_sum', 'client_credit_AMT_PAYMENT_CURRENT_sum_sum', 'client_credit_AMT_PAYMENT_TOTAL_CURRENT_sum_sum', 'client_credit_AMT_RECEIVABLE_PRINCIPAL_sum_sum', 'client_credit_AMT_RECIVABLE_sum_sum', 'client_credit_AMT_TOTAL_RECEIVABLE_sum_sum', 'client_installments_SK_ID_CURR_sum_sum', 'client_installments_AMT_PAYMENT_sum_sum']


We can see that the top categorical variables are 'client_credit_SK_ID_CURR_sum_sum', 'client_credit_AMT_CREDIT_LIMIT_ACTUAL_sum_sum', 'client_credit_AMT_TOTAL_RECEIVABLE_sum_sum'. Some of the top 10 of numerical variables are the balance of the client, the credit limit, the principal receivable.

We can use the method SelectKBest in sklearn with several different algorithm such as mutual information, ANNOVA. These algorithms can process negative values. Mutual information is a non negative value that measures the dependency between the features and the target. It comes from information theory. It is equal to zero iff the two random variables are independent. Higher value would mean higher dependency which means that one variable becomes less uncertain when we know the value of the other one. Note that mutual information method doesn't make assumption about the nature of the relationship, so it can capture non linear and non monotonic relation as well. Some other indicator such as correlation efficient can only capture linear relationship. Here is how to calculate the mutual information $$ I(X,Y) = H(X) - H(X\mid Y) $$. We can see that it is the difference betweeen the entropy for X and the conditional entropy for X given Y. The result is in bits. We only need to choose the corresponding algorithm for the SelectKBest method in sklearn. Since we have a thousand features, the dependency measurement for each variable is low, with the maximum being 0.03. Some of the top variables are about previous loan status, installment payment, actual credit limit, and cash loan purpose. It makes sense to use these features to predict the default rate of the client.

In [82]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train = train.select_dtypes(include=numerics)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)


In [83]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

# apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=mutual_info_classif, k=10)
fit = bestfeatures.fit(X_train, y_train)   # X, y defined as before

dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)

# concatenate two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)

# name the dataframe columns
featureScores.columns = ['Specs','Score']  

# print 10 best features
print(featureScores.nlargest(10,'Score'))  


  y = column_or_1d(y, warn=True)


                                                  Specs     Score
299             previous_loans_NAME_PORTFOLIO_POS_count  0.030435
1130           client_installments_AMT_PAYMENT_mean_min  0.024786
913          client_credit_AMT_TOTAL_RECEIVABLE_sum_min  0.024512
574                 client_cash_CNT_INSTALMENT_min_mean  0.023053
741      client_credit_AMT_CREDIT_LIMIT_ACTUAL_mean_min  0.022686
768      client_credit_AMT_DRAWINGS_ATM_CURRENT_sum_max  0.022160
352   previous_loans_NAME_YIELD_GROUP_low_action_cou...  0.021542
157   previous_loans_NAME_CASH_LOAN_PURPOSE_Journey_...  0.021407
30                                         EXT_SOURCE_3  0.021029
965     client_credit_CNT_DRAWINGS_POS_CURRENT_mean_min  0.020707


ANNOVA, short for analysis of variance, is a statistical test that can decode the correlation among the features. It can test the equality of the means among populations too. This method can see the difference between systematic component and random factors affecting the observed variability of a dataset. This information can be used to evaluate the impact of features on the target. The equation for ANNOVA for one feature is $$ \frac{distance between classes}{compactness of classes} $$. This ratio shows how good the feature is in predicting the classes of a variable (how good it contributes to separate them). 

For example, we have two classes for feature X separating: Class 1: [800, 300, 1300] and Class 2: [1000, 2000]. First we take the average of all points: (800 + 300 + 1300 + 1000 + 2000) / 5 = 1080. Then the mean of class 1: 800
Mean of class 2: 1500
The distance between classes: number of observations in class 1 * (mean 1 - mean)^2 + number of observations in class 2 * (mean 2 - mean)^2 = 3 * (800 - 1080)^2 + 2 * (1500 - 1080)^2 = 588000
The compactness of the classes: $$ \frac{sample variance 1 + sample variance 2}{number of observations 1 - 1 + number of observations 2 - 1} $$

In [98]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categoricals = ['object']
X_train = train.select_dtypes(include=numerics)
columns = X_train.columns
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X_train)
  
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(data=X_train, columns=columns)

In [99]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

# apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=f_classif, k=10)
fit = bestfeatures.fit(X_train, y_train)   # X, y defined as before

dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)

# concatenate two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)

# name the dataframe columns
featureScores.columns = ['Specs','Score']  

# print 10 best features
print(featureScores.nlargest(10,'Score'))  


                                                 Specs      Score
30                                        EXT_SOURCE_3  44.139397
29                                        EXT_SOURCE_2  39.448303
167  previous_loans_NAME_CASH_LOAN_PURPOSE_Purchase...  13.564337
168  previous_loans_NAME_CASH_LOAN_PURPOSE_Purchase...  13.564337
146  previous_loans_NAME_CASH_LOAN_PURPOSE_Car repa...  12.410798
200    previous_loans_CODE_REJECT_REASON_HC_count_norm  11.547177
212   previous_loans_CODE_REJECT_REASON_XAP_count_norm  11.498345
288  previous_loans_NAME_GOODS_CATEGORY_Vehicles_co...  11.284404
186  previous_loans_NAME_CONTRACT_STATUS_Refused_co...  11.032108
350    previous_loans_NAME_YIELD_GROUP_high_count_norm  10.263603


  y = column_or_1d(y, warn=True)
  139  140  153  154  155  156  161  162  169  170  237  238  239  240
  255  256  257  258  265  266  267  268  341  342  416  470  471  472
  473  474  475  476  477  486  487  488  489  490  491  492  493  502
  503  504  505  506  507  508  509  526  527  528  529  530  531  532
  533  608  622  623  624  625  643  644  645  646  647  648  649  650
  659  660  661  662  663  664  665  666  667  668  669  670  671  672
  673  674  795  796  797  798  955  956  957  958 1003 1004 1005 1006
 1019 1020 1021 1022] are constant.
  f = msb / msw



## Information gain
This technique measures the reduction in entropy after we do something to the dataset. For example, if we want to split the data for a decision tree model, we would like maximum gain.

## Correlation coefficient
This is the indicator of correlation between target variable and the input variables, it selects the highly correlated features with the target.

# Wrapper method

## Forward selection
This starts with an empty model. In each step, it considers all the available model and chooses the one that add the most to the performance of the model. The process is repeated until adding more variables to the model doesn't improve it anymore.

## Backward elimination
This starts with a model full of features. It removes the least significant feature one at a time, until no further improvement is observed.

## Recursive feature elimination (RFE)
This algorithm performs a greedy search to find the best performing feature subset. In each iteration, it creates model and determine the best and worst features. Then it builds the next model with the best ones until the end. The features are then ranked based on the order of their elimination.


In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(model, 3) # Select top 3 features
fit = rfe.fit(X_train, y_train)

# To get the selected feature names
selected_features = [feature for feature, mask in zip(X_train.columns, rfe.support_) if mask]
print(selected_features)



# Embedded method
This method chooses the features as a part of the creation of the model.

## LASSO 
LASSO prefers fewer non zero coefficients.


In [None]:
from sklearn.linear_model import LassoCV

lasso = LassoCV(cv=5) 
lasso.fit(X_train, y_train)

# To get the selected feature names
selected_features = [feature for feature, coef in zip(X_train.columns, lasso.coef_) if coef != 0]
print(selected_features)



## Elastic net

## Decision tree

A decision uses Gini or entropy impurity to estimate the quality of each split. Features at the top contribute to the final prediction more than others at the leaf. The expected fraction of the samples they contribute to thus can be used as an indicator of their importance.

