## Filter Methods - Basics
### Filter Methods: Constant features, quasi-constant features, duplicates

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split # to separate the dataset into training and testing set
from sklearn.feature_selection import VarianceThreshold # to remove constant / quasi-constant features

We will be using the Santander Customer Satisfaction dataset from Kaggle:
https://www.kaggle.com/c/santander-customer-satisfaction/data

In [2]:
# load the Santander customer satisfaction dataset from Kaggle

data = pd.read_csv('training_data/santander.csv')
data.shape

(76020, 371)

In [3]:
data.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


Do pertinent feature engineering step: check for missing values

In [5]:
# check the presence of null data.
# The snippets below will be able to compare nan values between 2 columns,
# so in principle missing data are not a problem.
# in any case, we see that there are no missing data in this dataset
# create a list with missing data

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

Separate the dataset into training and testing sets

## Remove constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. This is, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

Identifying and removing constant features, is an easy first step towards feature selection and more easily interpretable machine learning models.

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfitting.

In [6]:
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['TARGET'], axis=1),
    data['TARGET'],
    test_size=0.3,  # 30% of the observations will be allocated to the test set
    random_state=0)

X_train.shape, X_test.shape # this offers the original shape of the dataset, with 370 variables

((53214, 370), (22806, 370))

In [7]:
constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0  # list comprehenssion capturing all the constant features
]

print(len(constant_features)) # print how many constant features are in the dataset 

38


In [8]:
# we can then drop these columns from the train and test sets

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((53214, 332), (22806, 332))

## Remove quasi-constant features

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little if any information that allows a machine learning model to discriminate or predict a target. But there can be exceptions. So you should be careful when removing these type of features.

### Using variance threshold from sklearn

Variance threshold from sklearn is a simple baseline approach to feature extraction. It removes all features which variance doesn’t meet some threshold (we defined the threshold to be 0.01 in this case). By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [9]:
# remove quasi-constant features
sel = VarianceThreshold(
    threshold=0.01)  # 0.1 indicates 99% of observations approximately

sel.fit(X_train)  # fit finds the features with low variance

# get_support is a boolean vector that indicates which features are retained
# if we sum over get_support, we get the number of features that are not quasi-constant

print(sum(sel.get_support())) # how many not quasi-constant?
print(len(X_train.columns[sel.get_support()])) # another way

268
268


#### What should be the value of the threshold so we can eliminate all the constant features?

In [10]:
# finally we can print the quasi-constant feature names
print(
    len([
        x for x in X_train.columns
        if x not in X_train.columns[sel.get_support()]
    ]))

[x for x in X_train.columns if x not in X_train.columns[sel.get_support()]]

64


['ind_var1',
 'ind_var6_0',
 'ind_var6',
 'ind_var13_largo',
 'ind_var13_medio_0',
 'ind_var13_medio',
 'ind_var14',
 'ind_var17_0',
 'ind_var17',
 'ind_var18_0',
 'ind_var18',
 'ind_var19',
 'ind_var20_0',
 'ind_var20',
 'ind_var29_0',
 'ind_var29',
 'ind_var30_0',
 'ind_var31_0',
 'ind_var31',
 'ind_var32_cte',
 'ind_var32_0',
 'ind_var32',
 'ind_var33_0',
 'ind_var33',
 'ind_var34_0',
 'ind_var34',
 'ind_var40',
 'ind_var39',
 'ind_var44_0',
 'ind_var44',
 'num_var6_0',
 'num_var6',
 'num_var13_medio_0',
 'num_var13_medio',
 'num_var18_0',
 'num_var18',
 'num_var29_0',
 'num_var29',
 'num_var33',
 'num_var34_0',
 'num_var34',
 'delta_imp_aport_var33_1y3',
 'delta_num_aport_var33_1y3',
 'ind_var7_emit_ult1',
 'ind_var7_recib_ult1',
 'num_aport_var33_hace3',
 'num_aport_var33_ult1',
 'num_var7_emit_ult1',
 'num_compra_var44_hace3',
 'num_meses_var13_medio_ult3',
 'num_meses_var17_ult3',
 'num_meses_var29_ult3',
 'num_meses_var33_ult3',
 'num_meses_var44_ult3',
 'num_reemb_var13_ult1',

We can see that 58 columns / variables are almost constant.

In [11]:
# percentage of observations showing each of the different values
X_train['ind_var31'].value_counts() / np.float(len(X_train))

0    0.996599
1    0.003401
Name: ind_var31, dtype: float64

We can see that > 99% of the observations show one value, 0 for this variable. Therefore, this features is almost constant.

In [12]:
X_train['imp_op_var40_efect_ult1'].value_counts() / np.float(len(X_train)) # look at the distribution of the observations among the different values of the variable.

0.0       0.999493
900.0     0.000094
60.0      0.000056
1800.0    0.000056
270.0     0.000038
600.0     0.000038
120.0     0.000038
87.9      0.000019
870.0     0.000019
6600.0    0.000019
930.0     0.000019
750.0     0.000019
150.0     0.000019
1710.0    0.000019
300.0     0.000019
210.0     0.000019
1200.0    0.000019
Name: imp_op_var40_efect_ult1, dtype: float64

In [13]:
features_to_keep = X_train.columns[sel.get_support()]

We will use the transform function to reduce the training and testing sets. See below.

In [14]:
# we can then remove the features like this
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((53214, 268), (22806, 268))

In [15]:
# sklearn transformations lead to numpy arrays
# here I transform the arrays back to dataframes
# please be mindful of getting the columns assigned
# correctly

X_train= pd.DataFrame(X_train)
X_train.columns = features_to_keep

X_test= pd.DataFrame(X_test)
X_test.columns = features_to_keep

## Remove duplicated features

Often datasets contain one or more features that show the same values across all the observations. This means that both features are in essence identical. In addition, it is not unusual to introduce duplicated features after performing **one hot encoding** of categorical variables, particularly when using several highly cardinal variables.

Identifying and removing duplicated, and therefore redundant features, is an easy first step towards feature extraction and more easily interpretable machine learning models.

**Note**
Finding duplicated features is a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to perform it.



In [17]:
# check for duplicated features in the training set
duplicated_feat = []
for i in range(0, len(X_train.columns)):
    if i % 10 == 0:  #to see the evolution of the loop
        print(i)

    col_1 = X_train.columns[i]

    for col_2 in X_train.columns[i + 1:]:
        if X_train[col_1].equals(X_train[col_2]):
            
            print(col_1)
            print(col_2)
            print()
            duplicated_feat.append(col_2)
            
len(duplicated_feat)

0
10
20
30
ind_var26_0
ind_var26

ind_var25_0
ind_var25

40
ind_var37_0
ind_var37

50
60
70
num_var26_0
num_var26

num_var25_0
num_var25

80
90
num_var32_0
num_var32

num_var37_0
num_var37

100
num_var40
num_var39

saldo_var6
saldo_var29

110
saldo_var13_medio
saldo_medio_var13_medio_ult1

120
130
delta_imp_reemb_var13_1y3
delta_num_reemb_var13_1y3

140
delta_imp_reemb_var17_1y3
delta_num_reemb_var17_1y3

delta_imp_trasp_var17_in_1y3
delta_num_trasp_var17_in_1y3

delta_imp_trasp_var17_out_1y3
delta_num_trasp_var17_out_1y3

delta_imp_trasp_var33_in_1y3
delta_num_trasp_var33_in_1y3

delta_imp_trasp_var33_out_1y3
delta_num_trasp_var33_out_1y3

150
160
170
180
190
200
210
220
230
240
250
260


16

In [20]:
# let's check that indeed those features are duplicated
# I select a random pair from above

X_train[['num_var40', 'num_var39']].head(50)

Unnamed: 0,num_var40,num_var39
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0
5,0.0,0.0
6,0.0,0.0
7,0.0,0.0
8,0.0,0.0
9,0.0,0.0


In [21]:
X_train.drop(labels=duplicated_feat, axis=1, inplace=True)
X_test.drop(labels=duplicated_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((53214, 252), (22806, 252))