## Constant features

Constant features are those that show the same value, just one value, for all the observations of the dataset. In other words, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target.

Identifying and removing constant features is an easy first step towards feature selection and more easily interpretable machine learning models.

Here, I will demonstrate how to identify constant features using a dataset that I created for this course. 

To identify constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If using the VarianceThreshold, all our variables need to be numerical. If we do it manually however, we can apply the code to both numerical and categorical variables.

I will show 3 snippets of code, 1 where I use the VarianceThreshold and 2 manually coded alternatives.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

In [11]:
from pathlib import Path, PureWindowsPath

def get_full_path_nm (base_dir, path_add_ons, file_nm) :
    
# I've explicitly declared my path as being in Windows format, so I can use forward slashes in it.
# base_dir = "c:\\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments"
# win_path = base_dir + "\\feature-engineering" + "\\fe-recipes"
    
    does_file_exist = False
    incl_path  = base_dir + path_add_ons + "\\" + file_nm  


    filename = PureWindowsPath(incl_path)

    # Convert path to the right format for the current operating system
    correct_fnm = Path(filename)

# print ("Full path NM ", correct_fnm )
    
    if correct_fnm.is_file():
        print ("Full path NM exists ", correct_fnm)
        does_file_exist = True
    else :
        print ("Full path NM does NOT exist ", correct_fnm)
    
    return correct_fnm, does_file_exist



In [12]:
base_dir = "c:\\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments"
# win_path = base_dir + "\\feature-engineering" + "\\fe-recipes"
path_add_ons  = "\\input_data"
file_nm = "fselect_dataset_1.csv"

fnm, exists = get_full_path_nm (base_dir, path_add_ons, file_nm) 

# load our first dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

data = pd.read_csv(fnm)
data.shape

Full path NM exists  c:\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_1.csv


(50000, 301)

**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfitting.

In [13]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),  # drop the target
    data['target'],  # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

### Using VarianceThreshold from Scikit-learn

The VarianceThreshold from sklearn provides a simple baseline approach to feature selection. It removes all features which variance doesn’t meet a certain threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [14]:
sel = VarianceThreshold(threshold=0)

sel.fit(X_train)  # fit finds the features with zero variance

VarianceThreshold(threshold=0)

In [15]:
# get_support is a boolean vector that indicates which features are retained
# if we sum over get_support, we get the number of features that are not constant

# (go ahead and print the result of sel.get_support() to understand its output)

sum(sel.get_support())

266

In [16]:
# now let's print the number of constant feautures
# (see how we use ~ to exclude non-constant features)

constant = X_train.columns[~sel.get_support()]

len(constant)

34

We can see that 34 columns / variables are constant. This means that 34 variables show the same value, just one value, for all the observations of the training set.

In [17]:
# let's print the constant variable names

constant

Index(['var_23', 'var_33', 'var_44', 'var_61', 'var_80', 'var_81', 'var_87',
       'var_89', 'var_92', 'var_97', 'var_99', 'var_112', 'var_113', 'var_120',
       'var_122', 'var_127', 'var_135', 'var_158', 'var_167', 'var_170',
       'var_171', 'var_178', 'var_180', 'var_182', 'var_195', 'var_196',
       'var_201', 'var_212', 'var_215', 'var_225', 'var_227', 'var_248',
       'var_294', 'var_297'],
      dtype='object')

In [18]:
# let's visualise the values of one of the constant variables
# as an example

X_train['var_23'].unique()

array([0], dtype=int64)

In [19]:
# we can do the same for every feature:

for col in constant:
    print(col, X_train[col].unique())

var_23 [0]
var_33 [0]
var_44 [0]
var_61 [0]
var_80 [0]
var_81 [0]
var_87 [0]
var_89 [0.]
var_92 [0]
var_97 [0]
var_99 [0]
var_112 [0]
var_113 [0]
var_120 [0]
var_122 [0]
var_127 [0]
var_135 [0]
var_158 [0]
var_167 [0]
var_170 [0]
var_171 [0]
var_178 [0.]
var_180 [0.]
var_182 [0]
var_195 [0]
var_196 [0]
var_201 [0]
var_212 [0]
var_215 [0]
var_225 [0]
var_227 [0.]
var_248 [0]
var_294 [0]
var_297 [0]


We then use the transform() method of the VarianceThreshold to reduce the training and testing sets to its non-constant features.

Note that VarianceThreshold returns a NumPy array without feature names, so we need to capture the names first, and reconstitute the dataframe in a later step.

In [20]:
# capture non-constant feature names

feat_names = X_train.columns[sel.get_support()]

In [21]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

We passed from our original 300 variables, to 266.

In [22]:
# X_ train is a NumPy array
X_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [23]:
# reconstitute de dataframe

X_train = pd.DataFrame(X_train, columns=feat_names)
X_train.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_289,var_290,var_291,var_292,var_293,var_295,var_296,var_298,var_299,var_300
0,0.0,0.0,0.0,2.79,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,2.97,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,2.79,85435.2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,5.7,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Manual code 1: only works with numerical

In the following cells, I will show an alternative to the VarianceThreshold transformer of sklearn, were we write the code to find out constant variables, using the standard deviation from pandas.

In [24]:
# separate train and test (again, as we transformed the previous ones)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [25]:
# short and easy: find constant features

# in this dataset, all features are numeric,
# so this bit of code will suffice:

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

len(constant_features)

34

In [26]:
# drop these columns from the train and test sets:

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

We see how by removing constant features, we managed to reduced the feature space quite a bit.

Both the VarianceThreshold and the snippet of code I provided work with numerical variables. What can we do to find constant categorical variables?

One alternative is to encode the categories as numbers and then use the code above. But then you will put effort in pre-processing variables that are not informative.

The code below offers a better solution:

### Manual Code 2 - works also with categorical variables

In [27]:
# separate train and test (again, as we transformed the previous ones)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [28]:
# I will cast all the numeric features as object,
# to simulate that they are categorical

X_train = X_train.astype('O')
X_train.dtypes

var_1      object
var_2      object
var_3      object
var_4      object
var_5      object
            ...  
var_296    object
var_297    object
var_298    object
var_299    object
var_300    object
Length: 300, dtype: object

In [29]:
# to find variables that contain only 1 label/value
# we use the nunique() method from pandas, which returns the number
# of different values in a variable.

constant_features = [
    feat for feat in X_train.columns if X_train[feat].nunique() == 1
]

len(constant_features)

34

Same as before, we observe 34 variables that show only 1 value in all the observations of the dataset. Like this, we can appreciate the usefulness of looking out for constant variables at the beginning of any modeling exercise.

**Note** by default nunique() ignores missing values, so if your variables have missing values, use dropna=False within the parameters of nunique(). More details here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html

In [30]:
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

## Detecct Constant & QuasiConst variables 

In [32]:
base_dir = "c:\\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments"
path_add_ons  = "\\input_data"
file_nm = "fselect_dataset_1.csv"

fnm, exists = get_full_path_nm (base_dir, path_add_ons, file_nm) 

# load our first dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

df = pd.read_csv(fnm)
df.shape

Full path NM exists  c:\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_1.csv


(50000, 301)

In [33]:
df.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_292,var_293,var_294,var_295,var_296,var_297,var_298,var_299,var_300,target
0,0,0,0.0,0.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
1,0,0,0.0,3.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
2,0,0,0.0,5.88,0.0,0,0,0,0,0,...,0.0,0,0,3,0,0,0,0.0,67772.7216,0
3,0,0,0.0,14.1,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
4,0,0,0.0,5.76,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0


In [34]:
from fast_ml.utilities import display_all
from fast_ml.feature_selection import get_constant_features

constant_features = get_constant_features(df)
constant_features.head(10)

Unnamed: 0,Desc,Var,Value,Perc
0,Constant,var_80,0.0,100.0
1,Constant,var_97,0.0,100.0
2,Constant,var_113,0.0,100.0
3,Constant,var_112,0.0,100.0
4,Constant,var_33,0.0,100.0
5,Constant,var_212,0.0,100.0
6,Constant,var_99,0.0,100.0
7,Constant,var_215,0.0,100.0
8,Constant,var_248,0.0,100.0
9,Constant,var_44,0.0,100.0


In [35]:
# Get allthe constant features as a list

constant_features_list = constant_features.query("Desc=='Constant'")['Var'].to_list()
print(constant_features_list)

['var_80', 'var_97', 'var_113', 'var_112', 'var_33', 'var_212', 'var_99', 'var_215', 'var_248', 'var_44', 'var_122', 'var_92', 'var_89', 'var_87', 'var_81', 'var_167', 'var_61', 'var_225', 'var_227', 'var_120', 'var_23', 'var_127', 'var_196', 'var_297', 'var_195', 'var_294', 'var_135', 'var_201', 'var_182', 'var_180', 'var_178', 'var_158', 'var_171']


In [36]:
# Drop costant features fro the data frame 

print('Shape of Dataset before dropping the constant features: ', df.shape)
df.drop(columns = constant_features_list, inplace=True)
print('Shape of Dataset after dropping the constant features: ', df.shape)

Shape of Dataset before dropping the constant features:  (50000, 301)
Shape of Dataset after dropping the constant features:  (50000, 268)


In [37]:
# Quaiconstant features

constant_features = get_constant_features(df, threshold=0.99, dropna=False)
constant_features.head(10)

Unnamed: 0,Desc,Var,Value,Perc
0,Quasi Constant,var_124,0.0,99.998
1,Quasi Constant,var_104,0.0,99.998
2,Quasi Constant,var_66,0.0,99.998
3,Quasi Constant,var_223,0.0,99.998
4,Quasi Constant,var_67,0.0,99.998
5,Quasi Constant,var_69,0.0,99.998
6,Quasi Constant,var_129,0.0,99.998
7,Quasi Constant,var_73,0.0,99.998
8,Quasi Constant,var_247,0.0,99.998
9,Quasi Constant,var_36,0.0,99.998


In [39]:
quasi_constant_features_list = constant_features.query("Desc=='Quasi Constant'")['Var'].to_list()
print(quasi_constant_features_list)

['var_124', 'var_104', 'var_66', 'var_223', 'var_67', 'var_69', 'var_129', 'var_73', 'var_247', 'var_36', 'var_34', 'var_133', 'var_170', 'var_183', 'var_280', 'var_283', 'var_287', 'var_14', 'var_153', 'var_6', 'var_151', 'var_72', 'var_141', 'var_10', 'var_11', 'var_12', 'var_285', 'var_210', 'var_187', 'var_217', 'var_65', 'var_111', 'var_189', 'var_150', 'var_233', 'var_228', 'var_2', 'var_234', 'var_243', 'var_9', 'var_265', 'var_28', 'var_71', 'var_20', 'var_289', 'var_116', 'var_7', 'var_267', 'var_146', 'var_221', 'var_263', 'var_257', 'var_90', 'var_59', 'var_274', 'var_204', 'var_136', 'var_126', 'var_149', 'var_239', 'var_202', 'var_138', 'var_216', 'var_235', 'var_264', 'var_1', 'var_3', 'var_60', 'var_184', 'var_95', 'var_42', 'var_290', 'var_237', 'var_142', 'var_53', 'var_45', 'var_78', 'var_299', 'var_260', 'var_77', 'var_254', 'var_219', 'var_211', 'var_43', 'var_106', 'var_197', 'var_224', 'var_246', 'var_102', 'var_24', 'var_48', 'var_32', 'var_115', 'var_125', 'var_

In [40]:
print('Shape of Dataset before dropping the quasi constant features: ', df.shape)
df.drop(columns = quasi_constant_features_list, inplace=True)
print('Shape of Dataset after dropping the quasi constant features: ', df.shape)

Shape of Dataset before dropping the quasi constant features:  (50000, 268)
Shape of Dataset after dropping the quasi constant features:  (50000, 123)


## Duplicate Features 

In [41]:
base_dir = "c:\\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments"
path_add_ons  = "\\input_data"
file_nm = "fselect_dataset_1.csv"

fnm, exists = get_full_path_nm (base_dir, path_add_ons, file_nm) 

# load our first dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

df = pd.read_csv(fnm)
df.shape

Full path NM exists  c:\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_1.csv


(50000, 301)

In [42]:
from fast_ml.utilities import display_all
from fast_ml.feature_selection import get_duplicate_features

duplicate_features = get_duplicate_features(df)
duplicate_features.head(10)

Unnamed: 0,Desc,feature1,feature2
0,Duplicate Values,var_66,var_69
1,Duplicate Values,var_23,var_135
2,Duplicate Values,var_23,var_167
3,Duplicate Values,var_23,var_171
4,Duplicate Values,var_149,var_239
5,Duplicate Values,var_143,var_296
6,Duplicate Values,var_23,var_182
7,Duplicate Values,var_23,var_195
8,Duplicate Values,var_23,var_196
9,Duplicate Values,var_23,var_201


In [43]:
duplicate_features_list = duplicate_features.query("Desc=='Duplicate Values'")['feature2'].to_list()
print(duplicate_features_list)

['var_69', 'var_135', 'var_167', 'var_171', 'var_239', 'var_296', 'var_182', 'var_195', 'var_196', 'var_201', 'var_215', 'var_225', 'var_248', 'var_294', 'var_297', 'var_183', 'var_104', 'var_223', 'var_148', 'var_106', 'var_216', 'var_227', 'var_287', 'var_289', 'var_180', 'var_199', 'var_178', 'var_158', 'var_212', 'var_127', 'var_80', 'var_151', 'var_116', 'var_269', 'var_232', 'var_263', 'var_122', 'var_250', 'var_33', 'var_44', 'var_61', 'var_285', 'var_97', 'var_120', 'var_99', 'var_112', 'var_92', 'var_113', 'var_87', 'var_81']


In [44]:
print('Shape of Dataset before dropping the duplicate values features: ', df.shape)
df.drop(columns = duplicate_features_list, inplace=True)
print('Shape of Dataset after dropping the duplicate values features: ', df.shape)

Shape of Dataset before dropping the duplicate values features:  (50000, 301)
Shape of Dataset after dropping the duplicate values features:  (50000, 251)


In [45]:
duplicate_features = get_duplicate_features(df)
duplicate_features.head(10)

Unnamed: 0,Desc,feature1,feature2
0,Duplicate Index,var_2,var_234
1,Duplicate Index,var_66,var_67
2,Duplicate Index,var_236,var_249
3,Duplicate Index,var_194,var_238
4,Duplicate Index,var_187,var_217
5,Duplicate Index,var_162,var_258
6,Duplicate Index,var_133,var_283
7,Duplicate Index,var_133,var_280
8,Duplicate Index,var_124,var_247
9,Duplicate Index,var_111,var_189


In [46]:
duplicate_index_features_list = duplicate_features.query("Desc=='Duplicate Index'")['feature2'].to_list()
print(duplicate_index_features_list)

['var_234', 'var_67', 'var_249', 'var_238', 'var_217', 'var_258', 'var_283', 'var_280', 'var_247', 'var_189', 'var_177', 'var_299', 'var_153', 'var_129', 'var_129', 'var_235', 'var_141', 'var_286', 'var_283', 'var_280', 'var_133', 'var_73', 'var_89', 'var_130', 'var_129', 'var_67', 'var_66', 'var_210', 'var_65', 'var_71', 'var_283']


In [47]:
print('Shape of Dataset before dropping the duplicate index features: ', df.shape)
df.drop(columns = duplicate_index_features_list, inplace=True)
print('Shape of Dataset after dropping the duplicate index features: ', df.shape)

Shape of Dataset before dropping the duplicate index features:  (50000, 251)
Shape of Dataset after dropping the duplicate index features:  (50000, 226)


## Quasi-constant features

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. In general, these features provide little, if any, information that allows a machine learning model to discriminate or predict a target. But there can be exceptions. So you should be careful when removing these type of features.

Identifying and removing quasi-constant features, is an easy first step towards feature selection and more interpretable machine learning models.

Here, I will demonstrate how to identify quasi-constant features using a dataset that I created for this course. 

To identify quasi-constant features, we can use the VarianceThreshold from Scikit-learn, or we can code it ourselves. If we use the VarianceThreshold, all our variables need to be numerical. If we code it manually however, we can apply the code to both numerical and categorical variables.

I will show 2 snippets of code, 1 where I use the VarianceThreshold and 1 manually coded alternative.

In [48]:
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import VarianceThreshold

In [49]:
base_dir = "c:\\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments"
path_add_ons  = "\\input_data"
file_nm = "fselect_dataset_1.csv"

fnm, exists = get_full_path_nm (base_dir, path_add_ons, file_nm) 

# load our first dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

data = pd.read_csv(fnm)
data.shape

Full path NM exists  c:\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_1.csv


(50000, 301)

**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [50]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant features

First, I will remove constant features like I did in the previous lecture. This will allow a better visualisation of the quasi-constant ones.

In [51]:
# using the code from the previous lecture
# I remove 34 constant features

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

## Remove quasi-constant features

### Using the VarianceThreshold from sklearn

The VarianceThreshold from sklearn provides a simple baseline approach to feature selection. It removes all features which variance doesn’t meet a certain threshold. By default, it removes all zero-variance features, as we did in the previous notebook.

Here, we will change the default threshold to remove quasi-constant features, or, I should better say, features with low-variance:

Check the Scikit-learn docs for more details:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

In [52]:
sel = VarianceThreshold(threshold=0.01)  

sel.fit(X_train)  # fit finds the features with low variance

VarianceThreshold(threshold=0.01)

In [53]:
# get_support is a boolean vector that indicates which features 
# are retained, that is, which features have a higher variance than
# the threshold we indicated.

# If we sum over get_support, we get the number
# of features that are not quasi-constant

sum(sel.get_support())

215

In [54]:
# let's print the number of quasi-constant features

quasi_constant = X_train.columns[~sel.get_support()]

len(quasi_constant)

51

We can see that 51 columns / variables are almost constant. This means that 51 variables show predominantly one value for the majority of observations of the training set. Let's explore a few if these variables below.

In [55]:
# let's print the variable names
quasi_constant

Index(['var_1', 'var_2', 'var_7', 'var_9', 'var_10', 'var_19', 'var_28',
       'var_36', 'var_43', 'var_45', 'var_53', 'var_56', 'var_59', 'var_66',
       'var_67', 'var_69', 'var_71', 'var_104', 'var_106', 'var_116',
       'var_133', 'var_137', 'var_141', 'var_146', 'var_177', 'var_187',
       'var_189', 'var_194', 'var_197', 'var_198', 'var_202', 'var_218',
       'var_219', 'var_223', 'var_233', 'var_234', 'var_235', 'var_245',
       'var_247', 'var_249', 'var_250', 'var_251', 'var_256', 'var_260',
       'var_267', 'var_274', 'var_282', 'var_285', 'var_287', 'var_289',
       'var_298'],
      dtype='object')

In [56]:
# percentage of observations showing each of the different values
# of the variable

X_train['var_1'].value_counts() / np.float(len(X_train))

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  X_train['var_1'].value_counts() / np.float(len(X_train))


0    0.999629
3    0.000200
6    0.000143
9    0.000029
Name: var_1, dtype: float64

We can see that > 99% of the observations show one value, 0. Therefore, this features is fairly constant.

In [57]:
# let's explore another one

X_train['var_2'].value_counts() / np.float(len(X_train))

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  X_train['var_2'].value_counts() / np.float(len(X_train))


0    0.999971
1    0.000029
Name: var_2, dtype: float64

In [58]:
# capture feature names

feat_names = X_train.columns[sel.get_support()]

In [59]:
# remove the quasi-constant features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 215), (15000, 215))

In [60]:
# trasnform the array into a dataframe

X_train = pd.DataFrame(X_train, columns=feat_names)
X_test = pd.DataFrame(X_test, columns=feat_names)

X_test.head()

Unnamed: 0,var_3,var_4,var_5,var_6,var_8,var_11,var_12,var_13,var_14,var_15,...,var_286,var_288,var_290,var_291,var_292,var_293,var_295,var_296,var_299,var_300
0,0.0,2.79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,2.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0
3,0.0,2.76,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,2.94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [61]:
# separate train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

# remove constant features
# using the code from the previous lecture

constant_features = [
    feat for feat in X_train.columns if X_train[feat].std() == 0
]

X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

In [62]:
# create an empty list
quasi_constant_feat = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant value, that is the value that is shared
    # by most observations
    predominant = X_train[feature].value_counts(
        normalize=True).sort_values(ascending=False).values[0]

    # evaluate the predominant feature: do more than 99% of the observations
    # show 1 value?
    if predominant > 0.998:

        # if yes, add the variable to the list
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

108

In [63]:
# print the feature names

quasi_constant_feat

['var_1',
 'var_2',
 'var_3',
 'var_6',
 'var_7',
 'var_9',
 'var_10',
 'var_11',
 'var_12',
 'var_14',
 'var_16',
 'var_20',
 'var_24',
 'var_28',
 'var_32',
 'var_34',
 'var_36',
 'var_39',
 'var_40',
 'var_42',
 'var_43',
 'var_45',
 'var_48',
 'var_53',
 'var_56',
 'var_59',
 'var_60',
 'var_65',
 'var_66',
 'var_67',
 'var_69',
 'var_71',
 'var_72',
 'var_73',
 'var_77',
 'var_78',
 'var_90',
 'var_95',
 'var_98',
 'var_102',
 'var_104',
 'var_106',
 'var_111',
 'var_115',
 'var_116',
 'var_124',
 'var_125',
 'var_126',
 'var_129',
 'var_130',
 'var_133',
 'var_136',
 'var_138',
 'var_141',
 'var_142',
 'var_146',
 'var_149',
 'var_150',
 'var_151',
 'var_153',
 'var_159',
 'var_183',
 'var_184',
 'var_187',
 'var_189',
 'var_197',
 'var_202',
 'var_204',
 'var_210',
 'var_211',
 'var_216',
 'var_217',
 'var_219',
 'var_221',
 'var_223',
 'var_224',
 'var_228',
 'var_233',
 'var_234',
 'var_235',
 'var_236',
 'var_237',
 'var_239',
 'var_243',
 'var_245',
 'var_246',
 'var_247',
 

In [64]:
# select one feature from the list

quasi_constant_feat[2]

'var_3'

In [65]:
X_train['var_3'].value_counts(normalize=True)

0.0000         0.999629
207901.3365    0.000029
15028.0560     0.000029
25905.4866     0.000029
35685.9459     0.000029
3583.3941      0.000029
52105.7901     0.000029
86718.0000     0.000029
861.0900       0.000029
2641.0164      0.000029
5209.9500      0.000029
10281.6000     0.000029
12542.3100     0.000029
27.3000        0.000029
Name: var_3, dtype: float64

The feature shows 0 for more than 99.9% of the observations. But, it also shows a few different values for a very tiny proportion of the observations. This fact, will increase the feature variance, that is why, this feature is not captured by the VarianceThreshold in our previous cell. Yet, we can see that it is quasi-constant.

Keep in mind that the thresholds are arbitrary and decided by the user.

In [66]:
# finally, let's drop the quasi-constant features:

X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))

## Duplicated features

Often datasets contain duplicated features, that is, features that despite having different names, are identical.

In addition, we may often introduce duplicated features when performing **one hot encoding** of categorical variables, particularly if our datasets have many and /or highly cardinal categorical variables.

Identifying and removing duplicated, and therefore redundant features, is an easy first step towards feature selection and more interpretable machine learning models.

Here I will demonstrate how to identify duplicated features using a dataset that I created for this course. 

There is no function in Pandas to find duplicated columns. So we need to write a bit code to do so.

**Note**
Finding duplicated features can be a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to do it.

This method that I describe here to find duplicated features works for both **numerical and categorical** variables.

In [67]:
base_dir = "c:\\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments"
path_add_ons  = "\\input_data"
file_nm = "fselect_dataset_1.csv"

fnm, exists = get_full_path_nm (base_dir, path_add_ons, file_nm) 

# load our first dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

data = pd.read_csv(fnm)
data.shape

Full path NM exists  c:\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_1.csv


(50000, 301)

In [68]:
# check the presence of missing data.
# (there are no missing data in this dataset)

[col for col in data.columns if data[col].isnull().sum() > 0]

[]

In [69]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## Remove constant and quasi-constant

In [70]:
# remove constant and quasi-constant features first:
# we can remove the 2 types of features together with this code
# (we used it in our previous notebook)

# create an empty list
quasi_constant_feat = []

# iterate over every feature
for feature in X_train.columns:

    # find the predominant value, that is the value that is shared
    # by most observations
    predominant = (X_train[feature].value_counts() / np.float(
        len(X_train))).sort_values(ascending=False).values[0]

    # evaluate predominant feature: do more than 99% of the observations
    # show 1 value?
    if predominant > 0.998:
        quasi_constant_feat.append(feature)

len(quasi_constant_feat)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  predominant = (X_train[feature].value_counts() / np.float(
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  predominant = (X_train[feature].value_counts() / np.float(
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  predominant = (X_train[feature].value_counts() / np.float(
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  predominant = (X_train[feature].value_counts() / np.float(
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  predominant = (X_train[feature].value_counts() / np.float(
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/re

142

In [71]:
# we can then drop these columns from the train and test sets:

X_train.drop(labels=quasi_constant_feat, axis=1, inplace=True)
X_test.drop(labels=quasi_constant_feat, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))

## Remove duplicated features

To identify duplicated variables we need to iterate through all features of our dataset, and for each and every feature, try and find others that are identical, or duplicates.

We will create a dictionary of {variable: duplicated variables} pairs to identify them more easily throughout the demo. Keep in mind that in a dataset, there could be 2 or more features that are identical to each other.

In [72]:
# check for duplicated features in the training set:

# create an empty dictionary, where we will store 
# the groups of duplicates
duplicated_feat_pairs = {}

# create an empty list to collect features
# that were found to be duplicated
_duplicated_feat = []


# iterate over every feature in our dataset:
for i in range(0, len(X_train.columns)):
    
    # this bit helps me understand where the loop is at:
    if i % 10 == 0:  
        print(i)
    
    # choose 1 feature:
    feat_1 = X_train.columns[i]
    
    # check if this feature has already been identified
    # as a duplicate of another one. If it was, it should be stored in
    # our _duplicated_feat list.
    
    # If this feature was already identified as a duplicate, we skip it, if
    # it has not yet been identified as a duplicate, then we proceed:
    if feat_1 not in _duplicated_feat:
    
        # create an empty list as an entry for this feature in the dictionary:
        duplicated_feat_pairs[feat_1] = []

        # now, iterate over the remaining features of the dataset:
        for feat_2 in X_train.columns[i + 1:]:

            # check if this second feature is identical to the first one
            if X_train[feat_1].equals(X_train[feat_2]):

                # if it is identical, append it to the list in the dictionary
                duplicated_feat_pairs[feat_1].append(feat_2)
                
                # and append it to our monitor list for duplicated variables
                _duplicated_feat.append(feat_2)
                
                # done!

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150


In [73]:
# let's explore our list of duplicated features
len(_duplicated_feat)

6

In [74]:
# these are the ones:

_duplicated_feat

['var_148', 'var_199', 'var_296', 'var_250', 'var_232', 'var_269']

In [75]:
# let's explore the dictionary we created:

duplicated_feat_pairs

{'var_4': [],
 'var_5': [],
 'var_8': [],
 'var_13': [],
 'var_15': [],
 'var_17': [],
 'var_18': [],
 'var_19': [],
 'var_21': [],
 'var_22': [],
 'var_25': [],
 'var_26': [],
 'var_27': [],
 'var_29': [],
 'var_30': [],
 'var_31': [],
 'var_35': [],
 'var_37': ['var_148'],
 'var_38': [],
 'var_41': [],
 'var_46': [],
 'var_47': [],
 'var_49': [],
 'var_50': [],
 'var_51': [],
 'var_52': [],
 'var_54': [],
 'var_55': [],
 'var_57': [],
 'var_58': [],
 'var_62': [],
 'var_63': [],
 'var_64': [],
 'var_68': [],
 'var_70': [],
 'var_74': [],
 'var_75': [],
 'var_76': [],
 'var_79': [],
 'var_82': [],
 'var_83': [],
 'var_84': ['var_199'],
 'var_85': [],
 'var_86': [],
 'var_88': [],
 'var_91': [],
 'var_93': [],
 'var_94': [],
 'var_96': [],
 'var_100': [],
 'var_101': [],
 'var_103': [],
 'var_105': [],
 'var_107': [],
 'var_108': [],
 'var_109': [],
 'var_110': [],
 'var_114': [],
 'var_117': [],
 'var_118': [],
 'var_119': [],
 'var_121': [],
 'var_123': [],
 'var_128': [],
 'var_131'

In [76]:
# let's explore the number of keys in our dictionary

# we see it is 152, because 6 of the 158 were duplicates,
# so they were not included as keys

print(len(duplicated_feat_pairs.keys()))

152


In [77]:
# print the features with its duplicates

# iterate over every feature in our dict:
for feat in duplicated_feat_pairs.keys():
    
    # if it has duplicates, the list should not be empty:
    if len(duplicated_feat_pairs[feat]) > 0:

        # print the feature and its duplicates:
        print(feat, duplicated_feat_pairs[feat])
        print()

var_37 ['var_148']

var_84 ['var_199']

var_143 ['var_296']

var_177 ['var_250']

var_226 ['var_232']

var_229 ['var_269']



In [78]:
# let's check that indeed those features are duplicated
# I select a pair from above

X_train[['var_37', 'var_148']].head(10)

Unnamed: 0,var_37,var_148
17967,0,0
32391,0,0
9341,0,0
7929,0,0
46544,0,0
4149,0,0
33426,0,0
3002,0,0
6974,0,0
16864,0,0


In [79]:
X_train['var_37'].unique()

array([ 0,  3,  6,  9, 12, 21, 33, 15], dtype=int64)

In [80]:
X_train['var_148'].unique()

array([ 0,  3,  6,  9, 12, 21, 33, 15], dtype=int64)

In [81]:
# let's explore parts of the dataframe where the values in
# these features are different from 0:

X_train[X_train['var_37'] != 0][['var_37', 'var_148']].head(10)

Unnamed: 0,var_37,var_148
37493,3,3
20251,6,6
4264,6,6
48480,3,3
31607,3,3
41172,3,3
13502,3,3
7759,3,3
46118,3,3
2638,3,3


In [82]:
# finally, to remove the duplicates, what we are going to do is to retain
# the keys of the dictionary

# do you understand why? if not, go back to our loop in cell 7 and try to 
# determine the reason

X_train = X_train[duplicated_feat_pairs.keys()]
X_test = X_test[duplicated_feat_pairs.keys()]

X_train.shape, X_test.shape

((35000, 152), (15000, 152))

## Constant and Quasi-constant features with Feature-engine

In this notebook, we will remove constant and quasi-constant features utilizing the new functionality from Feature-engine.

In [83]:
from feature_engine.selection import DropConstantFeatures

In [84]:
base_dir = "c:\\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments"
path_add_ons  = "\\input_data"
file_nm = "fselect_dataset_1.csv"

fnm, exists = get_full_path_nm (base_dir, path_add_ons, file_nm) 

# load our first dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

data = pd.read_csv(fnm)
data.shape

Full path NM exists  c:\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_1.csv


(50000, 301)

In [85]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [86]:
sel = DropConstantFeatures(tol=1, variables=None, missing_values='raise')

sel.fit(X_train)

DropConstantFeatures()

In [87]:
# list of constant features

sel.features_to_drop_

['var_23',
 'var_33',
 'var_44',
 'var_61',
 'var_80',
 'var_81',
 'var_87',
 'var_89',
 'var_92',
 'var_97',
 'var_99',
 'var_112',
 'var_113',
 'var_120',
 'var_122',
 'var_127',
 'var_135',
 'var_158',
 'var_167',
 'var_170',
 'var_171',
 'var_178',
 'var_180',
 'var_182',
 'var_195',
 'var_196',
 'var_201',
 'var_212',
 'var_215',
 'var_225',
 'var_227',
 'var_248',
 'var_294',
 'var_297']

In [88]:
# number of constant features

len(sel.features_to_drop_)

34

In [89]:
# let's explore 1 of the constant feature values

X_train[sel.features_to_drop_[0]].unique()

array([0], dtype=int64)

In [90]:
# remove constant features from the data

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 266), (15000, 266))

The datasets now contain 34 features less. 

## Remove quasi-constant features

In [91]:
sel = DropConstantFeatures(tol=0.998, variables=None, missing_values='raise')

sel.fit(X_train)

DropConstantFeatures(tol=0.998)

In [92]:
# number of quasi-constant features

len(sel.features_to_drop_)

108

In [93]:
# list of quasi-constant features

sel.features_to_drop_

['var_1',
 'var_2',
 'var_3',
 'var_6',
 'var_7',
 'var_9',
 'var_10',
 'var_11',
 'var_12',
 'var_14',
 'var_16',
 'var_20',
 'var_24',
 'var_28',
 'var_32',
 'var_34',
 'var_36',
 'var_39',
 'var_40',
 'var_42',
 'var_43',
 'var_45',
 'var_48',
 'var_53',
 'var_56',
 'var_59',
 'var_60',
 'var_65',
 'var_66',
 'var_67',
 'var_69',
 'var_71',
 'var_72',
 'var_73',
 'var_77',
 'var_78',
 'var_90',
 'var_95',
 'var_98',
 'var_102',
 'var_104',
 'var_106',
 'var_111',
 'var_115',
 'var_116',
 'var_124',
 'var_125',
 'var_126',
 'var_129',
 'var_130',
 'var_133',
 'var_136',
 'var_138',
 'var_141',
 'var_142',
 'var_146',
 'var_149',
 'var_150',
 'var_151',
 'var_153',
 'var_159',
 'var_183',
 'var_184',
 'var_187',
 'var_189',
 'var_197',
 'var_202',
 'var_204',
 'var_210',
 'var_211',
 'var_216',
 'var_217',
 'var_219',
 'var_221',
 'var_223',
 'var_224',
 'var_228',
 'var_233',
 'var_234',
 'var_235',
 'var_236',
 'var_237',
 'var_239',
 'var_243',
 'var_245',
 'var_246',
 'var_247',
 

In [94]:
# percentage of observations showing each of the different values
# of the variable

var = sel.features_to_drop_[0]

X_train[var].value_counts(normalize=True)

0    0.999629
3    0.000200
6    0.000143
9    0.000029
Name: var_1, dtype: float64

In [95]:
# let's explore another one

var = sel.features_to_drop_[2]

X_train[var].value_counts(normalize=True)

0.0000         0.999629
207901.3365    0.000029
15028.0560     0.000029
25905.4866     0.000029
35685.9459     0.000029
3583.3941      0.000029
52105.7901     0.000029
86718.0000     0.000029
861.0900       0.000029
2641.0164      0.000029
5209.9500      0.000029
10281.6000     0.000029
12542.3100     0.000029
27.3000        0.000029
Name: var_3, dtype: float64

In [96]:
#remove the quasi-constant features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))

## Duplicated features with Feature-engine

In this notebook, we will identify and remove duplicated features with Feature-engine.

In [97]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from feature_engine.selection import DropDuplicateFeatures, DropConstantFeatures

In [98]:
base_dir = "c:\\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments"
path_add_ons  = "\\input_data"
file_nm = "fselect_dataset_1.csv"

fnm, exists = get_full_path_nm (base_dir, path_add_ons, file_nm) 

# load our first dataset

# (feel free to write some code to explore the dataset and become
# familiar with it ahead of this demo)

data = pd.read_csv(fnm)
data.shape

Full path NM exists  c:\Users\Arindam Banerji\CopyFolder\IoT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_1.csv


(50000, 301)

In [99]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [100]:
# remove constant and quasi-constant features first:
# we use Feature-engine for this

sel = DropConstantFeatures(tol=0.998, variables=None, missing_values='raise')

sel.fit(X_train)

DropConstantFeatures(tol=0.998)

In [101]:
# remove the quasi-constant features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 158), (15000, 158))

In [102]:
# set up the selector
sel = DropDuplicateFeatures(variables=None, missing_values='raise')

# find the duplicate features, this might take a while
sel.fit(X_train)

DropDuplicateFeatures(missing_values='raise')

In [103]:
# these are the pairs of duplicated features
# each set are duplicates

sel.duplicated_feature_sets_

[{'var_148', 'var_37'},
 {'var_199', 'var_84'},
 {'var_143', 'var_296'},
 {'var_177', 'var_250'},
 {'var_226', 'var_232'},
 {'var_229', 'var_269'}]

In [104]:
# these are the features that will be dropped
# 1 from each of the pairs above

sel.features_to_drop_

{'var_148', 'var_199', 'var_232', 'var_250', 'var_269', 'var_296'}

In [105]:
# let's explore our list of duplicated features

len(sel.features_to_drop_)

6

In [106]:
# remove the duplicated features

X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((35000, 152), (15000, 152))

## Stack Feature selection in a Pipeline

We can perform both steps together by setting up the transformers within a pipeline.

In [107]:
# separate dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1), # drop the target
    data['target'], # just the target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [108]:
pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)),
    ('duplicated', DropDuplicateFeatures()),
])

pipe.fit(X_train)

Pipeline(steps=[('constant', DropConstantFeatures(tol=0.998)),
                ('duplicated', DropDuplicateFeatures())])

In [109]:
# remove features

X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

X_train.shape, X_test.shape

((35000, 152), (15000, 152))

In [110]:
# we can navigate the pipeline transformers

len(pipe.named_steps['constant'].features_to_drop_)

142

In [111]:
pipe.named_steps['duplicated'].features_to_drop_

{'var_148', 'var_199', 'var_232', 'var_250', 'var_269', 'var_296'}