<a href="https://colab.research.google.com/github/Priyo-prog/Statistics-and-Data-Science/blob/main/Feature%20Selection%20Complete/Filter%20Methods/duplicated_features_removal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Remove Duplicated Features**

Often datasets contain duplicated features, that is, features that despite having different names, are identical.

In addition, we may often introduce duplicated features when performing one hot encoding of categorical variables, particularly if our datasets have many and /or highly cardinal categorical variables.

Identifying and removing duplicated, and therefore redundant features, is an easy first step towards feature selection and more interpretable machine learning models.

Here I will demonstrate how to identify duplicated features using a dataset that I created for this course.

There is no function in Pandas to find duplicated columns. So we need to write a bit code to do so.

Note Finding duplicated features can be a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to do it.

This method that I describe here to find duplicated features works for both numerical and categorical variables.


In [1]:
# Import important libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

In [2]:
# import the dataset from google drive
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [3]:
filename = "/content/drive/MyDrive/Data Science/Feature Selection/dataset_1.csv"

In [6]:
df = pd.read_csv(filename)
df.head(5)

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_292,var_293,var_294,var_295,var_296,var_297,var_298,var_299,var_300,target
0,0,0,0.0,0.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
1,0,0,0.0,3.0,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
2,0,0,0.0,5.88,0.0,0,0,0,0,0,...,0.0,0,0,3,0,0,0,0.0,67772.7216,0
3,0,0,0.0,14.1,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0
4,0,0,0.0,5.76,0.0,0,0,0,0,0,...,0.0,0,0,0,0,0,0,0.0,0.0,0


In [8]:
# Create features and labels
X = df.drop(labels="target", axis=1)
y = df["target"]

In [9]:
# Split the data in train and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)

In [11]:
# Remove the constant and quasi-constant

constant_features = [c_feat for c_feat in X_train.columns if X_train[c_feat].std() == 0]
len(constant_features)

34

In [12]:
# Remove the constant features
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True)

In [14]:
# Now remove the quasi-constants
# create the instance of the VarianceThreshold
sel = VarianceThreshold(threshold=0.01)
sel.fit(X_train)

In [15]:
sum(sel.get_support())

215

In [16]:
# Get the quasi-constant features
quasi_constants = X_train.columns[~sel.get_support()]

In [17]:
# Now remove the quasi-constant
X_train.drop(labels=quasi_constants, axis=1, inplace=True)
X_test.drop(labels=quasi_constants, axis=1, inplace=True)

In [18]:
X_train.shape, X_test.shape

((35000, 215), (15000, 215))

In [20]:
len(quasi_constants)

51

## **Remove the duplicated features**

In [21]:
# Create dictionary for duplicated features comparison
duplicate_feature_pairs = {}

# Create list of the duplicate features
duplicate_features = []

In [24]:
for i in range(0, len(X_train.columns)):

  # choose 1 feature
  feature_1 = X_train.columns[i]

  if feature_1 not in duplicate_features:
    duplicate_feature_pairs[feature_1] = []

    for feature_2 in X_train.columns[i+1:]:

      #if X_train[feat_1].equals(X_train[feat_2]):

      if X_train[feature_1].equals(X_train[feature_2]):

        # Push the feature 2 to the dictionary of feature 1
        # to create the duplicate pair
        duplicate_feature_pairs[feature_1].append(feature_2)

        # And also include it to the duplicate feature list
        # which can be dropped later from dataset
        duplicate_features.append(feature_2)

In [25]:
# Display the duplicate feature lists
duplicate_features

['var_151',
 'var_183',
 'var_148',
 'var_216',
 'var_199',
 'var_296',
 'var_239',
 'var_263',
 'var_232',
 'var_269']

In [26]:
# Now display the only dictionary list which has
# duplicate pairs

for feat in duplicate_feature_pairs.keys():

  if len(duplicate_feature_pairs[feat]) > 0:
    print(feat, duplicate_feature_pairs[feat])

var_6 ['var_151']
var_34 ['var_183']
var_37 ['var_148']
var_60 ['var_216']
var_84 ['var_199']
var_143 ['var_296']
var_149 ['var_239']
var_221 ['var_263']
var_226 ['var_232']
var_229 ['var_269']


In [27]:
# Drop the duplicate features from dataset
X_train.drop(labels=duplicate_features, axis=1, inplace=True)
X_test.drop(labels=duplicate_features, axis=1, inplace=True)

In [28]:
X_train.shape, X_test.shape

((35000, 205), (15000, 205))