<a href="https://colab.research.google.com/github/RemyaRS/Feature-Selection/blob/main/Dropping_Features_having_Constant_Value_for_Feature_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Import Libraries

In [1]:
import pandas as pd

# Self Built Dataset

In [2]:
data = pd.DataFrame({"A" : [1,2,3,4,9], "B" : [3,5,8,2,6], "C" : [3,3,3,3,3], "D" : [2,2,2,2,2], "E" : [8,2,6,3,6]})
data.head()

Unnamed: 0,A,B,C,D,E
0,1,3,3,2,8
1,2,5,3,2,2
2,3,8,3,2,6
3,4,2,3,2,3
4,9,6,3,2,6


Here in column C and D has constant values, which will not important in solving a problem statement or creation of model

Such columns will have 0 variance and std. dev because they are having same values

To delete those columns set a variable that can distinguish columns having variance = 0

In [3]:
from sklearn.feature_selection import VarianceThreshold
var_thresh = VarianceThreshold(threshold=0)

Now the variable var_thresh can be used to evalate dataset as follows :

In [4]:
var_thresh.fit(data)
var_thresh.get_support()

array([ True,  True, False, False,  True])

Since column A and B have non zero variance and C & D has zero variance, so var_thresh gives True for A & B and for C & D it gives False

Now form a variable that will store columns having zero variance and later this variable can be used to remove those columns from the dataset

In [5]:
const_col = [column for column in data.columns
             if column not in data.columns[var_thresh.get_support()]]

In [6]:
const_col

['C', 'D']

In [7]:
data = data.drop(const_col, axis =1)
data.head()

Unnamed: 0,A,B,E
0,1,3,8
1,2,5,2
2,3,8,6
3,4,2,3
4,9,6,6


#Imported Dataset

Source : https://www.kaggle.com/competitions/santander-customer-satisfaction/data?select=train.csv

###Using google drive to import dataset

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [10]:
dataset = pd.read_csv('/content/drive/MyDrive/train.csv')

In [11]:
dataset.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [12]:
dataset.shape

(76020, 371)

###Separating feature variable and labels , because this method of feature scaling performed on feature variable only

In [13]:
x = dataset.drop(labels = ['TARGET'], axis=1)
y = dataset['TARGET']
x.head()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016


In [14]:
x.shape

(76020, 370)

Now set the variance threshold

In [15]:
from sklearn.feature_selection import VarianceThreshold
var_thres = VarianceThreshold(threshold=0)

In [16]:
var_thres.fit(x)
var_thres.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

Column hacing 0 variance is showing False and rest are showing true

Now form a variable that will store columns having zero variance and later this variable can be used to remove those columns from the dataset

In [17]:
const_column = [column for column in x.columns
             if column not in x.columns[var_thres.get_support()]]

In [18]:
const_column

['ind_var2_0',
 'ind_var2',
 'ind_var27_0',
 'ind_var28_0',
 'ind_var28',
 'ind_var27',
 'ind_var41',
 'ind_var46_0',
 'ind_var46',
 'num_var27_0',
 'num_var28_0',
 'num_var28',
 'num_var27',
 'num_var41',
 'num_var46_0',
 'num_var46',
 'saldo_var28',
 'saldo_var27',
 'saldo_var41',
 'saldo_var46',
 'imp_amort_var18_hace3',
 'imp_amort_var34_hace3',
 'imp_reemb_var13_hace3',
 'imp_reemb_var33_hace3',
 'imp_trasp_var17_out_hace3',
 'imp_trasp_var33_out_hace3',
 'num_var2_0_ult1',
 'num_var2_ult1',
 'num_reemb_var13_hace3',
 'num_reemb_var33_hace3',
 'num_trasp_var17_out_hace3',
 'num_trasp_var33_out_hace3',
 'saldo_var2_ult1',
 'saldo_medio_var13_medio_hace3']

In [19]:
x.shape

(76020, 370)

In [20]:
x_new = x.drop(const_column, axis =1)
x_new .shape

(76020, 336)

If variance is to be increased to 10 % that can also be done, 
 for which threshold=0 should be written as threshold=0.1