
<h2><div style="font-family: Trebuchet MS; background-color: red; color: #FFFFFF; padding: 12px; line-height: 1.5;">📊 Constant Features 🤔</div> </h2>


Constant features are those that show the same value, just one value, for all the observations of the dataset. This is, the same value for all the rows of the dataset. These features provide no information that allows a machine learning model to discriminate or predict a target. Identifying and removing constant features. 
To identify constant features, we can use the VarianceThreshold function from sklearn.

In [196]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import collections
import os

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import roc_auc_score,classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [81]:
#Load the train dataset. It contain more then 76000 records. Lets load 10000 records only to make things fast.
df=pd.read_csv('../input/santander-customer-satisfaction/train.csv',nrows=10000)
df.shape

In [82]:
df.info()

In [83]:
#Find the missing data in the columns
[col for col in df.columns if df[col].isnull().sum()>0]
#After executing the above code we found that there is no missing data.

**It is good practice to select the features by examining only the training set to avoid overfitting.**

In [124]:
# separate dataset into train and test
X_train, X_test, Y_train, Y_test = train_test_split(df.drop(labels=['TARGET'], axis=1),df['TARGET'],test_size=0.3,random_state=0)
#Shape of training set and test set.
X_train.shape, X_test.shape

# I keep a copy of the dataset with all the variables
# to measure the performance of machine learning models
# at the end of the notebook
X_train_org=X_train.copy()
X_test_org=X_test.copy()

**Using Variance Threshold**<br>
Variance threshold from sklearn is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [125]:
varModel=VarianceThreshold(threshold=0) #Setting variance threshold to 0 which means features that have same value in all samples.
varModel.fit(X_train)

In [126]:
constArr=varModel.get_support()
constArr
#get_support() return True and False value for each feature.
#True: Not a constant feature
#False: Constant feature(It contains same value in all samples.)

In [127]:
#To find total number of constant and non constant features we will be using collections.Counter function.
collections.Counter(constArr)
#Non Constant feature:284
#Constant feature: 86

We can see there are 86  features/columns having constant value. This mean they have same value in all samples. Lets proof that, by selecting some of the constant features and print out value counts. We can also use unique method.

In [128]:
#Print out constant feature name
constCol=[col for col in X_train.columns if col not in X_train.columns[constArr]]
constCol

In [129]:
print(X_train['ind_var2_0'].value_counts())
print(X_train['ind_var13_medio'].value_counts())
print(X_train['ind_var27'].value_counts())

Constant features do not play  any role in predicting the result. So we will remove it from our training set and test set. Transform will remove all the constant columns from training set and test set but we will not use it because it will transform a dataframe to numpy array. We are going to use same training set and test set for other feasture selection as well. So will drop the constant columns from tables.

In [130]:
print('Shape before drop-->',X_train.shape, X_test.shape)
#X_train=varModel.transform(X_train)
#X_test=varModel.transform(X_test)
X_train.drop(columns=constCol,axis=1,inplace=True)
X_test.drop(columns=constCol,axis=1,inplace=True)
print('Shape after drop-->',X_train.shape, X_test.shape)



<h2><div style="font-family: Trebuchet MS; background-color: red; color: #FFFFFF; padding: 12px; line-height: 1.5;">📊 Quasi-Constant Features  🤔</div> </h2>

Quasi-constant features are those that show the same value for the great majority of the observations of the dataset. Mostly we do not consider these features in prediting the result.
To identify Quasi constant features, we can use the VarianceThreshold function from sklearn. We will be using the same training set and test set.


<h3><div style="font-family: Trebuchet MS; background-color:#176BA0;; color: #FFFFFF; padding: 10px; line-height: 1.5;">1. | Using Variance Threshold 🌟 📚</div></h3>

Variance threshold from sklearn is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.

In [131]:
#Create variance threshold model
quasiModel=VarianceThreshold(threshold=0.01) #It will search for the features having 99% of same value in all samples.
quasiModel.fit(X_train)

In [132]:
quasiArr=quasiModel.get_support()
quasiArr
#get_support() return True and False value for each feature.
#True: Not a quasi constant feature
#False: Quasi constant feature(It contains 99% same value in all samples.)

In [133]:
#To find total number of quasi constant and non quasi constant features we will be using collections.Counter function.
collections.Counter(quasiArr)
#Non quasi Constant feature:241
#Quasi constant feature: 43

We can see there are 43  features/columns having quasi  constant value. This mean they have 99% same value in all samples. Lets proof that, by selecting some of the quasi constant features and print out value counts.

In [134]:
#Print out quasi constant feature name
quasiCols=[col for col in X_train.columns if col not in X_train.columns[quasiArr]]
quasiCols

In [135]:
totalSampleCount=len(X_train)
print(X_train['num_aport_var33_ult1'].value_counts()/totalSampleCount)
print(X_train['num_var29'].value_counts()/totalSampleCount)
print(X_train['num_venta_var44_ult1'].value_counts()/totalSampleCount)

We can see here more than 99% observation show one value 0. Therefore, there features are almost constant. Lets remove it from training set and test set.

In [136]:
print('Shape before drop-->',X_train.shape, X_test.shape)
X_train.drop(columns=quasiCols,axis=1,inplace=True)
X_test.drop(columns=quasiCols,axis=1,inplace=True)
print('Shape after drop-->',X_train.shape, X_test.shape)




<h3><div style="font-family: Trebuchet MS; background-color:#176BA0;; color: #FFFFFF; padding: 10px; line-height: 1.5;">1. | Duplicated Features 🌟 📚</div></h3>

Often datasets contain one or more features that show the same values across all the observations. This means that both features are in essence identical. In addition, it is not unusual to introduce duplicated features after performing one hot encoding of categorical variables, particularly when using several highly cardinal variables. <br>
Note: Finding duplicated features is a computationally costly operation in Python, therefore depending on the size of your dataset, you might not always be able to perform it.

In [137]:
#The method will find the duplicate columns and return name of duplicated columns in an array
def duplicateColumns(data):
    dupliCols=[]
    for i in range(0,len(data.columns)):
        col1=data.columns[i]
        for col2 in data.columns[i+1:]:
            if data[col1].equals(data[col2]):
                dupliCols.append(col1+','+col2)
    return dupliCols

In [138]:
duplCols=duplicateColumns(X_train)
duplCols

In [139]:
print('Total Duplicated columns',len(duplCols))

In [140]:
#Lets verify the columns are Identical or not.
X_train[['ind_var1_0','ind_var40_0']]

In [141]:
#Get the duplicate column names
dCols=[col.split(',')[1] for col in duplCols]
dCols

Remove the duplicat columns from training set and test set.

In [142]:
#Find the count of unique columns
len(set(dCols))

In [143]:
print('Shape of our data before applying filter technique-->',df.shape)
print('Shape before droping duplicate columns-->',X_train.shape, X_test.shape)
X_train=X_train.drop(columns=dCols,axis=1)
X_test=X_test.drop(columns=dCols,axis=1)
print('Shape after droping duplicate columns-->',X_train.shape, X_test.shape)

# I keep a copy of the dataset except constant and duplicated variables
# to measure the performance of machine learning models
# at the end of the notebook
X_train_fil=X_train.copy()
X_test_fil=X_test.copy()

**As you can see after applying Constant, Quasi-Constant and Duplicate filter method we have remove 150 features(371-221=150) from our training set and test. Lets do some more filtration.**




<h3><div style="font-family: Trebuchet MS; background-color:#176BA0;; color: #FFFFFF; padding: 10px; line-height: 1.5;">1. | Correlation 🌟 📚</div></h3>

Its a process of establishing a relationship or connection between two or more feature. Good feature subsets contain features highly correlated with the target, yet uncorrelated to each other. 

I will be using House Prices dataset to show correlation between columns and target.

In [144]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [145]:
houseDf=pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

In [146]:
houseDf.head()

In [147]:
houseDf.info()

In [148]:
#Currently I will be dealling with numerical columns only.
colType = ['int64','float64']
#Select the columns which are either int64 or float64.
numCols=list(houseDf.select_dtypes(include=colType).columns)
#Assigning numerical columns from df to data variable. We can use the same variable as well.
data=houseDf[numCols]

In [149]:
data.shape

In [150]:
#Check if there is any missing data.
data.isnull().sum().max()

In [151]:
#Filling missing data
data.fillna(0,axis=1,inplace=True)

In [152]:
#Re-check if there is any missing data.
data.isnull().sum().max()

In [153]:
#Split our data in training and test set.
x_train,x_test,y_train,y_test=train_test_split(data.drop('SalePrice',axis=1),data['SalePrice'],test_size=.2,random_state=1)

In [154]:
# visualise correlated features
# I will build the correlation matrix, which examines the 
# correlation of all features (for all possible feature combinations)
# and then visualise the correlation matrix using seaborn
plt.figure(figsize=(20,15))
sns.heatmap(x_train.corr())
plt.show()

In the plot above, the light squares correspond to highly correlated features (>0.75). We can see that there are quite a few. The diagonal represents the correlation of a feature with itself, therefore the value is 1.




<h3><div style="font-family: Trebuchet MS; background-color:#176BA0;; color: #FFFFFF; padding: 10px; line-height: 1.5;">1. | Brute force approach 🌟 📚</div></h3>

In [155]:
def correlation(dataset,threshold):
    col_corr=set() # set will contains unique values.
    corr_matrix=dataset.corr() #finding the correlation between columns.
    for i in range(len(corr_matrix.columns)): #number of columns
        for j in range(i):
            if abs(corr_matrix.iloc[i,j])>threshold: #checking the correlation between columns.
                colName=corr_matrix.columns[i] #getting the column name
                col_corr.add(colName) #adding the correlated column name heigher than threshold value.
    return col_corr #returning set of column names
col=correlation(x_train,0.75)
print('Correlated columns:',col)

We can see that 3 features are highly correlated with other features in the training set. Currently we are dealing with a small dataset, that's why we have only 3 highly correlated features. Lets try to find out the correlated features in santander customer satisfaction database. 

In [156]:
#X_train is train dataset for Santander database.
scol=correlation(X_train,0.8)
print('Correlated columns:',scol)
print(len(scol))

**As you can see it here, there are 134 features heighly correlated with each othre.**
Correlated features in general doesn't improve model preformance most of the time. So its better to remove the correlated features. It makes the learning algorithm faster.
Due to the curse of dimensionality, less features usually mean high improvement in term of speed.






<h3><div style="font-family: Trebuchet MS; background-color:#176BA0;; color: #FFFFFF; padding: 10px; line-height: 1.5;">1. | Brute force  🌟 📚</div></h3>

In [157]:
print('Shape of our data before applying filter technique-->',df.shape)
print('Shape before droping duplicate columns-->',X_train.shape, X_test.shape)
X_train=X_train.drop(columns=scol,axis=1)
X_test=X_test.drop(columns=scol,axis=1)
print('Shape after droping duplicate columns-->',X_train.shape, X_test.shape)

In [206]:
# create a function to build random forests and compare performance in train and test set
def RandomForest(X_train, X_test, y_train, y_test):
    rf = RandomForestClassifier(n_estimators=200, random_state=1, max_depth=4)
    rf.fit(X_train, y_train)
    print('Train set')
    pred = rf.predict_proba(X_train)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
    print('Test set')
    pred = rf.predict_proba(X_test)
    print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

In [202]:
# original dataset result
RandomForest(X_train_org.drop(labels=['ID'], axis=1),
                  X_test_org.drop(labels=['ID'], axis=1),
                  Y_train, Y_test)

In [203]:
#Result after applying basic filter method on dataset.
RandomForest(X_train_fil.drop(labels=['ID'], axis=1),
                  X_test_fil.drop(labels=['ID'], axis=1),
                  Y_train, Y_test)

In [204]:
#Result after removing correlated features from filtered dataset.
RandomForest(X_train.drop(labels=['ID'], axis=1),
                  X_test.drop(labels=['ID'], axis=1),
                  Y_train, Y_test)

#### We can see after applying Constaint, Quasi Constant, Duplicate and Correlated filter features methods, we have removed **284** (original feature count = 371 - feature count after appling filter methods = 87) features and the performance of the model is also improved(0.7581 vs 7619).