<a href="https://colab.research.google.com/github/SidSolanki28/Absenteeism-at-Work/blob/master/Feature_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Drop constant features

## Variance Threshold

This method removes features with variation below a certain cutoff.
The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.

Variance Threshold doesn’t consider the relationship of features with the target variable.

**Resources**

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

https://github.com/krishnaik06/Complete-Feature-Selection/blob/master/1.%20Feature%20Selection-%20Dropping%20Constant%20Features.ipynb

https://towardsdatascience.com/why-how-and-when-to-apply-feature-selection-e9c69adfabf2 

## Template

In [None]:
from sklearn.feature_selection import VarianceThreshold

var_thres=VarianceThreshold(threshold=0)
var_thres.fit(X_train)

In [None]:
constant_columns = [column for column in X_train.columns
                    if column not in X_train.columns[var_thres.get_support()]]

In [None]:
print(len(constant_columns))

In [None]:
for column in constant_columns:
    print(column)

In [None]:
X_train.drop(constant_columns,axis=1)
X_train.drop(constant_columns,axis=1)

# Correlations

Removing the features which are highly positively correlated generally > 85%

**Resources**

https://github.com/krishnaik06/Complete-Feature-Selection/blob/master/2-Feature%20Selection-%20Correlation.ipynb

## Template

In [None]:
# Calculate correlations
corr = X_train.corr()

# Heatmap
plt.figure(figsize=(12,10))
sns.heatmap(corr, annot=True, fmt=".2f")
plt.show()

In [40]:
# with the following function we can select highly correlated features
# it will remove the first feature that is correlated with anything other feature

def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr

In [None]:
corr_features = correlation(X_train, 0.85)   # generally used 0.85 as a threshold
len(set(corr_features))

In [None]:
corr_features

In [None]:
X_train.drop(corr_features,axis=1)
X_test.drop(corr_features,axis=1)