<a href="https://colab.research.google.com/github/Priyo-prog/Statistics-and-Data-Science/blob/main/Feature%20Selection%20Complete/Filter%20Methods/correlation_handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Correlation**
Correlation Feature Selection evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the target, yet uncorrelated to each other".

References:

M. Hall 1999, Correlation-based Feature Selection for Machine Learning

Senliol, Baris, et al. "Fast Correlation Based Filter (FCBF) with a different search strategy." Computer and Information Sciences.

I will demonstrate how to select features based on correlation using 2 procedures:

The first one is a brute force function that finds correlated features without any further insight.

The second procedure finds groups of correlated features, which we can then explore to decide which one we keep and which ones we discard.

Often, more than 2 features are correlated with each other. We can find groups of 3, 4 or more features that are correlated. By identifying these groups, with procedure 2, we can then select from each group, which feature we want to keep, and which ones we want to remove.

Note

The most used method to determine correlation is the Pearson's correlation method, which is the one that I will carry out in this notebook.

In [51]:
pip install feature_engine



In [52]:
# import important libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures

In [53]:
# Mount google drive
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [54]:
filename = "/content/drive/MyDrive/Data Science/Feature Selection/dataset_1.csv"

In [55]:
df = pd.read_csv(filename)

In [56]:
# Create features and labels
X = df.drop(labels="target", axis=1)
y = df["target"]

In [57]:
# Split the data in train and test split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## **Handling Constants, Quasi-Constants and Duplicates**

Drop the Constant, Quasi-constant and duplicate features from the dataset
using Feature Engine and Pipeline.

In [58]:
pipe = Pipeline(steps=[("const_drop", DropConstantFeatures(tol=0.998, variables=None, missing_values="raise")),
                       ("duplicate_drop", DropDuplicateFeatures(variables=None, missing_values="raise"))])

In [59]:
pipe.fit(X_train)

In [60]:
# Now drop the constants and duplicates
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

In [61]:
X_train.shape, X_test.shape

((35000, 152), (15000, 152))

## **Handling the correlation**

In [62]:
# Brute Force Method

def correlation(dataset, threshold):

  correlated_features = set()

  corr_matrix = dataset.corr()

  for i in range(len(corr_matrix.columns)):

    for j in range(i):
      if abs(corr_matrix.iloc[i,j]) > threshold:
        colname = corr_matrix.columns[i]
        correlated_features.add(colname)

  return correlated_features

In [63]:
correlated_features = correlation(X_train, 0.8)
len(correlated_features)

76

In [64]:
# Drop the correlated features from the dataset
X_train.drop(labels=correlated_features, axis=1, inplace=True)
X_test.drop(labels=correlated_features, axis=1, inplace=True)
X_train.shape, X_test.shape

((35000, 76), (15000, 76))

## **2nd Approach**

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [66]:
pipe.fit(X_train)

In [67]:
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)
X_train.shape, X_test.shape

((35000, 152), (15000, 152))

In [68]:
corrmat = X_train.corr()
corrmat = corrmat.abs().unstack()
corrmat

var_4    var_4      1.000000
         var_5      0.055807
         var_8      0.194042
         var_13     0.033760
         var_15     0.491904
                      ...   
var_300  var_288    0.103559
         var_292    0.007170
         var_293    0.029480
         var_295    0.584980
         var_300    1.000000
Length: 23104, dtype: float64

In [69]:
corrmat = corrmat[corrmat <= 0.8]
corrmat = corrmat[corrmat < 1]
corrmat

var_4    var_5      0.055807
         var_8      0.194042
         var_13     0.033760
         var_15     0.491904
         var_17     0.327726
                      ...   
var_300  var_284    0.584998
         var_288    0.103559
         var_292    0.007170
         var_293    0.029480
         var_295    0.584980
Length: 22614, dtype: float64

In [70]:
corrmat = pd.DataFrame(corrmat)
corrmat

Unnamed: 0,Unnamed: 1,0
var_4,var_5,0.055807
var_4,var_8,0.194042
var_4,var_13,0.033760
var_4,var_15,0.491904
var_4,var_17,0.327726
...,...,...
var_300,var_284,0.584998
var_300,var_288,0.103559
var_300,var_292,0.007170
var_300,var_293,0.029480


In [71]:
corrmat = pd.DataFrame(corrmat).reset_index()
corrmat

Unnamed: 0,level_0,level_1,0
0,var_4,var_5,0.055807
1,var_4,var_8,0.194042
2,var_4,var_13,0.033760
3,var_4,var_15,0.491904
4,var_4,var_17,0.327726
...,...,...,...
22609,var_300,var_284,0.584998
22610,var_300,var_288,0.103559
22611,var_300,var_292,0.007170
22612,var_300,var_293,0.029480


In [72]:
# Create the column labels for the dataframe
corrmat.columns = ["feature1", "feature2", "corr"]
corrmat.head()

Unnamed: 0,feature1,feature2,corr
0,var_4,var_5,0.055807
1,var_4,var_8,0.194042
2,var_4,var_13,0.03376
3,var_4,var_15,0.491904
4,var_4,var_17,0.327726


In [80]:
# Find the groups of correlated features

grouped_feature_ls = []
correlated_groups = []

for feature in corrmat.feature1.unique():
  if feature not in grouped_feature_ls:

    # find all features correlated to a single feature
    correlated_block = corrmat[corrmat.feature1 == feature]

    # Already investigated feature list
    grouped_feature_ls = grouped_feature_ls + list(
        correlated_block.feature2.unique()) + [feature]

    # append the block of features to the list
    correlated_groups.append(correlated_block)


print(f"found {len(correlated_groups)} correlated groups, out of {X_train.shape[1]} features")

found 2 correlated groups, out of 152 features
