<a href="https://colab.research.google.com/github/Priyo-prog/Statistics-and-Data-Science/blob/main/Feature%20Selection%20Complete/Filter%20Methods/correlation_handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Correlation**
Correlation Feature Selection evaluates subsets of features on the basis of the following hypothesis: "Good feature subsets contain features highly correlated with the target, yet uncorrelated to each other".

References:

M. Hall 1999, Correlation-based Feature Selection for Machine Learning

Senliol, Baris, et al. "Fast Correlation Based Filter (FCBF) with a different search strategy." Computer and Information Sciences.

I will demonstrate how to select features based on correlation using 2 procedures:

The first one is a brute force function that finds correlated features without any further insight.

The second procedure finds groups of correlated features, which we can then explore to decide which one we keep and which ones we discard.

Often, more than 2 features are correlated with each other. We can find groups of 3, 4 or more features that are correlated. By identifying these groups, with procedure 2, we can then select from each group, which feature we want to keep, and which ones we want to remove.

Note

The most used method to determine correlation is the Pearson's correlation method, which is the one that I will carry out in this notebook.

In [23]:
pip install feature_engine



In [24]:
# import important libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures

In [25]:
# Mount google drive
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [26]:
filename = "/content/drive/MyDrive/Data Science/Feature Selection/dataset_1.csv"

In [27]:
df = pd.read_csv(filename)

In [28]:
# Create features and labels
X = df.drop(labels="target", axis=1)
y = df["target"]

In [29]:
# Split the data in train and test split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((35000, 300), (15000, 300))

## **Handling Constants, Quasi-Constants and Duplicates**

Drop the Constant, Quasi-constant and duplicate features from the dataset
using Feature Engine and Pipeline.

In [30]:
pipe = Pipeline(steps=[("const_drop", DropConstantFeatures(tol=0.998, variables=None, missing_values="raise")),
                       ("duplicate_drop", DropDuplicateFeatures(variables=None, missing_values="raise"))])

In [31]:
pipe.fit(X_train)

In [32]:
# Now drop the constants and duplicates
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

In [33]:
X_train.shape, X_test.shape

((35000, 152), (15000, 152))

## **Handling the correlation**

In [34]:
# Brute Force Method

def correlation(dataset, threshold):

  correlated_features = set()

  corr_matrix = dataset.corr()

  for i in range(len(corr_matrix.columns)):

    for j in range(i):
      if abs(corr_matrix.iloc[i,j]) > threshold:
        colname = corr_matrix.columns[i]
        correlated_features.add(colname)

  return correlated_features

In [35]:
correlated_features = correlation(X_train, 0.8)
len(correlated_features)

76

In [36]:
# Drop the correlated features from the dataset
X_train.drop(labels=correlated_features, axis=1, inplace=True)
X_test.drop(labels=correlated_features, axis=1, inplace=True)
X_train.shape, X_test.shape

((35000, 76), (15000, 76))