Reference : https://github.com/atulpatelDS/Youtube/blob/main/Feature_Engineering/Feature%20Selection%20using%20Correlation%20and%20Ranking%20Filter%20methods%20-Check%20Multi-collinearity-%20Tutorial%205.ipynb

**Correlation & Ranking Filter Methods for Feature Selection or Check the Multicollinearity in Features**

1. Low correlation means there's no linear relationship; it doesn't mean there's no information in the feature that predicts the target so in real life problem we don't delete those features which are not correlated with target.
2. It might be a good idea to remove one of the highly correlated between themselves non-target features, because they might be redundant.
3. In case of ordinals or binary features,as you can see columns('season','holiday', 'workingday', 'weather') correlation with Target won't tell you a lot. So I guess, the best way to test if a feature is important in case it's not correlated with target is to directly compare performance of a model with and without the feature. But still different features might have different importance for different algorithms.
4. If a feature is strongly correlated with your label, this means a linear function (or model) should be able to predict well the latter. Even if it is not correlated, it doesn't tell you that a non-linear model wouldn't perform well by using this feature.

In [18]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

def MAE(y_true,y_pred):
    return round(mean_absolute_error(y_true,y_pred),2)

In [25]:
num = 400
target = 'y2'
df_fulldata = pd.read_csv("./src/generated_data.csv").drop(columns = ['trend_data','season_data','noise_data','actual','y','y1','actual_y1','actual_y2'])
df_train = df_fulldata.loc[0:df_fulldata.shape[0]-num-1]
df_test = df_fulldata.loc[df_fulldata.shape[0]-num:]

X_train = df_train.drop(columns=target)
y_train = df_train[target]
X_test = df_test.drop(columns=target)
y_test = df_test[target]

X_train.shape,X_test.shape

((1059, 84), (400, 84))

In [3]:
scaler = StandardScaler()
pca = PCA(n_components=0.95)
mm = MinMaxScaler()

X_mm_train = mm.fit_transform(X_train)
X_mm_test = mm.transform(X_test)

y_mm_train = mm.fit_transform(y_train.to_numpy().reshape(-1,1))
y_mm_test = mm.transform(y_test.to_numpy().reshape(-1,1))

X_pca_train = pca.fit_transform(scaler.fit_transform(X_train))
X_pca_test = pca.transform(scaler.transform(X_test))

df_pca_train = pd.DataFrame(X_pca_train)
df_pca_test = pd.DataFrame(X_pca_test)

In [4]:
base_model = LinearRegression().fit(X_train, y_train)
# Returning the R^2 for the model
base_model_r2 = base_model.score(X_train, y_train)
y_pred = base_model.predict(X_train)
print(f'R^2: {base_model_r2:4f}')
print(f"MAE train : {MAE(y_pred, y_train)}")

base_model_r2 = base_model.score(X_test, y_test)
y_pred = base_model.predict(X_test)
print(f'R^2: {base_model_r2:4f}')
print(f"MAE test : {MAE(y_pred, y_test)}")

R^2: 0.877497
MAE train : 8.41
R^2: 0.479804
MAE test : 8.77


# Correlation
- Pearson’s correlation coefficient(linear data)
- Spearman’s rank coefficient(linear and nonlinear)

In [27]:
# How to calculate correlation between all features and remove highly correlated ones
def correlation(df, threshold=0.8, method='spearman'):
    dataset = df.copy()
    col_corr = list() # Set of all the names of deleted columns
    corr_matrix = dataset.corr(method=method)
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.append(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset
    print(f"Final X shape : {dataset.shape}")
    return dataset.columns.values

In [34]:
feature_ls = correlation(X_train,0.8)

Final X shape : (1059, 25)


In [35]:
corr_model = LinearRegression().fit(X_train[feature_ls], y_train)
# Returning the R^2 for the model
corr_model_r2 = corr_model.score(X_train[feature_ls], y_train)
y_pred = corr_model.predict(X_train[feature_ls])
print(f'R^2: {corr_model_r2:4f}')
print(f"MAE train : {MAE(y_pred, y_train)}")

corr_model_r2 = corr_model.score(X_test[feature_ls], y_test)
y_pred = corr_model.predict(X_test[feature_ls])
print(f'R^2: {corr_model_r2:4f}')
print(f"MAE test : {MAE(y_pred, y_test)}")

R^2: 0.873993
MAE train : 8.6
R^2: 0.478898
MAE test : 8.81


# Corr by Demand project