# Task 3: Feature importance

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
import pickle
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.linear_model import RidgeCV, LinearRegression, Ridge
from scipy.stats import pearsonr
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

## Load data

In [None]:
with open("/content/drive/MyDrive/bckrlab-exercises/task3_feature-importance_data.pickle", "rb") as f:
    X, y = pickle.load(f)
    # file = pickle.load(f)

In [None]:
f

<_io.BufferedReader name='/content/drive/MyDrive/bckrlab-exercises/task3_feature-importance_data.pickle'>

## Fit model

In [None]:
model = make_pipeline(StandardScaler(), RidgeCV())
model.fit(X, y)

## Analyse correlations and model coefficients

In [None]:
correlations_r = [pearsonr(X[:, i], y)[0] for i in range(X.shape[1])]
correlations_p = [pearsonr(X[:, i], y)[1] for i in range(X.shape[1])]

In [None]:
correlations_p

[0.273966000990635,
 0.7581588469525398,
 2.160929482002356e-63,
 0.11493662368098062,
 1.0197250230967917e-81,
 4.339895126001623e-90,
 0.5063249445046959,
 0.7588043398937258,
 0.4429447125215523,
 0.6775740385810042]

In [None]:
correlations_r

[-0.034627457350699706,
 0.009748596817114225,
 0.4966096277457641,
 0.04988109058023978,
 0.5545795931016668,
 0.57770353097386,
 -0.021039567598160145,
 -0.009721743778694045,
 0.024288745633746216,
 0.013163881192390406]

In [None]:
df = pd.DataFrame({
        "coef": model["ridgecv"].coef_,
        "correlation r": correlations_r,
        "correlation p": correlations_p},
    index=[f"feature {i}" for i in range(X.shape[1])])
df

Unnamed: 0,coef,correlation r,correlation p
feature 0,-1.529528,-0.034627,0.273966
feature 1,0.516905,0.009749,0.7581588
feature 2,59.958697,0.49661,2.160929e-63
feature 3,0.05312,0.049881,0.1149366
feature 4,68.390912,0.55458,1.019725e-81
feature 5,72.159892,0.577704,4.339894999999999e-90
feature 6,-2.134884,-0.02104,0.5063249
feature 7,39.197821,-0.009722,0.7588043
feature 8,40.087218,0.024289,0.4429447
feature 9,1.91571,0.013164,0.677574


In [None]:
# feature 7 and 8 are binary
np.unique(X[:,7]), np.unique(X[:,8])

(array([0., 1.]), array([0., 1.]))

In [None]:
# feature 7 and 8 are mutually exclusive
((X[:,7] == 1) & (X[:,8] == 1)).sum()

0

In [None]:
# feature 7 and 8 nearly cover the whole dataset!
((X[:,7] == 1) | (X[:,8] == 1)).sum() / X.shape[0]

0.99

## Tasks

Feature 7 and 8 seem to be important features for the model (with coefficients > 30!). However, taking a closer look, they are both binary, mutually exclusive, and nearly cover the whole dataset. Also they are barely correlated to the outcome by themselves. I would not expect them to both have such a high importance for the model and on top of that both positive! What is going on?

This phenopmena may just be a coincedence. For example, The age of a property doesn't necessarily influence the price of that property, it is more related to the location, transport system or inflation in economy. However, it might seem that with age, the property price increses, hence it will result in  high c orrelation eventhough age may have nothing to do with it.


Features 7 and 8 might be highly correlated with other features in the dataset, contributing to their apparent importance. Ridge regression, used in the code, includes a regularization term that can distribute coefficients among correlated features. If these binary features are correlated with the target variable but not necessarily causally related, Ridge regression might assign high coefficients to them.

Two variables are considered perfectly collinear if their correlation coefficient is +/-1.0, Ridge regression penalty is equal to the square of magnitude of coefficients. Two binary features (7 and 8) show high importance in the model despite their characteristics and the dataset's properties.
1.	Randomness in Dataset
The high correlation among features may just be a coincidence. Due to the randomness achieved via randomly generated data, few features just might turn out to have a higher correlation than in reality. This phenomenon may just be a coincidence. For example, the age of a property doesn't necessarily influence the price of that property, it is more related to the location, transport system or inflation in the economy. However, it might seem that with age, the property price increases, hence it will result in a high c correlation even though age may have nothing to do with it.
2.	Collinearity among the Features
Features 7 and 8 might be highly correlated with other features in the dataset, contributing to extensive influence. Ridge regression Code includes a regularization term that can distribute coefficients among correlated features. If these binary features are correlated with the target variable but not directly/causally related, Ridge regression might assign high coefficients to them.
3.	Noise in Output
RidgeCV does not reduce coefficients to zero, it might just try to fit the data resulting in overfitting.

