# Feature importance
### Author: Krzysztof Chmielewski

## What is feature importance and why is it used?
Feature importance measures how much each input feature contributes to a model's predictions.

There is many applications of feature importance such as:
- **Feature selection** - removing features with low contribution which enables to train smaller and faster models
- **Interpretation of a model** - helps to understand which feature contributes the most to the model predictions
- **Model/Feature engineering** - debugging to see for example why features that should matter, in reality doesn't have high enough rank. On the other side, helps to see and fix problems when feature that shouldn't be relevant as much has suspiciously high importance.

In [133]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error

In [None]:
def create_synthetic_data(n: int = 1000, d: int = 5, noise: float = 0.1):
    X = np.random.randn(n, d)

    true_w = np.random.randint(0, 20, d)
    y = X @ true_w + noise * np.random.randn(n)

    return X, y, true_w

In [135]:
def permutation_importance(model, X, y, metric, n_repeats=1):
    baseline = metric(model.predict(X), y)
    n_features = X.shape[1]
    importances = np.zeros(n_features)

    for i in range(n_features):
        scores = []
        for _ in range(n_repeats):
            X_perm = X.copy()
            X_perm[:, i] = np.random.permutation(X_perm[:, i])
            score = metric(model.predict(X_perm), y)
            scores.append(baseline - score)
        importances[i] = np.mean(scores)
    
    return importances / np.sum(importances)

In [136]:
X, y, true_w = create_synthetic_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print(true_w)

[9 2 4 4 5]


In [137]:
model = LinearRegression()
model.fit(X_train, y_train)
importances = permutation_importance(model, X_test, y_test, root_mean_squared_error, 10)
print(np.round(importances,2))

[0.39 0.08 0.16 0.16 0.21]


In [138]:
dtr = DecisionTreeRegressor(random_state=0)
dtr.fit(X_train, y_train)
print(np.round(dtr.feature_importances_,2))

[0.61 0.01 0.1  0.1  0.17]
