## Decima2 feature importances are more accurate than SHAP

This notebook generates a synthetic dataset with causal structure, trains a Random Forest Regressor model on this data and then determines both the Decima2 and SHAP feature importance explanations for this model. We show that Decima2 explanations are not only faster, but more faithfully recover the true causal structure in the data than SHAP.

First we import all relevant libraries

In [15]:
import shap

from decima2 import model_feature_importance
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

from decima2.utils.utils import feature_names
import time

## Function Design and Creation of Synthetic Data 
We now design our dataset where we randomly generate 3 independent variables $x_{1}$, $x_{2}$ and $x_{3}$ all from the same distributions. We then create $x_{4}$, $x_{5}$ as causally directly dependent on $x_{2}$ and $x_{1}$. 

We then design our function $f = (2*x_{1}) + 3 * x_{2} + x_{3}$ and use a Random Forest Regressor to learn this function. 

What are the Actual Importances

From this design we argue, from an importance perspective, the importance of $x_{1}$ is equal to the importance of $x_{5}$ as these features are completey interchangeable. From a business perspective, given a situation where $x_{1}$ represents weekly sales and $x_{5}$ represents  weekly transactions, both of these features could be used by the model to the same effect when predicting $f$. 

<it> Note that some people may disagree with this assumption and argue that $x_{1}$ should be assigned all the importance where $x_{5}$ is given none as $x_{1}$ the causal ancestor of $x_{5}$, and $x_{5}$ isn't even used by the function to make decisions! This type of functionality will be offered in our next release of model explanations so you can tailor your use case to your needs. </it>

This gives us a ground truth feature importance ordering of 
<ul> 
   <li> $x_{2}$ </li> 
   <li> $x_{4}$ </li> 
    <li> $x_{1}$ </li>
    <li> $x_{5}$ </li>
   <li> $x_{3}$ </li>
</ul>


In [16]:

def generate_data(n): 
    x_1 = np.random.normal(0, 2, size=(n))
    x_2 = np.random.normal(0, 2, size=(n))
    x_3 = np.random.normal(0, 2, size=(n))
    x_4 = x_2
    x_5 = x_1
    y = (2 * x_1) + (3 * x_2) + x_3
    X = pd.DataFrame()
    X['X_1']= x_1
    X['X_2'] = x_2
    X['X_3']= x_3
    X['X_4']= x_4
    X['X_5'] = x_5
    return X,y

In [17]:
X, y = generate_data(10000)

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
model = RandomForestRegressor(max_depth=100, random_state=42)
model.fit(X_train, y_train)
model.score(X_test,y_test)

0.9961376369273085

## Decima2 Feature Importance Explanations

We now generate our Decima2 explanations for our test data and trained Random Forest Regressor 

In [19]:
st = time.time()
explanation_app = model_feature_importance(X_test,y_test,model,output='text')
et = time.time()
print(et-st)
explanation_app


2.655003070831299


Unnamed: 0,Feature,Importance
1,X_2,1.84186
3,X_4,1.84186
2,X_3,1.05662
0,X_1,0.83642
4,X_5,0.83642


## SHAP explanations

We now generate our SHAP explanations for our test data and trained Random Forest Regressor 

In [20]:
st = time.time()
explainer = shap.Explainer(model, X_test)
shap_values = explainer(X_test,check_additivity=False)
et = time.time()
print(et-st)



59.592472076416016


In [21]:
attributions = shap_values.values.mean(axis=0)
attributions = attributions.reshape(X_test.shape[1])
feature_names(X_test,attributions)


Unnamed: 0,Feature,Importance
0,X_1,0.38393
4,X_5,0.38209
2,X_3,0.14086
3,X_4,0.11148
1,X_2,0.0929


We can see that our SHAP explanations take 30x more time than our Decima2 explanations! We can also see that Decima2 recovers our true feature importances whereas the SHAP explanations dont identify either $x_{2}$ or $x_{4}$ as the most important feature. AKA SHAP explanations are completely wrong even if we ignore our causal assumption from earlier!