Does Feature/Column Order of dataset matter while calculating SHAP values? #3601

sambaths · 2024-04-03T17:15:59Z

I'm seeing that the order of the dataset being used matters while calculating the shapley values. Is that the case ?

Sample code

import numpy as np
import pandas as pd
import xgboost as xgb
import shap

# Generate synthetic dataset for multi-class classification
np.random.seed(0)
n_samples = 100
n_features = 5
n_classes = 3

# Generate features & targets
features = np.random.normal(0, 1, (n_samples, n_features))
targets = np.random.randint(0, n_classes, n_samples)

# Create DataFrame with features and target 
df_A_first = pd.DataFrame(features, columns=[f'feature_{i}' for i in range(n_features)])
df_A_first['target'] = targets

# Shuffle the feature columns to create a different column order
column_order = list(df_A_first.columns[:-1])
np.random.shuffle(column_order)

df_B_first = df_A_first[column_order + ['target']]

# Train XGBoost classifiers for each column order
models = []
for df in [df_A_first, df_B_first]:
    model = xgb.XGBClassifier()
    model.fit(df.drop('target', axis = 1), df['target'])
    models.append(model)

# Create SHAP Explainer objects for each model
explainers = [shap.TreeExplainer(model) for model in models]

shap_values_list = []
for df, exp in zip([df_A_first, df_B_first], explainers):
    shap_values_list.append(exp.shap_values(df.iloc[:, :-1].iloc[:5]))

# Compare SHAP values for each model
for i, shap_values in enumerate(shap_values_list):
    print(f"SHAP values with {'feature order A' if i == 0 else 'feature order B'}:")
    print(shap_values)

In the above code, the only difference is the column order

df_A_first dataset column order - ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'target']
df_B_first dataset column order - ['feature_3', 'feature_2', 'feature_4', 'feature_0', 'feature_1', 'target']

The partial results are (Shapley values for the first row) :-

# shap_values_list[0][0] - Shapley values for first row of df_A_first
array([[-0.76066595,  0.04272805,  0.95740676],
       [-0.7689634 ,  0.37519506, -0.08894491],
       [ 0.49594778, -1.2271469 ,  0.34900278],
       [-0.06256621, -0.17336567,  0.39783403],
       [-1.2227407 , -0.05358636,  1.1311452 ]], dtype=float32)

# shap_values_list[1][0] - Shapley values for first row of df_B_first
array([[-0.06908824, -0.07854801,  0.4850777 ],
       [ 0.44342878, -1.107977  ,  0.2700421 ],
       [-0.85943675, -0.00879675,  1.3649093 ],
       [-0.84970415,  0.03958162,  0.78879184],
       [-0.52355677,  0.41594473,  0.0060116 ]], dtype=float32)

The text was updated successfully, but these errors were encountered:

CloseChoice · 2024-04-06T13:28:50Z

The reason for this is that your xgboost models are different which comes from a small number of examples and the way xgboost optimizes. If you could somehow change the underlying tree and just switch the features then I would expect the same results. I tried a bit but this is quite difficult to achieve even with sklearn models.

sambaths · 2024-04-06T13:34:05Z

@CloseChoice Are you suggesting the reason for this is the small number examples being used while building the model ?

In my use case, I have more data with more number of columns (all categorical) - but still getting different results for Shap.
Although with the different feature ordering - I'm getting the same predictions and metrics from both the xgboost models, it is just the shap values that are coming out differently.

CloseChoice · 2024-04-08T07:43:05Z

the reason for the huge differences is the sample size. But the reason that there exist differences at all is that the tree models are different (and not just the same model with a changed feature order). As I said if you want to check this thoroughly one would need to go deep into one of the decisiontree models and make sure that the weights stay the same and we just change the feature order.

But coming from the shap implementation I cannot see how an adjusted feature order could change shap values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Feature/Column Order of dataset matter while calculating SHAP values? #3601

Does Feature/Column Order of dataset matter while calculating SHAP values? #3601

sambaths commented Apr 3, 2024 •

edited

CloseChoice commented Apr 6, 2024

sambaths commented Apr 6, 2024 •

edited

CloseChoice commented Apr 8, 2024 •

edited

Does Feature/Column Order of dataset matter while calculating SHAP values? #3601

Does Feature/Column Order of dataset matter while calculating SHAP values? #3601

Comments

sambaths commented Apr 3, 2024 • edited

CloseChoice commented Apr 6, 2024

sambaths commented Apr 6, 2024 • edited

CloseChoice commented Apr 8, 2024 • edited

sambaths commented Apr 3, 2024 •

edited

sambaths commented Apr 6, 2024 •

edited

CloseChoice commented Apr 8, 2024 •

edited