Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Feature/Column Order of dataset matter while calculating SHAP values? #3601

Open
sambaths opened this issue Apr 3, 2024 · 3 comments
Open

Comments

@sambaths
Copy link

sambaths commented Apr 3, 2024

I'm seeing that the order of the dataset being used matters while calculating the shapley values. Is that the case ?

Sample code

import numpy as np
import pandas as pd
import xgboost as xgb
import shap

# Generate synthetic dataset for multi-class classification
np.random.seed(0)
n_samples = 100
n_features = 5
n_classes = 3

# Generate features & targets
features = np.random.normal(0, 1, (n_samples, n_features))
targets = np.random.randint(0, n_classes, n_samples)

# Create DataFrame with features and target 
df_A_first = pd.DataFrame(features, columns=[f'feature_{i}' for i in range(n_features)])
df_A_first['target'] = targets

# Shuffle the feature columns to create a different column order
column_order = list(df_A_first.columns[:-1])
np.random.shuffle(column_order)

df_B_first = df_A_first[column_order + ['target']]

# Train XGBoost classifiers for each column order
models = []
for df in [df_A_first, df_B_first]:
    model = xgb.XGBClassifier()
    model.fit(df.drop('target', axis = 1), df['target'])
    models.append(model)

# Create SHAP Explainer objects for each model
explainers = [shap.TreeExplainer(model) for model in models]

shap_values_list = []
for df, exp in zip([df_A_first, df_B_first], explainers):
    shap_values_list.append(exp.shap_values(df.iloc[:, :-1].iloc[:5]))

# Compare SHAP values for each model
for i, shap_values in enumerate(shap_values_list):
    print(f"SHAP values with {'feature order A' if i == 0 else 'feature order B'}:")
    print(shap_values)

In the above code, the only difference is the column order

df_A_first dataset column order - ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'target']
df_B_first dataset column order - ['feature_3', 'feature_2', 'feature_4', 'feature_0', 'feature_1', 'target']

The partial results are (Shapley values for the first row) :-

# shap_values_list[0][0] - Shapley values for first row of df_A_first
array([[-0.76066595,  0.04272805,  0.95740676],
       [-0.7689634 ,  0.37519506, -0.08894491],
       [ 0.49594778, -1.2271469 ,  0.34900278],
       [-0.06256621, -0.17336567,  0.39783403],
       [-1.2227407 , -0.05358636,  1.1311452 ]], dtype=float32)
# shap_values_list[1][0] - Shapley values for first row of df_B_first
array([[-0.06908824, -0.07854801,  0.4850777 ],
       [ 0.44342878, -1.107977  ,  0.2700421 ],
       [-0.85943675, -0.00879675,  1.3649093 ],
       [-0.84970415,  0.03958162,  0.78879184],
       [-0.52355677,  0.41594473,  0.0060116 ]], dtype=float32)
@CloseChoice
Copy link
Collaborator

The reason for this is that your xgboost models are different which comes from a small number of examples and the way xgboost optimizes. If you could somehow change the underlying tree and just switch the features then I would expect the same results. I tried a bit but this is quite difficult to achieve even with sklearn models.

@sambaths
Copy link
Author

sambaths commented Apr 6, 2024

@CloseChoice Are you suggesting the reason for this is the small number examples being used while building the model ?

In my use case, I have more data with more number of columns (all categorical) - but still getting different results for Shap.
Although with the different feature ordering - I'm getting the same predictions and metrics from both the xgboost models, it is just the shap values that are coming out differently.

@CloseChoice
Copy link
Collaborator

CloseChoice commented Apr 8, 2024

the reason for the huge differences is the sample size. But the reason that there exist differences at all is that the tree models are different (and not just the same model with a changed feature order). As I said if you want to check this thoroughly one would need to go deep into one of the decisiontree models and make sure that the weights stay the same and we just change the feature order.

But coming from the shap implementation I cannot see how an adjusted feature order could change shap values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants