Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shap on pyspark doesn't work with a loaded model #2480

Open
BDon-Tan opened this issue Apr 6, 2022 · 1 comment · May be fixed by #2700 or #3384
Open

Shap on pyspark doesn't work with a loaded model #2480

BDon-Tan opened this issue Apr 6, 2022 · 1 comment · May be fixed by #2700 or #3384

Comments

@BDon-Tan
Copy link

BDon-Tan commented Apr 6, 2022

Hi, all
I followed the code in test_trees.py. However, I got an error with a loaded pyspark model. But the trained pyspark model works well. The error is:

AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, no background dataset, or using feature_perturbation="interventional".

It seems fully_defined_weighting turns to be False while loading the saved model, which it will be True if the model was trained. There should be somethings wrong with SingleTree class but I couldn't correct them (I guess node_sample_weight is the key).

In [84]: explainer_train.model.fully_defined_weighting
Out[84]: True

In [85]: explainer_model.model.fully_defined_weighting
Out[85]: False
                # ensure that the passed background dataset lands in every leaf
                if np.min(self.trees[i].node_sample_weight) <= 0:
                    self.fully_defined_weighting = False

The code I used is below:

iris_sk = sklearn.datasets.load_iris()
iris = pd.DataFrame(data=np.c_[iris_sk['data'], iris_sk['target']],
                    columns=iris_sk['feature_names'] + ['target'])[:100]
col = ["sepal_length", "sepal_width", "petal_length", "petal_width", "type"]
iris = spark.createDataFrame(iris, col)
pipeline = PipelineModel.load("/user/tanbingdong/iris_pipeline")
model = GBTClassificationModel.load("/user/tanbingdong/iris_gbt.model")

va = VectorAssembler(inputCols=col[:-1], outputCol="features")
si = StringIndexer(inputCol="type", outputCol="label")
gbt = GBTClassifier(labelCol="label", featuresCol="features")

pipe = Pipeline(stages=[va, si, gbt])
pipeline_train = pipe.fit(iris)

explainer_train = shap.TreeExplainer(pipeline_train.stages[-1])
explainer_pipeline = shap.TreeExplainer(pipeline.stages[-1])
explainer_model = shap.TreeExplainer(model)

X = pd.DataFrame(data=iris_sk.data, columns=iris_sk.feature_names)[:100]

shap_values_train = explainer_train.shap_values(X)
shap_values_pipeline = explainer_pipeline.shap_values(X)
shap_values_model = explainer_model.shap_values(X)
@BDon-Tan BDon-Tan changed the title Shap on pyspark Shap on pyspark doesn't work with a loaded model Apr 6, 2022
@weishengtoh weishengtoh linked a pull request Sep 22, 2022 that will close this issue
@weishengtoh
Copy link

Hi there, could you try amending the code as suggested in this pull request?

Hopefully this will fix your issue.

@mriomoreno mriomoreno linked a pull request Nov 13, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants