Shap on pyspark doesn't work with a loaded model #2480

BDon-Tan · 2022-04-06T08:42:39Z

Hi, all
I followed the code in test_trees.py. However, I got an error with a loaded pyspark model. But the trained pyspark model works well. The error is:

AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, no background dataset, or using feature_perturbation="interventional".

It seems fully_defined_weighting turns to be False while loading the saved model, which it will be True if the model was trained. There should be somethings wrong with SingleTree class but I couldn't correct them (I guess node_sample_weight is the key).

In [84]: explainer_train.model.fully_defined_weighting
Out[84]: True

In [85]: explainer_model.model.fully_defined_weighting
Out[85]: False

                # ensure that the passed background dataset lands in every leaf
                if np.min(self.trees[i].node_sample_weight) <= 0:
                    self.fully_defined_weighting = False

The code I used is below:

iris_sk = sklearn.datasets.load_iris()
iris = pd.DataFrame(data=np.c_[iris_sk['data'], iris_sk['target']],
                    columns=iris_sk['feature_names'] + ['target'])[:100]
col = ["sepal_length", "sepal_width", "petal_length", "petal_width", "type"]
iris = spark.createDataFrame(iris, col)
pipeline = PipelineModel.load("/user/tanbingdong/iris_pipeline")
model = GBTClassificationModel.load("/user/tanbingdong/iris_gbt.model")

va = VectorAssembler(inputCols=col[:-1], outputCol="features")
si = StringIndexer(inputCol="type", outputCol="label")
gbt = GBTClassifier(labelCol="label", featuresCol="features")

pipe = Pipeline(stages=[va, si, gbt])
pipeline_train = pipe.fit(iris)

explainer_train = shap.TreeExplainer(pipeline_train.stages[-1])
explainer_pipeline = shap.TreeExplainer(pipeline.stages[-1])
explainer_model = shap.TreeExplainer(model)

X = pd.DataFrame(data=iris_sk.data, columns=iris_sk.feature_names)[:100]

shap_values_train = explainer_train.shap_values(X)
shap_values_pipeline = explainer_pipeline.shap_values(X)
shap_values_model = explainer_model.shap_values(X)

The text was updated successfully, but these errors were encountered:

weishengtoh · 2022-09-22T17:42:37Z

Hi there, could you try amending the code as suggested in this pull request?

Hopefully this will fix your issue.

BDon-Tan changed the title ~~Shap on pyspark~~ Shap on pyspark doesn't work with a loaded model Apr 6, 2022

weishengtoh linked a pull request Sep 22, 2022 that will close this issue

Fix PySpark GBT Issue #2700

Open

mriomoreno linked a pull request Nov 13, 2023 that will close this issue

Fix PySpark loaded models #3384

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shap on pyspark doesn't work with a loaded model #2480

Shap on pyspark doesn't work with a loaded model #2480

BDon-Tan commented Apr 6, 2022 •

edited

weishengtoh commented Sep 22, 2022

Shap on pyspark doesn't work with a loaded model #2480

Shap on pyspark doesn't work with a loaded model #2480

Comments

BDon-Tan commented Apr 6, 2022 • edited

weishengtoh commented Sep 22, 2022

BDon-Tan commented Apr 6, 2022 •

edited