Fix PySpark GBT Issue #2700

weishengtoh · 2022-09-22T17:36:46Z

Fix PySpark GBT Issue [fix #884 , fix #2480 ]

ISSUE:

Running PySpark GBT models sometimes causes the shap package to fail with the error message:

The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! " Try providing a larger background dataset, no background dataset, or using feature_perturbation="interventional"."

Using feature_perturbation="interventional" as suggested does not work with pyspark models as the predict function is not implemented for pyspark models

# lines 1082 to 1085 in shap/shap/explainers/_tree.py
if self.model_type == "pyspark":
  #import pyspark
  # TODO: support predict for pyspark
   raise NotImplementedError("Predict with pyspark isn't implemented. Don't run 'interventional' as feature_perturbation.")

The feature_perturbation="tree_path_dependent" is failing due to a check in the code that is meant to ensure that background dataset lands in every leaf.

# lines 1031 to 1033 in shap/shap/explainers/_tree.py
# ensure that the passed background dataset lands in every leaf
if np.min(self.trees[i].node_sample_weight) <= 0:
    self.fully_defined_weighting = False

This should by right pass in all cases, considering that no background dataset is required for feature_perturbation="tree_path_dependent".
In some cases for pyspark gbt models, fully_defined_weighting is incorrectly set to False. fully_defined_weighting is determined by the values of node_sample_weight, which is determined by the code below:

# line 1199 in shap/shap/explainers/_tree.py
self.node_sample_weight[index] = node.impurityStats().count() #weighted count of element trough this node

node.impurityStats() returns a GiniCalculator, and the method .count() should return a float instead of int.
See source

However, if you create a pyspark GBT model and obtain the values for node.impurityStats().count(),
you will notice that the values has been rounded down to int.
node.impurityStats().count() should return the same values as sum([e for e in node.impurityStats().stats()]) if you follow the image above. It is however rounding down the values, and in some cases values greater than 0 and less than 1 are rounded down to 0.
This causes the self.fully_defined_weighting to return False, even when the values are clearly not zero.

SOLUTION:

Avoid using node.impurityStats().count(). Replace with sum([e for e in node.impurityStats().stats()]) which does exactly the same, but retain the value as float.

venser12 · 2023-11-13T08:25:44Z

@slundberg @thatlittleboy @CloseChoice Could anyone merge this PR?

Thanks!

CloseChoice · 2023-11-13T09:16:27Z

@venser12 Thanks for following this up. Could you please add a test for this?
You could write the model to a temporary directory and load it back from there for the test.
Would also be great if you could resolve the conflicts

I will take a closer look later this week.

CloseChoice · 2023-11-13T09:55:24Z

shap/explainers/_tree.py

@@ -1196,7 +1196,7 @@ def buildTree(index, node):
                    self.values[index] = [node.prediction()] #prediction for the node
                else:
                    self.values[index] = [e for e in node.impurityStats().stats()] #for gini: NDarray(numLabel): 1 per label: number of item for each label which went through this node
-                self.node_sample_weight[index] = node.impurityStats().count() #weighted count of element trough this node
+                self.node_sample_weight[index] = sum([e for e in node.impurityStats().stats()]) #weighted count of element trough this node


Could you please explain why this is the correct way to do things? I am not deeply familiar with this code.

You cam find the answer above.

It is because

node.impurityStats().count()

rounds the values in PySpark models, so it return zero values when it was not zero values

If you can please merge this PR. Thanks.

venser12 · 2023-11-13T11:02:33Z

@CloseChoice, you can find the explanation at the top.

Here tou have a test:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import random

# Create Spark Session

spark = SparkSession.builder.appName("Shap").getOrCreate()

# Create DataFrame

data = [(random.randint(1, 100), random.randint(1, 50), random.randint(0, 1)) for _ in range(10)]

# Create Spark DataFrame

df = spark.createDataFrame(data, ["numeric", "numeric_2", "label"])

from pyspark.ml.classification import GBTClassifier

from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline

# Assambling columns
assembler = VectorAssembler(inputCols=["numeric", "numeric_2"], outputCol='features')

# GBTClassifier
gbt_classifier = GBTClassifier(featuresCol="features", labelCol="label")

# Pipeline for each step
pipeline = Pipeline(stages=[assembler, gbt_classifier])

# Train
pipeline = pipeline.fit(df)
model = pipeline.stages[-1]
model.save("path_to_model")

import shap
import numpy as np
from pyspark.ml.classification import GBTClassificationModel

loaded_model = GBTClassificationModel.load("path_to_model")

explainer=shap.Explainer(loaded_model)

explainer.shap_values(np.array(df.select("numeric","numeric_2").collect()[0]))

mriomoreno · 2023-11-13T11:30:20Z

@CloseChoice , I don't know why this branch has conflicts. We only have changed one line. I tried to merged it manually and it worked:

CloseChoice · 2023-11-13T19:36:06Z

Why do you use GBTClassifier for fitting and GBTClassificationModel for loading?

To fix the conflicts please do:

git pull upstream master or whatever you set the github remote to

Edit:
If I am not mistaken one would need to install hadoop and set the "HADOOP_HOME" variable correctly to test this locally. Since we are not doing this I would suggest that we do not test this in our CI.

mriomoreno · 2023-11-14T08:02:14Z

@CloseChoice , I used GBTClassificatacionModel becasuse the Pipeline step for the model is GBTClassificationModel insted of GBTClassifier:

Moreover, SHAP only supports GBTClassificationModel. It does not support GBTClassifier objects:

Exception: The passed model is not callable and cannot be analyzed directly with the given masker! Model: GBTClassifier_636da3917b9d

So with GBTClassificationModel Im getting the same error:

# GBTClassificationModel object
model = pipeline.stages[-1]

# Save the GBTClassificationModel
model.save("path_to_model")

import shap
import numpy as np
from pyspark.ml import GBTClassificationModel

# Load the GBTClassificationModel
loaded_model = GBTClassificationModel.load("path_to_model")

For no reason I get zero values when it is clearly not zero values, so we must change this in _tree.py

I still think that changing that line will fix the error.

I also opened a new PR with no conflicts #3384

Thanks!

Fix PySpark GBT Issue

be37403

weishengtoh mentioned this pull request Sep 22, 2022

Shap on pyspark doesn't work with a loaded model #2480

Open

dsgibbons mentioned this pull request Nov 26, 2022

Invite open PRs from slundberg/shap to contribute dsgibbons/shap#2

Closed

55 tasks

mriomoreno mentioned this pull request Nov 13, 2023

BUG: SHAP doesn't work with PySpark loaded models #3383

Open

4 tasks

CloseChoice reviewed Nov 13, 2023

View reviewed changes

mriomoreno approved these changes Nov 13, 2023

View reviewed changes

mriomoreno mentioned this pull request Nov 14, 2023

Fix PySpark loaded models #3384

Open

Merge branch 'master' into feature/fix-pyspark-gbt

5c581ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PySpark GBT Issue #2700

Fix PySpark GBT Issue #2700

weishengtoh commented Sep 22, 2022 •

edited

venser12 commented Nov 13, 2023 •

edited

CloseChoice commented Nov 13, 2023 •

edited

CloseChoice Nov 13, 2023

venser12 Nov 13, 2023

venser12 Nov 13, 2023

venser12 Nov 13, 2023

venser12 commented Nov 13, 2023 •

edited

mriomoreno commented Nov 13, 2023 •

edited

CloseChoice commented Nov 13, 2023 •

edited

mriomoreno commented Nov 14, 2023 •

edited

Fix PySpark GBT Issue #2700

Are you sure you want to change the base?

Fix PySpark GBT Issue #2700

Conversation

weishengtoh commented Sep 22, 2022 • edited

Fix PySpark GBT Issue [fix #884 , fix #2480 ]

ISSUE:

SOLUTION:

venser12 commented Nov 13, 2023 • edited

CloseChoice commented Nov 13, 2023 • edited

CloseChoice Nov 13, 2023

Choose a reason for hiding this comment

venser12 Nov 13, 2023

Choose a reason for hiding this comment

venser12 Nov 13, 2023

Choose a reason for hiding this comment

venser12 Nov 13, 2023

Choose a reason for hiding this comment

venser12 commented Nov 13, 2023 • edited

mriomoreno commented Nov 13, 2023 • edited

CloseChoice commented Nov 13, 2023 • edited

mriomoreno commented Nov 14, 2023 • edited

weishengtoh commented Sep 22, 2022 •

edited

venser12 commented Nov 13, 2023 •

edited

CloseChoice commented Nov 13, 2023 •

edited

venser12 commented Nov 13, 2023 •

edited

mriomoreno commented Nov 13, 2023 •

edited

CloseChoice commented Nov 13, 2023 •

edited

mriomoreno commented Nov 14, 2023 •

edited