Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Feature Value column to SHAP table. #1064

Merged

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Aug 14, 2020

Pull Request Description

Closes #1035

Docs Changes

Here

Demo for Regression Problem

image

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Aug 14, 2020

Codecov Report

Merging #1064 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##             main    #1064    +/-   ##
========================================
  Coverage   99.91%   99.91%            
========================================
  Files         188      191     +3     
  Lines       10296    10852   +556     
========================================
+ Hits        10287    10843   +556     
  Misses          9        9            
Impacted Files Coverage Δ
..._understanding/prediction_explanations/__init__.py 100.00% <ø> (ø)
evalml/pipelines/__init__.py 100.00% <ø> (ø)
evalml/tests/pipeline_tests/test_graphs.py 100.00% <ø> (ø)
evalml/utils/__init__.py 100.00% <ø> (ø)
evalml/automl/automl_search.py 99.55% <100.00%> (+<0.01%) ⬆️
evalml/exceptions/exceptions.py 100.00% <100.00%> (ø)
evalml/model_understanding/__init__.py 100.00% <100.00%> (ø)
evalml/model_understanding/graphs.py 100.00% <100.00%> (ø)
...derstanding/prediction_explanations/_algorithms.py 100.00% <100.00%> (ø)
...tanding/prediction_explanations/_user_interface.py 100.00% <100.00%> (ø)
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ecf3bd...184c87c. Read the comment docs.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a really good suggestion from the demos and this LGTM! 😁

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I left a question about how the call to PipelineBase._transform is happening now, a suggestion about using a pandas dataframe instead of a dict for table_maker, a suggestion about rounding, and a little test code suggested simplification. None blocking though.

@@ -47,14 +47,13 @@ def _compute_shap_values(pipeline, features, training_data=None):
if estimator.model_family == ModelFamily.BASELINE:
raise ValueError("You passed in a baseline pipeline. These are simple enough that SHAP values are not needed.")

pipeline_features = pipeline._transform(features)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you able to delete this? Are you now calling PipelineBase._transform one step up the stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, moved it to _make_single_prediction_shap_table to only call it once.

row = [feature_name, display_text]
feature_value = pipeline_features[feature_name]
if pd.api.types.is_number(feature_value):
feature_value = np.round(feature_value, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the rounding necessary? I think '{0:2f}'.format(feature_value) would be better. That way, you're guaranteed the output is limited to maximum 2 characters after the decimal point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're 100% right! Much better than calling round.

dtypes = ["t", "t", "f"] if include_shap_values else ["t", "t"]
alignment = ["c", "c", "c"] if include_shap_values else ["c", "c"]
dtypes = ["t", "f", "t", "f"] if include_shap_values else ["t", "t", "t"]
alignment = ["c", "c", "c", "c"] if include_shap_values else ["c", "c", "c"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the formatting for the columns but now that we are using string formatting to round rather than using numpy, we can use text for each column, which will simplify these lines.

@@ -136,14 +142,18 @@ def _make_single_prediction_shap_table(pipeline, input_features, top_k=3, traini
shap_values = _compute_shap_values(pipeline, input_features, training_data)
normalized_shap_values = _normalize_shap_values(shap_values)

# We need a dict of type {column_name: feature value}
pipeline_features = pipeline._transform(input_features)
features_dict = dict(zip(pipeline_features.columns, *pipeline_features.values))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a dict? Why not pass in the pandas df and index into it? The fewer datastructures we have floating around, the better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great!

@@ -18,7 +18,11 @@
def compare_two_tables(table_1, table_2):
assert len(table_1) == len(table_2)
for row, row_answer in zip(table_1, table_2):
assert row.strip().split() == row_answer.strip().split()
# To make it easier to compare header underline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so this isn't fixed-length? Or something which you could easily calculate in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got lazy but I fixed it now by setting the correct length in the answer and removing this line 😬

@@ -96,6 +100,7 @@ def test_explain_prediction(mock_normalize_shap_values,
pipeline = MagicMock()
pipeline.problem_type = problem_type
pipeline._classes = ["class_0", "class_1", "class_2"]
pipeline._transform.return_value = pd.DataFrame({"a": [10], "b": [20], "c": [30], "d": [40]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Could be nice to have the DF have more than one row, to make sure indexing works. Not sure if that's covered elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time we get to this point in the code, the df is guaranteed to only have one row. I'll add a comment to the test to explain!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok thanks


if include_string_features:
pipeline_features["a"] = "foo-feature"
pipeline_features["b"] = np.datetime64("2020-08-14")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, great that you included datetimes / something other than numeric and string

if include_shap_values:
new_answer = copy.deepcopy(answer)
for row in new_answer:
row.append(values[row[0]][0])
else:
new_answer = answer

assert _make_rows(values, values, top_k, include_shap_values) == new_answer
if include_string_features:
new_answer = copy.deepcopy(new_answer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a no-op, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, deleted this since I went with your suggestion below!

if row[0] == "a":
row[1] = "foo-feature"
elif row[0] == "b":
row[1] = "2020-08-14"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to wrack my brain to remember if this would work in python!

Suggested simplification:

filtered_answer = []
for row in new_answer:
    val = row[1]
    if row[0] == "a":
        val = "foo-feature"
    elif row[0] == "b":
        val = "2020-08-14"
    filtered_answer.append((row[0], val))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with your suggestion! Although it looks a little bit different because row has four values and so we want to append four values to filtered_answer, not just two values.

c 1.000 + 0.000
d -1.560 -- -2.560
e -1.800 -- -2.800
f -1.900 -- -2.900""".splitlines()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These reports look good!

@freddyaboulton
Copy link
Contributor Author

@dsherry Thanks for the feedback - I addressed your comments!

@freddyaboulton freddyaboulton merged commit 2ae172a into main Aug 19, 2020
@dsherry dsherry mentioned this pull request Aug 25, 2020
@freddyaboulton freddyaboulton deleted the 1035-add-feature-values-column-to-explanation-reports branch October 22, 2020 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a "Feature Values" column for prediction explanation reports
3 participants