Dataframe export #211

lopuhin · 2017-06-05T09:07:40Z

Related to #196
Add format_as_dataframe and format_as_dataframes, as discussed here #196 (comment)

TODO:

fix class order for transition features
documentation

codecov-io · 2017-06-05T09:34:25Z

Codecov Report

Merging #211 into master will increase coverage by <.01%.
The diff coverage is 97.43%.

@@            Coverage Diff            @@
##           master    #211      +/-   ##
=========================================
+ Coverage    97.4%   97.4%   +<.01%     
=========================================
  Files          41      42       +1     
  Lines        2585    2663      +78     
  Branches      496     514      +18     
=========================================
+ Hits         2518    2594      +76     
  Misses         35      35              
- Partials       32      34       +2

Impacted Files	Coverage Δ
eli5/__init__.py	`85.29% <100%> (+1.96%)`	⬆️
eli5/formatters/__init__.py	`100% <100%> (ø)`	⬆️
eli5/formatters/as_dataframe.py	`97.14% <97.14%> (ø)`

lopuhin · 2017-06-05T12:01:12Z

@kmike could you please review this? The build is finally green, I'm not sure about 13ee6b3 - I had to add pandas to docs/requirements.txt, but there are no lightdb/xgboost/lightning there, and they still build, and are imported in a similar way as far as I could tell.

kmike · 2017-06-05T13:23:08Z

docs/requirements.txt

@@ -2,4 +2,5 @@ ipython
 scipy
 numpy > 1.9.0
 scikit-learn >= 0.18
+pandas


Right; it could make sense to remove IPython from here, as it is also optional.

@kmike I still don't understand the failure here: https://travis-ci.org/TeamHG-Memex/eli5/jobs/239548322#L508 - it tries to import pandas and fails, and it's indeed imported at the module level, but it's the same for lightdb/xgboost/lightning, so I don't understand yet why they don't fail the docs build, but pandas does fail.
On the other hand, don't we want to check all libraries docs when doing the travis docs build? So maybe it makes sense to include all optional ones here?

@lopuhin unsupported libraries are mocked out here for docs: https://github.com/TeamHG-Memex/eli5/blob/master/docs/source/conf.py#L38

@kmike aha, thanks! That's what I was missing. Let me mock pandas there too.

it seems ipython is in requirements.txt because for IPython mock didn't work for some reason. But I'm not sure.

Fixed in e5f4fba

it seems ipython is in requirements.txt because for IPython mock didn't work for some reason. But I'm not sure.

Yes, just noticed, likely you already tried to mock it :)

kmike · 2017-06-05T14:22:36Z

tests/test_formatters_as_dataframe.py

+                    neg=[],
+                )),
+        ],
+    )


I think it'd be good to add tests for format_as_dataframe(s) applied to explain_weights / explain_prediction results; existing tests will still pass for DataFrame if we change output format.

Agreed, that would be much more robust. I added most tests in b4bc427, going to add CRF checks right in existing CRF tests - they show a failure during export by the way :)

And CRF tests done in 83f1cfc, now everything is covered I think.

kmike · 2017-06-05T14:23:17Z

eli5/formatters/as_dataframe.py

+
+
+@format_as_dataframe.register(FeatureImportances)
+def feature_importances_to_df(feature_importances):


What do you think about making such functions private, as they are implementation details of format_.. functions? Or do you see them as a part of public API?

right, we don't want to expose them, fixed in 1e235a2, thanks!

kmike · 2017-06-05T21:00:06Z

Looks good!

There is a gotcha though, which could make DataFrame support less useful / less easy to use: feature weights are filtered out at explain_... stage, so DataFrame only contains a few values by default, not all weights. This reduces usefulness of DataFrame export format, as there is not a lot one may want to do with e.g. top-20 features. So users in most cases should bump the limit to a very large value before using format_as_dataframes.

I wonder if it makes sense to add functions similar to show_.. functions, but tailored to DataFrames (explain_weights_df, explain_weights_dfs, etc.?). They should combine explain + format, and change default limit values.

Without these helpers it'd be good to at least document this gotcha, and maybe provide an example of working with features sing DataFrame format, maybe even a small tutorial.

Some recipe ideas:

filter features by their name using DataFrame str methods;
sum importances for some of the features (feature importance: option to stack weights for categorical variable values #159);
sort features by their absolute values, without taking sign in account (Allow to show features sorted by absolute weight value #57)
...?

This all doesn't have to be a part of this pull request, but I think we should document or fix gotcha with limits before the release.

lopuhin · 2017-06-06T07:46:10Z

@kmike yeah, that's a very important point.

I had another idea of how to address it, but I'm not sure how viable it is. Here it is:

make top=None by default in explain_weights
add top argument to format_as_text and format_as_html, and have the same default value as now, so they will be able to do the filtering.
show_weights will have the same default

The major problem with this idea is that we can't do this naively without sacrificing performance: I just tried eli5.explain_weights(clf, vec=vec, top=None) for clf.coef_.shape (20, 101631), and it took 8 seconds vs. just 382 ms for top=30. So to do it efficiently, we'll have to make some computation lazy, which would likely be harder to support. Or maybe it's possible to optimize it (say make it 2x-4x faster), but I'm not sure it would be enough. Do you think it's worthwhile to try it?

I wonder if it makes sense to add functions similar to show_.. functions, but tailored to DataFrames (explain_weights_df, etc.?). They should combine explain + format, and change default limit values.

I like the idea of this helpers - they will also make dataframe export more visible, and are simpler to implement.

kmike · 2017-06-29T12:55:51Z

I think it can be merged almost as-is, if we add a warning to format_as_dataframe(s) methods about top argument.

It won't be the final user-facing API (explain_weights_df(s)?), and there won't be tutorials, but it is already a good improvement if you don't have time to finish it at the moment.

lopuhin · 2017-06-29T13:28:12Z

@kmike I don't think I have time to try my suggestion from #211 (comment) at the moment, but I can at least wrap it up on Friday: add helpers and warnings (it's a blessing that notebooks show them by default).

kmike · 2017-06-29T13:29:09Z

@lopuhin sounds good, thanks!

kmike · 2017-06-29T14:44:37Z

I was thinking about warnings in docstrings, but showing real warnings also make sense. When we should show them? When default top is not overridden, and there are missing features?

See GH-196

@kmike

Thanks for idea @kmike!

They combine explanation and export to DataFrame and set top to None by default.

Maybe it was due to a "cyclic" import from mypy pov, but it's still very strange.

lopuhin · 2017-06-30T08:11:05Z

@kmike I added helpers, notes in the docs about missing features, and also raise warnings (e600f35) if any features are really missing.
I'm not entirely sure we should raise warnings in this case - it might be slightly annoying if you really want to export only top N features (and while it's possible to detect when top was explicitly passed, it will complicate code). If we decide it's better to be without them, they are in a separate commit (e600f35) which can be reverted.

kmike · 2017-06-30T13:39:13Z

@lopuhin looks good, thanks! I'll merge it without a warning.

kmike · 2017-06-30T13:49:24Z

Merged, thanks @lopuhin!

lopuhin · 2017-06-30T13:56:53Z

Thanks @kmike !

lopuhin changed the title ~~[WIP] Dataframe export~~ Dataframe export Jun 5, 2017

lopuhin requested a review from kmike June 5, 2017 11:59

kmike reviewed Jun 5, 2017

View reviewed changes

lopuhin added 15 commits June 29, 2017 19:21

WIP: export explanation to a dataframe

7253be5

See GH-196

Transition features to pivot table

9918c37

dataframe export: dispatch and format_as_dataframes

a50370b

Add format_as_dataframe tests

a7a01a3

Preserve transition features order

dd03a19

Do not fail if pandas is not available

cde5be4

Ignore as_dataframe in nodeps eli5 tests

232c71f

DOC document pandas dataframe export

7247878

Improve test coverage

be392c7

Add pandas to travis docs build

87123d6

Doc build: mock pandas instead of installing

ae84c82

Make helper dataframe functions private

7df74df

Add tests that check export of actual explanations

e41ba55

Thanks for idea @kmike!

Check dataframe export for CRF explanation

6bdb9b0

Use full name (explanation), it appears in the docs

9432781

lopuhin force-pushed the dataframe-export branch from d689b48 to 9432781 Compare June 29, 2017 16:21

Add explain_weights_df, explain_prediction_df helpers

a581b0c

They combine explanation and export to DataFrame and set top to None by default.

lopuhin added 2 commits June 30, 2017 10:29

Add helpers to docs and eli5.formatters ns

91ea0b9

Fix mypy type check

792ca85

Maybe it was due to a "cyclic" import from mypy pov, but it's still very strange.

lopuhin changed the title ~~Dataframe export~~ [WIP] Dataframe export Jun 30, 2017

lopuhin added 2 commits June 30, 2017 11:06

Add notes about top and missing features

dee5b27

Add warning about missing features

e600f35

lopuhin changed the title ~~[WIP] Dataframe export~~ Dataframe export Jun 30, 2017

kmike closed this Jun 30, 2017

lopuhin deleted the dataframe-export branch June 30, 2017 13:56

kmike mentioned this pull request Jul 3, 2017

Export tables not just to HTML #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe export #211

Dataframe export #211

lopuhin commented Jun 5, 2017 •

edited

Loading

codecov-io commented Jun 5, 2017 •

edited

Loading

lopuhin commented Jun 5, 2017

kmike Jun 5, 2017

lopuhin Jun 5, 2017

kmike Jun 5, 2017

lopuhin Jun 5, 2017

kmike Jun 5, 2017 •

edited

Loading

lopuhin Jun 5, 2017

kmike Jun 5, 2017

lopuhin Jun 5, 2017

lopuhin Jun 5, 2017

kmike Jun 5, 2017 •

edited

Loading

lopuhin Jun 5, 2017

kmike commented Jun 5, 2017 •

edited

Loading

lopuhin commented Jun 6, 2017

kmike commented Jun 29, 2017

lopuhin commented Jun 29, 2017

kmike commented Jun 29, 2017

kmike commented Jun 29, 2017

lopuhin commented Jun 30, 2017

kmike commented Jun 30, 2017

kmike commented Jun 30, 2017

lopuhin commented Jun 30, 2017



		@format_as_dataframe.register(FeatureImportances)
		def feature_importances_to_df(feature_importances):

Dataframe export #211

Dataframe export #211

Conversation

lopuhin commented Jun 5, 2017 • edited Loading

codecov-io commented Jun 5, 2017 • edited Loading

Codecov Report

lopuhin commented Jun 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Jun 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike Jun 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmike commented Jun 5, 2017 • edited Loading

lopuhin commented Jun 6, 2017

kmike commented Jun 29, 2017

lopuhin commented Jun 29, 2017

kmike commented Jun 29, 2017

kmike commented Jun 29, 2017

lopuhin commented Jun 30, 2017

kmike commented Jun 30, 2017

kmike commented Jun 30, 2017

lopuhin commented Jun 30, 2017

lopuhin commented Jun 5, 2017 •

edited

Loading

codecov-io commented Jun 5, 2017 •

edited

Loading

kmike Jun 5, 2017 •

edited

Loading

kmike Jun 5, 2017 •

edited

Loading

kmike commented Jun 5, 2017 •

edited

Loading