Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe export #211

Closed
wants to merge 20 commits into from
Closed

Dataframe export #211

wants to merge 20 commits into from

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented Jun 5, 2017

Related to #196
Add format_as_dataframe and format_as_dataframes, as discussed here #196 (comment)

TODO:

  • fix class order for transition features
  • documentation

@codecov-io
Copy link

codecov-io commented Jun 5, 2017

Codecov Report

Merging #211 into master will increase coverage by <.01%.
The diff coverage is 97.43%.

@@            Coverage Diff            @@
##           master    #211      +/-   ##
=========================================
+ Coverage    97.4%   97.4%   +<.01%     
=========================================
  Files          41      42       +1     
  Lines        2585    2663      +78     
  Branches      496     514      +18     
=========================================
+ Hits         2518    2594      +76     
  Misses         35      35              
- Partials       32      34       +2
Impacted Files Coverage Δ
eli5/__init__.py 85.29% <100%> (+1.96%) ⬆️
eli5/formatters/__init__.py 100% <100%> (ø) ⬆️
eli5/formatters/as_dataframe.py 97.14% <97.14%> (ø)

@lopuhin lopuhin changed the title [WIP] Dataframe export Dataframe export Jun 5, 2017
@lopuhin lopuhin requested a review from kmike June 5, 2017 11:59
@lopuhin
Copy link
Contributor Author

lopuhin commented Jun 5, 2017

@kmike could you please review this? The build is finally green, I'm not sure about 13ee6b3 - I had to add pandas to docs/requirements.txt, but there are no lightdb/xgboost/lightning there, and they still build, and are imported in a similar way as far as I could tell.

@@ -2,4 +2,5 @@ ipython
scipy
numpy > 1.9.0
scikit-learn >= 0.18
pandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; it could make sense to remove IPython from here, as it is also optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike I still don't understand the failure here: https://travis-ci.org/TeamHG-Memex/eli5/jobs/239548322#L508 - it tries to import pandas and fails, and it's indeed imported at the module level, but it's the same for lightdb/xgboost/lightning, so I don't understand yet why they don't fail the docs build, but pandas does fail.
On the other hand, don't we want to check all libraries docs when doing the travis docs build? So maybe it makes sense to include all optional ones here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lopuhin unsupported libraries are mocked out here for docs: https://github.com/TeamHG-Memex/eli5/blob/master/docs/source/conf.py#L38

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike aha, thanks! That's what I was missing. Let me mock pandas there too.

Copy link
Contributor

@kmike kmike Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems ipython is in requirements.txt because for IPython mock didn't work for some reason. But I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5f4fba

it seems ipython is in requirements.txt because for IPython mock didn't work for some reason. But I'm not sure.

Yes, just noticed, likely you already tried to mock it :)

neg=[],
)),
],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to add tests for format_as_dataframe(s) applied to explain_weights / explain_prediction results; existing tests will still pass for DataFrame if we change output format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that would be much more robust. I added most tests in b4bc427, going to add CRF checks right in existing CRF tests - they show a failure during export by the way :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And CRF tests done in 83f1cfc, now everything is covered I think.



@format_as_dataframe.register(FeatureImportances)
def feature_importances_to_df(feature_importances):
Copy link
Contributor

@kmike kmike Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making such functions private, as they are implementation details of format_.. functions? Or do you see them as a part of public API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we don't want to expose them, fixed in 1e235a2, thanks!

@kmike
Copy link
Contributor

kmike commented Jun 5, 2017

Looks good!

There is a gotcha though, which could make DataFrame support less useful / less easy to use: feature weights are filtered out at explain_... stage, so DataFrame only contains a few values by default, not all weights. This reduces usefulness of DataFrame export format, as there is not a lot one may want to do with e.g. top-20 features. So users in most cases should bump the limit to a very large value before using format_as_dataframes.

I wonder if it makes sense to add functions similar to show_.. functions, but tailored to DataFrames (explain_weights_df, explain_weights_dfs, etc.?). They should combine explain + format, and change default limit values.

Without these helpers it'd be good to at least document this gotcha, and maybe provide an example of working with features sing DataFrame format, maybe even a small tutorial.

Some recipe ideas:

This all doesn't have to be a part of this pull request, but I think we should document or fix gotcha with limits before the release.

@lopuhin
Copy link
Contributor Author

lopuhin commented Jun 6, 2017

@kmike yeah, that's a very important point.

I had another idea of how to address it, but I'm not sure how viable it is. Here it is:

  • make top=None by default in explain_weights
  • add top argument to format_as_text and format_as_html, and have the same default value as now, so they will be able to do the filtering.
  • show_weights will have the same default

The major problem with this idea is that we can't do this naively without sacrificing performance: I just tried eli5.explain_weights(clf, vec=vec, top=None) for clf.coef_.shape (20, 101631), and it took 8 seconds vs. just 382 ms for top=30. So to do it efficiently, we'll have to make some computation lazy, which would likely be harder to support. Or maybe it's possible to optimize it (say make it 2x-4x faster), but I'm not sure it would be enough. Do you think it's worthwhile to try it?

I wonder if it makes sense to add functions similar to show_.. functions, but tailored to DataFrames (explain_weights_df, etc.?). They should combine explain + format, and change default limit values.

I like the idea of this helpers - they will also make dataframe export more visible, and are simpler to implement.

@kmike
Copy link
Contributor

kmike commented Jun 29, 2017

I think it can be merged almost as-is, if we add a warning to format_as_dataframe(s) methods about top argument.

It won't be the final user-facing API (explain_weights_df(s)?), and there won't be tutorials, but it is already a good improvement if you don't have time to finish it at the moment.

@lopuhin
Copy link
Contributor Author

lopuhin commented Jun 29, 2017

@kmike I don't think I have time to try my suggestion from #211 (comment) at the moment, but I can at least wrap it up on Friday: add helpers and warnings (it's a blessing that notebooks show them by default).

@kmike
Copy link
Contributor

kmike commented Jun 29, 2017

@lopuhin sounds good, thanks!

@kmike
Copy link
Contributor

kmike commented Jun 29, 2017

I was thinking about warnings in docstrings, but showing real warnings also make sense. When we should show them? When default top is not overridden, and there are missing features?

They combine explanation and export to DataFrame and
set top to None by default.
Maybe it was due to a "cyclic" import from mypy pov,
but it's still very strange.
@lopuhin lopuhin changed the title Dataframe export [WIP] Dataframe export Jun 30, 2017
@lopuhin lopuhin changed the title [WIP] Dataframe export Dataframe export Jun 30, 2017
@lopuhin
Copy link
Contributor Author

lopuhin commented Jun 30, 2017

@kmike I added helpers, notes in the docs about missing features, and also raise warnings (e600f35) if any features are really missing.
I'm not entirely sure we should raise warnings in this case - it might be slightly annoying if you really want to export only top N features (and while it's possible to detect when top was explicitly passed, it will complicate code). If we decide it's better to be without them, they are in a separate commit (e600f35) which can be reverted.

@kmike
Copy link
Contributor

kmike commented Jun 30, 2017

@lopuhin looks good, thanks! I'll merge it without a warning.

@kmike
Copy link
Contributor

kmike commented Jun 30, 2017

Merged, thanks @lopuhin!

@kmike kmike closed this Jun 30, 2017
@lopuhin
Copy link
Contributor Author

lopuhin commented Jun 30, 2017

Thanks @kmike !

@lopuhin lopuhin deleted the dataframe-export branch June 30, 2017 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants