Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature_importance with more permutations #29

Closed
tkonopka opened this issue Jul 1, 2019 · 8 comments
Closed

feature_importance with more permutations #29

tkonopka opened this issue Jul 1, 2019 · 8 comments
Labels
feature 💡 New feature or request

Comments

@tkonopka
Copy link
Contributor

tkonopka commented Jul 1, 2019

Hi ModelOriented.

Very useful collection of tools in this package ecosystem. Thank you.

I came across ingredients because of the feature_importance function. It works well based on a single permutation, but the variability between runs is sometimes noticeable on small datasets. For example, runs on the Titanic dataset can disagree on the importance ordering of the second- and third-best features.

Would you be interested in including a new argument to set the number of permutations in feature_importance? The function could output the average dropout loss over those permutations. Returning averages would be compatible with the existing output format and hence with the rest of the package, for example, plots. I can send a pull request in this direction.

@pbiecek pbiecek added feature 💡 New feature or request help wanted labels Jul 2, 2019
@pbiecek
Copy link
Member

pbiecek commented Jul 2, 2019

Good idea! Both average and also the uncertanity would be useful. From B runs one can get min importance, max importance, 1q, 3q and average. Then on the plot we can add some information related to uncertainty

@tkonopka
Copy link
Contributor Author

tkonopka commented Jul 2, 2019

Implementing the average is straightforward and would preserve the format of the output object. For the uncertainty, the output format would have to change. Do you have a preference?

Some options:

  • Extend the data frame horizontally with new columns dropout_loss_[q0,q1,q2,q3,q4] for quartiles; q0 is min, q2 is median, q4 is max (or similar names). Pros: Format remains quite compact. Cons: some columns will repeat the same data when a user asks for one permutation; the summaries are fixed at quartiles.

  • Extend the data frame horizontally and report all permutations in columns dropout_loss_[1:n]. Pros: it is clear how many permutations are computed. Cons: summaries must be computed later in a plot/summary function; output can become bulky (but a custom print might help).

  • Report the output in a composite object with a simple data frame as a $summary and the complete permutation results in $permutations. A print or summary generic can display the $summary and other functions can use the rest. Pros: elegant. Cons: summaries like the quartiles must be computed later in plot/summary functions; adjustments needed to downstream code, e.g. plot functions.

  • Just to keep this on the table: avoid uncertainties. The new feature can provide an average with a one-liner; motivated users can assemble summaries by running the function multiple times.

@pbiecek
Copy link
Member

pbiecek commented Jul 3, 2019

Good points.
Third option looks like the most elastic solution, with maybe only two small suggestions.

I would like to keep backward compatibility for existing functions, as there are other packages that may depend on the feature_importance function (like modelDown).
And would keep additional information in object attributes rather than in lists (even it sometimes this causes problems with dplyr). Not everybody likes attributes but it is consistent with other packages like iBreakDown.

So, let's assume that there is an additional argument B - number of replications after n_sample.
In the output (data.frame with variable/dropout_loss/label) I would report average importance, average from B replications.
In the additional attr(., "B") we may keep B
In additional attr(., "raw_permutations") we may keep dataframe with dropout_losses from each permutations. Maybe in the long format, which is more common in the tidyverse. Particular permutations are in consecutive rows. (in similar spirit to this: https://github.com/ModelOriented/iBreakDown/blob/master/R/local_attributions.R#L164)

Then the plot function for feature_importance_explainer could have additional argument plot_range (or similar). If TRUE and if B>1 then whiskers may be added to the plot.

With default B=1 old scripts will work without any change, and for B>1 one may see averages or add additional information about range.

@tkonopka
Copy link
Contributor Author

tkonopka commented Jul 4, 2019

Hadn't considered attributes. That works, thanks. Some code on this is now in a fork, branch "fi_permutations".

The function already has an argument n_sample to downsample rows in the dataset. That can coordinate in several ways with permutation of the feature values. Right now, the code in the branch downsamples once and then performs many permutations on that single smaller dataset. But it might be more useful to downsample at each round. In that case, there is a choice between "first subsample rows, then permute values in features" or "first permute values in features, then subsample rows". There is also a choice between sampling with replacement or without. In many cases the differences will be minor, but do you have a preference?

@tkonopka
Copy link
Contributor Author

tkonopka commented Jul 7, 2019

I updated the fork and it can now run multiple permutations of feature values as well as subsample the dataset. The current implementation is to subsample, permute the feature values, then repeat the whole procedure B times.

Is it OK to send you a pull request with this?

@pbiecek
Copy link
Member

pbiecek commented Jul 7, 2019

Thanks, it looks great,
I would just add two things:

Argument keep_raw_permutations = TRUE. If someone will turn it to FALSE or if B = 1 then there is no need to keep raw_permutations.

In the raw_permutations attribute there is a missing column _label_. Would be useful for the plot

@tkonopka
Copy link
Contributor Author

tkonopka commented Jul 8, 2019

Thanks. I added the label and the additional argument. I set the default to NULL, though, to allow different behavior for B=1 and B>1.

@pbiecek
Copy link
Member

pbiecek commented Jul 8, 2019

looks great, thanks!

@pbiecek pbiecek closed this as completed in e6c8d18 Jul 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature 💡 New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants