feature_importance with more permutations #29

tkonopka · 2019-07-01T20:04:50Z

Hi ModelOriented.

Very useful collection of tools in this package ecosystem. Thank you.

I came across ingredients because of the feature_importance function. It works well based on a single permutation, but the variability between runs is sometimes noticeable on small datasets. For example, runs on the Titanic dataset can disagree on the importance ordering of the second- and third-best features.

Would you be interested in including a new argument to set the number of permutations in feature_importance? The function could output the average dropout loss over those permutations. Returning averages would be compatible with the existing output format and hence with the rest of the package, for example, plots. I can send a pull request in this direction.

The text was updated successfully, but these errors were encountered:

pbiecek · 2019-07-02T08:14:47Z

Good idea! Both average and also the uncertanity would be useful. From B runs one can get min importance, max importance, 1q, 3q and average. Then on the plot we can add some information related to uncertainty

tkonopka · 2019-07-02T17:27:50Z

Implementing the average is straightforward and would preserve the format of the output object. For the uncertainty, the output format would have to change. Do you have a preference?

Some options:

Extend the data frame horizontally with new columns dropout_loss_[q0,q1,q2,q3,q4] for quartiles; q0 is min, q2 is median, q4 is max (or similar names). Pros: Format remains quite compact. Cons: some columns will repeat the same data when a user asks for one permutation; the summaries are fixed at quartiles.
Extend the data frame horizontally and report all permutations in columns dropout_loss_[1:n]. Pros: it is clear how many permutations are computed. Cons: summaries must be computed later in a plot/summary function; output can become bulky (but a custom print might help).
Report the output in a composite object with a simple data frame as a $summary and the complete permutation results in $permutations. A print or summary generic can display the $summary and other functions can use the rest. Pros: elegant. Cons: summaries like the quartiles must be computed later in plot/summary functions; adjustments needed to downstream code, e.g. plot functions.
Just to keep this on the table: avoid uncertainties. The new feature can provide an average with a one-liner; motivated users can assemble summaries by running the function multiple times.

pbiecek · 2019-07-03T07:12:54Z

Good points.
Third option looks like the most elastic solution, with maybe only two small suggestions.

I would like to keep backward compatibility for existing functions, as there are other packages that may depend on the feature_importance function (like modelDown).
And would keep additional information in object attributes rather than in lists (even it sometimes this causes problems with dplyr). Not everybody likes attributes but it is consistent with other packages like iBreakDown.

So, let's assume that there is an additional argument B - number of replications after n_sample.
In the output (data.frame with variable/dropout_loss/label) I would report average importance, average from B replications.
In the additional attr(., "B") we may keep B
In additional attr(., "raw_permutations") we may keep dataframe with dropout_losses from each permutations. Maybe in the long format, which is more common in the tidyverse. Particular permutations are in consecutive rows. (in similar spirit to this: https://github.com/ModelOriented/iBreakDown/blob/master/R/local_attributions.R#L164)

Then the plot function for feature_importance_explainer could have additional argument plot_range (or similar). If TRUE and if B>1 then whiskers may be added to the plot.

With default B=1 old scripts will work without any change, and for B>1 one may see averages or add additional information about range.

tkonopka · 2019-07-04T05:44:08Z

Hadn't considered attributes. That works, thanks. Some code on this is now in a fork, branch "fi_permutations".

The function already has an argument n_sample to downsample rows in the dataset. That can coordinate in several ways with permutation of the feature values. Right now, the code in the branch downsamples once and then performs many permutations on that single smaller dataset. But it might be more useful to downsample at each round. In that case, there is a choice between "first subsample rows, then permute values in features" or "first permute values in features, then subsample rows". There is also a choice between sampling with replacement or without. In many cases the differences will be minor, but do you have a preference?

tkonopka · 2019-07-07T08:29:11Z

I updated the fork and it can now run multiple permutations of feature values as well as subsample the dataset. The current implementation is to subsample, permute the feature values, then repeat the whole procedure B times.

Is it OK to send you a pull request with this?

pbiecek · 2019-07-07T16:48:45Z

Thanks, it looks great,
I would just add two things:

Argument keep_raw_permutations = TRUE. If someone will turn it to FALSE or if B = 1 then there is no need to keep raw_permutations.

In the raw_permutations attribute there is a missing column _label_. Would be useful for the plot

tkonopka · 2019-07-08T19:44:35Z

Thanks. I added the label and the additional argument. I set the default to NULL, though, to allow different behavior for B=1 and B>1.

pbiecek · 2019-07-08T19:51:17Z

looks great, thanks!

pbiecek added feature 💡 New feature or request help wanted labels Jul 2, 2019

pbiecek added the super cool label Jul 7, 2019

tkonopka mentioned this issue Jul 8, 2019

permutations in feature_importance #31

Merged

pbiecek closed this as completed in e6c8d18 Jul 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature_importance with more permutations #29

feature_importance with more permutations #29

tkonopka commented Jul 1, 2019

pbiecek commented Jul 2, 2019

tkonopka commented Jul 2, 2019

pbiecek commented Jul 3, 2019

tkonopka commented Jul 4, 2019

tkonopka commented Jul 7, 2019

pbiecek commented Jul 7, 2019

tkonopka commented Jul 8, 2019

pbiecek commented Jul 8, 2019

feature_importance with more permutations #29

feature_importance with more permutations #29

Comments

tkonopka commented Jul 1, 2019

pbiecek commented Jul 2, 2019

tkonopka commented Jul 2, 2019

pbiecek commented Jul 3, 2019

tkonopka commented Jul 4, 2019

tkonopka commented Jul 7, 2019

pbiecek commented Jul 7, 2019

tkonopka commented Jul 8, 2019

pbiecek commented Jul 8, 2019