Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want to contribute - Latest updated source? #318

Open
jonlachmann opened this issue Nov 24, 2022 · 9 comments
Open

Want to contribute - Latest updated source? #318

jonlachmann opened this issue Nov 24, 2022 · 9 comments

Comments

@jonlachmann
Copy link

Hi!

I am interested in adding support for some new model classes, and saw the text in the readme saying that there are a lot of ongoing refactoring. I was just wondering if there is a branch which it is preferred to develop against, to avoid having to refactor more than necessary once the other code re-factorization is finished. Not really sure if this fits as an issue, but I figured it was the least inappropriate place to put this question.

Anyway, thanks for a well written R package!

@martinju
Copy link
Member

Hi, and thanks for reaching out and your willingness to contribute.

If you want to contribute now, you should do so to the branch named "devel".
The plan is to make that branch the master within this year.

Currently, the changes are not well documented, but at least the basic readme example and most of the vignette is up to date in the new branch. For your particular case of adding new model classes, I now put each model class in a new file, see e.g. https://github.com/NorskRegnesentral/shapr/blob/devel/R/model_ranger.R

@jonlachmann
Copy link
Author

Hi, thanks for the clarification of which branch to work against!

My interest is working with time series models, to explain the predictions at each step in the future. For now I have managed to do so by running the explanation function one time for each future value. Do you have any suggestions on how to pass more complex x_explain values? What I would want to do is include a "steps ahead" parameter to the predict function, which would allow me to have say 3 explanations, one for each step in the future, using the same x_explain vector. The way that I solved it for now was to set this parameter on the model before passing it to the explain functions, but I think it is a big inefficient having to run the function one time for each step.

What would be even more neat is to be able to have the predict function return a vector, which would then contain multiple values which are all explained separately. It more or less implies supporting models with multiple outputs, which I think is a quite general case which would add a lot of usability to the package. Do you see any general problems with this train of thought?

@martinju
Copy link
Member

Interesting application!
Currently, the only way to achieve this is like you did, which is clearly inefficient. Technically, I think it would be problematic to support multiple output and I am not sure the usefulness trumps the technical burden of direct support of multiple output prediction models. Please convince me if you disagree :-)

I see two possible ways to by-pass the problem in your (and similar) cases:

  1. Allow passing of additional arguments to the prediction function on a per-row-in-x_explain basis.
  2. Passing the output of explain() as an argument to a new call of explain() with a new set x_explain, reusing different parts of the previous explain-object. [note that the new version of the package merges what was previously in the shapr() into explain().] This way you can explain your lag-predictions iteratively like you are doing today without the efficiency loss. This approach would also allow reducing the uncertainty of existing explanation by adding additional samples in the monte carlo integration of v(S), or subsets S in kernelSHAP approximation, should that be desired.

The plan is to support point 2, although I have not decided exactly how to do it yet, and it will not be ready for the release this year. Which solution would you prefer?

PS: In the devel version we have implemented a new approach for time series prediction explanation (see approach='timeseries' and an example in the new vignette).

@jonlachmann
Copy link
Author

Yes, there are a lot of new things to take into consideration when explaining a forecast, like which value should be used as prediction_zero for example. I have a basic test case for a VAR model working as of now, just trying out things to see what gives results which makes sense, I will try to create a gist or something when it is in a more presentable form, to allow for more specific discussion if that would be of interest.

I will have to look into the code for the different approaches to see how it could be possible to do multiple outputs, it may very well be that you are correct that it is quite hard to get it working.

If point 1 would be supported, solving the multiple steps would be as easy as giving the step to forecast as an additional argument to the observations. I am however not sure if this would aid computational performance.

I will try to investigate the code a bit closer to be able to give some more informed answers! I will also look at the timeseries approach, is it ready to be used or under development?

@martinju
Copy link
Member

Regarding the multiple output, I am mostly worried about the bookkeeping. I am happy to discuss solutions if you give it a go.

From my perspective point 1 and 2 would both be as computationally efficient as supporting multiple output directly, it would just be slightly less user friendly.

The timeseries approach should be ready for use. It is the simple method introduced in section 4.3 in this paper: https://martinjullum.com/publication/jullum-2021-efficient/jullum-2021-efficient.pdf

You may also be interested in a recent paper which discusses similar (and more advanced?) methods for the same use case: https://arxiv.org/ftp/arxiv/papers/2211/2211.06507.pdf (I have not read it properly myself.)

@jonlachmann
Copy link
Author

Thanks for the links to the papers, will have a read when I get time. I have started some work on the code, made a small PR which I am not completely sure if it is useful, but I thought that I should start somewhere. Let me know if you have any comments.

@jonlachmann
Copy link
Author

I did manage to implement multiple outputs, and also got to know the package and methods a bit. If you have a look here https://github.com/jonlachmann/shapr/tree/output_size, you can maybe see if I managed to do something stupid or if it seems reasonable.

@martinju
Copy link
Member

Great! I just had a quick glance and couldn't spot any obvious issues. Could you upload a basic example script as well?

@jonlachmann
Copy link
Author

Great to hear, I am not really used to working with data.table, so I learned a bit while getting it to work. I usually always just let everything be a matrix, but can see the benefits in this case where more complicated selections and aggregations are needed.

I formatted and commented my test example and created a gist here https://gist.github.com/jonlachmann/47bcfc4b1a072debd7278547992cd1a6, it should work given that you check out my branch and run it from inside that folder. Let me know if you get it to run or have any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants