Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure explain() for iterative estimation with convergence detection ++ #396

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

martinju
Copy link
Member

@martinju martinju commented Jun 7, 2024

Very early draft. Lots of cleanup and moving things around remains, but the general overall structure will probably be close to what we got here.

To be done in this PR (some may be removed here and handled in separate PRs):

  • Add iterative kernelshap estimation: Sampling some coalitions at a time and estimating shapley values. Iterate until convergence threshold is met (based on estimated sd).
    • for features
    • for groups
  • Add bootstrapping approach estimating for standard deviation
    • for features
    • for groups
  • Add paired sampling of coalitions
    • for features
    • for groups
  • Add reweighting of sampling shapley kernel weights
  • Restructure non-iterative (classical) approach as a special case of the iterative approach for code simplicity
  • Rewrite batch_computing to be computed within each iteration
  • Remove all traces of iterative arguments put outside of the iter_list
  • Consider moving form n_batches to something like max_batch_size + min_n_batches as input argument. The former controls the memory allocation (smaller means less memory consumption), while the latter ensures we get at least some progress updates while doing the updating. An internal function could then set the number of batches at each iteration to min(n_combinations,max(min_n_batches,ceiling(n_combinations/max_batch_size))))
  • Remove the stop() calls based on the number of features.
  • Add parallelization of the bootstrapping function
  • Check that the new code structure works with parallelization (via future.lapply)
  • Add argument verbose with arguments - Add verbose = c("basic","shapley","vS_details"), with "basic" as the default, showing what is currently going on in the function, the filename of the tempfile, and what iteration we are at (+ later estimate of the remaining computationt time) NULL or "" should give no printout at all, "shapley" means printing intermediate shapley estimates, "vs_details" means printing results while estimating the vS_functions (where this is done in more than a single step).
  • add save_intermediate = c("to_disk","in_memory") to allow the user so save results to file while doing the estimation. May choose to save everything to disk or just what is needed to continue adding coalitions if we are not yet happy with the results.
  • Add tests for the iterative approach
  • Set new defaults in explain(): paired_sampling =TRUE, shapley_reweighting = "on_N" ++
  • Carefully update all test checking files in at least 3 steps: 1. ensure classical shapley estimates gives the same answer (keeping the same order of setup_approach and shapley_setup), 2. update shapley values with the new ordering, 3. update the internal objects
  • Add new argument which is a list of iterative-specific arguments.
  • Update documentation
  • Update vignette

Note: All non-exact methods fails now (also the Shapley values estimates) since shapley_setup is now called after setup_approach. All tests for Shapley values pass if these calls are but back to the original order (but we don't want that in the future).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant