New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treatment of categorical features in potential_interactions()
: suggestion to use R squared instead of squared correlation
#119
Comments
Very good idea, thanks a lot for the suggestion and the implementation. I will add it almost unchanged (X is always a data.frame, so sapply/vapply will always work): r_sq1 <- function(s, x) {
tryCatch(
summary(stats::lm(s ~ x))[["r.squared"]],
error = function(e) return(NA)
)
}
r_sq <- function(s, x) {
suppressWarnings(
vapply(x, FUN = r_sq1, FUN.VALUE = numeric(1L), s = s, USE.NAMES = FALSE)
)
}
r_sq(1:2, data.frame(x = c("A", "A"), y = c("A", "B"))) Without tryCatch(), above example would fail. If you want to attempt a PR, it would even be better. You would need to add the change in the NEWS file. |
One addition: We should probably work with adjusted R-Squared to have a fairer selection. The bins (in x) can be relatively small, then it seems unfair if we invest 1 df in the non-factors and >1 df in the factors. See the example: r_sq1 <- function(s, x) {
tryCatch(
summary(stats::lm(s ~ x))[["adj.r.squared"]],
error = function(e) return(NA)
)
}
r_sq <- function(s, x) {
suppressWarnings(
vapply(x, FUN = r_sq1, FUN.VALUE = numeric(1L), s = s, USE.NAMES = FALSE)
)
}
r_sq(1:2, data.frame(x = c("A", "A"), y = c("A", "B"))) # NA NaN
r_sq(1:3, data.frame(x = c(1, 2, 2), y = c("A", "B", "B"))) # 0.5 0.5 It will have an impact also on the numeric factors (but not on their order). |
Yeah, I was thinking the same, but wanted to illustrate the equivalence with what you're doing now for numerical features first before addressing that imbalance in degrees of freedom. I think it would be even better if we move away from a correlation / R squared measure to something that is comparable across different # Complicated case: we need to rely on modelled variation based heuristic
mod_var1 <- function(s, x) {
tryCatch(
mean(abs(stats::lm(s ~ x)$fitted - mean(s))),
error = function(e) return(NA)
)
}
mod_var <- function(s, x) {
suppressWarnings(
vapply(x, FUN = mod_var1, FUN.VALUE = numeric(1L), s = s, USE.NAMES = FALSE)
)
} |
I think moving to the model sum of squares, corrected for the residual degrees of freedom, would be the best solution.
Or taking the square root of that if we want to bring it to the scale of the predictions and more in line with SHAP interaction values. |
Thanks for the suggestions. I need to think about how this plays with the weighting regime over bins. For instance, we could, alternatively, use something like |
I've thought about it a bit more and created a pull request with my suggestion. I believe a good heuristic would be to look at the difference between the mean squared error (MSE) of the SHAP values of Here's an example using the new function:
|
|
Loudly thinking:
|
Weighted averages of (adjusted) R-squared (or the current Pearson correlation) across the bins are not appropriate I believe as it does not take the amount of variation in the SHAP values within each bin into account (your point 2). In the proposed new heuristic, we'd consider looking in each bin at: which I'd call the explained amount of variability (the different between the MSE of a null model and the MSE of the model regressed on or the average squared linear SHAP value of that model. On this scale, aggregating by taking a weighted average over the bins makes sense in my opinion. Finally, taking the square root puts it back on the scale of the SHAP values and makes it comparable to the sum of the absolute values of the exact SHAP interaction values, see my numerical example in a previous comment. |
just like that |
After thinking about this again, switching to a completely different regime is a too large change. Additionally, people often use factor variables with many categories, leading to perfect alignement within x bin (both on color feature and x feature). We can, however, think of how to parametrize the logic in
In all cases, we would switch to R-squared adjusted, using the formula in ping @RoelVerbelen |
Partly implemented in https://github.com/ModelOriented/shapviz/pulls The status is like this:
|
Hi @mayer79, apologies for my late reply and thanks for adding so much flexibility to the function! Exploring the new arguments, I've confirmed that the setting replicating my proposal above is this:
And I agree limiting My only remaining suggestion for your consideration would be for Thanks for implementing! |
Nice! We need to play with these arguments. Lumping small categories is a good idea, actually for both x variable and the color variable, right? |
Yes, for both, to avoid splitting the data in too many bins and avoid regression using too many factor levels. |
Thanks for providing a great SHAP visualisation package for R!
I'm looking into fast ways to surface interaction effects in H2O GBMs. Unfortunately, unlike xgboost, H2O does not provide interaction SHAP values and hence
shapviz
relies on a heuristic based on weighted squared Pearson correlation between the SHAP value and other features' values in itspotential_interactions()
implementation. I think that's a reasonable approach, but it doesn't work well for unordered categorical features (where it converts them to their arbitrarily ordered factor level numbers usingdata.matrix()
).A natural extension of what you are doing now, which I believe would be more appropriate for categorical features, would be to consider the R squared of a linear regression model of the SHAP values on each of the other feature. For continuous features, that would give you the exact same value you have now. For categorical features, that would be measuring the association between the unordered factor levels and the SHAP values in a way that's not constraint by the arbitrary feature level numbering.
If you want to implement that, lines 230-233 would have to be replaced by:
Here's a full example using a public H2O data set:
Created on 2023-10-24 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: