/
understanding_shapr.Rmd
1871 lines (1547 loc) · 76.6 KB
/
understanding_shapr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "`shapr`: Explaining individual machine learning predictions with Shapley values"
author: "Camilla Lingjærde, Martin Jullum, Lars Henry Berge Olsen & Nikolai Sellereite"
output: rmarkdown::html_vignette
bibliography: ../inst/REFERENCES.bib
vignette: >
%\VignetteEncoding{UTF-8}
%\VignetteIndexEntry{`shapr`: Explaining individual machine learning predictions with Shapley values}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
markdown:
wrap: 72
---
> [Introduction](#intro)
> [Overview of Package](#overview)
> [The Kernel SHAP Method](#KSHAP)
> [Examples](#ex)
> [Advanced usage](#advanced)
> [Scalability and efficency](#scalability)
> [Comparison to Lundberg & Lee's implementation](#compare)
<a id="intro"></a>
# Introduction
The `shapr` package implements an extended version of the Kernel SHAP
method for approximating Shapley values (@lundberg2017unified), in which
dependence between the features is taken into account
(@aas2019explaining). Estimation of Shapley values is of interest when
attempting to explain complex machine learning models. Of existing work
on interpreting individual predictions, Shapley values is regarded to be
the only model-agnostic explanation method with a solid theoretical
foundation (@lundberg2017unified). Kernel SHAP is a computationally
efficient approximation to Shapley values in higher dimensions, but it
assumes independent features. @aas2019explaining extend the Kernel SHAP
method to handle dependent features, resulting in more accurate
approximations to the true Shapley values. See the
[paper](https://www.sciencedirect.com/sdfe/reader/pii/S0004370221000539/pdf)
(@aas2019explaining) for further details.
<a id="overview"></a>
<br>
# Overview of Package
## Functions
Here is an overview of the main functions. You can read their
documentation and see examples with `?function_name`.
| Function Name | Description |
|:---------------------|:-------------------------------------------------|
| `explain` | Computes kernel SHAP values for test data. |
| `explain_forecast` | Analogous to `explain`, but for explaining forecasts from time series models. |
| `plot.shapr` | Plots the individual prediction explanations. Uses the `ggplot` and `ggbeeswarm` package. |
: Main functions in the `shapr` package.
<a id="KSHAP"></a>
<br>
# The Kernel SHAP Method
Assume a predictive model $f(\boldsymbol{x})$ for a response value $y$
with features $\boldsymbol{x}\in \mathbb{R}^M$, trained on a training
set, and that we want to explain the predictions for new sets of data.
This may be done using ideas from cooperative game theory, letting a
single prediction take the place of the game being played and the
features the place of the players. Letting $N$ denote the set of all $M$
players, and $S \subseteq N$ be a subset of $|S|$ players, the
"contribution" function $v(S)$ describes the total expected sum of
payoffs the members of $S$ can obtain by cooperation. The Shapley value
(@Shapley53) is one way to distribute the total gains to the players,
assuming that they all collaborate. The amount that player $i$ gets is
then
$$\phi_i(v) = \phi_i = \sum_{S \subseteq N \setminus\{i\}} \frac{|S| ! (M-| S| - 1)!}{M!}(v(S\cup \{i\})-v(S)),$$
that is, a weighted mean over all subsets $S$ of players not containing
player $i$. @lundberg2017unified define the contribution function for a
certain subset $S$ of these features $\boldsymbol{x}_S$ as
$v(S) = \mbox{E}[f(\boldsymbol{x})|\boldsymbol{x}_S]$, the expected
output of the predictive model conditional on the feature values of the
subset. @lundberg2017unified names this type of Shapley values SHAP
(SHapley Additive exPlanation) values. Since the conditional
expectations can be written as
```{=tex}
\begin{equation}
\label{eq:CondExp}
E[f(\boldsymbol{x})|\boldsymbol{x}_s=\boldsymbol{x}_S^*] = E[f(\boldsymbol{x}_{\bar{S}},\boldsymbol{x}_S)|\boldsymbol{x}_S=\boldsymbol{x}_S^*] =
\int f(\boldsymbol{x}_{\bar{S}},\boldsymbol{x}_S^*)\,p(\boldsymbol{x}_{\bar{S}}|\boldsymbol{x}_S=\boldsymbol{x}_S^*)d\boldsymbol{x}_{\bar{S}},
\end{equation}
```
the conditional distributions
$p(\boldsymbol{x}_{\bar{S}}|\boldsymbol{x}_S=\boldsymbol{x}_S^*)$ are
needed to compute the contributions. The Kernel SHAP method of
@lundberg2017unified assumes feature independence, so that
$p(\boldsymbol{x}_{\bar{S}}|\boldsymbol{x}_S=\boldsymbol{x}_S^*)=p(\boldsymbol{x}_{\bar{S}})$.
If samples $\boldsymbol{x}_{\bar{S}}^{k}, k=1,\ldots,K$, from
$p(\boldsymbol{x}_{\bar{S}}|\boldsymbol{x}_S=\boldsymbol{x}_S^*)$ are
available, the conditional expectation in above can be approximated by
```{=tex}
\begin{equation}
v_{\text{KerSHAP}}(S) = \frac{1}{K}\sum_{k=1}^K f(\boldsymbol{x}_{\bar{S}}^{k},\boldsymbol{x}_S^*).
\end{equation}
```
In Kernel SHAP, $\boldsymbol{x}_{\bar{S}}^{k}, k=1,\ldots,K$ are sampled
from the $\bar{S}$-part of the training data, *independently* of
$\boldsymbol{x}_{S}$. This is motivated by using the training set as the
empirical distribution of $\boldsymbol{x}_{\bar{S}}$, and assuming that
$\boldsymbol{x}_{\bar{S}}$ is independent of
$\boldsymbol{x}_S=\boldsymbol{x}_S^*$. Due to the independence
assumption, if the features in a given model are highly dependent, the
Kernel SHAP method may give a completely wrong answer. This can be
avoided by estimating the conditional distribution
$p(\boldsymbol{x}_{\bar{S}}|\boldsymbol{x}_S=\boldsymbol{x}_S^*)$
directly and generating samples from this distribution. With this small
change, the contributions and Shapley values may then be approximated as
in the ordinary Kernel SHAP framework. @aas2019explaining propose three
different approaches for estimating the conditional probabilities which
are implemented: `empirical`, `gaussian` and `copula`. The package also
implements the `ctree` method from @redelmeier2020explaining, and the
`vaeac` method from @olsen2022using. The original `independence` approach
of @lundberg2017unified is also available. The methods may also be combined,
such that e.g. one method is used when conditioning on a small number of
features, while another method is used otherwise.
The `shapr` package also supports directly estimating the contribution
function using regression. We briefly introduce the regression-based
methods below, but we refer to the separate regression vignette
(Shapley value explanations using the regression paradigm) and
@olsen2024comparative for an in-depth explanation of the regression paradigm.
<a id="gaussian"></a>
## Multivariate Gaussian Distribution Approach
The first approach arises from the assumption that the feature vector
$\boldsymbol{x}$ stems from a multivariate Gaussian distribution with
some mean vector $\boldsymbol{\mu}$ and covariance matrix
$\boldsymbol{\Sigma}$. Under this assumption, the conditional
distribution
$p(\boldsymbol{x}_{\bar{\mathcal{S}}} |\boldsymbol{x}_{\mathcal{S}}=\boldsymbol{x}_{\mathcal{S}}^*)$
is also multivariate Gaussian\
$\text{N}_{|\bar{\mathcal{S}}|}(\boldsymbol{\mu}_{\bar{\mathcal{S}}|\mathcal{S}},\boldsymbol{\Sigma}_{\bar{\mathcal{S}}|\mathcal{S}})$,
with analytical expressions for the conditional mean vector
$\boldsymbol{\mu}_{\bar{\mathcal{S}}|\mathcal{S}}$ and covariance matrix
$\boldsymbol{\Sigma}_{\bar{\mathcal{S}}|\mathcal{S}}$, see
@aas2019explaining for details. Hence, instead of sampling from the
marginal empirical distribution of $\boldsymbol{x}_{\bar{\mathcal{S}}}$
approximated by the training data, we can sample from the Gaussian
conditional distribution, which is fitted using the training data. Using
the resulting samples
$\boldsymbol{x}_{\bar{\mathcal{S}}}^k, k=1,\ldots,K$, the conditional
expectations be approximated as in the Kernel SHAP.
<a id="copula"></a>
## Gaussian Copula Approach
If the features are far from multivariate Gaussian, an alternative
approach is to instead represent the marginals by their empirical
distributions, and model the dependence structure by a Gaussian copula.
Assuming a Gaussian copula, we may convert the marginals of the training
data to Gaussian features using their empirical distributions, and then
fit a multivariate Gaussian distribution to these.
To produce samples from the conditional distribution
$p(\boldsymbol{x}_{\bar{\mathcal{S}}} |\boldsymbol{x}_{\mathcal{S}}=\boldsymbol{x}_{\mathcal{S}}^*)$,
we convert the marginals of $\boldsymbol{x}_{\mathcal{S}}$ to Gaussians,
sample from the conditional Gaussian distribution as above, and convert
the marginals of the samples back to the original distribution. Those
samples are then used to approximate the sample from the resulting
multivariate Gaussian conditional distribution. While other copulas may
be used, the Gaussian copula has the benefit that we may use the
analytical expressions for the conditionals
$\boldsymbol{\mu}_{\bar{\mathcal{S}}|\mathcal{S}}$ and
$\boldsymbol{\Sigma}_{\bar{\mathcal{S}}|\mathcal{S}}$. Finally, we may
convert the marginals back to their original distribution, and use the
resulting samples to approximate the conditional expectations as in the
Kernel SHAP.
<a id="empirical"></a>
## Empirical Conditional Distribution Approach
If both the dependence structure and the marginal distributions of
$\boldsymbol{x}$ are very far from the Gaussian, neither of the two
aforementioned methods will work very well. Few methods exists for the
non-parametric estimation of conditional densities, and the classic
kernel estimator (@rosenblatt1956) for non-parametric density estimation
suffers greatly from the curse of dimensionality and does not provide a
way to generate samples from the estimated distribution. For such
situations, @aas2019explaining propose an empirical conditional approach
to sample approximately from
$p(\boldsymbol{x}_{\bar{\mathcal{S}}}|\boldsymbol{x}_{\mathcal{S}}^*)$.
The idea is to compute weights
$w_{\mathcal{S}}(\boldsymbol{x}^*,\boldsymbol{x}^i),\ i=1,...,n_{\text{train}}$
for all training instances based on their Mahalanobis distances (in the
$S$ subset only) to the instance $\boldsymbol{x}^*$ to be explained.
Instead of sampling from this weighted (conditional) empirical
distribution, @aas2019explaining suggests a more efficient variant,
using only the $K$ instances with the largest weights:
$$v_{\text{condKerSHAP}}(\mathcal{S}) = \frac{\sum_{k=1}^K w_{\mathcal{S}}(\boldsymbol{x}^*,
\boldsymbol{x}^{[k]}) f(\boldsymbol{x}_{\bar{\mathcal{S}}}^{[k]},
\boldsymbol{x}_{\mathcal{S}}^*)}{\sum_{k=1}^K w_{\mathcal{S}}(\boldsymbol{x}^*,\boldsymbol{x}^{[k]})},$$
The number of samples $K$ to be used in the approximate prediction can
for instance be chosen such that the $K$ largest weights accounts for a
fraction $\eta$, for example $0.9$, of the total weight. If $K$ exceeds
a certain limit, for instance $5,000$, it might be set to that limit. A
bandwidth parameter $\sigma$ used to scale the weights, must also be
specified. This choice may be viewed as a bias-variance trade-off. A
small $\sigma$ puts most of the weight to a few of the closest training
observations and thereby gives low bias, but high variance. When
$\sigma \rightarrow \infty$, this method converges to the original
Kernel SHAP assuming feature independence. Typically, when the features
are highly dependent, a small $\sigma$ is typically needed such that the
bias does not dominate. @aas2019explaining show that a proper criterion
for selecting $\sigma$ is a small-sample-size corrected version of the
AIC known as AICc. As calculation of it is computationally intensive, an
approximate version of the selection criterion is also suggested.
Details on this is found in @aas2019explaining.
<a id="ex"></a>
<br>
## Conditional Inference Tree Approach
The previous three methods can only handle numerical data. This means
that if the data contains categorical/discrete/ordinal features, the
features first have to be one-hot encoded. When the number of
levels/features is large, this is not feasible. An approach that handles
mixed (i.e numerical, categorical, discrete, ordinal) features and both
univariate and multivariate responses is conditional inference trees
(@hothorn2006unbiased).
Conditional inference trees is a special tree fitting procedure that
relies on hypothesis tests to choose both the splitting feature and the
splitting point. The tree fitting procedure is sequential: first a
splitting feature is chosen (the feature that is least independent of
the response), and then a splitting point is chosen for this feature.
This decreases the chance of being biased towards features with many
splits (@hothorn2006unbiased).
We use conditional inference trees (*ctree*) to model the conditional
distribution,
$p(\boldsymbol{x}_{\bar{\mathcal{S}}}|\boldsymbol{x}_{\mathcal{S}}^*)$,
found in the Shapley methodology. First, we fit a different conditional
inference tree to each conditional distribution. Once a tree is fit for
given dependent features, the end node of
$\boldsymbol{x}_{\mathcal{S}}^*$ is found. Then, we sample from this end
node and use the resulting samples,
$\boldsymbol{x}_{\bar{\mathcal{S}}}^k, k=1,\ldots,K$, when approximating
the conditional expectations as in Kernel SHAP. See
@redelmeier2020explaining for more details.
The conditional inference trees are fit using the *party* and *partykit*
packages (@partykit_package).
<a id="vaeac"></a>
## Variational AutoEncoder with Arbitrary Conditioning (vaeac) Approach
Another approach that supports mixed features is the Variational AutoEncoder
with Arbitrary Conditioning (@olsen2022using), abbreviated to `vaeac`.
The `vaeac` is an extension of the regular variational autoencoder
(@kingma2014autoencoding), but instead of giving a probabilistic representation
of the distribution $p(\boldsymbol{x})$ it gives a probabilistic representation
of the conditional distribution
$p(\boldsymbol{x}_{\bar{\mathcal{S}}} \mid \boldsymbol{x}_{\mathcal{S}})$,
for all possible feature subsets $\mathcal{S}\subseteq\mathcal{M}$ simultaneously,
where $\mathcal{M}$ is the set of all features. That is, only a single `vaeac`
model is needed to model all conditional distributions.
The `vaeac` consists of three neural networks: a *full encoder*, a *masked encoder*,
and a *decoder*. The encoders map the full and masked/conditional input representations,
i.e., $\boldsymbol{x}$ and $\boldsymbol{x}_{\mathcal{S}}$, respectively,
to latent probabilistic representations. Sampled instances from this latent probabilistic
representations are sent to the decoder, which maps them back to the feature space
and provides a samplable probabilistic representation for the unconditioned features
$\boldsymbol{x}_{\bar{\mathcal{S}}}$. The full encoder is only used during the
training phase of the `vaeac` model to guide the training process of the masked encoder,
as the former relies on the full input sample $\boldsymbol{x}$, which is not accessible
in the deployment phase (when we generate the Monte Carlo samples), as we only have access
to $\boldsymbol{x}_{\mathcal{S}}$. The networks are trained by minimizing a variational
lower bound, and see Section 3 in @olsen2022using for an in-depth introduction to the
`vaeac` methodology. We use the `vaeac` model at the epoch which obtains the lowest
validation IWAE score to generate the Monte Carlo samples used in the Shapley value computations.
We fit the `vaeac` model using the *torch* package in $\textsf{R}$ (@torch). The main
parameters are the the number of layers in the networks (`vaeac.depth`), the width of the layers
(`vaeac.width`), the number of dimensions in the latent space (`vaeac.latent_dim`),
the activation function between the layers in the networks (`vaeac.activation_function`),
the learning rate in the ADAM optimizer (`vaeac.lr`), the number of `vaeac` models to initiate
to remedy poorly initiated model parameter values (`vaeac.n_vaeacs_initialize`), and
the number of learning epochs (`vaeac.epochs`). Call `?shapr::setup_approach.vaeac` for
a more detailed description of the parameters.
There are additional extra parameters which can be set by including a named list in the call to
the `explain()` function. For example, we can the change the batch size to 32 by including
`vaeac.extra_parameters = list(vaeac.batch_size = 32)` as a parameter in the call the `explain()` function. See `?shapr::vaeac_get_extra_para_default` for a description of the possible
extra parameters to the `vaeac` approach. We strongly encourage the user to specify the main and extra parameters to the `vaeac` approach at the correct place in the call to the `explain()` function. That is, the main parameters are directly entered to the `explain()` function, while the extra parameters are included in a named list called `vaeac.extra_parameters`. However, the `vaeac` approach will try to correct for misplaced and duplicated parameters and give warnings to the user.
## Categorical Approach
When the features are all categorical, we can estimate the conditional
expectations using basic statistical formulas. For example, if we have
three features, $x_1, x_2, x_3$ with three levels each (indicated as 1,
2, 3), and we are provided with a table of counts indicating how many
times each combination of feature values occurs, we can estimate the
marginal and conditional probabilities as follows. Marginal
probabilities are estimated by dividing the number of times a given
feature (or features) takes on a certain value in the data set with the
total number of observations in the data set. Condititional
probabilities (for example, $P(X_1 = 1 | X_2 = 1)$) are estimated by
first subsetting the data set to reflect the conditioning (i.e.,
extracting all rows where $X_2 = 1$), and then dividing the number of
times the feature on the left hand side of $|$ takes the given value in
this subset by the total number of observations in this subset. Once the
marginal and conditional probabilities are estimated for all
combinations of feature values, each conditional expectation can be
calculated. For example, the expected value of $X_1$ given $X_2 = 1$ and
$X_3 = 2$ is
$$E(X_1|X_2, X_3) = \sum_{x}x P(X_1 = x | X_2=1, X_3=2) = \sum_{x} x \frac{P(X_1 = x, X_2 = 1, X_3 = 2)}{P(X_2=1, X_3=2)}.$$.
<a id="Regression_approaches"></a>
## Separate and Surrogate Regression Approaches
Another paradigm for estimating the contribution function is the regression
paradigm. In contrast to the methods above, which belong to the Monte Carlo
paradigm, the regression based-methods use regression models to estimate the
contribution function $v(S) = E[f(\boldsymbol{x})|\boldsymbol{x}_S = \boldsymbol{x}_S^*]$ directly.
The separate regression method class fits a separate regression model for
each coalition $S$, while the surrogate regression method class fits a single
regression model to simultaneously predict the contribution function for all
coalitions. We refer to @olsen2024comparative for when one should use the
different paradigms, method classes, and methods.
In a separate vignette (Shapley value explanations using the regression paradigm),
we elaborate and demonstrate the regression paradigm.
We describe how to specify the regression model, enable automatic cross-validation
of the model's hyperparameters, and apply pre-processing steps to the data before
fitting the regression models. @olsen2024comparative divides the regression
paradigm into the separate and surrogate regression method classes. In the
separate vignette, we briefly introduce the two method classes. For an in-depth
explanation, we refer the reader to Sections 3.5 and 3.6 in @olsen2024comparative.
<a id="ex"></a>
<br>
# Examples {#examples}
`shapr` supports computation of Shapley values with any predictive model
which takes a set of numeric features and produces a numeric outcome.
Note that the ctree method takes both numeric and categorical variables.
Check under "Advanced usage" for an example of how this can be done.
The following example shows how a simple `xgboost` model is trained
using the `airquality` dataset, and how `shapr` can be used to explain
the individual predictions. Note that the empirical conditional
distribution approach is the default (i.e. `approach = "empirical"`).
The Gaussian, Gaussian copula, ctree, vaeac, or independence approaches can be
used instead by setting the argument `approach` to either `"gaussian"`,
`"copula"`, `"ctree"`, `"vaeac"`, `"categorical"` or `"independence"` in the code
below.
```r
library(xgboost)
library(data.table)
data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
ind_x_explain <- 1:6
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]
# Set seed for reproducibility
set.seed(123)
# Fitting a basic xgboost model to the training data
model <- xgboost::xgboost(
data = as.matrix(x_train),
label = y_train,
nround = 20,
verbose = FALSE
)
# Specifying the phi_0, i.e. the expected prediction without any features
p0 <- mean(y_train)
# Computing the actual Shapley values with kernelSHAP accounting for feature dependence using
# the empirical (conditional) distribution approach with bandwidth parameter sigma = 0.1 (default)
explanation <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
prediction_zero = p0
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#>
#> Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
# Printing the Shapley values for the test data.
# For more information about the interpretation of the values in the table, see ?shapr::explain.
print(explanation$shapley_values)
#> none Solar.R Wind Temp Month
#> <num> <num> <num> <num> <num>
#> 1: 43.086 13.21173 4.7856 -25.572 -5.5992
#> 2: 43.086 -9.97277 5.8307 -11.039 -7.8300
#> 3: 43.086 -2.29162 -7.0534 -10.150 -4.4525
#> 4: 43.086 3.32546 -3.2409 -10.225 -6.6635
#> 5: 43.086 4.30396 -2.6278 -14.152 -12.2669
#> 6: 43.086 0.47864 -5.2487 -12.553 -6.6457
# Plot the resulting explanations for observations 1 and 6
plot(explanation, bar_plot_phi0 = FALSE, index_x_explain = c(1, 6))
```
![](figure_main/unnamed-chunk-2-1.png)
There are multiple plot options specified by the `plot_type` argument in
`plot`. The `waterfall` option shows the changes in the prediction score
due to each features contribution (their Shapley values):
There are multiple plot options specified by the `plot_type` argument in `plot`.
The `waterfall` option shows the changes in the prediction score due to each features contribution (their Shapley values):
```r
plot(explanation, plot_type = "waterfall", index_x_explain = c(1, 6))
```
![](figure_main/unnamed-chunk-3-1.png)
The other two plot options, `"beeswarm"` and `"scatter"`, can be useful
when you have many observations that you want to explain. For the
purpose of illustration, we explain the whole `airquality` dataset
(including the training data) for these plot types. The
`plot_type = "beeswarm"` summarises the distribution of the Shapley
values along the x-axis across all features. Each point gives the
Shapley value of a given instance, where the points are colored by the
feature value of that instance:
```r
x_explain_many <- data[, ..x_var]
explanation_plot <- explain(
model = model,
x_explain = x_explain_many,
x_train = x_train,
approach = "empirical",
prediction_zero = p0
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
plot(explanation_plot, plot_type = "beeswarm")
```
![](figure_main/unnamed-chunk-4-1.png)
The `plot_type = "scatter"` plots the feature values on the x-axis and
Shapley values on the y-axis, as well as (optionally) a background
scatter_hist showing the distribution of the feature data:
```r
plot(explanation_plot, plot_type = "scatter", scatter_hist = TRUE)
```
![](figure_main/unnamed-chunk-5-1.png)
We can use mixed (i.e continuous, categorical, ordinal) data with `ctree` or `vaeac`.
Use `ctree` with mixed data in the following manner:
```r
# convert the month variable to a factor
data[, Month_factor := as.factor(Month)]
data_train_cat <- data[-ind_x_explain, ]
data_explain_cat <- data[ind_x_explain, ]
x_var_cat <- c("Solar.R", "Wind", "Temp", "Month_factor")
x_train_cat <- data_train_cat[, ..x_var_cat]
x_explain_cat <- data_explain_cat[, ..x_var_cat]
# Fitting an lm model here as xgboost does not handle categorical features directly
# (work around in example below)
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var_cat, collapse = " + ")))
model_lm_cat <- lm(lm_formula, data_train_cat)
p0 <- mean(y_train)
explanation_lm_cat <- explain(
model = model_lm_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
approach = "ctree",
prediction_zero = p0
)
#> Setting parameter 'n_batches' to 10 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
# Plot the resulting explanations for observations 1 and 6, excluding
# the no-covariate effect
plot(explanation_lm_cat, bar_plot_phi0 = FALSE, index_x_explain = c(1, 6))
```
![](figure_main/unnamed-chunk-6-1.png)
We can specify parameters used to build the conditional inference trees
in the following manner. Default values are based on
@hothorn2006unbiased.
```r
# Use the conditional inference tree approach
# We can specify parameters used to building trees by specifying mincriterion,
# minsplit, minbucket
explanation_ctree <- explain(
model = model_lm_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
approach = "ctree",
prediction_zero = p0,
ctree.mincriterion = 0.80,
ctree.minsplit = 20,
ctree.minbucket = 20
)
#> Setting parameter 'n_batches' to 10 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
# Default parameters (based on (Hothorn, 2006)) are:
# mincriterion = 0.95
# minsplit = 20
# minbucket = 7
```
If **all** features are categorical, one may use the categorical
approach as follows:
```r
# For the sake of illustration, convert ALL features to factors
data[, Solar.R_factor := as.factor(cut(Solar.R, 10))]
data[, Wind_factor := as.factor(cut(Wind, 3))]
data[, Temp_factor := as.factor(cut(Temp, 2))]
data[, Month_factor := as.factor(Month)]
data_train_all_cat <- data[-ind_x_explain, ]
data_explain_all_cat <- data[ind_x_explain, ]
x_var_all_cat <- c("Solar.R_factor", "Wind_factor", "Temp_factor", "Month_factor")
x_train_all_cat <- data_train_all_cat[, ..x_var_all_cat]
x_explain_all_cat <- data_explain_all_cat[, ..x_var_all_cat]
# Fit an lm model here
lm_formula_all_cat <- as.formula(paste0(y_var, " ~ ", paste0(x_var_all_cat, collapse = " + ")))
model_lm_all_cat <- lm(lm_formula_all_cat, data_train_all_cat)
explanation_cat_method <- explain(
model = model_lm_all_cat,
x_explain = x_explain_all_cat,
x_train = x_train_all_cat,
approach = "categorical",
prediction_zero = p0
)
#> Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
```
Shapley values can be used to explain any predictive model. For
predictive models taking time series as input, `approach='timeseries'`
can be used. In such models, joint behavior of consecutive time points
is often more important for the outcome than the single time points.
Therefore, it makes sense to derive Shapley value segments of the time
series instead of for each single time point. In `shapr` this can be
achieved through the `group` attribute. Other optional parameters of
`approach='timeseries'` are `timeseries.fixed_sigma_vec` and
`timeseries.bounds` (a vector indicating upper and lower bounds of the
time series if necessary).
```r
# Simulate time series data with AR(1)-structure
set.seed(1)
data_ts <- data.frame(matrix(NA, ncol = 41, nrow = 4))
for (n in 1:100) {
set.seed(n)
e <- rnorm(42, mean = 0, sd = 1)
m_1 <- 0
for (i in 2:length(e)) {
m_1[i] <- 1 + 0.8 * m_1[i - 1] + e[i]
}
data_ts[n, ] <- m_1[-1]
}
data_ts <- data.table::as.data.table(data_ts)
x_var_ts <- paste0("X", 1:40)
y_var_ts <- "X41"
ind_x_explain <- 1:6
data_ts_train <- data_ts[-ind_x_explain]
# Creating a predictive model (for illustration just predicting the next point in the time series with a linear model)
lm_ts_formula <- as.formula(X41 ~ .)
model_lm_ts <- lm(lm_ts_formula, data_ts_train)
x_explain_ts <- data_ts[ind_x_explain, ..x_var_ts]
x_train_ts <- data_ts[-ind_x_explain, ..x_var_ts]
# Spitting the time series into 4 segments
group_ts <- list(
S1 = paste0("X", 1:10),
S2 = paste0("X", 11:20),
S3 = paste0("X", 21:30),
S4 = paste0("X", 31:40)
)
p0_ts <- mean(unlist(data_ts_train[, ..y_var_ts]))
explanation_timeseries <- explain(
model = model_lm_ts,
x_explain = x_explain_ts,
x_train = x_train_ts,
approach = "timeseries",
prediction_zero = p0_ts,
group = group_ts
)
#> Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
```
## MSEv evaluation criterion
We can use the $\operatorname{MSE}_{v}$ criterion proposed by @frye2020shapley,
and later used by, e.g., @olsen2022using and @olsen2024comparative, to evaluate
and rank the approaches/methods. The $\operatorname{MSE}_{v}$ is given by
```{=tex}
\begin{align}
\label{eq:MSE_v}
\operatorname{MSE}_{v} = \operatorname{MSE}_{v}(\text{method } \texttt{q})
=
\frac{1}{N_\mathcal{S}} \sum_{\mathcal{S} \in \mathcal{P}^*(\mathcal{M})} \frac{1}{N_\text{explain}}
\sum_{i=1}^{N_\text{explain}} \left( f(\boldsymbol{x}^{[i]}) - {\hat{v}}_{\texttt{q}}(\mathcal{S}, \boldsymbol{x}^{[i]})\right)^2\!,
\end{align}
```
where ${\hat{v}}_{\texttt{q}}$ is the estimated contribution function using method $\texttt{q}$ and $N_\mathcal{S} = |\mathcal{P}^*(\mathcal{M})| = 2^M-2$, i.e., we have removed the empty ($\mathcal{S} = \emptyset$) and the grand combinations ($\mathcal{S} = \mathcal{M}$) as they are method independent. Meaning that these two combinations do not influence the ranking of the methods as the methods are not used to compute the contribution function for them.
The motivation behind the
$\operatorname{MSE}_{v}$ criterion is that
$\mathbb{E}_\mathcal{S}\mathbb{E}_{\boldsymbol{x}} (v_{\texttt{true}}(\mathcal{S},\boldsymbol{x}) - \hat{v}_{\texttt{q}}(\mathcal{S}, \boldsymbol{x}))^2$
can be decomposed as
```{=tex}
\begin{align}
\label{eq:expectation_decomposition}
\begin{split}
\mathbb{E}_\mathcal{S}\mathbb{E}_{\boldsymbol{x}} (v_{\texttt{true}}(\mathcal{S}, \boldsymbol{x})- \hat{v}_{\texttt{q}}(\mathcal{S}, \boldsymbol{x}))^2
&=
\mathbb{E}_\mathcal{S}\mathbb{E}_{\boldsymbol{x}} (f(\boldsymbol{x}) - \hat{v}_{\texttt{q}}(\mathcal{S}, \boldsymbol{x}))^2 \\
&\phantom{\,\,\,\,\,\,\,}- \mathbb{E}_\mathcal{S}\mathbb{E}_{\boldsymbol{x}} (f(\boldsymbol{x})-v_{\texttt{true}}(\mathcal{S}, \boldsymbol{x}))^2,
\end{split}
\end{align}
```
see Appendix A in @covert2020understanding. The first term on the right-hand side of
the equation above can be estimated by $\operatorname{MSE}_{v}$, while the second
term is a fixed (unknown) constant not influenced by the approach \texttt{q}. Thus, a low value
of $\operatorname{MSE}_{v}$ indicates that the estimated contribution function $\hat{v}_{\texttt{q}}$
is closer to the true counterpart $v_{\texttt{true}}$ than a high value.
In `shapr`, we allow for weighting the combinations in the $\operatorname{MSE}_{v}$ evaluation criterion either
uniformly or by using the corresponding Shapley kernel weights (or the sampling frequencies when sampling of
combinations is used).
This is determined by the logical parameter `MSEv_uniform_comb_weights` in the `explain()` function, and the
default is to do uniform weighting, that is, `MSEv_uniform_comb_weights = TRUE`.
### Advantage:
An advantage of the $\operatorname{MSE}_{v}$ criterion is that $v_\texttt{true}$ is not involved.
Thus, we can apply it as an evaluation criterion to real-world data sets where the true
Shapley values are unknown.
### Disadvantages:
First, we can only use the $\operatorname{MSE}_{v}$ criterion to rank the methods and not assess
their closeness to the optimum since the minimum value of the $\operatorname{MSE}_{v}$ criterion
is unknown. Second, the criterion evaluates the contribution functions and not the Shapley values.
Note that @olsen2024comparative observed a relatively linear relationship between the
$\operatorname{MSE}_{v}$ criterion and the mean absolute error $(\operatorname{MAE})$ between the
true and estimated Shapley values in extensive simulation studies where the true Shapley values
were known. That is, a method that achieves a low $\operatorname{MSE}_{v}$ score also tends to
obtain a low $\operatorname{MAE}$ score, and vice versa.
### Confidence intervals
The $\operatorname{MSE}_{v}$ criterion can be written as
$\operatorname{MSE}_{v} = \frac{1}{N_\text{explain}}\sum_{i=1}^{N_\text{explain}} \operatorname{MSE}_{v,\text{explain }i}$.
We can therefore use the central limit theorem to compute an approximate
confidence interval for the $\operatorname{MSE}_{v}$ criterion. We have that
$\operatorname{MSE}_{v} \pm t_{\alpha/2}\frac{\operatorname{SD}(\operatorname{MSE}_{v})}{\sqrt{N_\text{explain}}}$
is a $(1-\alpha/2)\%$ approximate confidence interval for the evaluation criterion,
where $t_{\alpha/2}$ is the $\alpha/2$ percentile of the $T_{N_\text{explain}-1}$ distribution.
Note that $N_\text{explain}$ should be large (rule of thumb is at least $30$) for the
central limit theorem to be valid. The quantities $\operatorname{MSE}_{v}$ and
$\frac{\operatorname{SD}(\operatorname{MSE}_{v})}{\sqrt{N_\text{explain}}}$ are returned by
the `explain()` function in the `MSEv` list of data tables. We can also compute similar
approximate confidence interval for $\operatorname{MSE}_{v}$ criterion for each
combination/coalition when only averaging over the observations. However, it does not
make sense in the other direction, i.e., when only averaging over the combinations for
each observation, as each combination is a different prediction tasks.
### MSEv examples
Start by explaining the predictions by using different methods and combining them into lists.
```r
# We use more explicands here for more stable confidence intervals
ind_x_explain_many <- 1:25
x_train <- data[-ind_x_explain_many, ..x_var]
y_train <- data[-ind_x_explain_many, get(y_var)]
x_explain <- data[ind_x_explain_many, ..x_var]
# Fitting a basic xgboost model to the training data
model <- xgboost::xgboost(
data = as.matrix(x_train),
label = y_train,
nround = 20,
verbose = FALSE
)
# Specifying the phi_0, i.e. the expected prediction without any features
p0 <- mean(y_train)
# Independence approach
explanation_independence <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "independence",
prediction_zero = p0,
n_samples = 1e2,
n_batches = 5,
MSEv_uniform_comb_weights = TRUE
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
# Empirical approach
explanation_empirical <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
prediction_zero = p0,
n_samples = 1e2,
n_batches = 5,
MSEv_uniform_comb_weights = TRUE
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
# Gaussian 1e1 approach
explanation_gaussian_1e1 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
prediction_zero = p0,
n_samples = 1e1,
n_batches = 5,
MSEv_uniform_comb_weights = TRUE
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
# Gaussian 1e2 approach
explanation_gaussian_1e2 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
prediction_zero = p0,
n_samples = 1e2,
n_batches = 5,
MSEv_uniform_comb_weights = TRUE
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
# Combined approach
explanation_combined <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = c("gaussian", "empirical", "independence"),
prediction_zero = p0,
n_samples = 1e2,
n_batches = 5,
MSEv_uniform_comb_weights = TRUE
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
# Create a list of explanations with names
explanation_list_named <- list(
"Ind." = explanation_independence,
"Emp." = explanation_empirical,
"Gaus. 1e1" = explanation_gaussian_1e1,
"Gaus. 1e2" = explanation_gaussian_1e2,
"Combined" = explanation_combined
)
```
We can then compare the different approaches by creating plots of the $\operatorname{MSE}_{v}$ evaluation criterion.
```r
# Create the MSEv plots with approximate 95% confidence intervals
MSEv_plots <- plot_MSEv_eval_crit(explanation_list_named,
plot_type = c("overall", "comb", "explicand"),
CI_level = 0.95
)
# 5 plots are made
names(MSEv_plots)
#> [1] "MSEv_explicand_bar" "MSEv_explicand_line_point" "MSEv_combination_bar" "MSEv_combination_line_point" "MSEv_bar"
```
The main plot if interest is the `MSEv_bar`, which displays the $\operatorname{MSE}_{v}$ evaluation criterion for each method averaged over both the combinations/coalitions and test observations/explicands. However, we can also look at the other plots where
we have only averaged over the observations or the combinations (both as bar and line plots).
```r
# The main plot of the overall MSEv averaged over both the combinations and observations
MSEv_plots$MSEv_bar
```
![](figure_main/unnamed-chunk-12-1.png)
```r
# The MSEv averaged over only the explicands for each combinations
MSEv_plots$MSEv_combination_bar
```
![](figure_main/unnamed-chunk-12-2.png)
```r
# The MSEv averaged over only the combinations for each observation/explicand
MSEv_plots$MSEv_explicand_bar
```
![](figure_main/unnamed-chunk-12-3.png)
```r
# To see which coalition S each of the `id_combination` corresponds to,
# i.e., which features that are conditions on.
explanation_list_named[[1]]$MSEv$MSEv_combination[, c("id_combination", "features")]
#> id_combination features
#> <int> <list>
#> 1: 2 1
#> 2: 3 2
#> 3: 4 3
#> 4: 5 4
#> 5: 6 1,2
#> 6: 7 1,3
#> 7: 8 1,4
#> 8: 9 2,3
#> 9: 10 2,4
#> 10: 11 3,4
#> 11: 12 1,2,3
#> 12: 13 1,2,4
#> 13: 14 1,3,4
#> 14: 15 2,3,4
```
We can specify the `index_x_explain` and `id_combination` parameters in `plot_MSEv_eval_crit()` to only plot
certain test observations and combinations, respectively.
```r
# We can specify which test observations or combinations to plot
plot_MSEv_eval_crit(explanation_list_named,
plot_type = "explicand",
index_x_explain = c(1, 3:4, 6),
CI_level = 0.95
)$MSEv_explicand_bar
```
![](figure_main/unnamed-chunk-13-1.png)
```r
plot_MSEv_eval_crit(explanation_list_named,
plot_type = "comb",
id_combination = c(3, 4, 9, 13:15),
CI_level = 0.95
)$MSEv_combination_bar
```
![](figure_main/unnamed-chunk-13-2.png)
We can also alter the plots design-wise as we do in the code below.
```r
bar_text_n_decimals <- 1
plot_MSEv_eval_crit(explanation_list_named) +
ggplot2::scale_x_discrete(limits = rev(levels(MSEv_plots$MSEv_bar$data$Method))) +
ggplot2::coord_flip() +
ggplot2::scale_fill_brewer(palette = "Paired") +
ggplot2::theme_minimal() + # This must be set before other theme calls
ggplot2::theme(
plot.title = ggplot2::element_text(size = 10),
legend.position = "bottom"
) +
ggplot2::geom_text(
ggplot2::aes(label = sprintf(
paste("%.", sprintf("%d", bar_text_n_decimals), "f", sep = ""),
round(MSEv, bar_text_n_decimals)
)),
vjust = -0.35, # This number might need altering for different plots sizes
hjust = 1.1, # This number might need altering for different plots sizes
color = "black",
position = ggplot2::position_dodge(0.9),
size = 4
)
```
![](figure_main/unnamed-chunk-14-1.png)
## Main arguments in `explain`
When using `explain`, the default behavior is to use all feature
combinations in the Shapley formula. Kernel SHAP's sampling based
approach may be used by specifying `n_combinations`, which is the number
of unique feature combinations to sample. If not specified, the exact
method is used. The computation time grows approximately exponentially
with the number of features. The training data and the model whose
predictions we wish to explain must be provided through the arguments
`x_train` and `model`. The data whose predicted values we wish to
explain must be given by the argument `x_explain`. Note that both
`x_train` and `x_explain` must be a `data.frame` or a `matrix`, and all
elements must be finite numerical values. Currently we do not support
missing values. The default approach when computing the Shapley values
is the empirical approach (i.e. `approach = "empirical"`). If you'd like
to use a different approach you'll need to set `approach` equal to
either `copula` or `gaussian`, or a vector of them, with length equal to
the number of features. If a vector, a combined approach is used, and
element `i` indicates the approach to use when conditioning on `i`
variables. For more details see [Combined approach](#combined) below.
When computing the kernel SHAP values by `explain`, the maximum number
of samples to use in the Monte Carlo integration for every conditional
expectation is controlled by the argument `n_samples` (default equals
`1000`). The computation time grows approximately linear with this
number. You will also need to pass a numeric value for the argument
`prediction_zero`, which represents the prediction value when not