[[source]](../api/alibi.explainers.html#alibi.explainers.PermutationImportance)

# Permutation Importance

## Overview

<a id="source_1"></a>
<a id="source_2"></a>

The permutation importance, initially propose by [Breiman (2001)](https://link.springer.com/article/10.1023/A:1010933404324)[[1]](#References), and further refined by [Fisher et al. (2019)](https://arxiv.org/abs/1801.01489)[[2]](#References) is a method to compute the global importance a feature for a tabular dataset. The computation of the feature importance is based on how much the model performance degrades when the feature values within a feature column are permuted. By inspecting the attribution received by each feature, a practitioner can understand which are the most important features that the model relies on to compute its predictions.

<img src="permutation_importance_intro_leave.png" alt="Permutation Importance using F1 score, Who's Going to Leave Next?" width="800"/>

**Figure 1**. Permutation Importance using $1 - F_1$ loss on "Who's Going to Leave Next?" dataset. Left figure displays the importance as the ratio between the permutation loss and the original loss. Right figure displays the importance as the difference between the permutation loss and the original loss.


Figure 1 displays the importance of each feature according to the $1 - F_1$ loss function reported as the ratio of the permuted loss and the original loss (left plot), and as the difference between the permuted loss and the original loss (right plot). We can observe that the most important feature that the model relies on is the `satisfaction level`. Following that, we have three features that have approximately the same importance, namely the `average_montly_hours`, `last_evaluation` and `number_project`. Finally, in our top 5 hierarchy we have `time_spend_company`. Features like `sales`, `salary`, `Work_accident` and `promotion_last_5years` receive an importance of 1 in the left plot and an importance of 0 in the right plot which are an indication that the features are not important to the model. For a more detailed analysis, please check the worked [example](../examples/permutation_importance_classification_leave.ipynb).


For pros & cons, see the [Permutation Importance](https://docs.seldon.io/projects/alibi/en/stable/overview/high_level.html#permutation-importance) section from the [Introduction](https://docs.seldon.io/projects/alibi/en/stable/overview/high_level.html) materials.

## Usage

TODO once the code review is completed.

## Theoretical exposition

<a id="source_1"></a>
[Breiman (2001)](https://link.springer.com/article/10.1023/A:1010933404324)[[1]](#References) initially proposed the permutation feature importance for a random forest classifier as method to compute the global importance of a feature as seen by the model. More precisely, consider a dataset with $M$ input features and a random forest classifier. After each tree is created, the values of the $m$-th feature in the [out-of-bag](https://en.wikipedia.org/wiki/Out-of-bag_error) (OOB) split are randomly permuted and the newly generated data is fed to the in-question tree to obtain a new prediction. The result for each newly generated data instance from OOB is saved. The process is then repeated for all features $m = 1, 2, ..., M$. After the procedure is completed for every tree, the noised responses are compared with the true label to give the misclassification rate. The importance of each feature is given by the percent increase in the misclassification rate as compared with the OOB rate when all the features are left intact.  

**The intuition behind the procedure described above is that an increase in the misclassification rate is an indication that a feature is important for the given model.**

<a id="source_2"></a>
Although the method was initially proposed for a random forest classifier, it can be easily generalized to any model and prediction task (e.g., classification or regression). [Fisher et al. (2019)](https://arxiv.org/abs/1801.01489)[[2]](#References) proposed a model agnostic version of the permutation feature importance called *model reliance* which is the one implemented in `alibi`.

### Notation

Before diving into the mathematical formulation of the model reliance, we first introduce some notation. Let $Z=(Y, X_1, X_2) \in \mathcal{Z}$ be an *iid* random variable with outcome $Y \in \mathcal{Y}$ and covariates (features) $X = (X_1, X_2) \in \mathcal{X}$, where the covariates subsets $X_1 \in \mathcal{X}_1$ and $X_2 \in \mathcal{X}_2$ may be each multivariate. The goal is to measure how much the model prediction relies on $X_1$ to predict $Y$. 



<a id="source_2"></a>
For a given prediction model $f$, [Fisher et al. (2019)](https://arxiv.org/abs/1801.01489)[[2]](#References) introduced the *model reliance* to be the percent increase in $f$'s expected loss when noise is added to $X_1$. Informally this can be written as:

$$
MR(f) = \frac{\text{Expected loss of $f$ under noise}}{\text{Expected loss of $f$ without noise}}
$$

Note that there are certain properties that the noise must satisfy:

* must render $X_1$ completely uninformative of the outcome $Y$.

*  must not alter the marginal distribution of $X_1$.


### Definition

Given the notation above, we can introduce formally the *model reliance*.

Let $Z^{(a)} = (Y^{(a)}, X_1^{(a)}, X_2^{(b)})$ and $Z^{(b)} = (Y^{(b)}, X_1^{(b)}, X_2^{(b)})$ be independent random variables, each following the same distribution as $Z = (Y, X_1, X_2)$. The expected loss of the model $f$ across pairs of observations $(Z^{(a)}, Z^{(b)})$ in which the values $X_1^{(a)}$ and $X_{1}^{(b)}$ have been switched is defined as: 

$$
e_{\text{switch}}(f) = \mathbb{E}[L\{f, (Y^{(b)}, X_1^{(a)}, X_2^{(b)})\}]
$$

Note that the definition above uses the pair $(Y^{(b)}, X_2^{(b)})$ from $Z^{(b)}$, but the variable $X_1^{(a)}$ from $Z^{(a)}$, hence the name *switched*. It is important to understand that the values $(Y^{(b)}, X_1^{(a)}, X_2^{(b)})$ do not relate to each other and thus we brake the correlation between $X_1$ with the remaining features $X_2$ and with the output $Y$. An alternative interpretation of $e_{\text{switch}}(f)$ is the expected loss of $f$ when noise is added to $X_1$ in such a way that $X_1$ becomes completely uninformative of $Y$, but the marginal of $X_1$ is unchanged.

The reference quantity to compare $e_{\text{switched}}(f)$ against is the standard expected loss when the features are left intact (i.e., none of the feature values were switched). Formally it can be written as:

$$
e_{\text{orig}}(f) = \mathbb{E}[L\{f, (Y, X_1, X_2)\}]
$$

Given the two quantities above, we can formally define $MR(f)$ as their ration:

$$
MR(f) = \frac{e_{\text{switched}}(f)}{e_{\text{orig}}(f)}
$$

There are three possible cases to be analyzed:

* $MR(f) \gt 1$ indicates that the model $f$ relies on the feature $X_1$. For example, a $MR(f) = 2$ means that the error loss has doubled when $X_1$ was permuted.

* $MR(f) = 1$ indicates that the model $f$ **does not** rely on the feature $X_1$. This means that the error has not changed when $X_1$ was permuted.

* $MR(f) \lt 1$ is an interesting case. Surprisingly, there exist models $f$ such that their reliance is less than one. For example, this can happen if the model $f$ treats $X_1$ and $Y$ as positively correlated when in fact they are negatively correlated. In many cases, a $MR(f) \lt 1$ implies the existence of a better performant model $f^\prime$ satisfying $MR(f^\prime) = 1$ and $e_{\text{orig}}(f^\prime) \le e_{\text{orig}}(f)$. This is equivalent to saying that the model $f$ is typically suboptimal.


<a id='source_3'></a>
An alternative definition of the model reliance which uses the difference instead of the ration is given by:

$$
MR_{\text{difference}}(f) = e_{\text{switch}}(f) - e_{\text{orig}}(f).
$$

As emphasized by [Molnar 2020](https://christophm.github.io/interpretable-ml-book/feature-importance.html)[[3]](#References), the positive aspect of using ratio over difference is that the results are comparable across multiple problems.

### Estimation of model reliance with U-statistics

For a given model $f$ and a dataset $Z = (Y, X_1, X_2)$, one has to estimate the $MR(f)$. The estimation of the $e_{\text{orig}}(f)$ is straightforward through the empirical loss, formally given by:

$$
\hat{e}_{\text{orig}}(f) = \frac{1}{n} \sum_{i=1}^n L\{f, Y^{(i)}, X_{1}^{(i)}, X_{2}^{(i)}\}.
$$

For the estimation of the $e_{\text{switched}}(f)$, one has to be more considerate because applying a naive permutation of the feature values can be a source of bias. To be more concrete on how the bias can be introduced, let us consider an example of four data instances 

$$
\mathcal{Z} =  \{(Y^{(1)}, X_1^{(1)}, X_2^{(1)}), (Y^{(2)}, X_1^{(2)}, X_2^{(2)}), (Y^{(3)}, X_1^{(3)}, X_2^{(3)}), (Y^{(4)}, X_1^{(4)}, X_2^{(4)})\}.
$$

Note that naively applying the permutation $(1, 2, 4, 3)$ to the original dataset will only break the correlation for two instances out of four, and the rest will be left intact. Since the first two instances will be left intact and since they follow the same data distribution that the model was trained on, we expect that the error for those instances to be low (i.e., if the model did not overfit), which will bring down the estimate of $e_{\text{switch}}(f)$. Thus, permutations $\pi$ for which there exist elements such that $\phi(i) = i$ are a source of bias in our estimate.

<a id="source_2"></a>
[Fisher et al. (2019)](https://arxiv.org/abs/1801.01489)[[2]](#References) proposed two alternative methods to compute an unbiased estimate using [U-statistic](https://en.wikipedia.org/wiki/U-statistic#:~:text=In%20statistical%20theory%2C%20a%20U,producing%20minimum%2Dvariance%20unbiased%20estimators.). The first estimate is to perform a "switch" operation across all observed pairs, by excluding pairings that are actually observed in the original dataset. Formally, it can be written as:

$$
\hat{e}_{\text{switch}}(f) = \frac{1}{n(n-1)} \sum_{i=1}^n \sum_{j \neq i} L\{f, (Y^{(i)}, X_1^{(i)}, X_2^{(i)}\}.
$$

The computation of the $\hat{e}_{\text{switch}}(f)$ can be expensive because the summation is performed over all $n(n-1)$ possible pairs.

If the estimation is prohibited due to the sample size, the following alternative estimator can be used:

$$
\hat{e}_{\text{divide}}(f) = \frac{1}{2\lfloor n/2 \rfloor} \sum_{i=1}^{\lfloor n/2 \rfloor} [L\{f, (Y^{(i)}, X_1^{(i + \lfloor n/2 \rfloor)}, X_2^{(i)}\} + L\{f, (Y^{(i + \lfloor n/2 \rfloor)}, X_1^{(i)}, X_2^{(i + \lfloor n/2 \rfloor)}) \}].
$$

Note that rather than summing over all possible pairs, the dataset is divided in half and the first and half values for $(Y, X_2)$ are matched with the second half values of $X_1$, and the other way around. Besides the light computation, this approach can provide confidence intervals by computing the estimates over multiple data splits.

<a id="source_1"></a>
<a id="source_2"></a>
We end our theoretical exposition by mentioning that both estimators above can be used to compute an unbiased estimate of $\hat{MR}(f)$. Furthermore, one interesting observation is that the definition of $e_{\text{switch}}$ is very similar to the one proposed by [Breiman (2001)](https://link.springer.com/article/10.1023/A:1010933404324)[[1]](#References). Formally, the approach described by [Breiman (2001)](https://link.springer.com/article/10.1023/A:1010933404324)[[1]](#References) can be written as:

$$
\hat{e}_{\text{permute}} = \sum_{i=1}^n L\{f, (Y^{(i)}, X_1^{\pi_l(i)}, X_2^{\pi_l(i)})\},
$$

where $\pi_j \in \{\pi_1, ..., \pi_{n!}\}$ is one permutation from the set of all permutations of $(1, ..., n)$. The calculation proposed by [Fisher et al. (2019)](https://arxiv.org/abs/1801.01489)[[2]](#References) is proportional to the sum of losses over all $n!$ permutations, excluding the $n$ unique combinations of the rows of $X_1$ and the rows of $[Y, X_2]$ that appear in the original sample. As mentioned before, excluding those combinations is necessary to preserve the unbiasedness of the $\hat{e}_{\text{switch}}(f)$. 

## Examples

[Permutation Importance classification example ("Who's Going to Leave Next?")](../examples/permutation_importance_classification_leave.ipynb)

## References

<a id='References'></a>

[[1]](#source_1) Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.

[[2]](#source_2) Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. "All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously." J. Mach. Learn. Res. 20.177 (2019): 1-81.

[[3]](#source_3) Molnar, Christoph. Interpretable machine learning. Lulu. com, 2020.