# Accumulated Local Effects

## Overview

Accumulated Local Effects (ALE) is a method for computing feature effects based on the paper [Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models](https://arxiv.org/abs/1612.08468) by Apley and Zhu. The algorithm provides model-agnostic (*black box*) global explanations for classification and regression models on tabular data.

ALE addresses some key shortcomings of the more popular method of estimating feature effects via [Partial Dependence Plots](https://christophm.github.io/interpretable-ml-book/pdp.html) (PDP) which can be very misleading. In the following section we motivate the ALE definition and the exact issues it addresses.

## Motivation and definition

The following exposition largely follows [Apley and Zhu (2016)](https://arxiv.org/abs/1612.08468) and [Molnar (2019)](https://christophm.github.io/interpretable-ml-book/ale.html).

Given a predictive model $f(x)$ where $x=(x_1,\dots x_d)$ is a vector of $d$ features we are interested in computing the *feature effects* of each feature $x_i$ on the model $f(x)$. A feature effect of feature $x_i$ is some function $g(x_i)$ designed to disentangle the contribution of $x_i$ to the response $f(x)$. To simplify notation in the following we condiser the $d=2$ case and define the feature effect functions for the first feature $x_1$.

### Partial Dependence

Partial Dependence Plots (PDP) is a very common method for computing feature effects. It is defined as

$$
\text{PD}(x_1) = \mathbb{E}[f(x_1, X_2)] = \int p(x_2)f(x_1, x_2)dx_2,
$$

where $p(x_2)$ is the marginal distribution of $X_2$. To estimate the expectation, we can take the training set $X$ and average the predictions of instances where the first feature for all instances is replaced by $x_1$:

$$
\widehat{\text{PD}}(x_1) = \frac{1}{n}\sum_{j=1}^{n}f(x_1, x_{2, j}).
$$

The PD function attempts to calculate the effect of $x_1$ by averaging the effects of the other feature $x_2$ over it's marginal distribution. This is problematic because by doing so we are averaging predictions of many *out of distribution* instances. For example, if $x_1$ and $x_2$ are a person's height and weight and $f$ predicts some other attribute of the person, then the PD function at a fixed height $x_1$ would average predictions of persons with height $x_1$ *and all possible weights* $x_2$ observed in the training set. Clearly, since height and weight are strongly correlated this would lead to many unrealistic data points. Since the predictor $f$ has not been trained on such impossible data points, the predictions are no longer meaningful. We can say that an implicit assumption motivating the PD approach is that the features are uncorrelated, however this is rarely the case and severely limits the usage of PDP.

An attempt to fix the issue with the PD function is to average over the conditional distribution instead of the marginal which leads to the following feature effect function:

$$
M(x_1) = \mathbb{E}[f(X_1, X_2)\vert X_1=x_1] = \int p(x_2\vert x_1)f(x_1,x_2)dx_2,
$$
where $p(x_2\vert x_1)$ is the conditional distribution of $X_2$. To estimate this function from the training set $X$ we can compute

$$
\widehat{M}(x_1) = \frac{1}{n(x_1)}\sum_{j\in N(x_1)}f(x_1,x_{2,j}),
$$
where $N(x_i)$ is a subset of indices $j$ for which $x_{1,j}$ falls into some small neighbourhood of $x_1$ and $n(x_1)$ is the number of such instances.

While this refinement addresses the issue of the PD function averaging over impossible data points, the use of the $M(x_1)$ function as feature effects remains limited when the features are correlated. To go back to the example with people's height and weight, if we fix the height to be some particular value $x_1$ and calculate the effects according to $M(x_1)$, because of the correlation of height and weight the function value mixes effects of *both* features and estimates the **combined** effect. This is undesirable as we cannot attribute the value of $M(x_1)$ purely to height. Furthermore, suppose height doesn't actually have any effect on the prediction, only weight does. Because of the correlation between height and weight, $M(x_1)$ would still show an effect which can be highly misleading.

The following plot summarizes the two approaches for estimating the effect of $x_1$ at a particular value when $x_2$ is strongly correlated with $x_1$:

![PDP_M_estimation](pdp_m.png)

### ALE

ALE solves the problem of mixing effects from different features. As with the function $M(x_1)$, $ALE$ uses the conditional distribution to average over other features, but instead of averaging the predictions directly, it averages *differences in predictions* to block the effect of correlated features.

\begin{align}
\widehat{\text{ALE}}(x_1) &= \int_{\min(x_1)}^{x_1}\mathbb{E}\left[\frac{\partial f(X_1,X_2)}{\partial X_1}\Big\vert X_1=z_1\right]dz_1 - c_1 \\
&= \int_{\min{x_1}}^{x_1}\int p(x_2\vert z_1)\frac{\partial f(z_1, x_2)}{\partial z_1}dx_2dz_1 - c_1,
\end{align}
where the constant $c_1$ is chosen such that the resulting ALE plot is vertically centered.