# Integradet gradients method

## Overview

Integrated gradients is a method originally proposed in Sundararajan et al., ["Axiomatic Attribution for Deep Networks"](https://arxiv.org/abs/1703.01365) that aims to attribute an importance value to each input feature of a machine learning model based on the gradients of the model's output with respect to the input. In particular, integrated gradients defines an attribution value for each feature by considering the integral of the gradients taken along a straight path from a baseline $x^\prime$ to the input $x.$

## Integrated gradients method

The method is valid both for regression and classification models. In case of a non-scalar output, such as in classification models or multi target regressions, the gradients shoul be calculated for one given element of the output. For classification models, the gradients usually refers to the output corresponding to the true class or to the class predicted by the model.

Let us consider an input instance $x,$ a baseline instance $x^\prime$ and a model $M: X \rightarrow Y$ which act on the features space $X$ and produces an output $y$ in the output space $Y.$ The attributions $A_i(x, x^\prime)$ for each feature $x_i$ with respect to the corresponding feature $x_i^\prime$ in the baseline are calculated as

$$A_i(x, x^\prime) = (x_i - x_i^\prime) \int_0^1 \frac{\partial M(x^\prime + \alpha (x - x^\prime))}{\partial x_i} d\alpha$$

where the integral is taken along a straight path from the baseline $x^\prime$ to the instace $x$ parameterized by the parameter $\alpha.$  

It is shown that such attributions satisfy the following axioms:

* Sensitivity axiom: if we consider a baseline  $x^\prime$ which differs from the input instance $x$ for the value of one feature $x_i$ and yields different predictions, the attribution given to feature $x_i$ must be non-zero. 

* Implementation invariance axiom: an attribution method should be such that the attributions do not depend on the particular implementation of the model.

* Completeness axiom: The completeness axiom states that the sum over all features attributions should be equal to the difference between the model's output at the instance $x$ and the model's output at the baseline $x^\prime$:
$$\sum_i A_i(x, x^\prime) = M(x) - M(x^\prime).$$


## Usage

The alibi implementation of the integrated gradients method is specific for tensorflow and keras models. 

```python 
import tensorflow as tf
from alibi.explainers import IntegratedGradients

model = tf.keras.models.load_model("path_to_your_model")

ig  = IntegratedGradients(model,
                          layer=None, 
                          feature_names=None,
                          n_steps=50,
                          method="gausslegendre")

```

* `model`: A tensorflow or keras model
* `layer`: Layer of the model respect to which the gradients are calculated. If None, gradients are calculated respect to the input features.
* `feature_names`: Names of each features (optional).
* `n_steps`: The number of steps in the integral approximation
* `method`: The method for the integral approximation. Available methods are: `riemann_left`, `riemann_right`, `riemann_middle`, `riemann_trapezoid`, `gausslegendre`.

```python 
explanation = ig.explain(X,
                         baselines=None,
                         target=None,
                         internal_batch_size=100,
                         return_convergence_delta=False,
                         return_predictions=False)

attributions = explanation.data['attributions']
```

* `X`: Instance for which integrated gradients attribution are computed.
* `baselines`: Baselines (start point of the path integral) for each instance. If the passed value is an np.ndarray must have the same shape of X. If not provided, all features values for the baselines are set to 0.
* `target`: Target class for which the gradients are computed. It must be provided if the model output dimension is higher than 1. For regressions model, target should not be provided. For classification models target can be either the True classes or the classes predicted by the model.
* `internal_batch_size`: Bach size for the internal batching.
* `return_convergence_delta`: If set to True, convergence deltas for all examples are returned in the Explanation object.
* `return_predictions`: If set to true, the original predictions for all examples are returned in the Explanation object.

### Layers attributions

It is possible to calculate the integrated gradients attributions for the model's input features or for each element of an intermediate layer of the model. Specifically,

* If the parameter `layer` is set to its default value `None` as in the example above, the attributions are calculated for each input feature.
* If a layer of the model is passed, the attributions are calculated for each element of the layer passed.

Calculating attribution with respect to an internal layer of the model is particullary usuful for models that take text as an input and use words-to-vectors embedding. In this case, the integrated gradients are calculated with respect to the embedding layer (see [example](../examples/integrated_gradients_imdb.nblink) on imdb dataset).

### Baselines

Conceptually, baselines represent data points which do not contain information useful for the model's task, and they are used as a benchmark by the integrated gradients method. Common choices for the baselines are data points with all features values set to zero (for example the black image in case of image classification) or set to a random value. 

However, the choice of the baselines can have a significative impact on the values of the attributions. For example, if we consider a simple binary image classification task where a model is trained to predict whether a picture was taken at night or during the day, considering the black image as a baseline would be misleading: in fact, with such a baseline all the dark pixel of the images would have zero attribution, while they are likely to be important for the task at hand. 

An extensive discussion about the impact of the baseline on integrated gradients attributions can be found  in P. Sturmfels at al., ["Visualizing the Impact of Feature Attribution Baselines"](https://distill.pub/2020/attribution-baselines/).

### Targets

In the context of the integrated gradients, the target variable specifies which element of the output should be considered to calculate the gradients. If the output of the model is a scalar, as in the case of single target regressions, a target is not necessary, and the gradients are calculated in a straightforward way. 

If the output of the model is a vector, the target value  specifies the position of the element in the output vector considered for the calculation of the gradients. In case of a classification model, the target can be either the true class or the class predicted by the model for a given input. 

## Examples

[Mnist dataset](../examples/integrated_gradients_mnist.nblink)

[Imagenet dataset](../examples/integrated_gradients_imagenet.nblink)

[Imdb dataset text classification](../examples/integrated_gradients_imdb.nblink)