# Integradet gradients method

## Overview

Integrated gradients is a method originally proposed by [Sundararajan et al.](https://arxiv.org/abs/1703.01365) that aims to attribute an importance value to each input feature of a machine learning model based on the gradients of the model's output with respect to the input. 
Roughly speaking, an higher gradient with respect to a given feature indicates more variability of the output when the feature is changed, which means the feature has a higher impact on the model's predictions. 

However this is not always the case: as pointed out in the [original paper](https://arxiv.org/abs/1703.01365), an attribution method should satisfy the sensitivity axiom, i.e. if we consider a beseline input instance $x^\prime$ which differs from the input instance of interest $x$ for the value of one feature $x_i$ and yields different predictions, the attribution given to feature $f_i$ should be non-zero. Since gradients might flatten at the input, simply taking the value of the gradient as an attribution might break the sensitivity axiom. 

Integrated gradients overcome this problem by considering the integral of the gradients taken along a given path from a baseline $x^\prime$ to $x$ and using the value of the path integral as an attribution. It is shown that such attributions satisfy the sensitivity axiom. 



## Integrated gradients method

The method is valid both for regression and classification models. In case of a non-scalar output, such in classification models or multi target regressions, the gradients shoul be calculated for one given element of the output. For classification models, the gradients usually refers to the output corresponding to the true class or to the class predicted by the model.

Let us consider an input instance $x,$ a beseline instance $x^\prime$ and a model $M: X \rightarrow Y$ which act on the feature space $X$ and produce an output $y$ in the output space $Y.$ The attributions $A_i(x, x^\prime)$ for each features $x_i$ with respect to baseline value $x_i^\prime$ are calculated as

$$A_i(x, x^\prime) = (x_i - x_i^\prime) \int_0^1 \frac{\partial M(x^\prime + \alpha (x - x^\prime))}{\partial x_i} d\alpha$$

where the integral is taken along a straight path from the baseline $x^\prime$ to the instace $x$ parameterized by the parameter $\alpha.$  


## Usage

The alibi implementation of the integrated gradients method is specific for tensorflow and keras models. It is possible to calculate the integrated gradients attributions for the model's input features or for each element of an intermediate layer of the model. Specifically,

* If the parameter `layer` is set to `None` as in the example below (`None` is the default value), the attributions are calculated for each input feature.
* If a layer of the model is passed, the attributions are calculated for each element of the layer.

Calculating attribution with respect to an internal layer of the model is particullary usuful for models that take text as an input. In this case, the integrated gradients are calculated with respect to the embedding layer (see [example](../examples/integrated_gradients_imdb.nblink) on imdb dataset).

```python 
import tensorflow as tf
from alibi.explainers import IntegratedGradients

model = tf.keras.models.load_model("path_to_your_model")

ig  = IntegratedGradients(model,
                          layer=None, 
                          n_steps=50,
                          method="gausslegendre")

```

* model: A tensorflow or keras model
* layer: Layer of the model respect to which the gradients are calculated. If None, gradients are calculated respect to the input features.
* n_steps: The number of steps in the integral approximation
* The method for the integral approximation. Available methods are: `riemann_left`, `riemann_right`, `riemann_middle`, `riemann_trapezoid`, `gausslegendre`.

```python 
explanation = ig.explain(X,
                         baselines=None,
                         features_names=None,
                         target=None,
                         internal_batch_size=100,
                         return_convergence_delta=False,
                         return_predictions=False)

attributions = explanation.data['attributions']
```

* X: Instance for which integrated gradients attribution are computed.
* baselines: Baselines (start point of the path integral) for each instance. If the passed value is an np.ndarray must have the same shape of X. If not provided, all features values for the baselines are set to 0.
* features_names: Names of each features (optional).
* target: Target class for which the gradients are computed. It must be provided if the model output dimension is higher than 1. For regressions model, target should not be provided. For classification models target can be either the True classes or the classes predicted by the model.
* internal_batch_size: Bach size for the internal batching.
* return_convergence_delta: If set to True, convergence deltas for all examples are returned in the Explanation object.
* return_predictions: If set to true, the original predictions for all examples are returned in the Explanation object.

## Examples

[Mnist dataset](../examples/integrated_gradients_mnist.nblink)

[Imagenet dataset](../examples/integrated_gradients_imagenet.nblink)

[Imdb dataset text classification](../examples/integrated_gradients_imdb.nblink)