[[source]](../api/alibi.confidence.linearity_measure.rst)

# Measuring the linearity of machine learning models

## Overview

Machine learning models include in general linear and non-linear operations: neural networks may include several layers consisting of linear algebra operations followed by non-linear activation functions, while models based on decision trees are by nature highly non-linear. The linearity measure function and class provide an operational definition for the amount of non-linearity of a map acting on vector spaces. Roughly speaking, the amount of non-linearity of the map is defined based on how much the output of the map applied to a linear superposition of input vectors differs from the linear superposition of the map's outputs for each individual vector. In the context of supervised learning, this definition is immediately applicable to machine learning models, which are fundamentally maps from a input vector space (the feature space) to an output vector space that may represent probabilities (for classification models) or actual values of quantities of interest (for regression models).

Given an input vector space $V$, an output vector space $W$ and a map $M: V \rightarrow W$,
the amount of non-linearity of the map $M$ in a region $\beta$ of the input space $V$ and relative to some coefficients $\alpha(\vec{v})$ is defined as
\begin{equation}\label{integral}
  L_{\beta, \alpha}^{(M)} = \left\| \int_{\beta} \alpha(\vec{v}) M(\vec{v}) d\vec{v} -
  M\left(\int_{\beta}\alpha(\vec{v})\vec{v}d\vec{v} \right) \right\|,
\end{equation}
where $\vec{v} \in V$ and $\|\cdot\|$ denotes the norm of a vector. 
If we consider a finite number of vectors, the amount of non-linearity can be defined as
\begin{equation}\label{discrete}
  L_{\beta, \alpha}^{(M)} = \left\| \sum_{i}^N \alpha_{i} M(\vec{v}_i)  -
  M\left(\sum_i^N \alpha_i\vec{v}_i \right) \right\|,
\end{equation}
where, with an abuse of notation,  $\beta$ is no longer a continuous region in
the input space but a collection of input vectors $\{\vec{v}_i\}$ and $\alpha$ is no longer a function but a collection of real coefficients $\{\alpha_i \}$ with $i \in \{0, ..., N\}$. Note that the second expression may be interpreted as an approximation of the integral quantity defined in the first expression, where the vectors $\{v_i\}$ are sampled uniformly in the region $\beta$.

## Application to machine learning models

In supervised learning, a model can be considered as a function $M$  mapping vectors from the input space (feature vectors) to vectors in the output space. The output space may represents probabilities in the case of a classification model or values of the target quantities in the case of a regression model. The definition of the linearity measure given above can be applied to the case of a regression model (either a single target regression or a multi target regression) in a straightforward way.

In case of a classifier, let us denote by $\vec{z}$ the logits vector of the model such that the probabilities of the model $M$ are given by $softmax(\vec{z}).$ Since the activation function of the last layer is usually highly non-linear, it is convenient to apply the definition of linearity given above to the logits vector $\vec{z}.$
In the "white box" scenario, in which we have access to the internal architecture of the model,  the vector $\vec{z}$ is accessible and the amount of non-linearity can be calculated immediately. On the other hand,  if the only accessible quantities are the output probabilities (the "black box" scenario), we need to invert the last layer's activation function in order to retrieve $\vec{z}.$ In other words, that means defining a new map $M^\prime = f^{-1} \circ M(\vec{v})$ and considering $L_{\beta, \alpha}^{(M^\prime)}$ as a measure of the non-linearity of the model. The activation function of the last layer is usually a sigmoid function for binary classification tasks or a softmax function for multi-class classification. 
The inversion of the sigmoid function does not present any particular challenge, and the map $M^\prime$ can be written as
\begin{equation}
  M^\prime = -\log \circ \Bigg(\frac{1-M(\vec{v})}{M(\vec{v})}\Bigg).
\end{equation}
On the other hand, the softmax probabilities $\vec{p}$ are defined in terms of the vector $\vec{z}$ as $p_j = e^{z_j}/\sum_j{e^{z_j}},$ where $z_j$ are the components of $\vec{z}$. The inverse $softmax^{-1}$ of the softmax function is thus defined up to a constant $C$ which does not depend on $j$ but might depend on the input vector $\vec{v}.$   The inverse map $M^\prime = softmax^{-1} \circ M(\vec{v})$ is then given by:
\begin{equation}\label{MInvSoft}
M^\prime = \log \circ M(\vec{v}) + C(\vec{v}),
\end{equation}
where $C(\vec{v})$ is an arbitrary constant depending in general on the input vector $\vec{v}.$

Since in the black box scenario it is not possible to assess the value of the constants $C$'s, henceforth we will ignore them and define the amount of  non-linearity of a machine learning model whose output is a probability distribution as
\begin{equation}\label{ConstantGaugeLin}
L_{\beta, \alpha}^{(\log \circ M)} = \left\| \sum_{i}^N \alpha_{i} \log \circ M(\vec{v}_i)  -
  \log \circ M\left(\sum_i^N \alpha_i\vec{v}_i \right)\right\|.
\end{equation}
It must be noted that the quantity above may in general be different from the "actual" amount of non-linearity of the model, i.e. the quantity calculated by accessing the activation vectors $\vec{z}$ directly.

<!--- However, assuming that $C(\vec{v})= C$  $\forall i$ and $\sum_i \alpha_i = 1,$ it can be shown (see \Cref{app:gauge}) that  $L_{\beta, \alpha}^{M^\prime}$ does not depend on $C$ and is given by
\cref{ConstantGaugeLin}. A more detailed discussion about the role played by the constants $C$ can be found in \Cref{app:gauge}. 
\begin{figure}
    \begin{subfigure}[t]{0.45\textwidth}
        \includegraphics[width=1.0\linewidth]{Lmodels.pdf} 
        \caption{Model comparison}\label{fig:lmodels}
        %\includegraphics[width=\linewidth]{Logistic.png} 
        %\caption{Logistic regression} \label{fig:logistic}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.45\textwidth}
        \includegraphics[width=\linewidth]{xgboost.pdf} 
        \caption{Xgboost classifier} \label{fig:xgboost}
    \end{subfigure}
    \caption{(a) Comparison of the linearity measure $L$ averaged over the whole features space for various models trained on the iris dataset: random forest (RF), xgboost (XB), support vector machine (SM), neural network (NN) and logistic regression (LR). Note that the scale of the x axis is logarithmic. (b) Decision boundaries (left panel) and linearity measure (right panel) for a xgboost classifier in features space. The x and y axis in the plots represent the sepal length and the sepal width, respectively.  Different colours correspond to different predicted classes. The markers represents the data points in the training set.}
\end{figure}
--->

## Implementation

### Sampling
The sampling of vectors in a neighbourhood of the instance of interest $v$ can also be done in two different ways. The first method, \emph{grid sampling}, 
%consists of defining the region $\beta$ as a discrete lattice of a given size around the instance of interest, with the size defined in terms of the L1 distance in the lattice; the vectors are then sampled from the  lattice according to a uniform distribution. The density and the size of the lattice are controlled by the resolution parameter \emph{res} and the size parameter \emph{epsilon} \cite{AlibiDocs}. This method is highly efficient and scalable from a computational point of view.
consists of sampling according to a uniform distribution from a lattice of points centered at $\vec{v}.$ This method is highly efficient and scalable from a computational point of view. 
The second method, \emph{knn}, consists of sampling from the same probability distribution the instance $\vec{v}$ was drawn from; this method is implemented by  simply selecting the $K$ nearest neighbours to $\vec{v}$ from a training set, when this is available. The second method imposes the constraint that the neighbourhood of $\vec{v}$ must include only vectors from the training set, and as a consequence it will exclude out-of-distribution instances from the computation of linearity.

### Pairwise vs global linearity
the Alibi implementation we propose two different method to associate a value of the linearity measure to $\vec{v},$ both based on the definition in  \cref{integral,discrete}. The first method consists of measuring the \emph{global} linearity in a region around $\vec{v}.$ This mean that we need to sample $N$ vectors $\{\vec{v}_i\}$ from a region $\beta$ of the input space around $\vec{v}$ and apply
\begin{equation}\label{discrete}
  L_{\beta, \alpha}^{(M)} = \left\| \sum_{i}^N \alpha_{i} M(\vec{v}_i)  -
  M\left(\sum_i^N \alpha_i\vec{v}_i \right) \right\|,
\end{equation}

The second method consists of measuring the \emph{pairwise} linearity between the instance of interest and other vectors close to it, averaging over all such pairs. In other words, we sample $N$ vectors $\{\vec{v}_i\}$ from $\beta$ as in the global method, but in this case we calculate the amount of  non-linearity $L_{(\vec{v},\vec{v_i}),\alpha}$ for every pair of vectors $(\vec{v}, \vec{v}_i)$ and average over all the pairs. Given two coefficients $\{\alpha_0, \alpha_1\}$ such that $\alpha_0 + \alpha_1 = 1,$ we can apply \cref{discrete} to each pair $(\vec{v}, \vec{v}_i)$  and define the pairwise linearity measure relative to the instance of interest $v$ as
\begin{equation}\label{pairwiselin}
L^{(M)} = \frac{1}{N} \sum_{i=0}^N \left\|\alpha_0 M(v) +  \alpha_1 M(v_i) - M(\alpha_0 v + \alpha_1 v_i)\right\|.
\end{equation}

The two methods are slightly different from a conceptual point of view: the global linearity measure defined in \cref{discrete} combines all $N$ vectors sampled in $\beta$ in a single superposition, and can be conceptually regarded as a direct approximation of the integral quantity defined in \cref{integral}. Thus, the quantity is strongly linked to the model behavior in the whole region $\beta.$ On the other hand, the pairwise linearity measure defined in \cref{pairwiselin} is an averaged quantity over pairs of superimposed vectors, with the instance of interest $v$ included in each pair. For that reason, it is conceptually more tied to the instance $v$ itself rather than the region $\beta$ around it. 

## Usage

### LinearityMeasure class

Given a ```model``` class with a ```predict``` method that return probabilities distribution in case of a classifier or numeric values in case of a regressor, the linearity measure $L$ around an instance of interest $X$ can be calculated using the class ```LinearityMeasure``` as follow 

```python 
from alibi.confidence.model_linearity import LinearityMeasure

predict_fn = lambda x: model.predict(x)

lm = LinearityMeasure()
lm.fit(X_train)
L = lm.score(predict_fn, X)
```

Where $X_{train}$ are the features vectors the model was trained with. The ```feature_range``` is infered form $X_{train}$ in the ```fit``` step.

### linearity_measure function

Given a ```model``` class with a ```predict``` method that return probabilities distribution in case of a classifier or numeric values in case of a regressor, the linearity measure $L$ around an instance of interest $X$ can also be calculated using the ```linearity_measure``` function as follow 

```python
from alibi.confidence.model_linearity import linearity_measure, _infer_feature_range

predict_fn = lambda x: model.predict(x)
features_range = _infer_feature_range(X_train)

L = linearity_measure(predict_fn, X, feature_range=features_range)
```

Note that in this case the ```feature_range``` must be explicitily passed to the function and it is infered before hand.  

## Examples

[Iris dataset](../examples/linearity_measure_iris.nblink)

[Fashion MNIST dataset](../examples/linearity_measure_fashion_mnist.nblink)