# ProtoSelect

## Overview

[Bien and Tibshirani (2012)](https://arxiv.org/pdf/1202.5933.pdf) proposed Protoselect, which is a prototype selection method with the goal of producing not only a condensed view of dataset but also on an interpretable model. Prototypes can be defined as instances that are representative of the entire reference data distribution. Formally, consider a dataset of training points $\mathcal{X} = \{x_1, ..., x_n \} \subset \mathbf{R}^p$ and their corresponding labels $\mathcal{Y} = \{y_1, ..., y_n\}$, where $y_i \in \{1, 2, ..., L\}$. ProtoSelect finds sets $\mathcal{P}_{l} \subseteq \mathcal{X}$ for each class $l$ such that the set union of $\mathcal{P}_1, \mathcal{P}_2, ..., \mathcal{P}_L$ would provided a distilled view of the reference dataset $(\mathcal{X}, \mathcal{Y})$.


Given the sets of prototypes, one can construct a simple interpretable classifier given by:
$$
    \hat{c}(x) = \arg\min_{l} \min_{z \in \mathcal{P}_l} d(x, z)
$$

Note that the classifier defined in the equation above would be equivalent to 1-KNN if each set $\mathcal{P}_l$ would consist only of instances belonging to class $l$.


## ProtoSelect method

ProtoSelect is designed such that each prototype would satisfy a set of desired properties. For a set $\mathcal{P}_l \subset \mathcal{X}$, the neighborhood of a point $x_i \in \mathcal{P}_l$ is given by the points contained in an $\epsilon$-ball centered in $x_i$, denoted as $B(x_i, \epsilon)$. Thus, given a radius $\epsilon$ for a point $x_i$, we say that another point $x_j$ is covered by $x_i$ if $x_j$ is contained in the $\epsilon$-ball centered in $x_i$. A visualization of the prototypes sets for various $\epsilon$ radius values are depicted in the following figure:

![ProtoSelect](protoselect_overview.png)

Bien and Tibshirani, *PROTOTYPE SELECTION FOR INTERPRETABLE CLASSIFICATION*, 2012

A desirable prototype for a class $l$ would satisfy the following properties:  
* cover as many training points of class $l$ as possible.
* covers as few training points as possible of classes different than $l$.
* is sparse (i.e., contains as few prototypes instances as possible). 

Formally, let us first define $\alpha_{j}^{(l)} \in \{0, 1\}$ to indicate whether we select $x_j$ to be in $\mathcal{P}_l$. Then we can write the three properties as an integer program as follows:

$$\min_{\alpha_{j}^{(l)}, \xi_{i}, \nu_{i}} \sum_{i}{\xi_i} + \sum_{i}{\nu_i} + \lambda \sum_{j, l}\alpha_{j}^{(l)} \text{ such that}$$

$$\sum_{j: x_i \in B(x_j, \epsilon)} \alpha_{j}^{(y_i)} \ge 1 - \xi_i, \forall x_i \in \mathcal{X}, \text{ (a)}$$

$$\sum_{j: x_i \in B(x_j, \epsilon), l \neq y_i} \alpha_{j}^{(l)} \le 0 + \nu_i, \forall x_i \in \mathcal{X}, \text{ (b)}$$


$$ \alpha_{j}^{l} \in \{0, 1\} \forall j, l,$$

$$ \xi_i, \nu_i \ge 0. $$

For each training point $x_i$, we introduce the slack variables $\xi_i$ and $\nu_i$. Before explaining the two constraints, note that $\sum_{j: x_i \in B(x_j, \epsilon)} \alpha_{j}^{(l)}$ counts the number of times that the instance $x_i$ is contained in $\epsilon$-balls centered in $x_j$, where $x_i \in \mathcal{P}_l$. The first constraint tries to encourage that each training point $(x_i, y_i)$  is covered in the $\epsilon$-ball of a prototype belonging to the same class $y_i$. On the other hand, the second constraint tries to encourage that $x_i$ will not belong to any $\epsilon$-ball centered in a prototype belonging to a different class $l \ne y_i$.

Because the integer program defined above can not be solved in polynomial time, the authors proposed two alternative solution. The first one consist in a relaxation of the objective and transforming the integer program into a linear program, for which post-processing is required to ensure feasibility of the solution. We refer the reader to the [paper](https://arxiv.org/pdf/1202.5933.pdf) for more details. The second one, recommended and implemented in *Alibi*, follows a greedy approach. Given the current state $(\mathcal{P}_1, ..., \mathcal{P}_L)$, in the next iteration the state is updated by $(\mathcal{P}_1, ..., \mathcal{P}_{l} \cup \{x_j\}, ..., \mathcal{P}_L)$, where $x_j$ is selected such that it maximizes the objective $\Delta\text{Obj}(x_j, l) = \Delta \xi(x_j,l) - \Delta\nu(x_j, l) - \lambda$, given that:

$$\Delta \xi(x_j, l) = \mid \mathcal{X}_l \cap (B(x_j, \epsilon) \setminus \cup_{x_{j^\prime} \in \mathcal{P}_l}B(x_{j^\prime}, \epsilon)) \mid \text{ (a)}$$

$$\Delta \nu(x_j, l) = \mid B(x_j, \epsilon) \cap (\mathcal{X} \setminus \mathcal{X}_l) \mid \text{ (b)}.$$

Note that $\Delta \xi(x_j, l)$ counts the number of new instances (i.e. not already covered by the existing prototypes) belonging to class $l$ that $x_j$ covers in the $\epsilon$-ball. On the other hand, $\Delta \nu(x_j, l)$ counts how many instances belonging to a different class than $l$ the $x_j$ element covers. Finally, $\lambda$ is the penalty/cost of adding a new prototypes encouraging sparsity (lower number of prototypes). Intuitively, a good prototype for a class $l$ will cover as many new instances belonging to class $l$ (i.e. maximize $\Delta \xi(x_j, l)$) and avoid covering elements outside the class $l$ (i.e. minimize $\Delta \nu(x_j, l)$). The prototype selection algorithm stops when all $\Delta\text{Obj}(x_j, l)$ are lower than 0.

## Usage


```python 
from alibi.prototypes import ProtoSelect
from alibi.utils.kernel import EuclideanDistance

explainer = ProtoSelect(kernel_distance=EuclideanDistance(), eps=eps, preprocess_fn=preprocess_fn)

```

* `kernel_distance`: Kernel to be used. Use `EuclideanDistance`.
* `eps`: Epsilon ball size.
* `lbd`: Penalty for each prototype. Encourages a lower number of prototypes to be selected.
* `batch_size`: Batch size to be used for kernel matrix computation.
* `preprocess_fn`: Preprocessing function for kernel matrix computation.
* `verbose`: Whether to display progression bar while computing prototypes points.


Following the initialization, we need to fit the explainer.

```python
explainer = explainer.fit(X=X, X_labels=X_labels, Y=X)
```

* `X`: Reference dataset to be summarized.
* `X_labels`: Labels of the reference dataset.
* `Y`: Dataset to choose the prototypes from. If ``None``, the prototypes will be selected from the reference dataset `X`.

Note that the reference set, `X`, and the dataset of potential prototypes, `Y`, do not have to be the same. This means that we can choose any set `Y` to select the prototypes from. Furthermore, note that we only need to specify the labels for the `X` set through `X_labels`, but not for `Y`. In case the labels `X_labels` are missing, the method implicitly assumes that all the instances belong to the same class. This means that the second term in the objective, $\Delta\nu(x_j, l)$, will be 0. Thus, the algorithm will try to find prototypes that cover as many data instances as possible, with minimum overlap between their corresponding $\epsilon$-balls. In this setting, the quality of the solution is fully determined by the choice of the $\epsilon$ radius. Note that one can obtain a trivial summarization containing a single prototype if the chosen radius is large enough to cover all data instances in the hypersphere, no matter the choice of the centering instance.


Finally, we can obtain the explanation by requesting the maximum number of prototypes to be returned:

```python
explanation = explainer.explain(num_prototypes=num_prototypes)
```

* `num_prototypes`: Number of maximum prototypes to be selected.

As we previously mentioned, the algorithm stops when the objective is less than 0, for all the remaining instances in the set of potential prototypes. This means that the algorithm can return a lower number of prototypes than the one requested.

Another important observation is that the explanation returns the prototypes with their corresponding labels although no labels were provided for the set `Y`. This is possible since each prototype $x$ will belong to a prototype set $\mathcal{P}_l$, and thus we can assign a label $l$ to $x$. Following the explanation step, one can train an interpretable 1-KNN classifier on the returned prototypes even for an unlabeled dataset `Y`. 

## Hyperparameter selection

*Alibi* exposes a cross-validation hyperparameter selection method for the radius $\epsilon$ when the Euclidean distance is used. The method returns the $\epsilon$ radius value that achieves the best accuracy score on a 1-KNN classification task.

```python
cv = cv_protoselect_euclidean(refset=(X_ref, X_ref_labels),
                              protoset=(X_proto, ),          # passed as a tuple for consistency
                              valset=(X_val, X_val_labels),
                              num_prototypes=num_prototypes,
                              quantiles=(0., 0.4),
                              preprocess_fn=preprocess_fn)
```

The method API is flexible and allows for various arguments such as setting a predefined $\epsilon$-range, the number of equidistant bins, the number of cross-validation splits when the validation set is not provided, etc. We refer the reader to the documentation page for a full parameter description.

The best $\epsilon$-radius can be access through `cv['best_eps']`. The object also contains other meta-data gathered throughout the hyperparameter search.

## Data modalities

The method can be applied to any data modality by passing the `preprocess_fn` expected to return a numerical feature representation of the data.

## Prototypes visualization for image modality

As proposed by [Bien and Tibshirani (2012)](https://arxiv.org/pdf/1202.5933.pdf), one can view the prototypes in a 2D image scatter plot. The size of each prototype is proportional to the log of the number of correct-class training images covered by that prototype.

![ProtoSelect ImageNet](protoselect_imagenet.png)

Prototypes of a subsampled ImageNet dataset containing 10 classes using a ResNet50 pretrained feature extractor.

```python
import umap
from alibi.prototypes.protoselect import visualize_prototypes

# define 2D reducer
reducer = umap.UMAP(random_state=26)
reducer = reducer.fit(preprocess_fn(X_train))

# display prototypes in 2D
visualize_prototypes(explanation=explanation,
                     refset=(X_ref, X_ref_labels),
                     reducer=reducer.transform,
                     preprocess_fn=preprocess_fn)
```

* `explanation`: Explanation object.
* `refset`: Tuple, `(X_ref, X_ref_labels)`, consisting of the reference data instances with the corresponding reference labels.
* `reducer`: 2D reducer. Reduces the input feature representation to 2D. Note that the reducer operated directly on the input instances if ``preprocess_fn=None``. If the `preprocess_fn` is specified, the reducer will be called on the feature representation obtained after calling `preprocess_fn` on the input instances.


Here we used a [UMAP](https://arxiv.org/abs/1802.03426) 2D reducer, but any other dimensionality reduction method will do. The `visualize_prototypes` method exposes other arguments to control how the images will be displayed. We refer the reader to the method's documentation for further details.

## Examples

[Tabular and image datasets](../examples/protoselect.ipynb)