# Quickstart

This notebook intends to be a hands-on introduction that demonstrates the most important features of the `metriculous` library and explains core concepts.

In [1]:
# %load_ext autoreload
# %autoreload 2

import numpy as np

## `ClassificationEvaluator`
Let's start with a demonstration how `metriculous` can be used to evaluate and compare a set of machine learning models.
We will train and evaluate a small set of classifiers on the Iris dataset, which is included in Scikit-Learn.
The Iris dataset contains 150 flowers, each belonging to one of three classes: _setosa_, _versicolor_, _virginica_.

To demonstrate `ClassificationEvaluator`,
For this example we are going to load the data, then train a number of machine learning models and compare them with the ClassificationEvaluartor included in `metriculous`.

#### Load data

In [2]:
from sklearn.datasets import load_iris

iris = load_iris()
iris.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [3]:
iris.data.shape

(150, 4)

In [4]:
iris.target.shape

(150,)

In [5]:
list(iris.target_names)

['setosa', 'versicolor', 'virginica']

#### Train models

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

train_indices, test_indices = train_test_split(
    np.arange(len(iris.data)), test_size=0.7, random_state=42
)

models = [
    (
        "LogisticRegression",
        LogisticRegression(multi_class="auto", solver="lbfgs", random_state=42),
    ),
    ("DecisionTree", DecisionTreeClassifier(random_state=42)),
    ("Dummy", DummyClassifier(strategy="stratified", random_state=42)),
    ("RandomForest", RandomForestClassifier(n_estimators=100, random_state=42)),
]

for name, model in models:
    model.fit(iris.data[train_indices], iris.target[train_indices])

#### Compare models
`metriculous` provides a `Comparator` class that serves to evaluate a sequence of prediction objects against a known ground truth, and to compare them. A `Comparator` needs to be initialized with an `Evaluator` object that computes the actual performance metrics and creates charts for each of the prediction objects. A default `Evaluator` implementation named `ClassificationEvaluator` is included in `metriculous` and it aims to satisfy the most common requirements for classification problems.

Let's use the two components to evaluate and compare our Iris classifiers:

In [7]:
import metriculous

test_targets_one_hot = np.eye(len(iris.target_names))[iris.target[test_indices]]

metriculous.Comparator(
    metriculous.evaluators.ClassificationEvaluator(
        # Note: All initialization parameters are optional.
        class_names=list(iris.target_names),
        top_n_accuracies=[1, 2, 3],
        filter_quantities=lambda quantity_name: quantity_name
        != "Average Precision setosa vs Rest",
        class_label_rotation_x=np.pi / 4,
        class_label_rotation_y=np.pi / 4,
    ),
).compare(
    ground_truth=test_targets_one_hot,
    model_predictions=[
        model.predict_proba(iris.data[test_indices]) for name, model in models
    ],
    model_names=[name for name, model in models],
    # sample_weights=np.array([0.5, 2.0, 1.0])[iris.target[test_indices]],
).display()

Unnamed: 0,Quantity,LogisticRegression,DecisionTree,Dummy,RandomForest
0,Accuracy,0.962,0.924,0.352,0.924
1,ROC AUC Macro Average,0.997,0.941,0.522,0.995
2,ROC AUC Micro Average,0.998,0.943,0.514,0.995
3,F1-Score Macro Average,0.959,0.918,0.348,0.918
4,F1-Score Micro Average,0.962,0.924,0.352,0.924
5,Top-1 Accuracy,0.962,0.924,0.352,0.924
6,Top-2 Accuracy,1.0,1.0,0.638,1.0
7,Top-3 Accuracy,1.0,1.0,1.0,1.0
8,ROC AUC setosa vs Rest,1.0,1.0,0.52,1.0
9,ROC AUC versicolor vs Rest,0.995,0.903,0.522,0.993

Unnamed: 0,Quantity,LogisticRegression,DecisionTree,Dummy,RandomForest
0,Mean KLD(P=target||Q=prediction),0.189,inf,inf,0.146
1,Log Loss,1.316,2.632,22.368,2.632
2,Brier Score Loss,0.029,0.051,0.432,0.03
3,Brier Score Loss (Soft Targets),0.029,0.051,0.432,0.03

Unnamed: 0,Quantity,LogisticRegression,DecisionTree,Dummy,RandomForest
0,Max Entropy,0.769,0.0,0.0,0.913
1,Mean Entropy,0.392,0.0,0.0,0.139
2,Min Entropy,0.051,0.0,0.0,0.0
3,Max Probability,0.991,1.0,1.0,1.0
4,Min Probability,0.0,0.0,0.0,0.0


## Concepts & Components

The comparison we just saw is based on various building blocks that `metriculous` exposes to the user for customizability. Let's go through them one by one, starting with the most simple ones.

### `Quantity`
A `Quantity` is a simple data container designed to hold the result of a measurement and some additional information. A few examples:

In [8]:
q1 = metriculous.Quantity(name="Cross-entropy", value=0.731, higher_is_better=False)

q1

Quantity(name='Cross-entropy', value=0.731, higher_is_better=False, description=None)

In [9]:
q2 = metriculous.Quantity(
    name="Accuracy",
    value=0.93,
    higher_is_better=True,
    description="Fraction of correctly classified datapoints",
)

q2

Quantity(name='Accuracy', value=0.93, higher_is_better=True, description='Fraction of correctly classified datapoints')

In [10]:
q3 = metriculous.Quantity(
    name="Fraction of cat predictions",
    value=0.47,
    higher_is_better=None,
    description="Fraction of datapoints that were classified as class 'cat'",
)

q3

Quantity(name='Fraction of cat predictions', value=0.47, higher_is_better=None, description="Fraction of datapoints that were classified as class 'cat'")

### `Evaluation`

An `Evaluation` consists of a model name, a list of `Quantity`s, and a list of callables that
generate [Bokeh](https://bokeh.pydata.org/en/latest/) figures.
Optionally, you can specify a primary metric by passing the name of one of the quanitities.
This is to indicate which quantity should be used for model selection.

In [11]:
from bokeh.plotting import figure


def make_figure(title):
    p = figure(title=title)
    p.line([0, 1, 2, 3], np.random.random(size=4), line_width=2)
    return p


evaluation = metriculous.Evaluation(
    model_name="MyModel",
    quantities=[q1, q2, q3],
    lazy_figures=[lambda: make_figure("Interesting Chart for MyModel")],
    primary_metric="Accuracy",
)

evaluation

Evaluation(model_name='MyModel', quantities=[Quantity(name='Cross-entropy', value=0.731, higher_is_better=False, description=None), Quantity(name='Accuracy', value=0.93, higher_is_better=True, description='Fraction of correctly classified datapoints'), Quantity(name='Fraction of cat predictions', value=0.47, higher_is_better=None, description="Fraction of datapoints that were classified as class 'cat'")], lazy_figures=[<function <lambda> at 0x11d3fd2f0>], primary_metric='Accuracy')

### `Evaluator`
An `Evaluator` is an interface. Implementations are expected to implement the method `evaluate` which has to return an `Evaluation`. An `Evaluator` has the purpose to compare a model prediction to the ground truth, compute various `Quantity`s and `Figure`s and return them as part of an `Evaluation` object.

Let's take a look at the code:

In [12]:
import inspect

print(inspect.getsource(metriculous.Evaluator))

class Evaluator:
    """
    Interface to be implemented by the user to compute quantities and charts that are
    relevant and applicable to the problem at hand.
    """

    def evaluate(
        self,
        ground_truth: Any,
        model_prediction: Any,
        model_name: str,
        sample_weights: Optional[Sequence[float]] = None,
    ) -> Evaluation:
        """Generates an Evaluation from ground truth and a model prediction."""
        raise NotImplementedError



We have already seen an `Evaluator` implementation that is shipped with `metriculous`: `ClassificationEvaluator`, which we used above to evaluate a list of Iris classifiers.
As a reminder, `ClassificationEvaluator` is a default implementation that aims to satisfy the most common requirements for classification problems.
More default implementations, such as `RegressionEvaluator`, will most likely be added to future versions of the libary.

Even though those default `Evaluator`s can be customized to some degree by passing settings to the constructor, you will probably run into a project were you want to implement your own project-specific `Evaluator`. Reasons might include
* you want to measure quantities or create figures that are not included in the default implementations, and it wouldn't make sense to add them to the libary
* you might want to pass in entirely different data structures, for example if your project is neither a classification problem nor a regression

Looking into the implementation of `metriculous.evaluators.ClassificationEvaluator` can be a good starting point in case you wanto to implement your own `Evaluator`.

### `Comparison`
A `Comparison` consists of a list of `Evaluation`s. It serves to compare a collection of models. By calling the `display` method in a Jupyter notebook you can display a table showing the `Quantity`s for all the models side by side, as well as the `Figure`s contained in the `Evaluation`s.

For a quick demonstration let's compare the `evaluation` defined above to a another `Evaluation`.

In [13]:
from dataclasses import replace

evaluation_2 = metriculous.Evaluation(
    model_name="MyModel_2",
    quantities=[
        replace(q1, value=0.71),
        replace(q2, value=0.31),
        replace(q3, value=0.13),
    ],
    lazy_figures=[lambda: make_figure("Interesting Chart for MyModel_2")],
    primary_metric="Accuracy",
)

comparison = metriculous.Comparison([evaluation, evaluation_2])
comparison

Comparison(evaluations=[Evaluation(model_name='MyModel', quantities=[Quantity(name='Cross-entropy', value=0.731, higher_is_better=False, description=None), Quantity(name='Accuracy', value=0.93, higher_is_better=True, description='Fraction of correctly classified datapoints'), Quantity(name='Fraction of cat predictions', value=0.47, higher_is_better=None, description="Fraction of datapoints that were classified as class 'cat'")], lazy_figures=[<function <lambda> at 0x11d3fd2f0>], primary_metric='Accuracy'), Evaluation(model_name='MyModel_2', quantities=[Quantity(name='Cross-entropy', value=0.71, higher_is_better=False, description=None), Quantity(name='Accuracy', value=0.31, higher_is_better=True, description='Fraction of correctly classified datapoints'), Quantity(name='Fraction of cat predictions', value=0.13, higher_is_better=None, description="Fraction of datapoints that were classified as class 'cat'")], lazy_figures=[<function <lambda> at 0x11cb160d0>], primary_metric='Accuracy')]

In [14]:
comparison.display()

Unnamed: 0,Quantity,MyModel,MyModel_2
0,Accuracy,0.93,0.31

Unnamed: 0,Quantity,MyModel,MyModel_2
0,Cross-entropy,0.731,0.71

Unnamed: 0,Quantity,MyModel,MyModel_2
0,Fraction of cat predictions,0.47,0.13


### `Comparator`
Last but not least there is the `Comparator` class. It's a convenience class that ties all previous building blocks together. It get initialized with an `Evaluator1` (such as `ClassificationEvaluator` as in the example above), and can then be used to make a `Comparison` – which, in turn, can be displayed with a `display()` call.

Note that the `compare` method has a very similar signature to `Evaluator.evaluate`. The important difference is that `Evaluator.evaluate` receives just a single prediction object, whereas `Comparator.compare` receives a sequence of prediction objects – with each object coming from one of the models that you want to compare.

In [15]:
print(inspect.getsource(metriculous.Comparator))

class Comparator:
    """Can generate model comparisons after initialization with an Evaluator."""

    def __init__(self, evaluator: Evaluator):
        self.evaluator = evaluator

    def compare(
        self,
        ground_truth: Any,
        model_predictions: Sequence[Any],
        model_names=None,
        sample_weights: Optional[Sequence[float]] = None,
    ) -> Comparison:
        """Generates a Comparison from a list of predictions and the ground truth.

        Args:
            model_predictions:
                List with one prediction object per model to be compared.
            ground_truth:
                A single ground truth object.
            model_names:
                Optional list of model names. If `None` generic names will be generated.
            sample_weights:
                Optional sequence of floats to modify the influence of individual
                samples on the statistics that will be measured.

        Returns:
            A Comparison object