# Introduction

### Brief Overview

Is a training set something immutable and unexpandable? Active learning relates to situations where the answer is no. The training set size can be increased, but, of course, labelling of new examples is not costless.

Pool-based setup of active learning assumes that, given a model and a training set, there is also a fixed and known $n$-element set of initially unlabelled examples and the goal is to select $k$, $k < n$, examples from there such that disclosure of their target variables produces the most significant impact on model quality.

There are other setups of active learning problems as well (e.g., how to synthesize feature representations of objects to be studied), but all of them are beyond the scope of this demo.

In this notebook, one can find answers to the following questions:
* How to use implementations of active learning strategies from `dsawl` package?
* How do $\varepsilon$-greedy active learning perform relatively random selection?

### References

* An article that contains review of approaches to active learning: [Yang, 2017](https://arxiv.org/pdf/1702.08540.pdf);
* An article about EG-Active algorithm: [Bouneffouf, 2014](https://arxiv.org/abs/1408.2196).

# General Preparations

### Import Statements

In [1]:
import math
from copy import copy
from typing import List

import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV

from dsawl.active_learning.pool_based_sampling import CombinedSamplerFromPool
from dsawl.active_learning.utils import make_committee

### Notebook-level Settings

In [2]:
np.random.seed(361)

In [3]:
sns.set()

### User-defined Settings

In [4]:
# It is not a good practice to store binary files
# (like PNG images) in a Git repository, but for
# your local use you can set it to `True`.
draw_plots = False

# Dataset Generation

In this section, a synthetic dataset that is used by further examples is created. You can skip details of this section if you are interested only in user interface of active learning utilities. If this is the case, go to "Step-by-Step Tutorial" section.

In [5]:
dimensionality = 2
lower_bound = -2
upper_bound = 2
pool_size = 300

In [6]:
X_train_initial = np.array(
    [[1, -1],
     [2, -2],
     [3, -3],
     [-1, -1],
     [-2, -2],
     [-3, -3],
     [0, 1],
     [0, 2],
     [0, 3]]
)

In [7]:
X_new = np.random.uniform(
    lower_bound, upper_bound, size=(pool_size, dimensionality)
)

In [8]:
X_hold_out = np.random.uniform(
    lower_bound, upper_bound, size=(pool_size, dimensionality)
)

In [9]:
def compute_target(X: np.ndarray) -> np.ndarray:
    """
    Compute class label for a simple classification problem where
    2D plane is split into three regions by rays such that they
    start from the origin and an angle between any pair of them
    has 120 degrees.
    
    :param X:
        coordinates of points from the plane
    :return:
        labels of regions where points are located
    """
    
    def compute_target_for_row(x: np.ndarray) -> int:
        if x[0] > 0:
            return 1 if x[1] - math.tan(math.radians(30)) * x[0] > 0 else 2
        else:
            return 1 if x[1] + math.tan(math.radians(30)) * x[0] > 0 else 3
        
    y = np.apply_along_axis(compute_target_for_row, axis=1, arr=X)
    return y

In [10]:
y_train_initial = compute_target(X_train_initial)
y_new = compute_target(X_new)
y_hold_out = compute_target(X_hold_out)

In [11]:
if draw_plots:
    fig = plt.figure(figsize=(15, 15))
    ax = fig.add_subplot(111)
    for label, color in zip(range(1, 4), ['b', 'r', 'g']):
        curr_X = X_train_initial[y_train_initial == label, :]
        ax.scatter(curr_X[:, 0], curr_X[:, 1], c=color, marker='D')
    for label, color in zip(range(1, 4), ['b', 'r', 'g']):
        curr_X = X_new[y_new == label, :]
        ax.scatter(curr_X[:, 0], curr_X[:, 1], c=color)

# Step-by-Step Tutorial

As of now, the most convenient and comprehensive user interface is provided by `CombinedSamplerFromPool` class. Its instances can exploit accumulated knowledge about decision boundary of the model and can make exploratory actions. Various approaches to exploitation and exploration are supported.

## Initialization

Class `CombinedSamplerFromPool` have two initialization arguments: `scorers` and `scorers_probabilities`. Let us discuss both of them.

#### An argument named `scorers`

An argument named `scorers` defines a list of internal entities that rank new objects by usefullness of their labels. The more valuable a label of an object is, the higher the rank should be. As for technical implementation, all scoring entities are instances of classes that inherit from these one class:
`dsawl.active_learning.pool_based_sampling.BaseScorer`.

Any instance that satisfies the above condition can be an element of `scorers`. However, the easiest and the safest way to pass value of `scorers` is to pass list of strings that can be recognized as a names of pre-defined scorers.

If it is a classification problem, supported strings are:
* 'confidence' — the $i$-th object has score $-(\max_{j} \hat{p}_{ij})$ where $\hat{p}_{ij}$ is estimated (predicted) probability that the $i$-th object is an object of $j$-th class;
* 'margin'  — the $i$-th object has score $-(\max_{j} \hat{p}_{ij} - \max_{j \ne \hat{y}_i} \hat{p}_{ij})$ where $\hat{y}_i$ is predicted class of the $i$-th object, i.e., $\hat{y}_i = \arg \max_{j} \hat{p}_{ij}$;
* 'entropy' — the $i$-th object has score $\sum_{j} \hat{p}_{ij} \log \hat{p}_{ij}$;
* 'divergence' — the $i$-th object has score $\sum_{k}D_{KL}(\hat{p}_{ijk} \, \Vert \, \overline{p}_{ij})$ where there is a committee (i.e., list) of classifiers indiced by $k$, $\hat{p}_{ijk}$ is predicted by the $k$-th classifier probability that the $i$-th object is an object of $j$-th class, $\overline{p}_{ij}$ is the average of all $\hat{p}_{ijk}$ over $k$, and $D_{KL}$ is Kullback-Leibler divergence between $\hat{p}_{ijk}$ and $\overline{p}_{ij}$ (both $\hat{p}_{ijk}$ and $\overline{p}_{ij}$ are considered to be distributions of class label $j$).

Note that for a binary classification problem, the first three options result in the same ranking.

If it is a regression problem, supported strings are:
* 'predictions_variance' — the $i$-th object has score $\mathrm{Var}_k \hat{y}_{ik}$ where there is a committee of regressors indiced by $k$ and $\hat{y}_{ik}$ is predicted by the $k$-th regressor target value for the $i$-th object;
* 'target_variance' — the $i$-th object has score that is equal to an estimate of target's variance at it: $\max(\hat{y^2}_i - \hat{y}_i^2, 0)$ where there is a pair of regressors and the first one predicts target itself, whereas the second one predicts squared target.

Finally, there are two strings for making exploratory actions:
* 'random' — all objects are ranked randomly;
* 'density' — the $i$-th object has score equal to negative logarithm of estimated density of data distribution at the corresponding to the $i$-th object point; such scoring is designed for outliers exploration.

All of the above strings define scoring function, but do not define tools of scorers. The meaning of the word 'tools' depends on subclass of `BaseScorer` class:
* if a string is 'confidence', 'margin', or 'entropy', tools are a classifier;
* if a string is 'divergence', tools are a committee of classifiers;
* if a string is 'predictions_variance', tools are a committee of regressors;
* if a string is 'target_variance', tools are a pair of regressors;
* if a string is 'random', tools are `None`;
* if a string is 'density', tools are a density estimator (such as `sklearn.mixture.GaussianMixture` or `sklearn.neighbors.KernelDensity`).

If scorer is created based on string, tools must be passed explicitly. It can be done either with `set_tools` method (properly trained tools are required) or with `update_tools` method (just one bare estimator is needed, but training data must be provided too).

Below cells show two equivalent ways of passing and setting `scorers`.

In [12]:
sampler = CombinedSamplerFromPool(scorers=['confidence'])
clf = RandomForestClassifier()
clf.fit(X_train_initial, y_train_initial)
sampler.set_tools(tools=clf, scorer_id=0)

In [13]:
sampler = CombinedSamplerFromPool(scorers=['confidence'])
sampler.update_tools(
    X_train=X_train_initial,
    y_train=y_train_initial,
    est=RandomForestClassifier(),
    scorer_id=0
)

The difference between these two ways becomes more clear if tools must be something more complicated than just one estimator. For example, tools can be a committee.

In [14]:
sampler = CombinedSamplerFromPool(['divergence'])
clf = RandomForestClassifier()
committee = make_committee(clf, X_train_initial, y_train_initial)
sampler.set_tools(committee, scorer_id=0)

In [15]:
sampler = CombinedSamplerFromPool(['divergence'])
sampler.update_tools(X_train_initial, y_train_initial, RandomForestClassifier(), scorer_id=0)

So a sole estimator is passed only in the second case, whereas in the first case a committee of estimators is passed.

One subtle issue is that formulas for confidence, margin, entropy, and divergence make rigorous sense only when predicted by classifier probabilities are true probabilities, i.e., numerical quantifications of uncertainty. However, some classifiers return just ordinal degrees of their internal assurance in class labels. Although such numbers are called probabilities, they are not probabilities. To go over this obstacle, it is supposed to calibrate predicted probabilities with Platt calibration or with isotonic regression. A class that can run any of these options is provided by `sklearn` package.

In [16]:
sampler = CombinedSamplerFromPool(['confidence'])
clf = CalibratedClassifierCV(RandomForestClassifier())
clf.fit(X_train_initial, y_train_initial)
sampler.set_tools(clf, scorer_id=0)

#### An argument named `scorers_probabilities`

Now, go to `scorers_probabilities` argument. It must be a list of floats.

In [17]:
epsilon_greedy_sampler = CombinedSamplerFromPool(
    ['margin', 'random'], [0.95, 0.05]
)
epsilon_greedy_sampler.update_tools(
    X_train_initial, y_train_initial, RandomForestClassifier(), scorer_id=0
)

In the above example, $\varepsilon$-greedy strategy is implemented. After enough data are gathered, it still performs plenty of exploratory actions and this is a drawback of this strategy (at least in static environments). To fix it, gradual decrease of exploration probability is needed. It can be done by calls of `set_scorers_probabilities` method.

In [18]:
epsilon_greedy_sampler.set_scorers_probabilities([0.99, 0.01])

## Usage

Usage of a created instance is as simple as the next cell.

In [19]:
indices = sampler.pick_new_objects(X_new, n_to_pick=3)
X_new[indices, :]

array([[ 0.49560098, -0.02856764],
       [ 0.39816861, -0.40686834],
       [ 0.21951536, -0.37881001]])

# Illustrative End-to-End Example

Here $\varepsilon$-greedy strategy is compared with a benchmark based on random selection from a pool.

In [20]:
# Random Forest usually does not warp probabilities.
clf = RandomForestClassifier(n_estimators=20, random_state=361)

In [21]:
max_n_points_to_explore = 100

In [22]:
scorers = ['margin', 'random']
scorers_probabilities = [0.9, 0.1]

In [23]:
def report_accuracy_of_benchmark(
        n_new_points: int,
        clf: BaseEstimator,
        X_train_initial: np.ndarray, y_train_inital: np.ndarray,
        X_new: np.ndarray, y_new: np.ndarray,
        X_hold_out: np.ndarray, y_hold_out: np.ndarray
        ) -> float:
    """
    Compute accuracy of approach where `n_new_points` objects
    are picked from a pool at random, without active learning.
    """
    X_train = np.vstack((X_train_initial, X_new[:n_new_points, :]))
    y_train = np.hstack((y_train_initial, y_new[:n_new_points]))
    clf.fit(X_train, y_train)
    y_hold_out_hat = clf.predict(X_hold_out)
    return accuracy_score(y_hold_out, y_hold_out_hat)

In [24]:
def report_accuracy_of_epsilon_greedy_strategy(
        n_new_points: int,
        clf: BaseEstimator,
        scorers: List[str],
        scorers_probabilities: List[float],
        X_train_initial: np.ndarray, y_train_inital: np.ndarray,
        X_new: np.ndarray, y_new: np.ndarray,
        X_hold_out: np.ndarray, y_hold_out: np.ndarray
        ) -> float:
    """
    Compute accuracy of epsilon-greedy approach to active
    learning.
    """
    X_train = copy(X_train_initial)
    y_train = copy(y_train_inital)
    clf.fit(X_train, y_train)
    sampler = CombinedSamplerFromPool(
        scorers, scorers_probabilities
    )
    sampler.set_tools(clf, scorer_id=0)
    for i in range(n_new_points):
        indices = sampler.pick_new_objects(X_new, n_to_pick=1)
        X_train = np.vstack((X_train, X_new[indices, :]))
        y_train = np.hstack((y_train, y_new[indices]))
        sampler.update_tools(X_train, y_train, scorer_id=0)
        X_new = np.delete(X_new, indices, axis=0)
        y_new = np.delete(y_new, indices)
    clf = sampler.get_tools(0)
    y_hold_out_hat = clf.predict(X_hold_out)
    return accuracy_score(y_hold_out, y_hold_out_hat)

In [25]:
benchmark_scores = [
    report_accuracy_of_benchmark(
        n, clf,
        X_train_initial, y_train_initial, X_new, y_new,
        X_hold_out, y_hold_out
    )
    for n in range(1, max_n_points_to_explore + 1)
]
sum(benchmark_scores)

93.703333333333276

In [26]:
epsilon_greedy_scores = [
    report_accuracy_of_epsilon_greedy_strategy(
        n, clf, scorers, scorers_probabilities,
        X_train_initial, y_train_initial, X_new, y_new,
        X_hold_out, y_hold_out
    )
    for n in range(1, max_n_points_to_explore + 1)
]
sum(epsilon_greedy_scores)

96.066666666666777

In [27]:
if draw_plots:
    fig = plt.figure(figsize=(15, 15))
    ax = fig.add_subplot(111)
    ax.plot(benchmark_scores)
    ax.plot(epsilon_greedy_scores, c='g')

To conclude, it can be seen that there is a noticable gain from usage of active learning instead of selecting objects randomly.

# Customized Extensions

Now suppose that any of pre-defined strings is not an appropriate choice for someone, because priorities of this user are unusual. Exploration is important, and another important thing is that it is more desirable to disclose the first class label than to disclose a label of second or third class. Moreover, sampling objects exactly near the decision boundary is not important. Sounds strange, does not it? However, it is easy to meet this specifications with `dsawl` package. Below, it is shown how to extend standard functionality with your own code.

First of all, define customized scoring function.

In [28]:
def compute_bayesian_scores(
        predicted_probabilities: np.ndarray,
        class_of_interest: int = 0
        ) -> np.ndarray:
    """
    Sample labels of objects from corresponding to them predicted
    distributions and return binary indicators of having a label
    of the class of interest.

    :param predicted_probabilities:
        predicted by the classifier probabilities of classes for
        each of the new objects, shape = (n_new_objects, n_classes);
        it is recommended to pass calibrated probabilities
    :param class_of_interest:
        ordinal number of class of interest, i.e., index of column
        with this class probabilities
    :return:
        indicators that labels sampled from predicted distributions
        are labels of the class of interest
    """
    n_classes = predicted_probabilities.shape[1]
    sampled_labels = []
    for distribution in predicted_probabilities:
        sampled_labels.append(np.random.choice(n_classes, p=distribution))
    sampled_labels = np.array(sampled_labels)
    result = (sampled_labels == class_of_interest).astype(int)
    return result

Then make a scorer. A single classifier is involved, so it can be `UncertaintyScorerForClassification`.

In [29]:
from dsawl.active_learning.scorers import UncertaintyScorerForClassification

scorer = UncertaintyScorerForClassification(
    scoring_fn=compute_bayesian_scores
)

And now all is ready for applied code.

In [30]:
sampler = CombinedSamplerFromPool(scorers=[scorer])
clf = RandomForestClassifier()
clf.fit(X_train_initial, y_train_initial)
sampler.set_tools(tools=clf, scorer_id=0)

In [31]:
indices = sampler.pick_new_objects(X_new, n_to_pick=3)
X_new[indices, :]

array([[ 0.24940934,  0.57662896],
       [-1.08723821,  1.14735078],
       [-1.55920436,  0.7766194 ]])

As it is shown above, it is easy to extend active learning functionality.