# Sample Annotating

In supervised and semi-supervised machine learning it is necessary to label data after it was selected by an active learning algorithm. This tutorial shows how to make a simple annotation tool using [`ipyannotations`](https://ipyannotations.readthedocs.io/en/latest/index.html) and [`superintendent`](https://superintendent.readthedocs.io/en/latest/index.html). This tutorial requires prior knowledge of our framework. If you are not familiar with it, try some basic [tutorials](https://scikit-activeml.github.io/scikit-activeml-docs/tutorials.html).

## Installation and Configuration

First, we'll need to install the necessary packages using pip.

In [None]:
!pip install ipyannotations
!pip install superintendent

In same [cases](https://ipyannotations.readthedocs.io/en/latest/installing.html), it is necessary to install / configure the front-end extension as well.

In [None]:
!jupyter nbextension install --user --py ipyannotations
!jupyter nbextension enable --user --py ipyannotations

Now we can start by importing some packages.

In [9]:
import numpy as np
import math
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_digits
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from skactiveml.utils import is_labeled
from skactiveml.classifier import SklearnClassifier
from skactiveml.pool import UncertaintySampling

from superintendent import Superintendent
from ipywidgets import widgets
from ipyannotations.images import ClassLabeller

import warnings
warnings.filterwarnings("ignore")

## The Annotation Widget Class

Now we define the class `DataLabeler`, which inherits from `Superintendent`. To adapt it to our framework, we have to overwrite the constructor and the methods `_annotation_iterator` and `retrain`.

In [10]:
class DataLabeler(Superintendent):
    """DataLabeler

    This class creates a widget for label assignments.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
        Training data set, usually complete, i.e. including the labeled and
        unlabeled samples.
    y : array-like of shape (n_samples)
        Labels of the training data set (possibly including unlabeled ones
        indicated by self.MISSING_LABEL.
    model : skactiveml.base.SkactivemlClassifier or skactiveml.base.SkactivemlRegressor
        Model implementing the method `fit`.
    query_strategy : skactiveml.base.QueryStrategy
        Query strategy used to select the next sample(s) to be labeled.
    labelling_widget : Optional[widgets.Widget]
        An input widget. This needs to follow the interface of the class
        superintendent.controls.base.SubmissionWidgetMixin
    batch_size : int, default=1
        The number of samples to be selected in one AL cycle.
    X_eval : array-like of shape (n_eval_samples, n_features), default=None
        Evaluation data set that is used by the `eval_method`. Only used if
        y_eval is specified.
    y_eval : array-like of shape (n_eval_samples), default=None
        Labels for the evaluation data set. Only used if X_eval is
        specified.
    eval_method : callable
        A function that accepts three arguments - model, x, and y - and
        returns a validation score of the model. If None,
        sklearn.model_selection.cross_val_score is used.
        """
    def __init__(
            self,
            X,
            y,
            model,
            query_strategy,
            labelling_widget,
            batch_size=1,
            X_eval=None,
            y_eval=None,
            eval_method = None,
            n_cycles=None,
            **kwargs,
    ):
        # Call the super constructor.
        try:
            super().__init__(
                labelling_widget=labelling_widget,
                eval_method=eval_method,
                **kwargs
            )
        except AttributeError:
            pass

        # Assign parameters.
        self.X = X
        self.y = y
        self.model = model
        self.X_eval = X_eval
        self.y_eval = y_eval
        self.batch_size = batch_size
        self.query_strategy = query_strategy
        self.n_cycles = n_cycles or math.ceil(len(X)/batch_size)
        
        self.labeled_indices = []
        self.candidates = np.arange(len(X))[~is_labeled(y)]

        # Generate the widgets.
        self.model_performance = widgets.HTML(value="")
        self.top_bar = widgets.HBox(
            [
                widgets.HBox(
                    [self.progressbar],
                    layout=widgets.Layout(width="50%", justify_content="space-between"),
                ),
                widgets.HBox(
                    [self.model_performance],
                    layout=widgets.Layout(width="50%"),
                ),
            ]
        )
        self.children = [self.top_bar, self.labelling_widget]

        # Start the annotation loop.
        self._begin_annotation()

    def _annotation_iterator(self):
        """The annotation loop."""
        self.children = [self.top_bar, self.labelling_widget]
        self.progressbar.bar_style = ""
        # Fit the model
        self.retrain()
        for i in range(self.n_cycles):
            # Query the next samples.
            idx = self.query_strategy.query(self.X.reshape(-1, 8*8), self.y, self.model, candidates=self.candidates, fit_clf=False, batch_size=self.batch_size)
            j = 0
            while j<len(idx):
                # Display and label the next sample.
                with self._render_hold_message("Loading..."):
                    self.labelling_widget.display(self.X[idx[j]])
                y = yield
                if y == 'undo':
                    j -= 2
                elif y is not None:
                    self.labeled_indices.append(idx[j])
                    self.y[idx[j]] = y
                else:  # Skip
                    self.candidates = self.candidates[self.candidates!=idx[j]]
                self.progressbar.value = np.sum(is_labeled(self.y))/(min(self.n_cycles*self.batch_size, len(self.X)))
                j += 1
            # Fit the model.
            self.retrain()
            # Brake if all samples are labeled.
            if np.all(is_labeled(self.y)):
                break

        yield self._render_finished()

    def _undo(self):
        self.y[self.labeled_indices.pop()] = self.model.missing_label
        self._annotation_loop.send('undo')  # Advance next item


    def retrain(self, button=None):
        """Re-train the model you passed when creating this widget.

        This calls the fit method of your model with the data that you've
        labelled. It will also score the classifier and display the
        performance.

        Parameters
        ----------
        button : widget.Widget, optional
            Optional & ignored; this is passed when invoked by a button.
        """
        with self._render_hold_message("Retraining..."):
            if self.X_eval is not None:
                X_eval = self.X_eval
                y_eval = self.y_eval
            else:
                X_eval = X[is_labeled(self.y)]
                y_eval = y[is_labeled(self.y)]

            # Fit the model.
            try:
                self.model.fit(self.X.reshape(len(self.X), -1), self.y)
            except ValueError as e:
                if str(e).startswith("This solver needs samples of at least 2"):
                    self.model_performance.value = "Not enough classes to retrain."
                    return
                else:
                    raise

            # Evaluate the model. By default, using cross validation. In sklearn this
            # clones the model, so it's OK to do after the model fit.
            try:
                if self.eval_method is not None:
                    performance = np.mean(
                        self.eval_method(self.model, X_eval.reshape(len(X_eval), -1), y_eval)
                    )
                else:
                    performance = np.mean(
                        cross_val_score(
                            self.model, X_eval.reshape(len(X_eval), -1), y_eval, cv=3, error_score=np.nan
                        )
                    )
            except ValueError as e:
                if "n_splits=" in str(e) \
                        or "Found array with 0 sample(s)" in str(e) \
                        or "cannot reshape array of size 0" in str(e):
                    self.model_performance.value = "Not enough labels to evaluate."
                    return
                else:
                    raise

            self.model_performance.value = f"Score: {performance:.3f}"

## Create Dataset
For this tutorial we use the dataset digit data set available through the sklearn package. The 8x8 images show handwritten digits from 0 to 9.

In [11]:
X = load_digits().data.reshape(-1, 8, 8)
y = np.full(shape=len(X), fill_value=np.nan)

## Create And Start Annotation Process

The `MLPClassified` of `sklearn` is used as classifier and `UncertaintySampling` as query strategy. To

In [12]:
pipe = Pipeline([('scaler', StandardScaler()), ('MLP', MLPClassifier())])
clf = SklearnClassifier(pipe, classes=range(10))

qs = UncertaintySampling()

labelling_widget = ClassLabeller(
    options=list(range(1, 10)) + [0], image_size=(100, 100))

data_labeler = DataLabeler(
    X=X,
    y=y,
    model=clf,
    query_strategy=qs,
    labelling_widget=labelling_widget,
    batch_size = 4,
    n_cycles=50
)
data_labeler

DataLabeler(children=(HBox(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0)…