> `SupervisableDataset` holds your data throughout the labeling process.
>
> :speedboat: Let's take a look at its core mechanisms.

-   <details open><summary>This page addresses **single components** of `hover`</summary>
    We are using code snippets to pick out parts of the annotation interface, so that the documentation can explain what they do.

    -   Please be aware that this is NOT how one would typically use `hover`.
    -   Typical usage deals with [recipes](../../tutorial/t1-active-learning) where the individual parts have been tied together.

</details>

-   <details open><summary>Dependencies for {== local environments ==}</summary>
    When you run the code locally, you may need to install additional packages.

    To render `bokeh` plots in Jupyter, you need:
```shell
    pip install jupyter_bokeh
```

    If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
    jupyter labextension install @jupyter-widgets/jupyterlab-manager
    jupyter labextension install @bokeh/jupyter_bokeh
```

</details>

## **Data Subsets**

We place unlabeled data and labeled data in different subsets: "raw", "train", "dev", and "test". Unlabeled data start from the "raw" subset, and can be transferred to other subsets after it gets labeled.

`SupervisableDataset` uses a "population table", `dataset.pop_table`, to show the size of each subset:

In [1]:
from hover.core.dataset import SupervisableTextDataset
import pandas as pd

raw_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
train_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_train.csv"

# for fast, low-memory demonstration purpose, sample the data
df_raw = pd.read_csv(raw_csv_path).sample(400)
df_raw["SUBSET"] = "raw"
df_train = pd.read_csv(train_csv_path).sample(400)
df_train["SUBSET"] = "train"
df_dev = pd.read_csv(train_csv_path).sample(100)
df_dev["SUBSET"] = "dev"
df_test = pd.read_csv(train_csv_path).sample(100)
df_test["SUBSET"] = "test"

# build overall dataframe and ensure feature type
df = pd.concat([df_raw, df_train, df_dev, df_test])
df["text"] = df["text"].astype(str)

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df, feature_key="text", label_key="label")

<br>

In [2]:
from bokeh.io import show, output_notebook

output_notebook()

# normally your would skip notebook_url or use Jupyter address
notebook_url = 'localhost:8888'

show(dataset.pop_table, notebook_url=notebook_url)

<br>

### **Transfer Data Between Subsets**

`COMMIT` and `DEDUP` are the mechanisms that `hover` uses to transfer data between subsets.

-   `COMMIT` copies selected points (to be discussed later) to a destination subset
    -   labeled-raw-only: `COMMIT` automatically detects which points are in the raw set with a valid label. Other points will not get copied.
    -   keep-last: you can commit the same point to the same subset multiple times and the last copy will be kept. This can be useful for revising labels before `DEDUP`.
-   `DEDUP` removes duplicates (identified by feature value) across subsets
    -   priority rule: test > dev > train > raw, i.e. test set data always gets kept during deduplication

-   <details open><summary>FAQ</summary>
    <details open><summary>Why does COMMIT only work on the raw subset?</summary>
        Most selections will happen through plots, where different subsets are on top of each other. This means selections can contain both unlabeled and labeled points.

        Way too often we find ourselves trying to view both the labeled and the unlabeled, but only moving the unlabeled "raw" points. So it's handy that COMMIT picks those points only.
    </details>

</details>

These mechanisms correspond to buttons in `hover`'s annotation interface, which you have encountered in the quickstart:

In [3]:
from bokeh.layouts import row, column

show(column(
    row(
        dataset.data_committer,
        dataset.dedup_trigger,
    ),
    dataset.pop_table,
), notebook_url=notebook_url)

You are generating standalone HTML/JS output, but trying to use real Python
callbacks (i.e. with on_change or on_event). This combination cannot work.

Only JavaScript callbacks may be used with standalone output. For more
information on JavaScript callbacks with Bokeh, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html

Alternatively, to use real Python callbacks, a Bokeh server application may
be used. For more information on building and running Bokeh applications, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/server.html



<br>

Of course, so far we have nothing to move, because there's no data selected. We shall now discuss selections.

## **Selection**

`hover` labels data points in bulk, which requires selecting groups of homogeneous data, i.e. semantically similar or going to have the same label. Being able to skim through what you selected gives you confidence about homogeneity.

Normally, selection happens through a plot (`explorer`), as we have seen in the quickstart. For the purpose here, we will "cheat" and assign the selection programmatically:

In [4]:
dataset._callback_update_selection(dataset.dfs["raw"].loc[:10])

show(dataset.sel_table, notebook_url=notebook_url)

<br>

### **Edit Data Within a Selection**

Often the points selected are not perfectly homogeneous, i.e. some outliers belong to a different label from the selected group overall. It would be helpful to `EVICT` them, and `SupervisableDataset` has a button for it.

Sometimes you may also wish to edit data values on the fly.  In hover this is called `PATCH`, and there also is a button for it.

-   by default, labels can be edited but feature values cannot.

Let's plot the forementioned buttons along with the selection table. Toggle any number of rows in the table, then click the button to `EVICT` or `PATCH` those rows:

In [5]:
show(column(
    row(
        dataset.selection_evictor,
        dataset.selection_patcher,
    ),
    dataset.sel_table,
), notebook_url=notebook_url)

You are generating standalone HTML/JS output, but trying to use real Python
callbacks (i.e. with on_change or on_event). This combination cannot work.

Only JavaScript callbacks may be used with standalone output. For more
information on JavaScript callbacks with Bokeh, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html

Alternatively, to use real Python callbacks, a Bokeh server application may
be used. For more information on building and running Bokeh applications, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/server.html



<br>