> `Annotator` is an `explorer` which provides a map of your data colored by labels.
>
> :speedboat: Let's walk through its components and how they interact with the `dataset`.
>
> -   {== You will find many of these components again in other `explorer`s. ==}

-   <details open><summary>This page addresses **single components** of `hover`</summary>
    For illustration, we are using code snippets to pick out specific widgets so that the documentation can explain what they do.

    -   Please be aware that you won't need to get the widgets by code in an actual use case.
    -   Typical usage deals with [recipes](../../tutorial/t1-active-learning) where the individual parts have been tied together.

</details>

-   <details open><summary>Dependencies for {== local environments ==}</summary>
    When you run the code locally, you may need to install additional packages.

    To run the text embedding code on this page, you need:
```shell
    pip install spacy
    python -m spacy download en_core_web_md
```

    To render `bokeh` plots in Jupyter, you need:
```shell
    pip install jupyter_bokeh
```

    If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
    jupyter labextension install @jupyter-widgets/jupyterlab-manager
    jupyter labextension install @bokeh/jupyter_bokeh
```

</details>

## **Preparation**

As always, start with a ready-for-plot dataset:

In [1]:
from hover.core.dataset import SupervisableTextDataset
import pandas as pd

raw_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
train_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_train.csv"

# for fast, low-memory demonstration purpose, sample the data
df_raw = pd.read_csv(raw_csv_path).sample(400)
df_raw["SUBSET"] = "raw"
df_train = pd.read_csv(train_csv_path).sample(400)
df_train["SUBSET"] = "train"
df_dev = pd.read_csv(train_csv_path).sample(100)
df_dev["SUBSET"] = "dev"
df_test = pd.read_csv(train_csv_path).sample(100)
df_test["SUBSET"] = "test"

# build overall dataframe and ensure feature type
df = pd.concat([df_raw, df_train, df_dev, df_test])
df["text"] = df["text"].astype(str)

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df, feature_key="text", label_key="label")

<br>

In [2]:
import spacy
import re
from functools import lru_cache

# use your preferred embedding for the task
nlp = spacy.load("en_core_web_md")

# raw data (str in this case) -> np.array
@lru_cache(maxsize=int(1e+4))
def vectorizer(text):
    clean_text = re.sub(r"[\s]+", r" ", str(text))
    return nlp(clean_text, disable=nlp.pipe_names).vector

# any kwargs will be passed onto the corresponding reduction
# for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html
# for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html
reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=2)

Vectorizing:   0%|          | 0/957 [00:00<?, ?it/s]

Vectorizing:   3%|▎         | 32/957 [00:00<00:03, 295.97it/s]

Vectorizing:   8%|▊         | 74/957 [00:00<00:02, 358.85it/s]

Vectorizing:  12%|█▏        | 117/957 [00:00<00:02, 387.46it/s]

Vectorizing:  16%|█▋        | 156/957 [00:00<00:03, 236.29it/s]

Vectorizing:  22%|██▏       | 208/957 [00:00<00:02, 305.99it/s]

Vectorizing:  28%|██▊       | 265/957 [00:00<00:01, 371.96it/s]

Vectorizing:  32%|███▏      | 309/957 [00:01<00:03, 213.19it/s]

Vectorizing:  39%|███▉      | 372/957 [00:01<00:02, 285.77it/s]

Vectorizing:  44%|████▎     | 418/957 [00:01<00:01, 320.00it/s]

Vectorizing:  50%|████▉     | 477/957 [00:01<00:01, 379.53it/s]

Vectorizing:  55%|█████▍    | 526/957 [00:01<00:01, 288.40it/s]

Vectorizing:  59%|█████▉    | 565/957 [00:01<00:01, 274.12it/s]

Vectorizing:  64%|██████▍   | 617/957 [00:02<00:01, 298.28it/s]

Vectorizing:  68%|██████▊   | 653/957 [00:02<00:01, 288.75it/s]

Vectorizing:  74%|███████▍  | 711/957 [00:02<00:00, 350.71it/s]

Vectorizing:  78%|███████▊  | 751/957 [00:02<00:00, 305.87it/s]

Vectorizing:  85%|████████▌ | 816/957 [00:02<00:00, 380.56it/s]

Vectorizing:  92%|█████████▏| 884/957 [00:02<00:00, 449.10it/s]

Vectorizing:  99%|█████████▉| 946/957 [00:02<00:00, 488.27it/s]

Vectorizing: 100%|██████████| 957/957 [00:02<00:00, 339.86it/s]




<br>

## **Scatter Plot: Semantically Similar Points are Close Together**

`hover` labels data points in bulk, which requires selecting groups of homogeneous data.

The core of the annotator is a scatter plot and labeling widgets:

In [3]:
from bokeh.io import show, output_notebook

output_notebook()

# normally your would skip notebook_url or use Jupyter address
notebook_url = 'localhost:8888'

from hover.recipes.subroutine import standard_annotator
from bokeh.layouts import row, column

annotator = standard_annotator(dataset)
show(column(
    row(annotator.annotator_input, annotator.annotator_apply),
    annotator.figure,
), notebook_url=notebook_url)

You are generating standalone HTML/JS output, but trying to use real Python
callbacks (i.e. with on_change or on_event). This combination cannot work.

Only JavaScript callbacks may be used with standalone output. For more
information on JavaScript callbacks with Bokeh, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html

Alternatively, to use real Python callbacks, a Bokeh server application may
be used. For more information on building and running Bokeh applications, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/server.html



<br>

### **Select Points on the Plot**

On the right of the scatter plot, you can find tap, polygon, and lasso tools which can select data points.

### **View Tooltips with Mouse Hover**

Embeddings are helpful but rarely perfect. This is why we have tooltips that show the detail of each point on mouse hover, allowing us to inspect points, discover patterns, and come up with new labels on the fly.

### **Show & Hide Subsets**

Showing labeled subsets can tell you which parts of the data has been explored and which ones have not. With toggle buttons, you can turn on/off the display for any subset.

In [4]:
show(annotator.data_key_button_group, notebook_url=notebook_url)

You are generating standalone HTML/JS output, but trying to use real Python
callbacks (i.e. with on_change or on_event). This combination cannot work.

Only JavaScript callbacks may be used with standalone output. For more
information on JavaScript callbacks with Bokeh, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html

Alternatively, to use real Python callbacks, a Bokeh server application may
be used. For more information on building and running Bokeh applications, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/server.html



<br>

## **Make Consecutive Selections**

Ever selected multiple (non-adjacent) files in your file system using <kbd>Ctrl</kbd>/<kbd>Command</kbd>?

Similarly but more powerfully, you can make consecutive selections with a "keep selecting" option.

In [5]:
show(annotator.selection_option_box, notebook_url=notebook_url)

<br>

-   <details open><summary>Selection option values: what do they do?</summary>
    Basic set operations on your old & new selection. [Quick intro here](https://www.geeksforgeeks.org/python-set-operations-union-intersection-difference-symmetric-difference/)

    -   `none`: the default, where a new selection `B` simply replaces the old one `A`.
    -   `union`: `A ∪ B`, the new selection gets unioned with the old one.
        -   this resembles the <kbd>Ctrl</kbd>/<kbd>Command</kbd> mentioned above.
    -   `intersection`: `A ∩ B`, the new selection gets intersected with the old one.
        -   this is particularly useful when going beyond simple 2D plots.
    -   `difference`: `A ∖ B`, the new selection gets subtracted from the old one.
        -   this is for de-selecting outliers.
</details>

## **Change Plot Axes**

`hover` supports dynamically choosing which embedding dimensions to use for your 2D plot. This becomes nontrivial, and sometimes very useful, when we have a 3D embedding (or higher):

In [6]:
reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=3)

annotator = standard_annotator(dataset)

show(column(
    row(annotator.dropdown_x_axis, annotator.dropdown_y_axis),
    annotator.figure,
), notebook_url=notebook_url)

Vectorizing:   0%|          | 0/957 [00:00<?, ?it/s]

Vectorizing: 100%|██████████| 957/957 [00:00<00:00, 1309608.13it/s]




You are generating standalone HTML/JS output, but trying to use real Python
callbacks (i.e. with on_change or on_event). This combination cannot work.

Only JavaScript callbacks may be used with standalone output. For more
information on JavaScript callbacks with Bokeh, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html

Alternatively, to use real Python callbacks, a Bokeh server application may
be used. For more information on building and running Bokeh applications, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/server.html



<br>

## **Text Search Widget: Include/Exclude**

Keywords or regular expressions can be great starting points for identifying a cluster of similar points based on domain expertise.

You may specify a *positive* regular expression to look for and/or a *negative* one to not look for.

The `annotator` will amplify the sizes of positive-match data points and shrink those of negative matches.

In [7]:
show(row(annotator.search_pos, annotator.search_neg), notebook_url=notebook_url)

You are generating standalone HTML/JS output, but trying to use real Python
callbacks (i.e. with on_change or on_event). This combination cannot work.

Only JavaScript callbacks may be used with standalone output. For more
information on JavaScript callbacks with Bokeh, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html

Alternatively, to use real Python callbacks, a Bokeh server application may
be used. For more information on building and running Bokeh applications, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/server.html



<br>

### **Preview: Use Search for Selection in Finder**

In a particular kind of plot called `finder` (search it in the README!), the search widget can directly operate on your selection as a filter.

## **The Plot and The Dataset**

When we apply labels through the annotator plot, it's acutally the `dataset` behind the plot that gets immediately updated. The plot itself is not in direct sync with the dataset, which is a design choice for performance. Instead, we will use a trigger called `PUSH` for updating the data entries to the plot.

### **PUSH: Synchronize from Dataset to Plots**

Below is the full interface of the `dataset`, where you can find a green "Push" button:

In [8]:
show(dataset.view(), notebook_url=notebook_url)

You are generating standalone HTML/JS output, but trying to use real Python
callbacks (i.e. with on_change or on_event). This combination cannot work.

Only JavaScript callbacks may be used with standalone output. For more
information on JavaScript callbacks with Bokeh, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html

Alternatively, to use real Python callbacks, a Bokeh server application may
be used. For more information on building and running Bokeh applications, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/server.html



In a built-in `recipe`, the "Push" button will update the latest data to every `explorer` linked to the `dataset`.