> Welcome to the basic use case of `hover`!
>
> :sunglasses: Let's say we want to label some data and call it a day.

-   <details open><summary>Dependencies for {== local environments ==}</summary>
    When you run the code locally, you may need to install additional packages.

    To run the text embedding code on this page, you need:
```shell
    pip install spacy
    python -m spacy download en_core_web_md
```

    To render `bokeh` plots in Jupyter, you need:
```shell
    pip install jupyter_bokeh
```

    If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
    jupyter labextension install @jupyter-widgets/jupyterlab-manager
    jupyter labextension install @bokeh/jupyter_bokeh
```

</details>

## **Ingredient 1 / 3: Raw Data**

Start with a spreadsheet loaded in `pandas`.

We turn it into a [`SupervisableDataset`](../../reference/core-dataset/#hover.core.dataset.SupervisableDataset) designed for labeling:

In [1]:
from hover.core.dataset import SupervisableTextDataset
import pandas as pd

example_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
# for fast, low-memory demonstration purpose, sample the data
df_raw = pd.read_csv(example_csv_path).sample(1000)
df_raw["text"] = df_raw["text"].astype(str)

# data is divided into 4 subsets: "raw" / "train" / "dev" / "test"
# this example assumes no labeled data available., i.e. only "raw"
df_raw["SUBSET"] = "raw"

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df_raw, feature_key="text", label_key="label")

# each subset can be accessed as its own DataFrame
dataset.dfs["raw"]().head(5)

Unnamed: 0,text,SUBSET,label
0,"Hello, I have a motherboard and a case for sal...",raw,ABSTAIN
1,[...]> [...]> If the data isn't there when t...,raw,ABSTAIN
2,Since the wiretap chip is being distributed in...,raw,ABSTAIN
3,: I'm looking at the following three SUV's; ...,raw,ABSTAIN
4,Can some people with cache cards PLEASE post ...,raw,ABSTAIN


-   <details open><summary>FAQ</summary>
    <details open><summary>What if I have multiple features?</summary>
        `feature_key` refers to the field that will be vectorized later on, which can be a JSON that encloses multiple features.

        For example, suppose our data entries look like this:
```python
        {"f1": "foo", "f2": "bar", "non_feature": "abc"}
```

        We can put `f1` and `f2` in a JSON and convert the entries like this:
```python
        # could also keep f1 and f2 around
        {'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}
```

    </details>

    <details open><summary>Can I use audio or image data?</summary>
        Yes! Please check out the "Guides" section of the documentation.
    </details>

</details>

## **Ingredient 2 / 3: Embedding**

A pre-trained embedding lets us group data points semantically.

In particular, let's define a `data -> embedding vector` function.

In [2]:
import spacy
import re
from functools import lru_cache

# use your preferred embedding for the task
nlp = spacy.load("en_core_web_md")

# raw data (str in this case) -> np.array
@lru_cache(maxsize=int(1e+4))
def vectorizer(text):
    clean_text = re.sub(r"[\s]+", r" ", str(text))
    return nlp(clean_text, disable=nlp.pipe_names).vector

text = dataset.dfs["raw"]().loc[0, "text"]
vec = vectorizer(text)
print(f"Text: {text}")
print(f"Vector shape: {vec.shape}")

Text: Hello, I have a motherboard and a case for sale as a package. Both of them came from a CompuAdd computer I bought last August and am    presently upgrading. Here are the specs--  Motherboard ----------- Cyrix 486SL 25 MHz microprocessor Chips and Technology chipset (SCATsx V2.3.6 SLSLC) 8 SIMM banks for a maximum of 32 Megs of RAM BUILT-IN Floppy and Hard Drive Controllers BUILT-IN ports--1 Parallel, 2 Serial (9 and 25 pin) BUILT-IN Paradise SVGA controller with 1 meg of RAM (Windows drivers inc.)    -can do up to 1024x768 @ 256 colors    -based on the Western Digital WD90C31 chip Math co-processor slot 3 16-bit expansion slots and 2 8-bit expansion slots  Case ---- Low-Profile Desktop Very nice grey color 150 watt power supply Room for 2 floppies plus HD  Mouse ----- 3-button Microsoft-compatible Grey color matches case  All original manuals and documentation, video drivers, etc. are included.  You are probably wondering why I must sell the case with the motherboard. It is simpl

-   <details open><summary>Tips</summary>
    <details open><summary>Caching</summary>
        `dataset` by itself stores the original features but not the corresponding vectors.

        To avoid vectorizing the same feature again and again, we could simply do:
```python
        from functools import cache

        @cache
        def vectorizer(feature):
            # put code here
```

        If you'd like to limit the size of the cache, something like `@lru_cache(maxsize=10000)` could help.

        Check out [functools](https://docs.python.org/3/library/functools.html) for more options.

    </details>

    <details open><summary>Vectorizing multiple features</summary>
        Suppose we have multiple features enclosed in a JSON:
```python
        # could also keep f1 and f2 around
        {'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}
```

        Also, suppose we have individual vectorizers likes this:
```python
        def vectorizer_1(feature_1):
            # put code here

        def vectorizer_2(feature_2):
            # put code here
```

        Then we can define a composite vectorizer:
```python
        import json
        import numpy as np

        def vectorizer(feature_json):
            data_dict = json.loads(feature_json)
            vectors = []
            for field, func in [
                ("f1", vectorizer_1),
                ("f2", vectorizer_2),
            ]:
                vectors.append(func(data_dict[field]))

            return np.concatenate(vectors)
```
    </details>

</details>

## **Ingredient 3 / 3: 2D Embedding**

We compute a 2D version of the pre-trained embedding to visualize the whole dataset.

Hover has built-in methods for calling [umap](https://umap-learn.readthedocs.io/en/latest/) or [ivis](https://bering-ivis.readthedocs.io/en/latest/).

-   <details open><summary>Dependencies (when in your own environment)</summary>
    The libraries for this step are not directly required by `hover`:

    -   for umap: `pip install umap-learn`
    -   for ivis: `pip install ivis[cpu]` or `pip install ivis[gpu]`

    `umap-learn` is installed in this demo environment.
</details>

In [3]:
# any kwargs will be passed onto the corresponding reduction
# for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html
# for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html
reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=2)

# what we did adds 'embed_2d_0' and 'embed_2d_1' columns to the DataFrames in dataset.dfs
dataset.dfs["raw"]().head(5)

Vectorizing:   0%|          | 0/986 [00:00<?, ?it/s]

Vectorizing:   5%|▌         | 51/986 [00:00<00:01, 509.86it/s]

Vectorizing:  10%|█         | 102/986 [00:00<00:02, 401.03it/s]

Vectorizing:  15%|█▍        | 146/986 [00:00<00:02, 416.22it/s]

Vectorizing:  19%|█▉        | 189/986 [00:00<00:02, 308.95it/s]

Vectorizing:  23%|██▎       | 224/986 [00:00<00:03, 221.97it/s]

Vectorizing:  29%|██▉       | 290/986 [00:00<00:02, 286.71it/s]

Vectorizing:  34%|███▍      | 333/986 [00:01<00:02, 306.81it/s]

Vectorizing:  41%|████      | 405/986 [00:01<00:01, 355.96it/s]

Vectorizing:  48%|████▊     | 471/986 [00:01<00:01, 420.60it/s]

Vectorizing:  52%|█████▏    | 517/986 [00:01<00:01, 416.86it/s]

Vectorizing:  59%|█████▉    | 584/986 [00:01<00:00, 479.40it/s]

Vectorizing:  65%|██████▍   | 636/986 [00:01<00:00, 369.00it/s]

Vectorizing:  73%|███████▎  | 717/986 [00:01<00:00, 465.54it/s]

Vectorizing:  80%|████████  | 793/986 [00:01<00:00, 535.23it/s]

Vectorizing:  87%|████████▋ | 854/986 [00:02<00:00, 487.07it/s]

Vectorizing:  94%|█████████▍| 931/986 [00:02<00:00, 548.80it/s]

Vectorizing: 100%|██████████| 986/986 [00:02<00:00, 411.63it/s]




Unnamed: 0,text,SUBSET,label,embed_2d_0,embed_2d_1
0,"Hello, I have a motherboard and a case for sal...",raw,ABSTAIN,8.171587,7.221344
1,[...]> [...]> If the data isn't there when t...,raw,ABSTAIN,11.794644,4.784071
2,Since the wiretap chip is being distributed in...,raw,ABSTAIN,9.848407,5.98604
3,: I'm looking at the following three SUV's; ...,raw,ABSTAIN,9.247622,5.13544
4,Can some people with cache cards PLEASE post ...,raw,ABSTAIN,9.28352,5.714137


## :sparkles: **Apply Labels**

We are ready for the annotation interface!

In [4]:
from hover.recipes.stable import simple_annotator

interactive_plot = simple_annotator(dataset)

# ---------- NOTEBOOK MODE: for your actual Jupyter environment ---------
# this code will render the entire plot in Jupyter
# from bokeh.io import show, output_notebook
# output_notebook()
# show(interactive_plot, notebook_url='https://localhost:8888')

-   <details open><summary>Tips: annotation interface basics</summary>
    <details open><summary>Video guide</summary>
        <iframe width="560" height="315" src="https://www.youtube.com/embed/WYN2WduzJWg" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

    </details>

    <details open><summary>Text guide</summary>
        There should be a `SupervisableDataset` board on the left and an `BokehDataAnnotator` on the right, each with a few buttons.

        === "SupervisableDataset"
            -   `push`: push `Dataset` updates to the bokeh plots.
            -   `commit`: add data entries selected in the `Annotator` to a specified subset.
            -   `dedup`: deduplicate across subsets by `feature` (last in gets kept).
            -   `export`: save your data (all subsets) in a specified format.

        === "BokehDataAnnotator"
            -   `raw`/`train`/`dev`/`test`: choose which subsets to display or hide.
            -   `apply`: apply the `label` input to the selected points in the `raw` subset only.

        We've essentially put the data into neighborboods based on the vectorizer, but the quality (homogeneity of labels) of such neighborhoods can vary.

        -   hover over any data point to see its tooltip.
        -   take advantage of different selection tools to apply labels at appropriate scales.
        -   the search widget might turn out useful.
            -    note that it does not select points but highlights them.
    </details>

</details>