> Welcome to the basic use case of `hover`!
>
> :sunglasses: Let's say we want to label some data and call it a day.

-   <details open><summary>Dependencies for {== local environments ==}</summary>
    When you run the code locally, you may need to install additional packages.

    To run the text embedding code on this page, you need:
```shell
    pip install spacy
    python -m spacy download en_core_web_md
```

    To render `bokeh` plots in Jupyter, you need:
```shell
    pip install jupyter_bokeh
```

    If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
    jupyter labextension install @jupyter-widgets/jupyterlab-manager
    jupyter labextension install @bokeh/jupyter_bokeh
```

</details>

## **Ingredient 1 / 3: Raw Data**

Start with a spreadsheet loaded in `pandas`.

We turn it into a [`SupervisableDataset`](../../reference/core-dataset/#hover.core.dataset.SupervisableDataset) designed for labeling:

In [1]:
from hover.core.dataset import SupervisableTextDataset
import pandas as pd

example_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
# for fast, low-memory demonstration purpose, sample the data
df_raw = pd.read_csv(example_csv_path).sample(1000)
df_raw["text"] = df_raw["text"].astype(str)

# data is divided into 4 subsets: "raw" / "train" / "dev" / "test"
# this example assumes no labeled data available., i.e. only "raw"
df_raw["SUBSET"] = "raw"

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df_raw, feature_key="text", label_key="label")

# each subset can be accessed as its own DataFrame
dataset.dfs["raw"]().head(5)

Unnamed: 0,text,SUBSET,label
0,"I understand Caddy is working on one, double ...",raw,ABSTAIN
1,I'm looking for a PC that is small and doesn't...,raw,ABSTAIN
2,I just bought a new IDE hard drive for my syst...,raw,ABSTAIN
3,My GS came with XGT V4s and they are NOT all...,raw,ABSTAIN
4,I am a novice (at best) in working with pc net...,raw,ABSTAIN


-   <details open><summary>FAQ</summary>
    <details open><summary>What if I have multiple features?</summary>
        `feature_key` refers to the field that will be vectorized later on, which can be a JSON that encloses multiple features.

        For example, suppose our data entries look like this:
```python
        {"f1": "foo", "f2": "bar", "non_feature": "abc"}
```

        We can put `f1` and `f2` in a JSON and convert the entries like this:
```python
        # could also keep f1 and f2 around
        {'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}
```

    </details>

    <details open><summary>Can I use audio or image data?</summary>
        Yes! Please check out the "Guides" section of the documentation.
    </details>

</details>

## **Ingredient 2 / 3: Embedding**

A pre-trained embedding lets us group data points semantically.

In particular, let's define a `data -> embedding vector` function.

In [2]:
import spacy
import re
from functools import lru_cache

# use your preferred embedding for the task
nlp = spacy.load("en_core_web_md")

# raw data (str in this case) -> np.array
@lru_cache(maxsize=int(1e+4))
def vectorizer(text):
    clean_text = re.sub(r"[\s]+", r" ", str(text))
    return nlp(clean_text, disable=nlp.pipe_names).vector

text = dataset.dfs["raw"]().loc[0, "text"]
vec = vectorizer(text)
print(f"Text: {text}")
print(f"Vector shape: {vec.shape}")

Text:  I understand Caddy is working on one, double battery, super high perf engine, more gauges, a bit 'stretched', etc, lots of communication equipment, the works. Color selection is limited though.   The problem is that the guy at 1600 Penn. Avenue is about to get it (Pres. Clinton) (Last time it was a Lincoln, this time a Caddy).    Not to my knowledge; I know GM does conversion work for things like  hot climates (i.e. the Chevy Caprices sold to the Middle East) but  things like that are always done by third parties, NOT the manufacturer. Maybe you will need to buy a specific package that has beefed-up everything, perhaps the police cruiser package on the Caprice/Crown Vic and start from there.    "And I wuz drivin' along in my armored Seville STS and this punk pulls out of nowhere with an RPG (Rocket Propelled Grenade) but the bulletproof windshield stopped him" :-) Don't think many people on the net have a need for bulletproof cars.   Check with local armored service companies/se

-   <details open><summary>Tips</summary>
    <details open><summary>Caching</summary>
        `dataset` by itself stores the original features but not the corresponding vectors.

        To avoid vectorizing the same feature again and again, we could simply do:
```python
        from functools import cache

        @cache
        def vectorizer(feature):
            # put code here
```

        If you'd like to limit the size of the cache, something like `@lru_cache(maxsize=10000)` could help.

        Check out [functools](https://docs.python.org/3/library/functools.html) for more options.

    </details>

    <details open><summary>Vectorizing multiple features</summary>
        Suppose we have multiple features enclosed in a JSON:
```python
        # could also keep f1 and f2 around
        {'feature': '{"f1": "foo", "f2": "bar"}', 'non_feature': 'abc'}
```

        Also, suppose we have individual vectorizers likes this:
```python
        def vectorizer_1(feature_1):
            # put code here

        def vectorizer_2(feature_2):
            # put code here
```

        Then we can define a composite vectorizer:
```python
        import json
        import numpy as np

        def vectorizer(feature_json):
            data_dict = json.loads(feature_json)
            vectors = []
            for field, func in [
                ("f1", vectorizer_1),
                ("f2", vectorizer_2),
            ]:
                vectors.append(func(data_dict[field]))

            return np.concatenate(vectors)
```
    </details>

</details>

## **Ingredient 3 / 3: 2D Embedding**

We compute a 2D version of the pre-trained embedding to visualize the whole dataset.

Hover has built-in methods for calling [umap](https://umap-learn.readthedocs.io/en/latest/) or [ivis](https://bering-ivis.readthedocs.io/en/latest/).

-   <details open><summary>Dependencies (when in your own environment)</summary>
    The libraries for this step are not directly required by `hover`:

    -   for umap: `pip install umap-learn`
    -   for ivis: `pip install ivis[cpu]` or `pip install ivis[gpu]`

    `umap-learn` is installed in this demo environment.
</details>

In [3]:
# any kwargs will be passed onto the corresponding reduction
# for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html
# for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html
reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=2)

# what we did adds 'embed_2d_0' and 'embed_2d_1' columns to the DataFrames in dataset.dfs
dataset.dfs["raw"]().head(5)

Vectorizing:   0%|          | 0/984 [00:00<?, ?it/s]

Vectorizing:   3%|▎         | 34/984 [00:00<00:02, 339.49it/s]

Vectorizing:   8%|▊         | 74/984 [00:00<00:03, 254.71it/s]

Vectorizing:  13%|█▎        | 127/984 [00:00<00:02, 350.04it/s]

Vectorizing:  17%|█▋        | 169/984 [00:00<00:02, 372.82it/s]

Vectorizing:  22%|██▏       | 221/984 [00:00<00:02, 331.30it/s]

Vectorizing:  29%|██▉       | 290/984 [00:00<00:01, 425.27it/s]

Vectorizing:  34%|███▍      | 337/984 [00:00<00:01, 336.41it/s]

Vectorizing:  38%|███▊      | 376/984 [00:01<00:02, 265.58it/s]

Vectorizing:  45%|████▌     | 444/984 [00:01<00:01, 344.37it/s]

Vectorizing:  50%|████▉     | 488/984 [00:01<00:01, 365.41it/s]

Vectorizing:  54%|█████▍    | 531/984 [00:01<00:01, 348.90it/s]

Vectorizing:  58%|█████▊    | 570/984 [00:01<00:01, 282.51it/s]

Vectorizing:  61%|██████▏   | 603/984 [00:01<00:01, 230.99it/s]

Vectorizing:  68%|██████▊   | 672/984 [00:02<00:00, 318.24it/s]

Vectorizing:  75%|███████▌  | 738/984 [00:02<00:00, 364.00it/s]

Vectorizing:  81%|████████  | 798/984 [00:02<00:00, 414.90it/s]

Vectorizing:  86%|████████▌ | 846/984 [00:02<00:00, 399.31it/s]

Vectorizing:  90%|█████████ | 890/984 [00:02<00:00, 315.27it/s]

Vectorizing: 100%|██████████| 984/984 [00:02<00:00, 352.91it/s]




Unnamed: 0,text,SUBSET,label,embed_2d_0,embed_2d_1
0,"I understand Caddy is working on one, double ...",raw,ABSTAIN,9.199659,6.410269
1,I'm looking for a PC that is small and doesn't...,raw,ABSTAIN,13.072384,4.275556
2,I just bought a new IDE hard drive for my syst...,raw,ABSTAIN,12.246441,6.590402
3,My GS came with XGT V4s and they are NOT all...,raw,ABSTAIN,11.719909,4.028717
4,I am a novice (at best) in working with pc net...,raw,ABSTAIN,9.488355,7.083333


## :sparkles: **Apply Labels**

We are ready for the annotation interface!

In [4]:
from hover.recipes.stable import simple_annotator

interactive_plot = simple_annotator(dataset)

# ---------- NOTEBOOK MODE: for your actual Jupyter environment ---------
# this code will render the entire plot in Jupyter
# from bokeh.io import show, output_notebook
# output_notebook()
# show(interactive_plot, notebook_url='https://localhost:8888')

-   <details open><summary>Tips: annotation interface basics</summary>
    <details open><summary>Video guide</summary>
        <iframe width="560" height="315" src="https://www.youtube.com/embed/WYN2WduzJWg" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

    </details>

    <details open><summary>Text guide</summary>
        There should be a `SupervisableDataset` board on the left and an `BokehDataAnnotator` on the right, each with a few buttons.

        === "SupervisableDataset"
            -   `push`: push `Dataset` updates to the bokeh plots.
            -   `commit`: add data entries selected in the `Annotator` to a specified subset.
            -   `dedup`: deduplicate across subsets by `feature` (last in gets kept).
            -   `export`: save your data (all subsets) in a specified format.

        === "BokehDataAnnotator"
            -   `raw`/`train`/`dev`/`test`: choose which subsets to display or hide.
            -   `apply`: apply the `label` input to the selected points in the `raw` subset only.

        We've essentially put the data into neighborboods based on the vectorizer, but the quality (homogeneity of labels) of such neighborhoods can vary.

        -   hover over any data point to see its tooltip.
        -   take advantage of different selection tools to apply labels at appropriate scales.
        -   the search widget might turn out useful.
            -    note that it does not select points but highlights them.
    </details>

</details>