> Suppose we have some custom functions for labeling or filtering data, which resembles [`snorkel`](https://github.com/snorkel-team/snorkel)'s typical scenario.
>
> :speedboat: Let's see how these functions can be combined with `hover`.

-   <details open><summary>This page addresses **single components** of `hover`</summary>
    For illustration, we are using code snippets to pick out specific widgets so that the documentation can explain what they do.

    -   Please be aware that you won't need to get the widgets by code in an actual use case.
    -   Typical usage deals with [recipes](../../tutorial/t1-active-learning) where the individual parts have been tied together.

</details>

-   <details open><summary>Dependencies for {== local environments ==}</summary>
    When you run the code locally, you may need to install additional packages.

    To run the text embedding code on this page, you need:
```shell
    pip install spacy
    python -m spacy download en_core_web_md
```

    To use `snorkel` labeling functions, you need:
```shell
    pip install snorkel
```

    To render `bokeh` plots in Jupyter, you need:
```shell
    pip install jupyter_bokeh
```

    If you are using JupyterLab older than 3.0, use this instead ([reference](https://pypi.org/project/jupyter-bokeh/)):
```shell
    jupyter labextension install @jupyter-widgets/jupyterlab-manager
    jupyter labextension install @bokeh/jupyter_bokeh
```

</details>

## **Preparation**

As always, start with a ready-for-plot dataset:

In [1]:
from hover.core.dataset import SupervisableTextDataset
import pandas as pd

raw_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_raw.csv"
train_csv_path = "https://raw.githubusercontent.com/phurwicz/hover-gallery/main/0.5.0/20_newsgroups_train.csv"

# for fast, low-memory demonstration purpose, sample the data
df_raw = pd.read_csv(raw_csv_path).sample(400)
df_raw["SUBSET"] = "raw"
df_train = pd.read_csv(train_csv_path).sample(400)
df_train["SUBSET"] = "train"
df_dev = pd.read_csv(train_csv_path).sample(100)
df_dev["SUBSET"] = "dev"
df_test = pd.read_csv(train_csv_path).sample(100)
df_test["SUBSET"] = "test"

# build overall dataframe and ensure feature type
df = pd.concat([df_raw, df_train, df_dev, df_test])
df["text"] = df["text"].astype(str)

# this class stores the dataset throught the labeling process
dataset = SupervisableTextDataset.from_pandas(df, feature_key="text", label_key="label")

<br>

In [2]:
import spacy
import re
from functools import lru_cache

# use your preferred embedding for the task
nlp = spacy.load("en_core_web_md")

# raw data (str in this case) -> np.array
@lru_cache(maxsize=int(1e+4))
def vectorizer(text):
    clean_text = re.sub(r"[\s]+", r" ", str(text))
    return nlp(clean_text, disable=nlp.pipe_names).vector

# any kwargs will be passed onto the corresponding reduction
# for umap: https://umap-learn.readthedocs.io/en/latest/parameters.html
# for ivis: https://bering-ivis.readthedocs.io/en/latest/api.html
reducer = dataset.compute_nd_embedding(vectorizer, "umap", dimension=2)

Vectorizing: 100%|██████████| 959/959 [00:02<00:00, 337.86it/s]


<br>

## **Labeling Functions**

Labeling functions are functions that **take a `pd.DataFrame` row and return a label or abstain**.

Inside the function one can do many things, but let's start with simple keywords wrapped in regex:

-   <details open><summary>About the decorator @labeling_function</summary>
    ::: hover.utils.snorkel_helper.labeling_function
</details>

In [3]:
from hover.utils.snorkel_helper import labeling_function
from hover.module_config import ABSTAIN_DECODED as ABSTAIN
import re

@labeling_function(targets=["rec.autos"])
def auto_keywords(row):
    flag = re.search(
        r"(?i)(diesel|gasoline|automobile|vehicle|drive|driving)", row.text
    )
    return "rec.autos" if flag else ABSTAIN

@labeling_function(targets=["rec.sport.baseball"])
def baseball_keywords(row):
    flag = re.search(r"(?i)(baseball|stadium|\ bat\ |\ base\ )", row.text)
    return "rec.sport.baseball" if flag else ABSTAIN

@labeling_function(targets=["sci.crypt"])
def crypt_keywords(row):
    flag = re.search(r"(?i)(crypt|math|encode|decode|key)", row.text)
    return "sci.crypt" if flag else ABSTAIN

@labeling_function(targets=["talk.politics.guns"])
def guns_keywords(row):
    flag = re.search(r"(?i)(gun|rifle|ammunition|violence|shoot)", row.text)
    return "talk.politics.guns" if flag else ABSTAIN

@labeling_function(targets=["misc.forsale"])
def forsale_keywords(row):
    flag = re.search(r"(?i)(sale|deal|price|discount)", row.text)
    return "misc.forsale" if flag else ABSTAIN

LABELING_FUNCTIONS = [
    auto_keywords,
    baseball_keywords,
    crypt_keywords,
    guns_keywords,
    forsale_keywords,
]

<br>

In [4]:
# we will come back to this block later on
# LABELING_FUNCTIONS.pop(-1)

<br>

### **Using a Function to Apply Labels**

Hover's `SnorkelExplorer` (short as `snorkel`) can take the labeling functions above and apply them on areas of data that you choose. The widget below is responsible for labeling:

In [5]:
from bokeh.io import show, output_notebook

output_notebook()

# normally your would skip notebook_url or use Jupyter address
notebook_url = 'localhost:8888'

from hover.recipes.subroutine import standard_snorkel

snorkel_plot = standard_snorkel(dataset)
snorkel_plot.subscribed_lf_list = LABELING_FUNCTIONS
show(snorkel_plot.lf_apply_trigger, notebook_url=notebook_url)

You are generating standalone HTML/JS output, but trying to use real Python
callbacks (i.e. with on_change or on_event). This combination cannot work.

Only JavaScript callbacks may be used with standalone output. For more
information on JavaScript callbacks with Bokeh, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html

Alternatively, to use real Python callbacks, a Bokeh server application may
be used. For more information on building and running Bokeh applications, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/server.html



<br>

### **Using a Function to Apply Filters**

Any function that labels is also a function that filters. The filter condition is `"keep if did not abstain"`. The widget below handles filtering:

In [6]:
show(snorkel_plot.lf_filter_trigger, notebook_url=notebook_url)

You are generating standalone HTML/JS output, but trying to use real Python
callbacks (i.e. with on_change or on_event). This combination cannot work.

Only JavaScript callbacks may be used with standalone output. For more
information on JavaScript callbacks with Bokeh, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/interaction/callbacks.html

Alternatively, to use real Python callbacks, a Bokeh server application may
be used. For more information on building and running Bokeh applications, see:

    https://docs.bokeh.org/en/latest/docs/user_guide/server.html



<br>

Unlike the toggled filters for `finder` and `softlabel`, filtering with functions is on a per-click basis. In other words, this particular filtration doesn't persist when you select another area.

## **Dynamic List of Functions**

Python lists are mutable, and we are going to take advantage of that for improvising and editing labeling functions on the fly.

Run the block below and open the resulting URL to launch a recipe.

-   labeling functions are evaluated against the `dev` set.
    -   hence you are advised to send the labels produced by these functions to the `train` set, not the `dev` set.
-   come back and edit the list of labeling functions **in-place** in one of the code cells above.
    -   then go to the launched app and refresh the functions!

In [7]:
from hover.recipes.experimental import snorkel_crosscheck

interactive_plot = snorkel_crosscheck(dataset, LABELING_FUNCTIONS)

# ---------- NOTEBOOK MODE: for your actual Jupyter environment ---------
# this code will render the entire plot in Jupyter
# from bokeh.io import show, output_notebook
# output_notebook()
# show(interactive_plot, notebook_url='https://localhost:8888')

What's really cool is that in your local environment, this update-and-refresh operation can be done all in a notebook. So now you can

-   interactively evaluate and revise labeling functions
-   visually assign specific data regions to apply those functions

which makes labeling functions significantly more accurate and applicable.