# Interactive Labeling

In many applications creating the label taxonomy for your problem is the hard part.
In this notebook we show some tricks that can help you with that.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import altair as alt

from datasets import load_dataset
import pandas as pd

from whatlies.language import UniversalSentenceLanguage, BytePairLanguage
from whatlies.transformers import Umap

from ipysheet.pandas_loader import from_dataframe, to_dataframe
from sklearn.metrics.pairwise import cosine_similarity

## Example Dataset

In [None]:
ds = load_dataset("bing_coronavirus_query_set", queries_by="state", start_date="2020-09-01", end_date="2020-09-30")

df = ds.data['train'].to_pandas()

In [None]:
us = (
    df
    .loc[lambda d: d['Country']=="United States"]
    .value_counts("Query")
    .reset_index()
    .rename({"Query":"query",0:"counts"},axis=1)
)

us

## Trick 1: Align Regex Matches

This helps you eyeball what's matched by a pattern a lot easier.

In [None]:
from rich.console import Console 
import re

def print_matches_centered(texts,pattern,left=45,right=45,style="bold blue underline",max_lines=None):
    console = Console(highlight=False)
    n_matches = 0

    max_lines = max_lines if max_lines else len(texts)
    
    # we shuffle 
    for text in pd.Series(texts).sample(frac=1):
            match = re.search(pattern,text)
            if match:
                start, end = match.span()
                length = end - start
                if start > left:
                    prefix = text[(start-left):start]
                else:
                    prefix = " "*(left-start) + text[:start]
                processed_text = prefix+f"[{style}]" + text[start:end] + "[/]" + text[end:(end+right-length)]
                console.print(processed_text)
 
                n_matches+=1
                if n_matches >= max_lines:
                    break

In [None]:
print_matches_centered(us['query'],"mask",max_lines=20)

## Load embeddings

We use the whatlies package as a convenience wrapper around our sentence embedders and dimensionality reducers.

In practice you want to try out different embeddings, different dimensionality reducers, and different hyper paramters for the latter.

Clustering in practice is like reading tea leaves: you need to stir the cup every now to see what new patterns emerge.

In [None]:
lang = BytePairLanguage("en")  # Use UniversalSentenceLanguage() for better results

embset = lang[[s for s in us['query']]]
embs = embset.to_X()

umapped = embset.transform(Umap(2)).to_X()  # Umap has kwargs that you can play with, or try PCA

us[['dim0','dim1']] = umapped

## Plot

This plot is also in the whatlies package, it's called the brush_plot there, we create it ourselves here so we can edit it interactively more easily.

In [None]:
us['label'] = "Missing"  # initialize labels

In [None]:
x_axis='dim0'
y_axis='dim1'
x_label = "X"
y_label = "Y"
color="label"
tooltip=["query",'label','counts']
title="hello"
n_show=15

In [None]:
result = (
    alt.Chart(us)
    .mark_circle(size=60,opacity=.2)
    .encode(
        x=alt.X(x_axis, axis=alt.Axis(title=x_label)),
        y=alt.X(y_axis, axis=alt.Axis(title=y_label)),
        tooltip=tooltip,
        color=alt.Color(":N", legend=None) if not color else alt.Color(color),
    )
    .properties(title=title)
)

brush = alt.selection(type="interval")
ranked_text = (
    alt.Chart(us)
    .mark_text()
    .encode(
        y=alt.Y("row_number:O", axis=None),
        color=alt.Color(":N", legend=None) if not color else alt.Color(color),
    )
    .transform_window(row_number="row_number()")
    .transform_filter(brush)
    .transform_window(rank="rank(row_number)")
    .transform_filter(alt.datum.rank < n_show)
)
text_plt = ranked_text.encode(text="query:N").properties(
    width=250, title="Text Selection"
)
result.add_selection(brush) | text_plt

## Assign labels

This is just an example.
Here we greediy assign labels: a row gets the label of the last pattern that was matched.
This isn't perfect. I'd prefer to assign each label to a separate column and then do some manual refinement afterwards.

In [None]:
# us['label'] = "Missing"  # Uncomment if you want to remove all previous labels

patterns = [
    "county",
    "mask|shield|face\b|cover",
#     states_pattern,  # Defined below
    "quarantine",
    "football|nfl|ball"
]

for pat in patterns:
    us.loc[[True if re.search(pat, q) else False for q in us['query']], 'label'] = pat
    
us['label'].value_counts(normalize=True).reset_index().assign(label = lambda d: d.label.apply(lambda f: f"{f:.2f}"))

In [None]:
states = """Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
District of Columbia
Puerto Rico
Guam
American Samoa
U.S. Virgin Islands
Northern Mariana Islands
"""

states_pattern = "|".join(s.lower() for s in states.splitlines())

## Similarity Search

Sometimes you can't describe a rule as a regex pattern.
Here we show how you can write an example sentence. Display similar sentences in an editable table and then assign a label to those rows that you found that match it well.

Ideally you'd also add these as seperate columns to then resolve which label(s) fit best.

In [None]:
ex = "face mask"

In [None]:
us['sims'] = cosine_similarity(lang[ex].vector.reshape((1,-1)),embs).reshape((-1,))

In [None]:
manual_label = (
    us
    .nlargest(50,columns="sims")
    .sort_values(by="dim0")
    .assign(relevant = "X")
    [["query","label","relevant"]]
)

In [None]:
sheet = from_dataframe(manual_label)

In [None]:
sheet

In [None]:
out = to_dataframe(sheet)

In [None]:
us.loc[
    out.query("relevant == 'X'").index.astype(int), "label"] = "manual_mask"

## Concluding remarks


I've come to believe that:

> the Labeling **IS** the Learning

It is a great exercise to learn what is actually in our data.
For many business applications creating the label taxonomy is actually the hard part.
For that the tools discussed here are particularly useful.

Just don't forget to hand over these crude labels to someone (or yourself) who do go over them one by one to verify if they're correct or not.
A simple three step process could be:

1. Bulk labeling with tools inspired by what's in this notebook
2. Manual verification if the labels were correct (prioritize checking observations where model(s) disagree)
3. Manually input the correct labels for the mistakes identified in 2. AND OPTIONALLY update the rule that created that label in 1.

Just because you need to label your data doesn't mean that you have to do it one by one.
And it can be extended by zooming in on subsets of your data, e.g. those predicted wrongly by your current model, or where different models in your ensemble disagree.