# Bulk Labelling as a Notebook

This notebook contains a convenient pattern to cluster and label new text data. The end-goal is to discover intents that might be used in a virtual assistant setting. This can be especially useful in an early stage and is part of the "iterate on your data"-mindset. 

## Dependencies 

You'll need to install a few things to get started. 

- [whatlies](https://rasahq.github.io/whatlies/)
- [human-learn](https://koaning.github.io/human-learn/)

You can install both tools by running this line in an empty cell; 

```python
%pip install "whatlies[tfhub]" "human-learn"
```

We use `whatlies` to fetch embeddings and to handle the dimensionality reduction. We use `human-learn` for the interactive labelling interface. Feel free to check the documentation of both packages to learn more. 

## Let's go

To get started we'll first import a few tools.

In [1]:
import pathlib 
import numpy as np
from whatlies.language import CountVectorLanguage, UniversalSentenceLanguage, BytePairLanguage, SentenceTFMLanguage
from whatlies.transformers import Pca, Umap

Next we will load in some embedding frameworks. There can be very heavy, just so you know! 

In [6]:
lang_cv  = CountVectorLanguage(10)
lang_use = UniversalSentenceLanguage()
lang_bp  = BytePairLanguage("en", dim=300, vs=200_000)

INFO:absl:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/4'.
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 160.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 320.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 480.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 640.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 770.00MB
INFO:absl:Downloading https://tfhub.dev/google/universal-sentence-encoder/4: 930.00MB
INFO:absl:Downloaded https://tfhub.dev/google/universal-sentence-encoder/4, Total size: 987.47MB
INFO:absl:Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/4'.


Next we'll load in the texts that we'd like to embed/cluster. Feel free to provide another file here.

In [7]:
txt = pathlib.Path("nlu.md").read_text()
texts = list(set([t.replace(" - ", "") for t in txt.split("\n") if len(t) > 0 and t[0] != "#"]))
print(f"We're going to plot {len(texts)} texts.")

We're going to plot 1087 texts.


Keep in mind that it's better to start out with 1000 sentences or so. Much more might break the browser's memory in the next visual.

## Showing Clusters 

![](pipeline.png)

The cell below will take the texts and have them pass through different language backends. After this they will be mapped to a two dimensional space by using [UMAP](https://umap-learn.readthedocs.io/en/latest/). It takes a while to plot everything (mainly because the universal sentence encoder and the transformer language models are heavy).

In [8]:
%%time

def make_plot(lang):
    return (lang[texts]
             .transform(Umap(2))
             .plot_interactive(annot=False)
             .properties(width=200, height=200, title=type(lang).__name__))

make_plot(lang_cv) | make_plot(lang_bp) | make_plot(lang_use)

CPU times: user 36.2 s, sys: 2.16 s, total: 38.4 s
Wall time: 17.5 s


What you see are four charts. You should notice that certain clusters have appeared. For your usecase you might need to check which language backend makes the most sense. 

## Note for Non-English 

The only model shown here that is English specific is the universal sentence encoder (`lang_use`). All the other ones also support other languages. For more information check the [bytepair documentation](https://nlp.h-its.org/bpemb/) and the [sentence transformer documentation](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models).

## Towards Labelling 

We'll now prepare a dataframe that we'll assign labels to. We'll do that by loading in the same text file but now into a pandas dataframe.

In [9]:
df = lang_use[texts].transform(Umap(2)).to_dataframe().reset_index()
df.columns = ['text', 'd1', 'd2']
df['label'] = ''
df.shape[0]

1087

We are now going to be labelling!

# Fancy interactive drawing! 

We'll be using Vincent's infamous [human-learn library](https://koaning.github.io/human-learn/guide/drawing-features/custom-features.html) for this. First we'll need to instantiate some charts.

Next we get to draw! Drawing can be a bit tricky though, so pay attention. 

1. You'll want to double-click to start drawing. 
2. You can then click points together to form a polygon. 
3. Next you need to double-click to stop drawing. 

This allows you to draw polygons that can be used in the code below to fetch the examples that you're interested in.

## Rerun

This is where we will start labelling. That also means that we might re-run this cell after we've added labels.

In [1]:
from hulearn.experimental.interactive import InteractiveCharts

charts = InteractiveCharts(df.loc[lambda d: d['label'] == ''], labels=['group'])

charts.add_chart(x='d1', y='d2')

NameError: name 'df' is not defined

We can now use this selection to retreive a subset of rows. This is a quick varification to see if the points you select indeed belong to the same cluster.

In [11]:
from hulearn.preprocessing import InteractivePreprocessor
tfm = InteractivePreprocessor(json_desc=charts.data())

df.pipe(tfm.pandas_pipe).loc[lambda d: d['group'] != 0].sample(10)

Unnamed: 0,text,d1,d2,label,group
144,Find me a restaurant where I can eat.,0.660732,10.245481,,1
817,Do you find me a restaurant?,0.134726,10.103104,,1
787,Could you find a restaurant for me?,0.234376,10.163591,,1
1066,Can you find a restaurant for me?,0.349077,10.148664,,1
979,Show me how to find a restaurant,0.786331,10.294456,,1
968,Help me with finding this restaurant,0.825996,10.560854,,1
198,help me find restaurant,0.872393,10.543851,,1
487,I want to find some restauant nearby,0.822904,10.086877,,1
1072,I need to find this restaurant,0.748952,10.569963,,1
794,Would you find me a restaurant?,0.180469,10.078395,,1


If you're confident that you'd like to assign a label, you can do so below. 

In [12]:
label_name = 'restaurants'

In [13]:
idx = df.pipe(tfm.pandas_pipe).loc[lambda d: d['group'] != 0].index

df.iloc[idx, 3] = label_name

print(f"We just assigned {len(idx)} labels!")

We just assigned 55 labels!


That's it! You've just attached a label to a group of points! 

## Rerun 

You can now scroll up and start relabelling clusters that aren't assigned yet. Once you're confident that this works, you can export by running the final code below.

In [22]:
df.head()

Unnamed: 0,text,d1,d2,label
0,What languages can you communicate in?,2.727605,-0.036231,
1,what u can do?,10.641682,5.367977,
2,What exactly is my name?,14.029011,1.462056,
3,What does Rasa make?,24.505444,6.782425,
4,can you help me?,8.584467,4.582135,


In [168]:
df.to_csv("first_order_labelled.csv")

## Final Notes

There's a few things to mention. 

1. This method of labelling is great when you're working on version 0 of something. It'll get you a whole lot of data fast but it won't be high quality data. 
2. The use-case for this method might be at the start of design a virtual assistant. You've probably got data from social media that you'd like to use as a source of inspiration for intents. This is certainly a valid starting point but you should be aware that the language that folks use on a feedback form is different than the language used in a chatbox. Again, these labels are a reasonable starting point, but they should not be regarded as ground truth. 
3. Labelling is only part of the goal here. Another big part is understanding the data. This is very much a qualitative/human task. You might be able to quickly label 1000 points in 5 minutes with this technique but you'll lack an understanding if you don't take the time for it. 