#  🧑‍💻 Annotating Data with Argilla 🧑‍💻

📅 _Data Science Summer School 2023, 22.08.2023_

👨‍🏫 By [Moritz Laurer](https://www.linkedin.com/in/moritz-laurer/).
For questions, reach out to: m.laurer@vu.nl

</a><a href="https://github.com/MoritzLaurer/summer-school-transformers-2023/blob/main/6_annotation_interface_argilla.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Annotation interfaces for data centric AI

The notebook `data-quality-and-cleaning.ipynb` shows how we can automatically identify label issues with `CleanLab` in python. However, for proper data annotation and cleaning, we need an annotation interface with more functionalities.

There are several several annotation interfaces for NLP use-cases, for example:
* [CleanLab Studio](https://cleanlab.ai/)
* [Galileo](https://www.rungalileo.io/)
* [LabelStudio](https://labelstud.io/)
* [Argilla](https://argilla.io/)

This notebook uses **Argilla**, because it is free, directly configurable via Python and fully open-source. I'm also an open-source contributor to Argilla (e.g. see my tutorial on Active Learning with Argilla [here](https://docs.argilla.io/en/latest/tutorials/notebooks/deploying-textclassification-colab-activelearning.html)).

Every interface provider has different advantages and disadvantages, so the best choice depends on your specific circumstances and I recommend comparing them for yourself.



## A brief overview of Argilla

Argilla is an annotation interface for NLP tasks. It supports: Text classification, token classification (e.g. NER, Named Entity Recognition), text generation tasks (e.g. summarization, translation, etc.), and some more specialised use-cases like RLHF (Reinforcement Learning from Human Feedback).

They have several demos hosted on [Argilla's Hugging Face Space](https://huggingface.co/argilla). Let's look at their demo "Argilla UI Demo Space". `username`: argilla, `password`: 12345678


This notebook is partly based on [Argilla's tutorial](https://docs.argilla.io/en/latest/tutorials/notebooks/monitoring-textclassification-cleanlab-explainability.html) on it's CleanLab integration.

## Annotating data in your own interface




There are two main ways of [installing and running Argilla](https://docs.argilla.io/en/latest/getting_started/quickstart_installation.html): (1) locally on your own computer via Docker; (2) online on a Hugging Face Space. For this workshop in Google Colab, it's easier to run Argilla in a Hugging Face Space.

The only requirement for this is a (free) Hugging Face account. If you have an account, you can click on the button "Deploy on HF Spaces" below and set up your Space with a few clicks. Note: The free GPU and free (ephemeral) storage is sufficient for testing. If you want to use this professionally, you probably need persistant storage space, which currently costs $ 0.01 / hour.

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

After creating our own Hugging Face Space, we can prepare the data for annotation and upload it to the HF Space.

In this notebook, we will use CleanLab again to identify potential label issues and we will then upload and manually correct the data.

### Install dependencies

In [None]:
#%pip install argilla datasets scikit-learn cleanlab -qqq
!pip install argilla~=1.14.0 -qqq
!pip install datasets~=2.14.0 -qqq
!pip install cleanlab~=2.4.0 -qqq
!pip install sentence-transformers~=2.2.2 -qqq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.5/71.5 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import argilla as rg
from argilla.labeling.text_classification import find_label_errors

from datasets import load_dataset
from sentence_transformers import SentenceTransformer

from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression

### Prepare data

In [None]:
# load and prepare the data
dataset = load_dataset("dair-ai/emotion")["train"]

# you can also test another dataset for topic classification
#dataset = load_dataset("ag_news")
#dataset = dataset["train"].train_test_split(train_size=0.1)["train"]

print(dataset)

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})


In [None]:
# encode texts with sbert model
sbert_embedder = SentenceTransformer("intfloat/e5-small-v2")

texts_embedded = sbert_embedder.encode(dataset["text"])


Downloading (…)31236/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Downloading (…)485a431236/README.md:   0%|          | 0.00/67.8k [00:00<?, ?B/s]

Downloading (…)5a431236/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/133M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)31236/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)485a431236/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)a431236/modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

In [None]:
model = LogisticRegression(max_iter=400, random_state=42)

# get predicted probabilities for the whole dataset via cross validation
cv_probs = cross_val_predict(
    model,
    X=texts_embedded,
    y=dataset["label"],
    cv=5,
    method="predict_proba",
    n_jobs=-1,
)

cv_probs

array([[0.67178275, 0.17727986, 0.03677739, 0.04909162, 0.0465247 ,
        0.01854367],
       [0.19353656, 0.55332526, 0.07613966, 0.05944438, 0.09057712,
        0.02697702],
       [0.16019305, 0.06244397, 0.04829067, 0.68477152, 0.02200757,
        0.02229321],
       ...,
       [0.01565235, 0.93916582, 0.02440097, 0.00345771, 0.00918523,
        0.00813791],
       [0.24709814, 0.16544301, 0.0660439 , 0.4832586 , 0.01223457,
        0.02592178],
       [0.55738649, 0.11483681, 0.01276   , 0.11961905, 0.13981171,
        0.05558593]])

In [None]:
# get classification labels
labels_text = dataset.features["label"].names
print(labels_text)

In [None]:
# create records for the test set
import argilla as rg

records = [
    rg.TextClassificationRecord(
        text=data["text"],
        prediction=list(zip(labels_text, prediction)),
        annotation=labels_text[data["label"]],
        metadata={"split": "train"},
        vectors={"sbert_vector": vector}
    )
    for data, prediction, vector in zip(dataset, cv_probs, texts_embedded.tolist())
]
print(len(records))
#records[:3]

16000


In [None]:
# get records with potential label errors
from argilla.labeling.text_classification import find_label_errors

records_with_label_error = find_label_errors(records)
print(len(records_with_label_error))
#records_with_label_error[:3]

2764


### Inspect and annotate data with Argilla

In [None]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# you find this link by: (1) opening your HF Space, (2) clicking on the three dots in the top right, (3) clicking on "embed this space"
rg.init(
    api_url="https://moritzlaurer-argilla-workshop-demo.hf.space",
    api_key="admin.apikey", # Replace api_key if you configured a custom API key
    workspace="admin",
)

In [None]:
# log data to the Argilla web app / HF space
dataset_name = "dataset_cleaning"

rg.log(records_with_label_error, name=dataset_name, workspace="admin")

Output()

BulkResponse(dataset='dataset_cleaning', processed=2764, failed=0)

By default the records in the `records_with_label_error` list are ordered by their likelihood of containing a label error.
They will also contain a metadata called "label_error_candidate" by default, which reflects the order in the list.
You can use this field in the *Argilla* web app to sort the records.

### Save your annotated dataset

After annotating and cleaning the data in the browser, you can then download it again and store it in any format you like.

In [None]:
records = rg.load(dataset_name)

dataset_cleaned = records.to_datasets()
print(dataset_cleaned)

# you can also upload the dataset to the HF hub, if you want
#dataset_cleaned.push_to_hub("<name of the dataset on the HF Hub>")

Dataset({
    features: ['text', 'inputs', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'vectors', 'multi_label', 'explanation', 'id', 'metadata', 'status', 'event_timestamp', 'metrics'],
    num_rows: 2764
})


## Exercise and Questions

1. **Exercise:** (5 min)
    * Go to [Argilla's Hugging Face Space](https://huggingface.co/argilla) and click on their demo space:  "Argilla UI Demo Space (login: argilla/12345678)". `username`: argilla, `password`: 12345678
    * Choose any task & dataset you are interested in and annotate some data for around 5 minutes.

2. **Questions** (5 min)
    * Which types of label issues did you discover?
    * What was easy and what was difficult when annotating data yourself?
    * Reread the notebook.
    * **Post any questions in the chat**

3. **break** (5 min)
