# 🏷️ Labelling a NER dataset for retraining with SpanMarker

In this example, we will show you how to integrate Argilla into a SpanMarker workflow to obtain a NER model, log its predictions to Argilla, manually marking correct and incorrect prediction and retrain your model. Human-in-the-loop!

SpanMarker is a framework for training Named Entity Recognition models using familiar encoders such as BERT, RoBERTa... It's built on top of 🤗 Transformers, easy to easy, and making a pipeline integrating both SpanMarker and Argilla is a super powerful combination. Let's take a look on how to do it.

**TODO: Write a summary of the workflow like in Mistral**


## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).

**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.ipynb). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

### Install dependencies
Let's start by installing the required dependencies to run both Argilla and the remainder of this tutorial.

In [1]:
#%pip install "argilla" "spanmarker" "datasets"

Let's now import the Argilla module for reading and writing data:

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import argilla as rg

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:

In [11]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="https://ignacioct-argilla.hf.space",
    api_key="owner.apikey",
    workspace="admin"
)

If you're running a private Hugging Face Space, you will also need to set the 

*   List item
*   List item

[HF_TOKEN](https://huggingface.co/settings/tokens) as follows:

In [12]:
# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# # Replace workspace with the name of your workspace
# rg.init(
#     api_url="https://hf-space.hf.space", 
#     api_key="owner.apikey",
#     workspace="admin",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

## Loading the dataset

The dataset chosen is a version of the [CoNLL-2002 dataset](https://huggingface.co/datasets/tomaarsen/conll2002) that highlights four types of entities: persons, locations, organizations and names of miscellaneous entities. This is a tagged example instance:

```
[PER Wolff] , currently a journalist in [LOC Argentina] , played with [PER Del Bosque] in the final years of the seventies in [ORG Real Madrid] .
```

This version of the dataset contains instances in both Spanish and Dutch, and has a size of 35k instances. Let's load it and take a look.

In [13]:
from datasets import load_dataset
dataset = load_dataset("tomaarsen/conll2002", "es")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'document_id', 'sentence_id', 'tokens', 'pos_tags', 'ner_tags'],
        num_rows: 8323
    })
    validation: Dataset({
        features: ['id', 'document_id', 'sentence_id', 'tokens', 'pos_tags', 'ner_tags'],
        num_rows: 1915
    })
    test: Dataset({
        features: ['id', 'document_id', 'sentence_id', 'tokens', 'pos_tags', 'ner_tags'],
        num_rows: 1517
    })
})

In [14]:
dataset["test"][0]

{'id': '0',
 'document_id': 0,
 'sentence_id': 0,
 'tokens': ['La', 'Coruña', ',', '23', 'may', '(', 'EFECOM', ')', '.'],
 'pos_tags': [4, 28, 13, 59, 28, 21, 29, 22, 20],
 'ner_tags': [5, 6, 0, 0, 0, 0, 3, 0, 0]}

We have two types of entities in the dataset, PoS tags and NER tags. We will focus on the later. These NER tags follow a BIO scheme, meaning that each token can be a B tag (meaning that's the beggining of an entity), an I tag (an intermediate token of the entity), or an O tag (meaning that the token is a regular word). These scheme is encoded in the `features` field of the dataset.

In [15]:
labels = dataset["train"].features["ner_tags"].feature.names
labels

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

Let's create a dictionary that will come in handy later for translating list of entities.

In [16]:
tag_dict = {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}

## Loading a petrained model for SpanMarker.

To make the initial predictions, we need a pretrained model to use alongside SpanMarker. We will do our first predictions in a zero-shot fashion, with no prior training. The model of choice, a [SpanMarker RoBERTa](https://huggingface.co/tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super) trained on the [FewNERD dataset](https://huggingface.co/datasets/DFKI-SLT/few-nerd), can be used directly on NER tasks, so it allows us to go straight to the prediction phase. Let's load it and do our first predictions.

In [17]:
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super")

In [18]:
print(dataset['test'][0]["tokens"])
model.predict(dataset['test'][0]['tokens'])

['La', 'Coruña', ',', '23', 'may', '(', 'EFECOM', ')', '.']


[{'span': ['La', 'Coruña'],
  'label': 'location-GPE',
  'score': 0.929939329624176,
  'word_start_index': 0,
  'word_end_index': 2},
 {'span': ['EFECOM'],
  'label': 'organization-media/newspaper',
  'score': 0.7107717990875244,
  'word_start_index': 6,
  'word_end_index': 7}]

We've just made a zero-shot prediction, and quite accurate! Both entities are correctly classified. 

## Log predictions into Argilla

Now that we show how to make predictions using the model, let's go a step forward and log some predictions into Argilla. We will work with a relatively small subset of the training split of this dataset to keep agility in performance. 

We will use `TokenClassificationDataset`, as our `FeedbackDataset` does not support NER tasks yet. But stay tunned for future releases, as we are planning on including this functionality for `FeedbackDatasets` as well! TokenClassificationDataset are made by TokenClassificationRecords, which, among several parameters, allow us to log the raw text, the tokens of that text, our predictions, and also some metadata to include extra information. We will use the metadata section to include the NER tags predicted by the model, so we can use a little hint in the annotation phase if we need to.

In [19]:
# Build records for the first 20 examples
records = []

for record in dataset["train"].select(range(20)):

    # Grouping up the raw text, the tokenized text and the predictions
    predictions = model.predict(record['tokens'])
    raw_text = " ".join(record["tokens"])
    tokenized_text = record['tokens']   # we assume the text is split by spaces

    # In the predictions we only have the starting and ending word indexes, but we need
    # the character indexes to build TokenClassificationRecords. To obtain them, we have 
    # made a quick solution that searches for the star and end characters of each word 
    # and makes a list of tuples
    word_indices = []
    current_index = 0
    for word in tokenized_text:
        start = raw_text.find(word, current_index)
        end = start + len(word)
        current_index = end
        word_indices.append((start, end))

    # Now, we add these indexes to the predicions, to be able to append the predictions later.
    for p in predictions:
        p["start_char_index"] = word_indices[p['word_start_index']][0]
        p["end_char_index"] = word_indices[p['word_end_index']-1][1]

    # Building TokenClassificationRecord
    records.append(
        rg.TokenClassificationRecord(
            text=raw_text,
            tokens=tokenized_text,
            prediction=[(p["label"], p["start_char_index"], p["end_char_index"], p["score"]) for p in predictions],
            prediction_agent="tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super",
            metadata={'dataset_gold_labels': [tag_dict[tag] for tag in record["ner_tags"]]}
        )
    )

# Log the records to Argilla
rg.log(records, name="conll2002_es", metadata={"split": "train"})

Output()

BulkResponse(dataset='conll2002_es', processed=20, failed=0)

After the previous snippet, we should have our first 20 predictions loaded in Argilla, ready for us to be annotated. Let's take a moment to go through the records and check if both the predictions made by our zero-shot model are correct, and if we can find more entities that have not been predicted by the model. In each record, if we press to the three-dot menu, we can choose to see more information on that specific record. By doing that, we can access the gold labels that we passed as metadata, in case of doubts.

If you need help using the Argilla UI, you can go [here](https://docs.argilla.io/en/latest/reference/webapp/index.html) to find some documentation on the subject.

![title](https://i.imgur.com/my74hnL.png)