Incorrect tokenization in CoNLL 2003 export #15

david-waterworth · 2020-11-02T08:13:14Z

I'm evaluating a number of different annotation interfaces and am currently looking at label-studio.

I noticed when I exported as json and colnll2003 I got quite different results for the same string. The reason seems to be the tokenization. I'm trying to identify named entities from IoT device identification strings - the string usually have delimiters like - or . in place of space (or none at all). i.e.

AHU-G1-V-1-1-Ctrl-Md

I tagged AHU-G1-V-1-1 and Ctrl-Md, when exported as json I get the correct tokens

When I export as colnll2003 it seems to apply some form of word tokenization so it creates a single token "AHU-G1-V-1-1-Ctrl-Md -X- _ B-Equip"

I'm not sure if this is a label-studio issue, but prodigy for example applies tokenization before annotation to ensure that the start / end of the selection is align with a token.

makseq · 2020-11-02T08:15:38Z

@david-waterworth can you provide correct and incorrect examples for one entity?

david-waterworth · 2020-11-02T09:02:17Z

@makseq

This is the "correct" output as json

  {
    "completions": [
      {
        "created_at": 1604303915,
        "id": 32001,
        "lead_time": 8.224,
        "result": [
          {
            "from_name": "label",
            "id": "mqjil7Iq1m",
            "to_name": "text",
            "type": "labels",
            "value": {
              "end": 12,
              "labels": [
                "Equip"
              ],
              "start": 0,
              "text": "AHU-G1-V-1-1"
            }
          },
          {
            "from_name": "label",
            "id": "937NmbLsRD",
            "to_name": "text",
            "type": "labels",
            "value": {
              "end": 20,
              "labels": [
                "Point"
              ],
              "start": 13,
              "text": "Ctrl-Md"
            }
          }
        ]
      }
    ],
    "data": {
      "text": "AHU-G1-V-1-1-Ctrl-Md"
    },
    "id": 32
  }

This is the same example in CoNLL 2003

AHU-G1-V-1-1-Ctrl-Md -X- _ B-Equip

I'm not sure if CoNLL 2003 only applies to text which can be split into "words" by whitespace? I'm not sure how useful that is if you're using some sort of workpiece tokenizer, or like in my case custom tokenization.

i.e. tokenization might get tokenised into "token" "ization<\w>" so the model needs

token -X- _ B
ization<\w> -X- _ I

(i.e. in general I'm not sure how you can map from character spans to BIO encoding without taking into account the tokenizer?)

niklub · 2020-11-02T18:44:42Z

Hi, @david-waterworth !

Indeed, currently, only primitive tokenization is used (by splitting into tokens using whitespaces).
However, I think it could be extended for the custom tokenization scheme, take a look at this https://github.com/heartexlabs/label-studio-converter/blob/master/label_studio_converter/utils.py#L20

Do you think that something like replacing text.split() -> text.split('-') would solve your problem ?

david-waterworth · 2020-11-02T21:15:46Z

Hi @niklub

I think it would be close but not correct, the problem is the '-' character is both a delimiter and an infix character in the tagged entities. text.split('-') removes it. I think you may account for that, as you don't tag with 'O' if the next tokens is an 'I'? But my pre-tokeniser splits between characters based on whether the preceding and next are from the same class (char, numeric and punctuation). Then I used the BPE tokeniser which may further split (particularly longer strings) to avoid tokens.

I think my best option is to use the json export for now. Long term, being able to pass tokens as inputs to the annotation UI may help i.e.

{
"text": {"AHU-G1-V-1-1-Ctrl-Md"},
"tokens: [{text": "AHU", start=0, end=2}, ...]
}

tomasohara · 2021-10-12T22:29:04Z

Hi, I noticed this when checking for conversion issues.

I actually worked on an XML format feature, which required tokenization that better matches the annotations. For this I used a combination of whitespace tokenization (via \S) and word-based tokenization (via \W):

def word_and_whitespace_tokenize(text):
    """Return tokenization of TEXT around word boundaries and with whitespace separated
    Note: This first does split with (\\W+) and then further does (\\S+) split of non-word tokens"""
    tokens = []
    for token in re.split(r"(\W+)", text):
        for sub_token in re.split(r"(\S+)", token):
            if sub_token:
                tokens.append(sub_token)
    return tokens

A proper solution would require doing using the annotated text to determine tokenization (e.g., iterating through the text by the annotated spans).

The attached code shows how this is fits in with the converter code (version 1.2), using a single file for convenience.
misc_converter.py.txt

I'll look into converting this into a pull request later this fall. It needs to be upgraded to the latest release and a way to handle overlapping annotations which would leads to misformed XML.

david-waterworth closed this as completed Nov 16, 2020

tomasohara mentioned this issue Oct 13, 2021

make create_tokens_and_tags robust against missing annotation fields #61

Open

makseq added the bug Something isn't working label Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect tokenization in CoNLL 2003 export #15

Incorrect tokenization in CoNLL 2003 export #15

david-waterworth commented Nov 2, 2020

makseq commented Nov 2, 2020

david-waterworth commented Nov 2, 2020 •

edited

niklub commented Nov 2, 2020

david-waterworth commented Nov 2, 2020

tomasohara commented Oct 12, 2021

Incorrect tokenization in CoNLL 2003 export #15

Incorrect tokenization in CoNLL 2003 export #15

Comments

david-waterworth commented Nov 2, 2020

makseq commented Nov 2, 2020

david-waterworth commented Nov 2, 2020 • edited

niklub commented Nov 2, 2020

david-waterworth commented Nov 2, 2020

tomasohara commented Oct 12, 2021

david-waterworth commented Nov 2, 2020 •

edited