Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect tokenization in CoNLL 2003 export #15

Closed
david-waterworth opened this issue Nov 2, 2020 · 5 comments
Closed

Incorrect tokenization in CoNLL 2003 export #15

david-waterworth opened this issue Nov 2, 2020 · 5 comments
Labels
bug Something isn't working

Comments

@david-waterworth
Copy link

I'm evaluating a number of different annotation interfaces and am currently looking at label-studio.

I noticed when I exported as json and colnll2003 I got quite different results for the same string. The reason seems to be the tokenization. I'm trying to identify named entities from IoT device identification strings - the string usually have delimiters like - or . in place of space (or none at all). i.e.

AHU-G1-V-1-1-Ctrl-Md

I tagged AHU-G1-V-1-1 and Ctrl-Md, when exported as json I get the correct tokens

When I export as colnll2003 it seems to apply some form of word tokenization so it creates a single token "AHU-G1-V-1-1-Ctrl-Md -X- _ B-Equip"

I'm not sure if this is a label-studio issue, but prodigy for example applies tokenization before annotation to ensure that the start / end of the selection is align with a token.

@makseq
Copy link
Member

makseq commented Nov 2, 2020

@david-waterworth can you provide correct and incorrect examples for one entity?

@david-waterworth
Copy link
Author

david-waterworth commented Nov 2, 2020

@makseq

This is the "correct" output as json

  {
    "completions": [
      {
        "created_at": 1604303915,
        "id": 32001,
        "lead_time": 8.224,
        "result": [
          {
            "from_name": "label",
            "id": "mqjil7Iq1m",
            "to_name": "text",
            "type": "labels",
            "value": {
              "end": 12,
              "labels": [
                "Equip"
              ],
              "start": 0,
              "text": "AHU-G1-V-1-1"
            }
          },
          {
            "from_name": "label",
            "id": "937NmbLsRD",
            "to_name": "text",
            "type": "labels",
            "value": {
              "end": 20,
              "labels": [
                "Point"
              ],
              "start": 13,
              "text": "Ctrl-Md"
            }
          }
        ]
      }
    ],
    "data": {
      "text": "AHU-G1-V-1-1-Ctrl-Md"
    },
    "id": 32
  }

This is the same example in CoNLL 2003

AHU-G1-V-1-1-Ctrl-Md -X- _ B-Equip

I'm not sure if CoNLL 2003 only applies to text which can be split into "words" by whitespace? I'm not sure how useful that is if you're using some sort of workpiece tokenizer, or like in my case custom tokenization.

i.e. tokenization might get tokenised into "token" "ization<\w>" so the model needs

token -X- _ B
ization<\w> -X- _ I

(i.e. in general I'm not sure how you can map from character spans to BIO encoding without taking into account the tokenizer?)

@niklub
Copy link
Contributor

niklub commented Nov 2, 2020

Hi, @david-waterworth !

Indeed, currently, only primitive tokenization is used (by splitting into tokens using whitespaces).
However, I think it could be extended for the custom tokenization scheme, take a look at this https://github.com/heartexlabs/label-studio-converter/blob/master/label_studio_converter/utils.py#L20

Do you think that something like replacing text.split() -> text.split('-') would solve your problem ?

@david-waterworth
Copy link
Author

Hi @niklub

I think it would be close but not correct, the problem is the '-' character is both a delimiter and an infix character in the tagged entities. text.split('-') removes it. I think you may account for that, as you don't tag with 'O' if the next tokens is an 'I'? But my pre-tokeniser splits between characters based on whether the preceding and next are from the same class (char, numeric and punctuation). Then I used the BPE tokeniser which may further split (particularly longer strings) to avoid tokens.

I think my best option is to use the json export for now. Long term, being able to pass tokens as inputs to the annotation UI may help i.e.

{
"text": {"AHU-G1-V-1-1-Ctrl-Md"},
"tokens: [{text": "AHU", start=0, end=2}, ...]
}

@tomasohara
Copy link

Hi, I noticed this when checking for conversion issues.

I actually worked on an XML format feature, which required tokenization that better matches the annotations. For this I used a combination of whitespace tokenization (via \S) and word-based tokenization (via \W):

def word_and_whitespace_tokenize(text):
    """Return tokenization of TEXT around word boundaries and with whitespace separated
    Note: This first does split with (\\W+) and then further does (\\S+) split of non-word tokens"""
    tokens = []
    for token in re.split(r"(\W+)", text):
        for sub_token in re.split(r"(\S+)", token):
            if sub_token:
                tokens.append(sub_token)
    return tokens

A proper solution would require doing using the annotated text to determine tokenization (e.g., iterating through the text by the annotated spans).

The attached code shows how this is fits in with the converter code (version 1.2), using a single file for convenience.
misc_converter.py.txt

I'll look into converting this into a pull request later this fall. It needs to be upgraded to the latest release and a way to handle overlapping annotations which would leads to misformed XML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants