New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect tokenization in CoNLL 2003 export #15
Comments
@david-waterworth can you provide correct and incorrect examples for one entity? |
This is the "correct" output as json
This is the same example in CoNLL 2003
I'm not sure if CoNLL 2003 only applies to text which can be split into "words" by whitespace? I'm not sure how useful that is if you're using some sort of workpiece tokenizer, or like in my case custom tokenization. i.e. tokenization might get tokenised into "token" "ization<\w>" so the model needs token -X- _ B (i.e. in general I'm not sure how you can map from character spans to BIO encoding without taking into account the tokenizer?) |
Hi, @david-waterworth ! Indeed, currently, only primitive tokenization is used (by splitting into tokens using whitespaces). Do you think that something like replacing |
Hi @niklub I think it would be close but not correct, the problem is the '-' character is both a delimiter and an infix character in the tagged entities. text.split('-') removes it. I think you may account for that, as you don't tag with 'O' if the next tokens is an 'I'? But my pre-tokeniser splits between characters based on whether the preceding and next are from the same class (char, numeric and punctuation). Then I used the BPE tokeniser which may further split (particularly longer strings) to avoid tokens. I think my best option is to use the json export for now. Long term, being able to pass tokens as inputs to the annotation UI may help i.e. { |
Hi, I noticed this when checking for conversion issues. I actually worked on an XML format feature, which required tokenization that better matches the annotations. For this I used a combination of whitespace tokenization (via \S) and word-based tokenization (via \W):
A proper solution would require doing using the annotated text to determine tokenization (e.g., iterating through the text by the annotated spans). The attached code shows how this is fits in with the converter code (version 1.2), using a single file for convenience. I'll look into converting this into a pull request later this fall. It needs to be upgraded to the latest release and a way to handle overlapping annotations which would leads to misformed XML. |
I'm evaluating a number of different annotation interfaces and am currently looking at label-studio.
I noticed when I exported as json and colnll2003 I got quite different results for the same string. The reason seems to be the tokenization. I'm trying to identify named entities from IoT device identification strings - the string usually have delimiters like - or . in place of space (or none at all). i.e.
AHU-G1-V-1-1-Ctrl-Md
I tagged AHU-G1-V-1-1 and Ctrl-Md, when exported as json I get the correct tokens
When I export as colnll2003 it seems to apply some form of word tokenization so it creates a single token "AHU-G1-V-1-1-Ctrl-Md -X- _ B-Equip"
I'm not sure if this is a label-studio issue, but prodigy for example applies tokenization before annotation to ensure that the start / end of the selection is align with a token.
The text was updated successfully, but these errors were encountered: