-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid CoNLL string when model output doesn't provide lemma #29
Comments
Thank you for your question. I think this is a reasonable request. Can you do a PR? |
With pleasure! Can you please draw my attention to any test suites you'd want me to run on my end before the PR? |
If you could add a test that ensures that your problem does not occur anymore (i.e. when |
rominf
added a commit
to rominf/spacy_conll
that referenced
this issue
May 18, 2024
rominf
added a commit
to rominf/spacy_conll
that referenced
this issue
May 18, 2024
The issue is fixed by 22a14f7. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When a model doesn't successfully lemmatize a token and the
lemma
attribute of the SpaCy token is unassigned,spacy_conll
leaves that column blank in the CoNLL string. As thespacy_conll
error message says when it attempts to reconvert the CoNLL string back to a SpaCy document, it is invalid to have an empty column in a CoNLL document. To avoid generating an invalid CoNLL string, I think we want to add a condition to the formatter to make sure the lemma attribute, and therefore the column's value, is never None.For example:
The model I've used that failed to lemmatize a token and left it blank was
spacy-stanza
, English. And here's a text that I've encountered, which had a token that stanza didn't lemmatize: "VillaJakeF1 There are already a number of tools that can detect it. I’ve been using chatGPT a bit recently to get coding snippets, and I have to say a lot of it is either incomplete or incorrect. I wouldn’t want to rely on it for something as important as a thesis. But it is early days still"I ran into this problem when, having produced a SpaCy document with a
spacy-stanza
pipeline that had the CoNLL formatter, I gave thedoc._.conll_str
to the CoNLL parser to turn the string back into a SpaCy document. But because the formatter had rendered invalid CoNLL, due to the missing lemma, the CoNLL parser failed.The text was updated successfully, but these errors were encountered: