Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid CoNLL string when model output doesn't provide lemma #29

Open
kat-kel opened this issue Sep 10, 2023 · 3 comments · May be fixed by #33
Open

Invalid CoNLL string when model output doesn't provide lemma #29

kat-kel opened this issue Sep 10, 2023 · 3 comments · May be fixed by #33

Comments

@kat-kel
Copy link

kat-kel commented Sep 10, 2023

When a model doesn't successfully lemmatize a token and the lemma attribute of the SpaCy token is unassigned, spacy_conll leaves that column blank in the CoNLL string. As the spacy_conll error message says when it attempts to reconvert the CoNLL string back to a SpaCy document, it is invalid to have an empty column in a CoNLL document. To avoid generating an invalid CoNLL string, I think we want to add a condition to the formatter to make sure the lemma attribute, and therefore the column's value, is never None.

For example:

        token_conll = (
            token_idx,
            token.text,
            token.lemma_ if token.lemma_ else "_",
            token.pos_,
            token.tag_,
            str(token.morph) if token.has_morph and str(token.morph) else "_",
            head_idx,
            token.dep_,
            token._.conll_deps_graphs_field,
            token._.conll_misc_field,
        )

The model I've used that failed to lemmatize a token and left it blank was spacy-stanza, English. And here's a text that I've encountered, which had a token that stanza didn't lemmatize: "VillaJakeF1 There are already a number of tools that can detect it. I’ve been using chatGPT a bit recently to get coding snippets, and I have to say a lot of it is either incomplete or incorrect. I wouldn’t want to rely on it for something as important as a thesis. But it is early days still"

I ran into this problem when, having produced a SpaCy document with a spacy-stanza pipeline that had the CoNLL formatter, I gave the doc._.conll_str to the CoNLL parser to turn the string back into a SpaCy document. But because the formatter had rendered invalid CoNLL, due to the missing lemma, the CoNLL parser failed.

@BramVanroy
Copy link
Owner

Thank you for your question. I think this is a reasonable request. Can you do a PR?

@kat-kel
Copy link
Author

kat-kel commented Sep 11, 2023

With pleasure! Can you please draw my attention to any test suites you'd want me to run on my end before the PR?

@BramVanroy
Copy link
Owner

If you could add a test that ensures that your problem does not occur anymore (i.e. when lemma_ is None, no error is raised), then that'd be great!

rominf added a commit to rominf/spacy_conll that referenced this issue May 18, 2024
@rominf rominf linked a pull request May 18, 2024 that will close this issue
6 tasks
rominf added a commit to rominf/spacy_conll that referenced this issue May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants