Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

French model is confused by "tu" #2251

Closed
randomstuff opened this issue Apr 23, 2018 · 6 comments
Closed

French model is confused by "tu" #2251

randomstuff opened this issue Apr 23, 2018 · 6 comments
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization feat / tagger Feature: Part-of-speech tagger lang / fr French language data and models models Issues related to the statistical models perf / accuracy Performance: accuracy

Comments

@randomstuff
Copy link

Spacy (2.0.11) seems to be confused by the word "tu" in French:

import spacy
from IPython.core.display import display, HTML
from spacy import displacy
from jinja2 import Environment, PackageLoader, DictLoader, FileSystemLoader, select_autoescape

def show(doc):
    return display(HTML(displacy.render(doc, style='dep')))

loader = DictLoader({
    "words.html": """
        <table>
            <thead>
                <tr>
                    <th>Texte</th>
                    <th>Tag</th>
                    <th>Lemma</th>
                </tr>
            </thead>
            <tbody>
                {% for word in words %}
                <tr>
                    <td>{{ word.text   | escape }}</td>
                    <td>{{ word.tag_   | escape }}</td>
                    <td>{{ word.lemma_ | escape }}</td>
                </tr>
                {% endfor %}
            </tbody>
        </table>
    """
})

env = Environment(
    loader=loader,
    autoescape=select_autoescape(['html', 'xml']))

def word_analyze(doc): 
    html = env.get_template('words.html').render(words=doc)
    return HTML(html)

nlp = spacy.load("fr_core_news_md")
Texte Tag Lemma
Je PRON__Number=Sing|Person=1 Je
vais VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin aller
bien ADV___ bien
. PUNCT___ .
Texte Tag Lemma
Tu AUX__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin Tu
vas AUX__Tense=Past|VerbForm=Part aller
bien ADV___ bien
. PUNCT___ .
Texte Tag Lemma
Comment ADV__PronType=Int Comment
vas VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin aller
- PUNCT___ -
tu AUX__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part taire
? PUNCT___ ?

The correct lemma for "tu" should be "PRON__Number=Sing|Person=2" in both cases ans the lemma should be "tu"/"Tu".

@ines ines added performance models Issues related to the statistical models lang / fr French language data and models labels Apr 28, 2018
@GaneshBaronAloir
Copy link

I have the same issue. Did you find a solution?

@ines
Copy link
Member

ines commented Jul 2, 2018

@randomstuff @mrsaboteur Thanks for the reports – we're actually just working on improving
the model tests, so we can run the new v2.1 models against examples like this.

I just created simplified test cases from the two examples above. Are those correct (do they describe the correct, intended behaviour), or did I miss something here?

doc = nlp("Tu vas bien.")
assert doc[0].tag_ == 'PRON__Number=Sing|Person=2'
doc = nlp("Comment vas-tu?")
assert doc[3].tag_ == 'PRON__Number=Sing|Person=2'
assert doc[3].lemma_ == 'tu'

@ines ines added feat / tagger Feature: Part-of-speech tagger feat / lemmatizer Feature: Rule-based and lookup lemmatization labels Jul 2, 2018
@randomstuff
Copy link
Author

@ines, yes these tags for "tu" are correct.

@GaneshBaronAloir
Copy link

@ines I confirm the tags are correct. Thank you for looking into it.

@ines
Copy link
Member

ines commented Dec 14, 2018

Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

@ines ines closed this as completed Dec 14, 2018
@lock
Copy link

lock bot commented Jan 13, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / lemmatizer Feature: Rule-based and lookup lemmatization feat / tagger Feature: Part-of-speech tagger lang / fr French language data and models models Issues related to the statistical models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

3 participants