Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

em-dash receives empty POS tag with 'en' models #1700

Closed
gabbard opened this issue Dec 8, 2017 · 12 comments
Closed

em-dash receives empty POS tag with 'en' models #1700

gabbard opened this issue Dec 8, 2017 · 12 comments
Labels
feat / tagger Feature: Part-of-speech tagger lang / en English language data and models models Issues related to the statistical models perf / accuracy Performance: accuracy

Comments

@gabbard
Copy link

gabbard commented Dec 8, 2017

The EM DASH character receives an empty POS tags, both coarse and fine, in the 'en' models for spacy 2.0.3:

In 2.0.3:

>>> import spacy
>>> nlp = spacy.load('en')
>>> x = nlp('hello — world')
>>> x.print_tree()
[{'word': 'hello', 'lemma': 'hello', 'NE': '', 'POS_fine': 'UH', 'POS_coarse': 'INTJ', 'arc': 'ROOT', 'modifiers': [{'word': '—', 'lemma': '—', 'NE': '', 'POS_fine': '', 'POS_coarse': '', 'arc': 'punct', 'modifiers': []}, {'word': 'world', 'lemma': 'world', 'NE': '', 'POS_fine': 'NN', 'POS_coarse': 'NOUN', 'arc': 'npadvmod', 'modifiers': []}]}]

But in 1.9.0:

>>> import spacy
>>> nlp = spacy.load('en')
>>> x = nlp('hello — world')
>>> x.print_tree()
[{'word': 'hello', 'lemma': 'hello', 'NE': '', 'POS_fine': 'UH', 'POS_coarse': 'INTJ', 'arc': 'ROOT', 'modifiers': [{'word': '—', 'lemma': '—', 'NE': '', 'POS_fine': ':', 'POS_coarse': 'PUNCT', 'arc': 'punct', 'modifiers': []}, {'word': 'world', 'lemma': 'world', 'NE': '', 'POS_fine': 'NN', 'POS_coarse': 'NOUN', 'arc': 'appos', 'modifiers': []}]}]

Info about spaCy

  • spaCy version: 2.0.3
  • Platform: Darwin-17.2.0-x86_64-i386-64bit
  • Python version: 3.6.3
  • Models: en
@gabbard
Copy link
Author

gabbard commented Dec 8, 2017

hmm - it's actually a little more complicated. If we take the 2.03 example above and then do:

>> for word in x:
...     print(word)
...     print(word.pos)
...     print(word.pos_)
...     print(word.tag)
...     print(word.tag_)
...
hello
90
INTJ
3252815442139690129
UH
—
96
PUNCT
0

world
91
NOUN
15308085513773655218
NN

So pos_ is correctly PUNCT but tag_ is still empty.

@ines ines added lang / en English language data and models feat / tagger Feature: Part-of-speech tagger labels Mar 27, 2018
@ines ines added the models Issues related to the statistical models label Jul 2, 2018
@adrianeboyd
Copy link
Contributor

This problem doesn't seem to be English-specific. I've also tried several other languages (German, Spanish, Portuguese) and the tags for em dash are all empty.

@ines ines added perf / accuracy Performance: accuracy and removed performance labels Aug 15, 2018
@IsabelMeraner
Copy link

IsabelMeraner commented Oct 4, 2018

I have the same problem with empty tags for German:

import spacy
nlp = spacy.load('de_core_news_sm')
doc = nlp(sentence_string)
print(doc)

for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

results in:

Der nach aussen und abwärts kippende Baum traf haargenau auf eine nahe Tanne , konnte nicht weiter fallen , sondern verklemmte sich so , dass auch der ganze Wurzelkuchen haltmachte .


Der     Xxx True True
nach     xxxx True True
aussen     xxxx True False
und     xxx True True
abwärts     xxxx True False
kippende     xxxx True False
Baum     Xxxx True False
traf     xxxx True False
haargenau     xxxx True False
auf     xxx True True
eine     xxxx True True
nahe     xxxx True False
Tanne     Xxxxx True False

... which means that only token.shape_, token.is_alpha, token.is_stop have an actual return value, all the others seem to be empty strings.

Do you have any suggestions on how to fix this issue or is there another workaround?

Thank you so much.

Info about spaCy

spaCy version: 1.9.0:
Python version: 3.5.2
Models: de

@ines
Copy link
Member

ines commented Oct 5, 2018

@IsabelMeraner

... which means that only token.shape_, token.is_alpha, token.is_stop have an actual return value, all the others seem to be empty strings.

That's strange and basically means that the model isn't predicting anything... 🤔 It makes sense that the lexical attributes all exist, because those aren't predicted by the model.

Could you run python -m spacy validate and check the output for the German model? Issues like this can potentially point to a version incompatibility.

@IsabelMeraner
Copy link

@ines

Could you run python -m spacy validate and check the output for the German model? Issues like this can potentially point to a version incompatibility.

I just upgraded to spacy version 2.0.12 in order to be able to run the validate option. That's what I get:

$ spacy --info

Info about spaCy
Models
Python version     3.5.2
spaCy version      2.0.12
Location           /Users/isabelmeraner/anaconda/lib/python3.5/site-packages/spacy
Platform           Darwin-17.7.0-x86_64-i386-64bit`

$ python -m spacy validate

`Installed models (spaCy v2.0.12)
/Users/isabelmeraner/anaconda/lib/python3.5/site-packages/spacy

TYPE        NAME                  MODEL                 VERSION
package     de-core-news-md       de_core_news_md       1.0.0    --> n/a
package     en-core-web-sm        en_core_web_sm        1.2.0    --> 2.0.0

Use the following commands to update the model packages:
python -m spacy download en_core_web_sm

Use the following commands to update the model packages:

It looks like spacy 2.0.12 is incompatible with the German model, the command to update the German model packages is empty (see above). If I try to download the model for German, I get a compatibility error:

$ python -m spacy download de-core-news-md
Compatibility error No compatible model found for 'de-core-news-md' (spaCy v2.0.12).

What would you suggest in order to use the German model for pos-tagging? Can you provide an updated version of "de-core-news-md" for the most recent version of spaCy v2.0.12?
Thank you so much!

@ines
Copy link
Member

ines commented Oct 8, 2018

@IsabelMeraner Thanks for the update! I was confused for a second where the de_core_news_md model was coming from, since we've only been working on that for spaCy v2.1.0. But it looks like your environment still has the models for spaCy 1.x (!) installed – those are the linear models that aren't compatible with the neural network models in v2.x at all. (I'm surprised spaCy didn't raise a more serious error here, actually.)

So, in summary: for spaCy 2.0.12, there's currently only a small German model, de_core_news_sm. If you need a larger model with vectors, you can try the md model we've trained for the upcoming version, which is currently available on pip as spacy-nightly. (If you want to test the nightly build, make sure to use a virtual environment, though! You don't want to accidentally install the new version and models into your existing spaCy environment, because that can potentially mess things up!)

@IsabelMeraner
Copy link

@ines Thanks a lot for the detailed explanation.

The linear models from the older spaCy versions must have been the reason for this behaviour: After using spaCy 2.0.12 and the compatible German model de_core_news_sm the tags are no longer empty!

One last question regarding the usage of the spacy-nightly -md model for German:
After installing the new model with the command pip install spacy-nightly I checked the compatibility using $ python -m spacy validate:

    Installed models (spaCy v2.1.0a1)
    /Users/isabelmeraner/anaconda/envs/spacy/lib/python3.6/site-packages/spacy

    TYPE        NAME                  MODEL                 VERSION
    package     en-core-web-sm        en_core_web_sm        2.1.0a0  ✔
    package     de-core-news-sm       de_core_news_sm       2.1.0a0  ✔
    link        de_core_news_sm       de_core_news_sm       2.1.0a0  ✔
    link        de                    de_core_news_sm       2.1.0a0  ✔
    link        en                    en_core_web_sm        2.1.0a0  ✔

However, the larger -md model is not listed here and I get an error as soon as I try to load it with
nlp = spacy.load('de_core_news_md'):

OSError: [E050] Can't find model 'de_core_news_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Do you have any suggestions here? Thank you very much.

@ines
Copy link
Member

ines commented Oct 9, 2018

@IsabelMeraner Glad it's working now! And it looks like you don't actually have the _md model installed? In your spacy-nightly environment, try running python -m spacy download de_core_news_md!

@IsabelMeraner
Copy link

IsabelMeraner commented Oct 9, 2018

@ines Of course, it was only installed in the other environment. Now it's working fine with the bigger model. Thanks again for your help!

@kuchenrolle
Copy link

I'm using spacy 2.0.13 and have the compatible model en_core_web_lg 2.0.0 installed, but I still get an empty tag for "—" in the sentence: "11.00am — Tony has been given an appointment at the local hospital."

I'm running it on Python 3.6.2 with IPython 6.1.0. And I get the same behaviour from the _sm model (also 2.0.0).

@ines
Copy link
Member

ines commented Dec 14, 2018

Merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

@ines ines closed this as completed Dec 14, 2018
@lock
Copy link

lock bot commented Jan 13, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / tagger Feature: Part-of-speech tagger lang / en English language data and models models Issues related to the statistical models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

6 participants