Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noun chunking inconsistency #2451

Closed
chozelinek opened this issue Jun 14, 2018 · 2 comments
Closed

Noun chunking inconsistency #2451

chozelinek opened this issue Jun 14, 2018 · 2 comments
Labels
lang / en English language data and models perf / accuracy Performance: accuracy

Comments

@chozelinek
Copy link

The problem

I've realised that sometimes noun chunks yield a noun chunk which is embedded in a longer one. I have only identified this behaviour in a few examples involving clauses with which.

Take the sentence "Including equity share of refineries in which the Group has a stake."

"the Group" and "in which the Group has a stake" are marked as noun chunks. But this does not happen normally. I put below a few examples so you can reproduce and study this.

How to reproduce the behaviour

import spacy
nlp = spacy.load('en_core_web_md')

text0 = "American company listed on NASDAQ in which the Group holds a 23.51% interest as of December 31, 2016."
text1 = "Including equity share of refineries in which the Group has a stake."
text2 = "Prices for oil and natural gas may fluctuate widely due to many\nfactors over which TOTAL has no control."
text3 = "This\nscope, which is different from the “operated domain” mentioned\nabove, includes all the assets in which the Group has a financial\ninterest or rights to production.\n "
text4 = "GHG emissions are also published on an equity interest basis, i.e.,\nby consolidating the Group share of the emissions of all assets in\nwhich the Group has a financial interest or rights to production.\n "
text5 = "From this profit, minus prior losses, if any, the following items are\ndeducted in the order indicated:\n 1) 5% to constitute the legal reserve fund, until said fund reaches\n10% of the share capital;\n 2) the amounts set by the Shareholders’ Meeting to fund reserves\nfor which it determines the allocation or use; and\n 3) the amounts that the Shareholders’ Meeting decides to retain.\n "

texts = [text0, text1, text2, text3, text4, text5]

for i, t in enumerate(texts):
    print('# Noun chunks in text {}:'.format(i))
    doc = nlp(t)
    for np in doc.noun_chunks:
        print(np)

These are my comments on the texts analyzed:

  • Text 0: "the Group" and "in which the Group holds a 23.51% interest"
  • Text 1: "the Group" and "in which the Group has a stake".
  • Text 2: "TOTAL" and "over which TOTAL has no control".
  • Text 3: "the Group" and "in which the Group has a financial".
  • Text 4: no issue as per this example, this is the behaviour I expected.
  • Text 5: "it" and "for which it determines the allocation".

Your Environment

  • spaCy version: 2.0.11
  • Platform: Darwin-17.6.0-x86_64-i386-64bit
  • Python version: 3.6.3
  • Models: en_core_web_md, fr_core_news_md, es_core_news_md, de_core_news_sm, pt_core_news_sm, fr_core_news_sm
@ines ines added performance lang / en English language data and models labels Jun 14, 2018
@ines ines added perf / accuracy Performance: accuracy and removed performance labels Aug 15, 2018
@ines
Copy link
Member

ines commented Dec 14, 2018

The noun chunks depend on the part-of-speech tags and dependency parse, so this issue likely comes down to incorrect predictions made by the tagger or parser.

I'm merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

@ines ines closed this as completed Dec 14, 2018
@lock
Copy link

lock bot commented Jan 13, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / en English language data and models perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

2 participants