Noun chunking inconsistency #2451

chozelinek · 2018-06-14T08:28:17Z

The problem

I've realised that sometimes noun chunks yield a noun chunk which is embedded in a longer one. I have only identified this behaviour in a few examples involving clauses with which.

Take the sentence "Including equity share of refineries in which the Group has a stake."

"the Group" and "in which the Group has a stake" are marked as noun chunks. But this does not happen normally. I put below a few examples so you can reproduce and study this.

How to reproduce the behaviour

import spacy
nlp = spacy.load('en_core_web_md')

text0 = "American company listed on NASDAQ in which the Group holds a 23.51% interest as of December 31, 2016."
text1 = "Including equity share of refineries in which the Group has a stake."
text2 = "Prices for oil and natural gas may fluctuate widely due to many\nfactors over which TOTAL has no control."
text3 = "This\nscope, which is different from the “operated domain” mentioned\nabove, includes all the assets in which the Group has a financial\ninterest or rights to production.\n "
text4 = "GHG emissions are also published on an equity interest basis, i.e.,\nby consolidating the Group share of the emissions of all assets in\nwhich the Group has a financial interest or rights to production.\n "
text5 = "From this profit, minus prior losses, if any, the following items are\ndeducted in the order indicated:\n 1) 5% to constitute the legal reserve fund, until said fund reaches\n10% of the share capital;\n 2) the amounts set by the Shareholders’ Meeting to fund reserves\nfor which it determines the allocation or use; and\n 3) the amounts that the Shareholders’ Meeting decides to retain.\n "

texts = [text0, text1, text2, text3, text4, text5]

for i, t in enumerate(texts):
    print('# Noun chunks in text {}:'.format(i))
    doc = nlp(t)
    for np in doc.noun_chunks:
        print(np)

These are my comments on the texts analyzed:

Text 0: "the Group" and "in which the Group holds a 23.51% interest"
Text 1: "the Group" and "in which the Group has a stake".
Text 2: "TOTAL" and "over which TOTAL has no control".
Text 3: "the Group" and "in which the Group has a financial".
Text 4: no issue as per this example, this is the behaviour I expected.
Text 5: "it" and "for which it determines the allocation".

Your Environment

spaCy version: 2.0.11
Platform: Darwin-17.6.0-x86_64-i386-64bit
Python version: 3.6.3
Models: en_core_web_md, fr_core_news_md, es_core_news_md, de_core_news_sm, pt_core_news_sm, fr_core_news_sm

ines · 2018-12-14T11:13:48Z

The noun chunks depend on the part-of-speech tags and dependency parse, so this issue likely comes down to incorrect predictions made by the tagger or parser.

I'm merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

lock · 2019-01-13T16:59:08Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added performance lang / en English language data and models labels Jun 14, 2018

ines added perf / accuracy Performance: accuracy and removed performance labels Aug 15, 2018

ines closed this as completed Dec 14, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noun chunking inconsistency #2451

Noun chunking inconsistency #2451

chozelinek commented Jun 14, 2018

ines commented Dec 14, 2018 •

edited

lock bot commented Jan 13, 2019

Noun chunking inconsistency #2451

Noun chunking inconsistency #2451

Comments

chozelinek commented Jun 14, 2018

The problem

How to reproduce the behaviour

Your Environment

ines commented Dec 14, 2018 • edited

lock bot commented Jan 13, 2019

ines commented Dec 14, 2018 •

edited