Spacy sentence splitting fails on long or complex sentences #2689

jeyendranbalakrishnan · 2018-08-20T21:27:52Z

Description

Spacy sentence splitting incorrectly splits long/complex sentences.
In two examples I encountered, Spacy incorrectly split one long sentence after a comma, and another long sentence after a closing paranthesis ')'.
I found incorrect splitting in other similar sentences too.
The two examples and steps to reproduce are described below.

Steps/Code to Reproduce

import spacy
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
texts = [
'Definitely encourage you to continue making big bets in 2018. The new project seems like a great opportunity for us to invest in an area where the org needs better tooling. It says alot when you made the internal team swap to address this bet. It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself. I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline, if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.',
"Continue helping us push back on smaller (lower impact) requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling. There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ. So far, you've been a big help with this."
]

for n, text in enumerate(texts) :
doc = nlp(text)
print('Doc ', n, ':', sep='')
for i, sentence in enumerate(doc.sents)
print(i, sentence, sep=':' )

Expected Results

Doc 0:
0: Definitely encourage you to continue making big bets in 2018.
1: The new project seems like a great opportunity for us to invest in an area where the org needs better tooling.
2: It says alot when you made the internal team swap to address this bet.
3: It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself.
4: I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline, if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.
Doc 1:
0: Continue helping us push back on smaller (lower impact) requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling.
1: There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ.
2: So far, you've been a big help with this.

Actual Results

Doc 0:
0: Definitely encourage you to continue making big bets in 2018.
1: The new project seems like a great opportunity for us to invest in an area where the org needs better tooling.
2:I t says alot when you made the internal team swap to address this bet.
3: It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself.
4: I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline,
5: if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.
Doc 1:
0: Continue helping us push back on smaller (lower impact)
1: requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling.
2: There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ.
3: So far, you've been a big help with this.

My Environment

Windows-10-10.0.17134-SP0
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.12.1
SciPy 1.1.0
Scikit-Learn 0.19.1
Spacy 2.0.11

jeyendranbalakrishnan · 2018-09-04T20:01:23Z

Hello, just checking to see if there is any feedback on this issue.
Thanks a lot!

ines · 2018-12-14T11:34:59Z

By default, spaCy uses the parser to set sentence boundaries. This is usually more accurate – however, depending on the data, it also means that it's affected by wrong predictions in the dependency parse. See here for details on how to customise the sentence segmentation and how to use a rule-based component instead: https://spacy.io/usage/linguistic-features#section-sbd

I'm also merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

lock · 2019-01-13T16:58:40Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added feat / parser Feature: Dependency Parser perf / accuracy Performance: accuracy labels Sep 5, 2018

ines closed this as completed Dec 14, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy sentence splitting fails on long or complex sentences #2689

Spacy sentence splitting fails on long or complex sentences #2689

jeyendranbalakrishnan commented Aug 20, 2018

jeyendranbalakrishnan commented Sep 4, 2018

ines commented Dec 14, 2018

lock bot commented Jan 13, 2019

Spacy sentence splitting fails on long or complex sentences #2689

Spacy sentence splitting fails on long or complex sentences #2689

Comments

jeyendranbalakrishnan commented Aug 20, 2018

Description

Steps/Code to Reproduce

Expected Results

Actual Results

My Environment

jeyendranbalakrishnan commented Sep 4, 2018

ines commented Dec 14, 2018

lock bot commented Jan 13, 2019