Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spacy sentence splitting fails on long or complex sentences #2689

Closed
jeyendranbalakrishnan opened this issue Aug 20, 2018 · 3 comments
Closed
Labels
feat / parser Feature: Dependency Parser perf / accuracy Performance: accuracy

Comments

@jeyendranbalakrishnan
Copy link

Description

Spacy sentence splitting incorrectly splits long/complex sentences.
In two examples I encountered, Spacy incorrectly split one long sentence after a comma, and another long sentence after a closing paranthesis ')'.
I found incorrect splitting in other similar sentences too.
The two examples and steps to reproduce are described below.

Steps/Code to Reproduce

import spacy
nlp = spacy.load('en_core_web_lg', disable = ['ner'])
texts = [
'Definitely encourage you to continue making big bets in 2018. The new project seems like a great opportunity for us to invest in an area where the org needs better tooling. It says alot when you made the internal team swap to address this bet. It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself. I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline, if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.',
"Continue helping us push back on smaller (lower impact) requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling. There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ. So far, you've been a big help with this."
]

for n, text in enumerate(texts) :
doc = nlp(text)
print('Doc ', n, ':', sep='')
for i, sentence in enumerate(doc.sents)
print(i, sentence, sep=':' )

Expected Results

Doc 0:
0: Definitely encourage you to continue making big bets in 2018.
1: The new project seems like a great opportunity for us to invest in an area where the org needs better tooling.
2: It says alot when you made the internal team swap to address this bet.
3: It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself.
4: I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline, if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.
Doc 1:
0: Continue helping us push back on smaller (lower impact) requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling.
1: There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ.
2: So far, you've been a big help with this.

Actual Results

Doc 0:
0: Definitely encourage you to continue making big bets in 2018.
1: The new project seems like a great opportunity for us to invest in an area where the org needs better tooling.
2:I t says alot when you made the internal team swap to address this bet.
3: It's easy to think "lets do it", but committing to it by moving parts that existing stakeholders were previously happy with (I hope) is a big bet in itself.
4: I've developed the opinion that with our current team size and the kind of requests I've seen come down the pipeline,
5: if every stakeholder is perfectly happy, it's likely we are not really taking those big bets.
Doc 1:
0: Continue helping us push back on smaller (lower impact)
1: requests to keep time, not only for big bets, but also tech debt, better documentation, internal improvements and tooling.
2: There is delicate balance needed to keep helping our partners in the short term, while working for the long term objectives of XYZ.
3: So far, you've been a big help with this.

My Environment

Windows-10-10.0.17134-SP0
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.12.1
SciPy 1.1.0
Scikit-Learn 0.19.1
Spacy 2.0.11

@jeyendranbalakrishnan
Copy link
Author

Hello, just checking to see if there is any feedback on this issue.
Thanks a lot!

@ines ines added feat / parser Feature: Dependency Parser perf / accuracy Performance: accuracy labels Sep 5, 2018
@ines
Copy link
Member

ines commented Dec 14, 2018

By default, spaCy uses the parser to set sentence boundaries. This is usually more accurate – however, depending on the data, it also means that it's affected by wrong predictions in the dependency parse. See here for details on how to customise the sentence segmentation and how to use a rule-based component instead: https://spacy.io/usage/linguistic-features#section-sbd

I'm also merging this with #3052. We've now added a master thread for incorrect predictions and related reports – see the issue for more details.

@ines ines closed this as completed Dec 14, 2018
@lock
Copy link

lock bot commented Jan 13, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / parser Feature: Dependency Parser perf / accuracy Performance: accuracy
Projects
None yet
Development

No branches or pull requests

2 participants