Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sentence splitter: sentences ending with acronyms #24

Merged
merged 2 commits into from
Jun 21, 2019

Conversation

nickto
Copy link
Contributor

@nickto nickto commented May 10, 2019

A sentence ending with an acronym was dot distinguished from an initial
followed by a period in the middle of the sentence. E.g., "Adamson is not from USA.
They are from Europe" was considered as a single sentence because "USA." was
treated as an initial.

  1. Added a requirement for the initial to be only one letter long: "A." in "A.
    Adamson" is still treated as an initial, while "USA" in "from USA. They" is not.
  2. Added a test for such case.

Sentence ending with an acronym was dot distinguisged from an initial
followed by a period in the middle of the sentence. Added a requirement
for the initial to be only one letter long, and added a test for such a
case.
@codecov-io
Copy link

codecov-io commented May 10, 2019

Codecov Report

Merging #24 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #24   +/-   ##
=======================================
  Coverage   80.86%   80.86%           
=======================================
  Files           9        9           
  Lines         277      277           
=======================================
  Hits          224      224           
  Misses         53       53
Impacted Files Coverage Δ
src/sentences/sentence_splitting.jl 88.13% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d877dc...e3f8d93. Read the comment docs.

@oxinabox
Copy link
Member

Good catch, do you think it would be better use use \b rather than \s ?
\b is word-boundry, where as \s is white-space.
I think \b would work better if at the start of the string,

Incorporated the following suggestion and added tests:
JuliaText#24 (comment)

"This is A. A. Adamson." should be a single sentence, but older regex
would split it into mutiple because it required a space before the
second A.
@nickto
Copy link
Contributor Author

nickto commented May 13, 2019

Good catch, do you think it would be better use use \b rather than \s ?
\b is word-boundry, where as \s is white-space.
I think \b would work better if at the start of the string,

Changed it to \b and added a test that fails when using \s, thanks!

@oxinabox
Copy link
Member

LGTM, will merge when tests pass

@oxinabox oxinabox merged commit c216aa9 into JuliaText:master Jun 21, 2019
@oxinabox
Copy link
Member

thank!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants