Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split_sentences - handling spaces after "." #32

Open
Ayushk4 opened this issue Jun 27, 2019 · 7 comments
Open

split_sentences - handling spaces after "." #32

Ayushk4 opened this issue Jun 27, 2019 · 7 comments

Comments

@Ayushk4
Copy link
Member

Ayushk4 commented Jun 27, 2019

It might be better if the empty second element in resulting array isn't there.

julia> WordTokenizers.split_sentences("This is a sentence. ")
2-element Array{SubString{String},1}:
 "This is a sentence."
 "" 
@oxinabox
Copy link
Member

I would be in favor of this

@RohitPingale
Copy link

RohitPingale commented Oct 11, 2019

julia> WordTokenizers.split_sentences("This is a sentence. ")
SubString{String}["This is a sentence.", ""]
How about filtering the "" elements from the array?

@oxinabox
Copy link
Member

I think we just change the regex so it never appears in the first place

@RohitPingale
Copy link

RohitPingale commented Oct 11, 2019

sentences = replace(sentences, r"([?!.])\s" => Base.SubstitutionString("\\1\n"))
The problem is in this line. The regex is working fine but while SubstitutionString it putting the \n which is needed if we have two or more sentences because we are splitting the sentences followed by this sentences = split(sentences, "\n"). But in above case \n not needed because its just a white space.

@oxinabox
Copy link
Member

oxinabox commented Oct 11, 2019

Link to PR: #37

@RohitPingale
Copy link

RohitPingale commented Oct 11, 2019

julia>WordTokenizers.split_sentences(" This is a sentence.Laugh Out Loud. Keep coding. No. Yes! True! ohh!ya! me too. ")
7-element Array{SubString{String},1}:
" This is a sentence.Laugh Out Loud."
"Keep coding."
"No."
"Yes!"
"True!"
"ohh!ya!"
"me too."
I observed that the sentence which has no space after delimiter(Obviously that sentence grammatically incorrect) is not considered as two separate sentences(Like .Laugh Out Loud. and Ohh!ya!). Can this consider as an issue?

@oxinabox
Copy link
Member

Yes, done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants