Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling ellipsis #1

Open
dzhelonkin opened this issue Aug 21, 2018 · 1 comment
Open

Handling ellipsis #1

dzhelonkin opened this issue Aug 21, 2018 · 1 comment

Comments

@dzhelonkin
Copy link

Hi!
Thank you for your contribution on nltk project. Your model handling Russian punctuation much better than other nltk models, but there is an issue with a ellipsis(...). Examples:

>>> import nltk
>>> sent_tokenize = nltk.data.load('tokenizers/punkt/russian.pickle')
>>> sent_tokenize.tokenize("Мама мыла раму… Папа мыл кларнет...")
['Мама мыла раму… Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму... Папа мыл кларнет...")
['Мама мыла раму... Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!!! Папа мыл кларнет...")
['Мама мыла раму!!!', 'Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!.. Папа мыл кларнет...")
['Мама мыла раму!..', 'Папа мыл кларнет...']

Is it work as designed (ex. 1 and ex. 2)? Ellipsis in Russian usually shows the end of a sentence, but maybe I am wrong.

@Mottl
Copy link
Owner

Mottl commented Aug 25, 2018

Hi,

I doubt we can fix that easy. Probably one can split sentences by ellipsis taking into account the case of the first letter of the next word following ellipsis. Need to have a deeper look how it could be done with PunktSentenceTokenizer class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants