Handling ellipsis #1

dzhelonkin · 2018-08-21T09:37:16Z

Hi!
Thank you for your contribution on nltk project. Your model handling Russian punctuation much better than other nltk models, but there is an issue with a ellipsis(...). Examples:

>>> import nltk
>>> sent_tokenize = nltk.data.load('tokenizers/punkt/russian.pickle')
>>> sent_tokenize.tokenize("Мама мыла раму… Папа мыл кларнет...")
['Мама мыла раму… Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму... Папа мыл кларнет...")
['Мама мыла раму... Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!!! Папа мыл кларнет...")
['Мама мыла раму!!!', 'Папа мыл кларнет...']
>>> sent_tokenize.tokenize("Мама мыла раму!.. Папа мыл кларнет...")
['Мама мыла раму!..', 'Папа мыл кларнет...']

Is it work as designed (ex. 1 and ex. 2)? Ellipsis in Russian usually shows the end of a sentence, but maybe I am wrong.

The text was updated successfully, but these errors were encountered:

Mottl · 2018-08-25T17:06:11Z

Hi,

I doubt we can fix that easy. Probably one can split sentences by ellipsis taking into account the case of the first letter of the next word following ellipsis. Need to have a deeper look how it could be done with PunktSentenceTokenizer class.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling ellipsis #1

Handling ellipsis #1

dzhelonkin commented Aug 21, 2018

Mottl commented Aug 25, 2018 •

edited

Handling ellipsis #1

Handling ellipsis #1

Comments

dzhelonkin commented Aug 21, 2018

Mottl commented Aug 25, 2018 • edited

Mottl commented Aug 25, 2018 •

edited