# Сегментация и токенизация

В этом ноутбуке демонстрируются функции сегментации на предложения и токенизации в `NLTK`, `spaCy` и `Stanza`.  
Сначала рассматриваются примеры для английского языка, в конце приводится код для русского.

### Содержание:
1. [NLTK](#NLTK)
    - [NLTK: сегментация на предложения](#NLTK:-сегментация-на-предложения)
    - [NLTK: токенизация](#NLTK:-токенизация)
    - [NLTK: работа со скобочными выражениями](#NLTK:-работа-со-скобочными-выражениями)
1. [spaCy](#spaCy)
    - [spaCy: сегментация на предложения](#spaCy:-сегментация-на-предложения)
        - [Вариант 1. Dependency parser](#Вариант-1.-Dependency-parser)
        - [Вариант 2. Statistical sentence segmenter](#Вариант-2.-Statistical-sentence-segmenter)
        - [Вариант 3. Rule-based pipeline component](#Вариант-3.-Rule-based-pipeline-component)
        - [Вариант 4. Custom rule-based strategy](#Вариант-4.-Custom-rule-based-strategy)
    - [spaCy: токенизация](#spaCy:-токенизация)
1. [Stanza](#Stanza)
    - [Stanza: сегментация](#Stanza:-сегментация)
    - [Stanza: токенизация](#Stanza:-токенизация)
1. [Сравнение](#Сравнение)
    - [Сегментация](#Сегментация)
    - [Токенизация](#Токенизация)
1. [Русский язык](#Русский-язык)
    - [Natasha](#Natasha)
    - [Русский язык: сегментация](#Русский-язык:-сегментация)
    - [Русский язык: токенизация](#Русский-язык:-токенизация)

# NLTK

`nltk.tokenize` package: https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize

In [1]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize, SExprTokenizer

Эти функции требуют установленной модели `Punkt`.

Загрузка модулей и корпусов для NLTK (должно открыться специальное приложение):

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
texts = [
    'This is very simple sentence.',
    'This sentence, has comma!',
    'The first sentence. The second sentence.',
    'The first sentence.The second sentence without space.',
    'This sentence does not have a point at the end',
    "This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space. ",
    'An example with "quotes"',
    'An example with references [1], [2, 3], and this[4]',
    'This is an example of the names: J. Smith, J.Smith, J.K. Smith, J. K. Smith, John K. Smith, John K.Smith, Mr. Smith, Mr.Smith',
    'this is an example of the names with lower letters: j. smith, j.smith, j.k. smith, j. k. smith, sohn k. smith, john k.smith, mr. smith, mr.smith',
    'An example with \n newline',
    'An example with dots... Between sentences...',
    'Another example with dots...without space... or with.',
    'An example with hyphen-based word, for example, Emilia-Romagna.',
    'Hyphen (or minus) on the -left and on the right- sides.',
    'Numbers in various forms: 1, 2.0, 3,5, 4,000, 5.000.000, 6,000,000, 7$, $8, 9.0$, $10.5, 11%, 12.5%', 
    '@remy: This is waaaaayyyy too much for you!!!!!! :))) :(',
    'Good muffins cost $3.88\nin New York.  Please buy me... two of them.\n\nThanks.',
    '"An example with direct speech?" - said a developer',
    'Another example: "Direct speech, the dot before the quote." And next sentence.',
    'Third example: "The dot after the quote". And next sentence.'
]

#### NLTK: сегментация на предложения

In [4]:
for text in texts:
    sentences = sent_tokenize(text)
    print(text)
    for i, sentence in enumerate(sentences):
        print(f'\t{i}: {sentence}')
    print('\n')

This is very simple sentence.
	0: This is very simple sentence.


This sentence, has comma!
	0: This sentence, has comma!


The first sentence. The second sentence.
	0: The first sentence.
	1: The second sentence.


The first sentence.The second sentence without space.
	0: The first sentence.The second sentence without space.


This sentence does not have a point at the end
	0: This sentence does not have a point at the end


This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space. 
	0: This sentence doesn't have an apostrophe... wait...
	1: It has word's apostrophe and double  space.


An example with "quotes"
	0: An example with "quotes"


An example with references [1], [2, 3], and this[4]
	0: An example with references [1], [2, 3], and this[4]


This is an example of the names: J. Smith, J.Smith, J.K. Smith, J. K. Smith, John K. Smith, John K.Smith, Mr. Smith, Mr.Smith
	0: This is an example of the names: J. Smith, J.Smith, J.K. Smith, J. K. S

#### NLTK: токенизация

In [5]:
for text in texts:
    tokens = word_tokenize(text)
    print(text)
    print(tokens)
    print('\n')

This is very simple sentence.
['This', 'is', 'very', 'simple', 'sentence', '.']


This sentence, has comma!
['This', 'sentence', ',', 'has', 'comma', '!']


The first sentence. The second sentence.
['The', 'first', 'sentence', '.', 'The', 'second', 'sentence', '.']


The first sentence.The second sentence without space.
['The', 'first', 'sentence.The', 'second', 'sentence', 'without', 'space', '.']


This sentence does not have a point at the end
['This', 'sentence', 'does', 'not', 'have', 'a', 'point', 'at', 'the', 'end']


This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space. 
['This', 'sentence', 'does', "n't", 'have', 'an', 'apostrophe', '...', 'wait', '...', 'It', 'has', 'word', "'s", 'apostrophe', 'and', 'double', 'space', '.']


An example with "quotes"
['An', 'example', 'with', '``', 'quotes', "''"]


An example with references [1], [2, 3], and this[4]
['An', 'example', 'with', 'references', '[', '1', ']', ',', '[', '2', ',', '3', ']', 

#### NLTK: работа со скобочными выражениями
[ссылка](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.sexpr)

In [6]:
right_parentheses = '(a b (c d)) e f (g)'

In [7]:
SExprTokenizer().tokenize(right_parentheses)

['(a b (c d))', 'e', 'f', '(g)']

In [8]:
wrong_parentheses = 'c) d) e (f (g'

In [9]:
try:
    SExprTokenizer().tokenize(wrong_parentheses)
except ValueError as error:
    print(error)

Un-matched close paren at char 1


# spaCy

Установка: https://spacy.io/usage

* Токенизация: https://spacy.io/usage/linguistic-features#tokenization  
* Сегментация: https://spacy.io/usage/linguistic-features#sbd

### spaCy: сегментация на предложения

Сегментация в spaCy имеет 4 варианта:
1. `Dependency parser` – статистический синтаксический парсер, осуществляющий полный разбор текста, в том числе сегентацию на предложения.
1. `Statistical sentence segmenter` – быстрый статистический компонент, осуществляющий только выделение границ предложений.
1. `Rule-based pipeline component` – компонент для выделения границ предложений на основе правил.
1. `Custom function` – компонент для разработки своей функции сегментации.

In [10]:
import spacy

In [11]:
def output_sentences(texts, nlp):
    for text in texts:
        doc = nlp(text)
        print(text)
        for i, sent in enumerate(doc.sents):
            print(f'\t{i}: {sent.text}')
        print('\n')

#### Вариант 1. Dependency parser

In [12]:
nlp = spacy.load("en_core_web_sm")

In [13]:
output_sentences(texts, nlp)

This is very simple sentence.
	0: This is very simple sentence.


This sentence, has comma!
	0: This sentence, has comma!


The first sentence. The second sentence.
	0: The first sentence.
	1: The second sentence.


The first sentence.The second sentence without space.
	0: The first sentence.
	1: The second sentence without space.


This sentence does not have a point at the end
	0: This sentence does not have a point at the end


This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space. 
	0: This sentence doesn't have an apostrophe...
	1: wait...
	2: It has word's apostrophe and double  space.


An example with "quotes"
	0: An example with "quotes"


An example with references [1], [2, 3], and this[4]
	0: An example with references [1], [2, 3], and this[4]


This is an example of the names: J. Smith, J.Smith, J.K. Smith, J. K. Smith, John K. Smith, John K.Smith, Mr. Smith, Mr.Smith
	0: This is an example of the names: J. Smith, J.Smith, J.K. Smith

#### Вариант 2. Statistical sentence segmenter

In [14]:
nlp = spacy.load("en_core_web_sm", exclude=["parser"])
nlp.enable_pipe("senter")

In [15]:
output_sentences(texts, nlp)

This is very simple sentence.
	0: This is very simple sentence.


This sentence, has comma!
	0: This sentence, has comma!


The first sentence. The second sentence.
	0: The first sentence.
	1: The second sentence.


The first sentence.The second sentence without space.
	0: The first sentence.
	1: The second sentence without space.


This sentence does not have a point at the end
	0: This sentence does not have a point at the end


This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space. 
	0: This sentence doesn't have an apostrophe... wait...
	1: It has word's apostrophe and double  space.


An example with "quotes"
	0: An example with "quotes"


An example with references [1], [2, 3], and this[4]
	0: An example with references [1], [2, 3], and this[4]


This is an example of the names: J. Smith, J.Smith, J.K. Smith, J. K. Smith, John K. Smith, John K.Smith, Mr. Smith, Mr.Smith
	0: This is an example of the names: J. Smith, J.Smith, J.K. Smith, J.

#### Вариант 3. Rule-based pipeline component

In [16]:
nlp = spacy.load("en_core_web_sm", exclude=["parser"])
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x1c7e7540380>

In [17]:
# Компоненты в конвейере spaCy
nlp.component_names

['tok2vec',
 'tagger',
 'senter',
 'ner',
 'attribute_ruler',
 'lemmatizer',
 'sentencizer']

In [18]:
output_sentences(texts, nlp)

This is very simple sentence.
	0: This is very simple sentence.


This sentence, has comma!
	0: This sentence, has comma!


The first sentence. The second sentence.
	0: The first sentence.
	1: The second sentence.


The first sentence.The second sentence without space.
	0: The first sentence.
	1: The second sentence without space.


This sentence does not have a point at the end
	0: This sentence does not have a point at the end


This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space. 
	0: This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space.


An example with "quotes"
	0: An example with "quotes"


An example with references [1], [2, 3], and this[4]
	0: An example with references [1], [2, 3], and this[4]


This is an example of the names: J. Smith, J.Smith, J.K. Smith, J. K. Smith, John K. Smith, John K.Smith, Mr. Smith, Mr.Smith
	0: This is an example of the names: J. Smith, J.Smith, J.K. Smith, J. K. 

#### Вариант 4. Custom rule-based strategy
См. по [ссылке](https://spacy.io/usage/linguistic-features#sbd-custom)

### spaCy: токенизация

In [19]:
nlp = spacy.load("en_core_web_sm")

In [20]:
for text in texts:
    doc = nlp(text)
    print(text)
    tokens = [str(token) for token in doc]
    print(tokens)
    print('\n')

This is very simple sentence.
['This', 'is', 'very', 'simple', 'sentence', '.']


This sentence, has comma!
['This', 'sentence', ',', 'has', 'comma', '!']


The first sentence. The second sentence.
['The', 'first', 'sentence', '.', 'The', 'second', 'sentence', '.']


The first sentence.The second sentence without space.
['The', 'first', 'sentence', '.', 'The', 'second', 'sentence', 'without', 'space', '.']


This sentence does not have a point at the end
['This', 'sentence', 'does', 'not', 'have', 'a', 'point', 'at', 'the', 'end']


This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space. 
['This', 'sentence', 'does', "n't", 'have', 'an', 'apostrophe', '...', 'wait', '...', 'It', 'has', 'word', "'s", 'apostrophe', 'and', 'double', ' ', 'space', '.']


An example with "quotes"
['An', 'example', 'with', '"', 'quotes', '"']


An example with references [1], [2, 3], and this[4]
['An', 'example', 'with', 'references', '[', '1', ']', ',', '[', '2', ',',

---
Предположим, мы хотим, чтобы `Emilia-Romagna` не разделялась на отдельные токены. Для этого создадим свой токенизатор на базе существующего ([ссылка 1](https://spacy.io/usage/linguistic-features#native-tokenizers), [ссылка 2](https://stackoverflow.com/questions/51012476/spacy-custom-tokenizer-to-include-only-hyphen-words-as-tokens-using-infix-regex)):

In [21]:
import re
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[—~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

In [22]:
nlp = spacy.load("en_core_web_sm")

nlp.tokenizer = custom_tokenizer(nlp)

In [23]:
text = 'An example with hyphen-based word, for example, Emilia-Romagna.'
doc = nlp(text)
print(text)
tokens = [str(token) for token in doc]
print(tokens)

An example with hyphen-based word, for example, Emilia-Romagna.
['An', 'example', 'with', 'hyphen-based', 'word', ',', 'for', 'example', ',', 'Emilia-Romagna', '.']


# Stanza

Токенизация и сегментация в Stanza осуществляется при помощи единого компонента `TokenizeProcessor`.

In [24]:
import stanza

In [25]:
# Загрузка англоязычных моделей
stanza.download("en")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 11.4MB/s]
2021-05-17 21:52:18 INFO: Downloading default packages for language: en (English)...
2021-05-17 21:52:19 INFO: File exists: C:\Users\Евгений\stanza_resources\en\default.zip.
2021-05-17 21:52:22 INFO: Finished downloading models and saved to C:\Users\Евгений\stanza_resources.


In [26]:
nlp = stanza.Pipeline(lang='en', processors='tokenize')

2021-05-17 21:52:22 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2021-05-17 21:52:22 INFO: Use device: gpu
2021-05-17 21:52:22 INFO: Loading: tokenize
2021-05-17 21:52:24 INFO: Done loading processors!


### Stanza: сегментация

In [27]:
for text in texts:
    doc = nlp(text)
    print(text)
    for i, sent in enumerate(doc.sentences):
        print(f'\t{i}: {sent.text}')
    print('\n')

This is very simple sentence.
	0: This is very simple sentence.


This sentence, has comma!
	0: This sentence, has comma!


The first sentence. The second sentence.
	0: The first sentence.
	1: The second sentence.


The first sentence.The second sentence without space.
	0: The first sentence.
	1: The second sentence without space.


This sentence does not have a point at the end
	0: This sentence does not have a point at the end


This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space. 
	0: This sentence doesn't have an apostrophe... wait...
	1: It has word's apostrophe and double  space.


An example with "quotes"
	0: An example with "quotes"


An example with references [1], [2, 3], and this[4]
	0: An example with references [1], [2, 3], and this[4]


This is an example of the names: J. Smith, J.Smith, J.K. Smith, J. K. Smith, John K. Smith, John K.Smith, Mr. Smith, Mr.Smith
	0: This is an example of the names: J. Smith, J.Smith, J.K. Smith, J.

### Stanza: токенизация

In [28]:
for text in texts:
    doc = nlp(text)
    print(text)
    tokens = []
    for sentence in doc.sentences:
        tokens += [token.text for token in sentence.tokens]
    print(tokens)
    print('\n')

This is very simple sentence.
['This', 'is', 'very', 'simple', 'sentence', '.']


This sentence, has comma!
['This', 'sentence', ',', 'has', 'comma', '!']


The first sentence. The second sentence.
['The', 'first', 'sentence', '.', 'The', 'second', 'sentence', '.']


The first sentence.The second sentence without space.
['The', 'first', 'sentence', '.', 'The', 'second', 'sentence', 'without', 'space', '.']


This sentence does not have a point at the end
['This', 'sentence', 'does', 'not', 'have', 'a', 'point', 'at', 'the', 'end']


This sentence doesn't have an apostrophe... wait... It has word's apostrophe and double  space. 
['This', 'sentence', 'does', "n't", 'have', 'an', 'apostrophe', '...', 'wait', '...', 'It', 'has', 'word', "'s", 'apostrophe', 'and', 'double', 'space', '.']


An example with "quotes"
['An', 'example', 'with', '"', 'quotes', '"']


An example with references [1], [2, 3], and this[4]
['An', 'example', 'with', 'references', '[', '1', ']', ',', '[', '2', ',', '3',

# Сравнение

In [29]:
# spaCy - Dependency parser
nlp_spacy1 = spacy.load("en_core_web_sm")

# spaCy - Statistical sentence segmenter
nlp_spacy2 = spacy.load("en_core_web_sm", exclude=["parser"])
nlp_spacy2.enable_pipe("senter")

# spaCy - Rule-based pipeline component
nlp_spacy3 = spacy.load("en_core_web_sm", exclude=["parser"])
nlp_spacy3.add_pipe("sentencizer")

# stanza
nlp_stanza = stanza.Pipeline(lang='en', processors='tokenize')

2021-05-17 21:52:26 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2021-05-17 21:52:26 INFO: Use device: gpu
2021-05-17 21:52:26 INFO: Loading: tokenize
2021-05-17 21:52:26 INFO: Done loading processors!


### Сегментация

In [30]:
for text in texts:
    
    print(f'{text}\n')
    
    # nltk
    print('\tnltk:')
    sentences_nltk = sent_tokenize(text)
    for i, sentence in enumerate(sentences_nltk):
        print(f'\t\t{i}: {sentence}')
    print('\n')
    
    # spaCy - Dependency parser
    print('\tspaCy1:')
    doc_spacy1 = nlp_spacy1(text)
    for i, sent in enumerate(doc_spacy1.sents):
        print(f'\t\t{i}: {sent.text}')
    print('\n')
    
    # spaCy - Statistical sentence segmenter
    print('\tspaCy2:')
    doc_spacy2 = nlp_spacy2(text)
    for i, sent in enumerate(doc_spacy2.sents):
        print(f'\t\t{i}: {sent.text}')
    print('\n')    

    # spaCy - Rule-based pipeline component
    print('\tspaCy3:')
    doc_spacy3 = nlp_spacy3(text)
    for i, sent in enumerate(doc_spacy3.sents):
        print(f'\t\t{i}: {sent.text}')
    print('\n')    
    
    # stanza
    print('\tstanza:')
    doc_stanza = nlp_stanza(text)
    for i, sent in enumerate(doc_stanza.sentences):
        print(f'\t\t{i}: {sent.text}')
    print('\n')
    
    print('-----------------------------------------------------------------------')

This is very simple sentence.

	nltk:
		0: This is very simple sentence.


	spaCy1:
		0: This is very simple sentence.


	spaCy2:
		0: This is very simple sentence.


	spaCy3:
		0: This is very simple sentence.


	stanza:
		0: This is very simple sentence.


-----------------------------------------------------------------------
This sentence, has comma!

	nltk:
		0: This sentence, has comma!


	spaCy1:
		0: This sentence, has comma!


	spaCy2:
		0: This sentence, has comma!


	spaCy3:
		0: This sentence, has comma!


	stanza:
		0: This sentence, has comma!


-----------------------------------------------------------------------
The first sentence. The second sentence.

	nltk:
		0: The first sentence.
		1: The second sentence.


	spaCy1:
		0: The first sentence.
		1: The second sentence.


	spaCy2:
		0: The first sentence.
		1: The second sentence.


	spaCy3:
		0: The first sentence.
		1: The second sentence.


	stanza:
		0: The first sentence.
		1: The second sentence.


------------

### Токенизация

In [31]:
# spacy
nlp_spacy = spacy.load("en_core_web_sm")

# stanza
nlp_stanza = stanza.Pipeline(lang='en', processors='tokenize')

2021-05-17 21:52:27 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2021-05-17 21:52:27 INFO: Use device: gpu
2021-05-17 21:52:27 INFO: Loading: tokenize
2021-05-17 21:52:27 INFO: Done loading processors!


In [32]:
for text in texts:
    
    print(f'{text}\n')
    
    # nltk
    print('\tnltk:')
    tokens_nltk = word_tokenize(text)
    print(f'\t{tokens_nltk}')
    print('\n')
    
    # spaCy
    print('\tspaCy:')
    doc_spacy = nlp_spacy(text)
    tokens_spacy = [str(token) for token in doc_spacy]
    print(f'\t{tokens_spacy}')
    print('\n')
       
    # stanza
    print('\tstanza:')
    doc_stanza = nlp_stanza(text)
    tokens_stanza = []
    for sentence in doc_stanza.sentences:
        tokens_stanza += [token.text for token in sentence.tokens]
    print(f'\t{tokens_stanza}')
    print('\n')
    
    print('-----------------------------------------------------------------------')

This is very simple sentence.

	nltk:
	['This', 'is', 'very', 'simple', 'sentence', '.']


	spaCy:
	['This', 'is', 'very', 'simple', 'sentence', '.']


	stanza:
	['This', 'is', 'very', 'simple', 'sentence', '.']


-----------------------------------------------------------------------
This sentence, has comma!

	nltk:
	['This', 'sentence', ',', 'has', 'comma', '!']


	spaCy:
	['This', 'sentence', ',', 'has', 'comma', '!']


	stanza:
	['This', 'sentence', ',', 'has', 'comma', '!']


-----------------------------------------------------------------------
The first sentence. The second sentence.

	nltk:
	['The', 'first', 'sentence', '.', 'The', 'second', 'sentence', '.']


	spaCy:
	['The', 'first', 'sentence', '.', 'The', 'second', 'sentence', '.']


	stanza:
	['The', 'first', 'sentence', '.', 'The', 'second', 'sentence', '.']


-----------------------------------------------------------------------
The first sentence.The second sentence without space.

	nltk:
	['The', 'first', 'sentence.

# Русский язык

In [33]:
texts_rus = [
    'Пример простого предложения',
    'Пример простого предложения с точкой.',
    'Тоже пример предложения, но с восклицательным знаком!',
    'Первое предложение. Второе предложение.',
    'Первое предложение.Второе предложение без пробела.',
    'У этого предложения в конце нет точки',
    "В этом предложении есть апостроф' и двойной  пробел.",
    'Пример с "кавычками"',
    'Пример со ссылками [1], [2, 3], и вот так[4]',
    'Это пример с именами: С. Иванов, С.Иванов, С.Ю. Иванов, С.Ю.Иванов, С. Ю. Иванов, Сергей Ю. Иванов, Сергей Ю.Иванов, к.п.н. Иванов, проф.Иванов',
    'Это пример с именами строчными буквами: с. иванов, с.иванов, с.ю. иванов, с.ю.иванов, с. ю. иванов, сергей ю. иванов, сергей ю.иванов, проф. иванов, проф.иванов',
    'Пример с \n переносом',
    'Пример с многоточием... Между предложениями...',
    'Другой пример с многоточием...без пробелов... и с ними.',
    'Пример с дефисами: Санкт-Петербург, Кирово-Чепецк, кирово-чепецк, как-нибудь, как-то, Какой-то, Кто-то.',
    'Дефисы (или минус) -слева и справа-',
    'Числа: 1, 2.0, 3,5, 4,000, 5.000.000, 6,000,000, 7$, $8, 9.0$, $10.5, 11%, 12.5%', 
    '@john: Это суууупер!!!!!! :))) :(',
    'Классный ресторан :) Обязательно приду ещё раз!',
    'Хороший маффин стоит 3,88 \nв Нью-Йорке. Пожалуйста, купи мне... два.\n\nСпасибо.',
    '"Пример с прямой речью?" - говорит разработчик',
    'Вот ещё один пример: "Тоже прямая речь, точка до кавычек." И следующее предложение.',
    'Третий пример: "Точка после кавычек". И следующее предложение.',
    '10 млн. руб. или долл. - точку не нужно ставить после "млн", но ставят.',
    'Изъято 20 тыс. шт. патронов.'
]

### Natasha
Проект `Natasha` содержит библиотеки и компоненты для решения базовых задач обработки русского языка: сегментация на токены и предложения, морфологический и синтаксический анализ, извлечение именованных сущностей:
* [Github](https://natasha.github.io)
* [Статья на Хабре](https://habr.com/ru/post/516098/)
* [Инструкция по установке](https://github.com/natasha/natasha#install)

Токенизация и сегментация [реализована](https://github.com/natasha/razdel) в модуле `Razdel`.  
В проекте `Natasha` проводилось сравнение [токенизаторов](https://github.com/natasha/razdel#tokens) и [сегментаторов](https://github.com/natasha/razdel#sentences).

In [34]:
from razdel import tokenize, sentenize

### Русский язык: сегментация

Добавлена библиотека `ru_sent_tokenize` (только для русского языка): https://pypi.org/project/rusenttokenize/

In [35]:
from rusenttokenize import ru_sent_tokenize

Русскоязычные [модели](https://spacy.io/models/ru) в spaCy:
- ru_core_news_sm (16 MB) – small
- ru_core_news_md (41 MB) – medium
- ru_core_news_lg (491 MB) – large

Установка: `python -m spacy download ru_core_news_sm`

In [36]:
spacy_model_name = 'ru_core_news_sm'
# spacy_model_name = 'ru_core_news_md'
# spacy_model_name = 'ru_core_news_lg'

In [37]:
# Загрузка русскоязычных моделей Stanza
stanza.download("ru")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 11.4MB/s]
2021-05-17 21:52:27 INFO: Downloading default packages for language: ru (Russian)...
2021-05-17 21:52:28 INFO: File exists: C:\Users\Евгений\stanza_resources\ru\default.zip.
2021-05-17 21:52:32 INFO: Finished downloading models and saved to C:\Users\Евгений\stanza_resources.


In [38]:
# spaCy - Dependency parser
nlp_spacy1 = spacy.load(spacy_model_name)

# spaCy - Statistical sentence segmenter
nlp_spacy2 = spacy.load(spacy_model_name, exclude=["parser"])
nlp_spacy2.enable_pipe("senter")

# spaCy - Rule-based pipeline component
nlp_spacy3 = spacy.load(spacy_model_name, exclude=["parser"])
nlp_spacy3.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x1c8640fa5c0>

In [39]:
# stanza
nlp_stanza = stanza.Pipeline(lang='ru', processors='tokenize')

2021-05-17 21:52:34 INFO: Loading these models for language: ru (Russian):
| Processor | Package   |
-------------------------
| tokenize  | syntagrus |

2021-05-17 21:52:34 INFO: Use device: gpu
2021-05-17 21:52:34 INFO: Loading: tokenize
2021-05-17 21:52:34 INFO: Done loading processors!


In [40]:
for text in texts_rus:
    
    print(f'{text}\n')
    
    # nltk
    print('\tnltk:')
    sentences_nltk = sent_tokenize(text, language='russian')
    for i, sentence in enumerate(sentences_nltk):
        print(f'\t\t{i}: {sentence}')
    print('\n')
    
    # spaCy - Dependency parser
    print('\tspaCy1:')
    doc_spacy1 = nlp_spacy1(text)
    for i, sent in enumerate(doc_spacy1.sents):
        print(f'\t\t{i}: {sent.text}')
    print('\n')
    
    # spaCy - Statistical sentence segmenter
    print('\tspaCy2:')
    doc_spacy2 = nlp_spacy2(text)
    for i, sent in enumerate(doc_spacy2.sents):
        print(f'\t\t{i}: {sent.text}')
    print('\n')    

    # spaCy - Rule-based pipeline component
    print('\tspaCy3:')
    doc_spacy3 = nlp_spacy3(text)
    for i, sent in enumerate(doc_spacy3.sents):
        print(f'\t\t{i}: {sent.text}')
    print('\n')    
    
    # stanza
    print('\tstanza:')
    doc_stanza = nlp_stanza(text)
    for i, sent in enumerate(doc_stanza.sentences):
        print(f'\t\t{i}: {sent.text}')
    print('\n')
    
    # Natasha
    print('\tNatasha:')
    for i, sent in enumerate(sentenize(text)):
        print(f'\t\t{i}: {sent.text}')
    print('\n')
    
    # ru_sent_tokenize
    print('\tru_sent_tokenize:')
    sentences_rusenttokenize = ru_sent_tokenize(text)
    for i, sent in enumerate(sentences_rusenttokenize):
        print(f'\t\t{i}: {sent}')
    print('\n')
    
    print('-----------------------------------------------------------------------')

Пример простого предложения

	nltk:
		0: Пример простого предложения


	spaCy1:
		0: Пример простого предложения


	spaCy2:
		0: Пример простого предложения


	spaCy3:
		0: Пример простого предложения


	stanza:
		0: Пример простого предложения


	Natasha:
		0: Пример простого предложения


	ru_sent_tokenize:
		0: Пример простого предложения


-----------------------------------------------------------------------
Пример простого предложения с точкой.

	nltk:
		0: Пример простого предложения с точкой.


	spaCy1:
		0: Пример простого предложения с точкой.


	spaCy2:
		0: Пример простого предложения с точкой.


	spaCy3:
		0: Пример простого предложения с точкой.


	stanza:
		0: Пример простого предложения с точкой.


	Natasha:
		0: Пример простого предложения с точкой.


	ru_sent_tokenize:
		0: Пример простого предложения с точкой.


-----------------------------------------------------------------------
Тоже пример предложения, но с восклицательным знаком!

	nltk:
		0: Тоже пример предл

### Русский язык: токенизация

In [41]:
# spacy
nlp_spacy = spacy.load("ru_core_news_sm")

# stanza
nlp_stanza = stanza.Pipeline(lang='ru', processors='tokenize')

2021-05-17 21:52:36 INFO: Loading these models for language: ru (Russian):
| Processor | Package   |
-------------------------
| tokenize  | syntagrus |

2021-05-17 21:52:36 INFO: Use device: gpu
2021-05-17 21:52:36 INFO: Loading: tokenize
2021-05-17 21:52:36 INFO: Done loading processors!


In [42]:
for text in texts_rus:
    
    print(f'{text}\n')
    
    # nltk
    print('\tnltk:')
    tokens_nltk = word_tokenize(text, language='russian')
    print(f'\t{tokens_nltk}')
    print('\n')
    
    # spaCy
    print('\tspaCy:')
    doc_spacy = nlp_spacy(text)
    tokens_spacy = [str(token) for token in doc_spacy]
    print(f'\t{tokens_spacy}')
    print('\n')
       
    # stanza
    print('\tstanza:')
    doc_stanza = nlp_stanza(text)
    tokens_stanza = []
    for sentence in doc_stanza.sentences:
        tokens_stanza += [token.text for token in sentence.tokens]
    print(f'\t{tokens_stanza}')
    print('\n')
    
    # Natasha
    print('\tNatasha:')
    tokens_natasha = []
    for sentence in sentenize(text):
        tokens_natasha += [token.text for token in tokenize(sentence.text)]
    print(f'\t{tokens_natasha}')
    print('\n')
    
    print('-----------------------------------------------------------------------')

Пример простого предложения

	nltk:
	['Пример', 'простого', 'предложения']


	spaCy:
	['Пример', 'простого', 'предложения']


	stanza:
	['Пример', 'простого', 'предложения']


	Natasha:
	['Пример', 'простого', 'предложения']


-----------------------------------------------------------------------
Пример простого предложения с точкой.

	nltk:
	['Пример', 'простого', 'предложения', 'с', 'точкой', '.']


	spaCy:
	['Пример', 'простого', 'предложения', 'с', 'точкой', '.']


	stanza:
	['Пример', 'простого', 'предложения', 'с', 'точкой', '.']


	Natasha:
	['Пример', 'простого', 'предложения', 'с', 'точкой', '.']


-----------------------------------------------------------------------
Тоже пример предложения, но с восклицательным знаком!

	nltk:
	['Тоже', 'пример', 'предложения', ',', 'но', 'с', 'восклицательным', 'знаком', '!']


	spaCy:
	['Тоже', 'пример', 'предложения', ',', 'но', 'с', 'восклицательным', 'знаком', '!']


	stanza:
	['Тоже', 'пример', 'предложения', ',', 'но', 'с', 'восклиц