# Sentence Segmentation
spaCy has its own rules to divide sentences. However, there are times we might need our own.

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
doc = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc.sents:
    print(f'{sent} |')

"Management is doing things right; leadership is doing the right things." |
-Peter Drucker |


## Add a segmentation rule

In [3]:
# token.i returns the index of the token. We will use it to define the start of the sentences.
for token in doc[:-1]:
    print(f'{token.i}: {token}')

0: "
1: Management
2: is
3: doing
4: things
5: right
6: ;
7: leadership
8: is
9: doing
10: the
11: right
12: things
13: .
14: "
15: -Peter


### Creating the rule
Our rule will be the following: Divide the sentences by semicolon (;).

In [4]:
def set_custom_boundaries(doc):
    for token in doc:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

### Adding the rule to the pipeline
We convert the function into a spaCy component then add it to the pipeline.

In [5]:
from spacy.language import Language

@Language.component('custom_boundaries')
def custom_boundaries_component(doc):
    return set_custom_boundaries(doc)

# Add the component to the pipeline
nlp.add_pipe('custom_boundaries', before='parser')
nlp.pipe_names

['tok2vec',
 'tagger',
 'custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

### Testing
This time, the text is split into three sentences instead of two.

In [6]:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc4.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker


## Changing spaCy's segmentation rules
Imagine we are writing a poem. We don't want to split the text by "." but rather line breaks.

In [7]:
nlp = spacy.load('en_core_web_sm')  # reset pipeline

poem = '''Whispers in twilight's embrace,
Silent dreams, a celestial chase.
Stars dance, painting night's grace. Glimmers of hope guide our way.'''

poem_doc = nlp(poem)

In [13]:
poem_doc = nlp(poem_doc)

for i, sent in enumerate(poem_doc.sents):
    print(f'{i+1}. {sent}')

1. Whispers in twilight's embrace,
Silent dreams, a celestial chase.

2. Stars dance, painting night's grace.
3. Glimmers of hope guide our way.


### Creating a rule
Our rule will be the following: split sentences by line.

**Note**: `yield` is used for memory-efficiency which is useful when working with large variable such as large documents for example.
`yield` generates the values of a generator one by one. It suspends the execution and wait till the next() method is called to resume.

In [14]:
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    start = 0
    new_line = False

    for word in doc:
        if new_line:
            doc[start:word.i].is_sent_start = True
            new_line = False
            start = word.i
        elif word.text.startswith('\n'):
            new_line = True

    doc[start:].is_sent_start = True

    return doc


In [15]:
nlp.add_pipe("set_custom_boundaries", before="parser")
poem_doc = nlp(poem)

print([sent.text for sent in doc.sents])

AttributeError: 'spacy.tokens.span.Span' object has no attribute 'is_sent_start'

In [16]:
for tok in split_on_newlines(doc):
    print(tok)

NameError: name 'split_on_newlines' is not defined