# Sentence Segmentation
In **spaCy Basics** we saw briefly how Doc objects are divided into sentences

In [1]:
import spacy

In [30]:
nlp = spacy.load('en_core_web_lg')

In [3]:
# From Spacy Basics:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [5]:
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


### `Doc.sents` is a generator
It is important to note that `doc.sents` is a *generator*. That is, a Doc is not segmented until `doc.sents` is called. This means that, where you could print the second Doc token with `print(doc[1])`, you can't call the "second Doc sentence" with `print(doc.sents[1])`

In [6]:
doc[0]

This

In [7]:
print(doc.sents[2])

TypeError: 'generator' object is not subscriptable

However, you *can* build a sentence collection by running `doc.sents` and saving the result to a list, just like range

In [8]:
list(doc.sents)[0]

This is the first sentence.

### `sents` are Spans
At first glance it looks like each `sent` contains text from the original Doc object. In fact they're just Spans with start and end token pointers.

In [9]:
type(list(doc.sents)[0])

spacy.tokens.span.Span

## Adding Rules
spaCy's built-in `sentencizer` relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added *before* the creation of the Doc object, as that is where the parsing of segment start tokens happens:

In [31]:
doc = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')


In [32]:
for sent in doc.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter Drucker


In [33]:
# ADD A NEW RULE TO THE PIPELIN
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start =True
    return doc

nlp.add_pipe(set_custom_boundaries,before ='parser')

nlp.pipe_names

['tagger', 'set_custom_boundaries', 'parser', 'ner']

The new rule has to run before the document is parsed. Here we can either pass the argument `before='parser'` or `first=True`

In [34]:
doc = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

In [35]:
for sent in doc.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter Drucker


### Why not change the token directly?
Why not simply set the `.is_sent_start` value to True on existing tokens?

In [37]:
# Try to change the .is_sent_start attribute:
doc[7].is_sent_start = True

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.

## Changing the Rules
In some cases we want to *replace* spaCy's default sentencizer with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentencizer that breaks on linebreaks.

In [39]:
nlp = spacy.load('en_core_web_sm')

In [40]:
mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."

In [41]:
# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)

In [45]:
print(mystring)

This is a sentence. This is another.

This is a 
third sentence.


In [44]:
for sent in doc.sents:
    print(sent)

This is a sentence.
This is another.


This is a 
third sentence.


In [42]:
for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n', 'third', 'sentence', '.']


In [46]:
# CHANGING THE RULES

In [47]:
from spacy.pipeline import SentenceSegmenter

In [53]:
def slpit_on_newlines(doc):
    start = 0 
    seen_newline = False
    for word in doc:
        if seen_newline:
            yield doc[start : word.i]
            start = word.i
            seen_newline =False
        elif word.text.startswith('\n'):
            seen_newline =True    
    yield doc[start:]    
sbd = SentenceSegmenter(nlp.vocab,strategy=slpit_on_newlines)
nlp.add_pipe(sbd)

While the function `split_on_newlines` can be named anything we want, it's important to use the name `sbd` for the SentenceSegmenter

In [55]:
doc = nlp(mystring)
for sent in doc.sents:
    print(sent)

This is a sentence. This is another.


This is a 

third sentence.


Here we see that periods no longer affect segmentation, only linebreaks do. This would be appropriate when working with a long list of tweets for instance