# Sentence Segmentation
In **spaCy Basics** we saw briefly how Doc objects are divided into sentences. In this section we'll learn how sentence segmentation works, and how to set our own segmentation rules.

In [1]:
# Perform standard imports
import spacy
nlp = spacy.load(u"en_core_web_sm")

In [2]:
doc = nlp(u"This is the first sentence. This is another sentence. This is the last sentence.")

for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


### `Doc.sents` is a generator
It is important to note that `doc.sents` is a *generator*. That is, a Doc is not segmented until `doc.sents` is called. This means that, where you could print the second Doc token with `print(doc[1])`, you can't call the "second Doc sentence" with `print(doc.sents[1])`:

However, you *can* build a sentence collection by running `doc.sents` and saving the result to a list:

In [4]:
doc_sents = [doc for doc in doc.sents]
doc_sents

[This is the first sentence.,
 This is another sentence.,
 This is the last sentence.]

<font color=green>**NOTE**: `list(doc.sents)` also works. We show a list comprehension as it allows you to pass in conditionals.</font>

In [5]:
doc_sents[1]

This is another sentence.

### `sents` are Spans
At first glance it looks like each `sent` contains text from the original Doc object. In fact they're just Spans with start and end token pointers.

In [6]:
type(doc_sents[1])

spacy.tokens.span.Span

In [7]:
print(doc_sents[1].start, doc_sents[1].end)

6 11


Let's add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon, the next token should start a new segment.

In [8]:
# SPACY'S DEFAULT BEHAVIOR
doc = nlp(u'"Management is doing the right thing; leadership is doing the right things." -Peter drucker')

In [9]:
doc.text

'"Management is doing the right thing; leadership is doing the right things." -Peter drucker'

In [10]:
for sent in doc.sents:
    print(sent)
    print('\n')

"Management is doing the right thing; leadership is doing the right things."


-Peter drucker




## Adding Rules
spaCy's built-in `sentencizer` relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added *before* the creation of the Doc object, as that is where the parsing of segment start tokens happens:

In [19]:
from spacy.language import Language

In [22]:
# ADD A SEGMENTATION RULE
@Language.component('set_custom_boundaries')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe('set_custom_boundaries', before='parser')

nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

<font color=green>The new rule has to run before the document is parsed. Here we can either pass the argument `before='parser'` or `first=True`.

In [25]:
doc4 = nlp(u'"Management is doing the right thing; leadership is doing the right things." -Peter drucker')

for sent in doc4.sents:
    print(sent)

"Management is doing the right thing;
leadership is doing the right things."
-Peter drucker


## Changing the Rules
In some cases we want to *replace* spaCy's default sentencizer with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentencizer that breaks on linebreaks.

In [67]:
nlp = spacy.load('en_core_web_sm')  # reset to the original

mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."

# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)

for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n', 'third', 'sentence', '.']


In [30]:
from spacy.language import Language

In [68]:
# CHANGE SEGMENTATION RULES
@Language.component('split_on_newlines')
def split_on_newlines(doc):
    for token in doc[:-1]:
        if token.text == '\n':
            doc[token.i+1].is_sent_start = True
        elif token.text == '.':
            doc[token.i+1].is_sent_start = False
    return doc

nlp.add_pipe('split_on_newlines', before='parser')

<function __main__.split_on_newlines(doc)>

In [69]:
doc = nlp(mystring)
for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n']
['third', 'sentence', '.']


In [70]:
for sentence in doc.sents:
    print(sentence)

This is a sentence.
This is another.


This is a 

third sentence.
