## **Sentence Segmentation **

The process of deciding from where the sentences actually start or end in NLP or we can simply say that here we are dividing a paragraph based on sentences. This process is known as Sentence Segmentation. In Python, we implement this part of NLP using the spacy library. Spacy is used for Natural Language Processing in Python.



In [2]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [3]:
# From spacy basics:
doc = nlp(u'This is the first sentence. This is another sentece. This is the last sentence. ')

In [4]:
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentece.
This is the last sentence.


"""### `Doc.sents` is a generator
It is important to note that `doc.sents` is a *generator*. That is, a Doc is not segmented until `doc.sents` is called. This means that, where you could print the second Doc token with `print(doc[1])`, you can't call the "second Doc sentence" with `print(doc.sents[1])`:
"""

In [7]:
print(doc[1])

is


In [8]:
print(doc.sents[1])

TypeError: 'generator' object is not subscriptable

However, you *can* build a sentence collection by running `doc.sents` and saving the result to a list:

In [9]:
doc_sents = [sent for sent in doc.sents]
doc_sents

[This is the first sentence.,
 This is another sentece.,
 This is the last sentence.]

"""<font color=green>**NOTE**: `list(doc.sents)` also works. We show a list comprehension as it allows you to pass in conditionals.</font>"""

In [10]:
#now you can access individual sentence :
print(doc_sents[1])

This is another sentece.


`sents` are Spans
At first glance it looks like each `sent` contains text from the original Doc object. In fact they're just Spans with start and end token pointers.


In [11]:
type(doc_sents[1])

spacy.tokens.span.Span

In [12]:
print(doc_sents[1].start, doc_sents[1]. end)

6 11


Adding Rules
spaCy's built-in `sentencizer` relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added *before* the creation of the Doc object, as that is where the parsing of segment start tokens happens:


In [13]:
#parsing the segmentation start tokens happens during the nlp pipeline
doc2 = nlp (u'This is a sentence. This ')

In [14]:
for token in doc2:
  print(token.is_sent_start, ''+token.text)

True This
False is
False a
False sentence
False .
True This


<font color=green>Notice we haven't run `doc2.sents`, and yet `token.is_sent_start` was set to True on two tokens in the Doc.</font>

Let's add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon, the next token should start a new segment.



In [15]:
# Spacy's Default Behavior
doc3 = nlp (u'" Management is doing things right; leadership is doing the right things." -Peter Drucker')

In [16]:
for sent in doc3.sents:
  print(sent)

" Management is doing things right; leadership is doing the right things."
-Peter Drucker


In [17]:
#Add a new rule to the pipeline
def set_custom_boundaries(doc):
  for token in doc[:-1]:
    if token.text == ';':
      doc[token.i+1].is_sent_start = True
  return doc

In [18]:
from spacy.language import Language


In [19]:
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    # Your custom logic here
    return doc


In [20]:
nlp.add_pipe("set_custom_boundaries", before="parser")


In [21]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

"""<font color=green>The new rule has to run before the document is parsed. Here we can either pass the argument `before='parser'` or `first=True`."""


In [22]:
#Re-run the Doc object creation:
doc4 = nlp(u'" Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc4.sents:
  print(sent)

" Management is doing things right; leadership is doing the right things."
-Peter Drucker


In [23]:
for sent in doc3.sents:
  print(sent)

" Management is doing things right; leadership is doing the right things."
-Peter Drucker


Why not change the token directly?
Why not simply set the `.is_sent_start` value to True on existing tokens?


In [24]:
#Find the token we want to change:
doc3[7]

leadership

In [25]:
doc3[7].is_sent_start = True

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

<font color=green>spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.</font>

## Changing the Rules
In some cases we want to *replace* spaCy's default sentencizer with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentencizer that breaks on linebreaks.
"""


In [26]:
nlp = spacy.load('en_core_web_sm') #reset to the original

In [27]:
mystring = u"This is a asentence. This is another .\n\nThis is a \nthird sntence."

In [28]:
#Spacy Default Behaviour:
doc = nlp(mystring)

In [29]:
for sent in doc.sents:
  print([token.text for token in sent])

['This', 'is', 'a', 'asentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n', 'third', 'sntence', '.']


In [32]:
#changing the rules
from spacy.language import Language

In [34]:
@Language.component("custom_sentencizer")
def custom_sentencizer(doc):
    for token in doc[:-1]:
        if token.text == "\n":
            doc[token.i+1].is_sent_start = True
    return doc

In [35]:
nlp.add_pipe("custom_sentencizer", before="parser")

In [36]:
doc = nlp("This is a sentence.\nThis is a new one.")
for sent in doc.sents:
    print(sent.text)

This is a sentence.

This is a new one.


In [37]:
def split_on_newlines(doc):
  start = 0
  seen_newline = False
  for word in doc:
    if seen_newline:
      yield doc[start:word.i]
      start = word.i
      seen_newline = False
    elif word.text.startswith('\n'): #handles multiple occurrence
      seen_newline = True
  yield doc[start:]  #handles the last group of tokens

<font color=green>While the function `split_on_newlines` can be named anything we want, it's important to use the name `sbd` for the SentenceSegmenter.</font>

In [41]:
doc = nlp(mystring)
for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'asentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n']
['third']
['sntence', '.']
