# **Sentence Segmentation**
The process of deciding from where the sentences actually start or end in NLP or we can simply say that here we are dividing a paragraph based on sentences. This process is known as Sentence Segmentation. The idea here looks very simple. In English and some other languages, we can split apart the sentences whenever we see a punctuation mark.

In [1]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [2]:
s1 = "This is a sentence. This is second sentence. Thig is last sentence."
doc1 = nlp(s1)

In [3]:
doc1.sents

<generator at 0x2171f721580>

In [4]:
for sent in doc1.sents:
  print(sent.text)

This is a sentence.
This is second sentence.
Thig is last sentence.


In [5]:
s3 = "This is a sentence. This is second U.K. sentence. This is last sentence."
doc3=nlp(s3)

In [6]:
for sent in doc3.sents:
  print(sent.text)

This is a sentence.
This is second U.K. sentence.
This is last sentence.


### **Custom Pipeline**

In [7]:
s2="This is. A sentence. | This is. Another sentence."
doc2=nlp(s2)

In [8]:
for sent in doc2.sents:
  print(sent.text)

This is.
A sentence.
| This is.
Another sentence.


In [9]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [10]:
import spacy
from spacy.language import Language

@Language.component("custom_sentencizer")
def custom_sentencizer(doc):
    for i, token in enumerate(doc[:-2]):
        # Define sentence start if pipe + titlecase token
        if token.text == "|" and doc[i + 1].is_title:
            doc[i + 1].is_sent_start = True
        else:
            # Explicitly set sentence start to False otherwise, to tell
            # the parser to leave those tokens alone
            doc[i + 1].is_sent_start = False
    return doc

In [11]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("custom_sentencizer", before="parser") 

<function __main__.custom_sentencizer(doc)>

In [12]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'custom_sentencizer',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [13]:
doc2=nlp(s2)
for sent in doc2.sents:
  print(sent.text)

This is. A sentence. |
This is. Another sentence.


### **Another Example of Custom Pipeline**

In [14]:
s2 = "This is a sentence; This is second sentence; This is last sentence."
doc2 = nlp(s2)

In [15]:
for sent in doc2.sents:
  print(sent.text)

This is a sentence; This is second sentence; This is last sentence.


In [16]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'custom_sentencizer',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [17]:
import spacy
from spacy.language import Language

@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
  for token in doc[:-1]:
    if token.text==';':
      print(token.i)
      doc[token.i+1].is_sent_start=True
  return doc;

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser") 

<function __main__.set_custom_boundaries(doc)>

In [18]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [19]:
doc2=nlp(s2)
for sent in doc2.sents:
  print(sent.text)

4
9
This is a sentence;
This is second sentence;
This is last sentence.
