In [None]:
s1 = "This is a sentence. This is a second sentence. This is the last Sentence."
s2 = "This is a sentence; This is a second sentence; This is the last sentence."

In [2]:
import spacy
nlp = spacy.load(name = 'en_core_web_sm')

In [3]:
doc1 = nlp(s1)

In [4]:
print(doc1)

This is a sentence. This is a second sentence. This is the last Sentence.


Doc.sents is a generator
It is important to note that doc.sents is a generator. That is, a Doc is not segmented until doc.sents is called. This means that, where you could print the second Doc token with print(doc[1]), you can't call the "second Doc sentence" with print(doc.sents[1]):

In [5]:
for sent in doc1.sents:
    print(sent.text)

This is a sentence.
This is a second sentence.
This is the last Sentence.


In [6]:
s3 = "This is a sentence. This is a second U.K. sentence. This is the last Sentence."

In [7]:
doc3 = nlp(s3)

In [8]:
for sent in doc3.sents:
    print(sent.text)

This is a sentence.
This is a second U.K. sentence.
This is the last Sentence.


In [9]:
doc2 = nlp(s2)
for sent in doc2.sents:
    print(sent.text)

This is a sentence; This is a second sentence; This is the last sentence.


We can see spacy has failed to understand the seperate sentences 
due to presence of semi-colon.

In [10]:
#setting custom boundary, its almost similar to EXCEl column to text
#where we put some delimiters to split things but at a much complicated scale.

In [11]:
#this part is new in Spacy v3.0. The @Language.component decorator lets you turn a simple function into a pipeline component. 
#It takes at least one argument, the name of the component factory. You can use this name to add an instance of your component to the pipeline. 
#It can also be listed in your pipeline config, so you can save, load and train pipelines using your component.
from spacy.language import Language
@Language.component("set_custom_boundaries")
##
#setting our own function
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            print(token.i)
            doc[token.i+1].is_sent_start = True
    return doc

#it defines that last sentence is ending with fullstop so we omit that by [:-1]
#then we print tokens and we tell that Next token i+1 is the start of a sentence.

In [12]:
#default nlp pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

In [13]:
nlp.add_pipe("set_custom_boundaries", before = 'parser')

<function __main__.set_custom_boundaries(doc)>

In [14]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_custom_boundaries',
 'parser',
 'ner',
 'attribute_ruler',
 'lemmatizer']

In [15]:
doc_2 = nlp(s2)
for sent in doc_2.sents:
    print(sent)

4
10
This is a sentence;
This is a second sentence;
This is the last sentence.
