#### About
Sentence Segmentation is a vital task in Natural language processing that is meant to split a text into sentences seperated by in-built or custom delimiter.

Spacy has in-built functionalities to facilitate this task which can be tweaked as per custom use-case.

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

  return torch._C._cuda_getDeviceCount() > 0
2023-01-17 00:25:21.829262: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2023-01-17 00:26:05.333830: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-01-17 00:26:05.336047: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-01-17 00:26:05.341226: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2023-01-17 00:26:05.341286: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: suraj
2023-01-17 00:26:05.341299: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: suraj
2023-01-17 00:26:05.341514: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.226.0
2023-01-17 00:26:05.34

In [2]:
text = "This is Suraj. I am here to illustrate the first demo of sentence segmentation."
doc = nlp(text)
for sent in doc.sents:
    print(sent)



This is Suraj.
I am here to illustrate the first demo of sentence segmentation.


#### This segmented the sentence by its default delimiter ie fullstop.


In [4]:
sentences = []
for sent in doc.sents:
    sentences.append(sent)

print(type(sent[1]))

<class 'spacy.tokens.token.Token'>


In [6]:
# it can also be used as span and define the start and end.
print(sentences[1].start, sentences[1].end)

4 16


However, We can segment texts based on custom delimiters too. Let's have a look at it.

In [8]:
#let's have a look at a different example.
text2 = "How we do things when no one is looking at it, Often matters; It is because that's what defines our character."
doc2 = nlp(text2)
sentences2 = []
for sent in doc2.sents:
    sentences2.append(sent)
    print(sent)


How we do things when no one is looking at it, Often matters; It is because that's what defines our character.


In [9]:
len(sentences2)

1

Clearly, The inbuilt sentence segmentation just segmented our second text into just one sentence ignoring the delimiters ; and ,
Let's build custom rules for it. Refer <a href="https://spacy.io/usage/processing-pipelines#custom-components"> Link </a>

In [17]:
from spacy.language import Language

#always add the following decorator 
@Language.component("set_rules")
def set_rules(doc):
    for token in doc[:-1]:
        if token.text == ";" or token.text == ',':
            doc[token.i+1].is_sent_start=True
    return doc


In [19]:
# add this to existing nlp pipe
nlp.add_pipe("set_rules", before='parser')

<function __main__.set_rules(doc)>

In [20]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'set_rules',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [22]:
# re-run the doc object creation
doc2 = nlp(text2)
sentences2_modified = []
for sent in doc2.sents:
    sentences2_modified.append(sent)
    print(sent)

How we do things when no one is looking at it,
Often matters;
It is because that's what defines our character.


In [24]:
len(sentences2_modified)

3

Conclusion - We can see that it segments our text based on our custom rules

Let's see one more complex example.

In [30]:
text3 = "Hi, This is Suraj. \n We are back with yet another illustration on the topic. \n"


In [31]:
nlp = spacy.load('en_core_web_sm', exclude=["parser"])
    
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)

for sent in nlp(text3).sents:
    print("next sentence")
    print(sent)

next sentence
Hi, This is Suraj. 
 We are back with yet another illustration on the topic. 



For more details in multiple language, Use <a href="https://github.com/nipunsadvilkar/pySBD"> pySBD </a>