## Sentence Segmentation

**Sentence Segmentation** is the process of dividing a text into its individual sentences. This is a critical first step in many NLP tasks because the meaning of words often depends on their context within a single sentence.

In spaCy, this is handled automatically by the `sents` property of a `Doc` object, which provides a `generator` to iterate over the detected sentences.

In [2]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
doc = nlp(
    "a quick brown fox jumps over the lazy dog. The five boxing wizards jump quickly. Sphinx of black quartz, judge my vow."
)

In [5]:
for sent in doc.sents:
    print(sent)

a quick brown fox jumps over the lazy dog.
The five boxing wizards jump quickly.
Sphinx of black quartz, judge my vow.


In [6]:
type(doc.sents)

_cython_3_2_1.generator

A generator in Python is a special type of iterable that produces items one at a time strictly when asked, rather than computing them all at once and storing them in memory (like a list does).

Think of it as Lazy Evaluation:
- **A List** (`[]`) is like downloading an entire movie to your hard drive before watching it. It takes up memory immediately.
- **A Generator** is like **streaming** a movie. It only loads the specific scene you are watching right now.

Why does **spaCy** use a generator for `doc.sents`?
Efficiency.

Because it is a generator, you cannot access items by index directly.

```python
# This will throw an error: 'generator' object is not subscriptable
first_sentence = doc.sents[0] 
```

If you need a list (e.g., to access a specific index), you must explicitly convert it:

```python
# Convert generator to a list (consumes memory)
sentence_list = list(doc.sents)
```


In [11]:
doc = nlp(
    '"The only way to do great work is to love what you do; if you haven\'t found it yet, keep looking." — Steve Jobs.'
)

In [12]:
doc.text

'"The only way to do great work is to love what you do; if you haven\'t found it yet, keep looking." — Steve Jobs.'

In [13]:
for sent in doc.sents:
    print(sent, "\n")

"The only way to do great work is to love what you do; if you haven't found it yet, keep looking." 

— Steve Jobs. 



In [None]:
# 1. Add a segmentation rule
from spacy.language import Language


@Language.component("semicolon_boundary")
def set_semicolon_boundary(doc):
    for token in doc[:-1]:
        # If the current token is a semicolon, mark the next token as a sentence start
        if token.text == ";":
            doc[token.i + 1].is_sent_start = True
    return doc


# Add the component to the pipeline before the dependency parser
if "semicolon_boundary" not in nlp.pipe_names:
    nlp.add_pipe("semicolon_boundary", before="parser")

In [None]:
# Test with the quote
doc = nlp(
    '"The only way to do great work is to love what you do; if you haven\'t found it yet, keep looking."'
)
for sent in doc.sents:
    print(sent)

"The only way to do great work is to love what you do;
if you haven't found it yet, keep looking."


In [18]:
nlp.pipe_names

['tok2vec',
 'tagger',
 'semicolon_boundary',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner']

In [None]:
# 2. Change segmentaion rules