
# **Sentence Segmentation**

---

## **1. Theory**

### **What is Sentence Segmentation?**

* **Definition**: Sentence segmentation is the process of splitting a text into **individual sentences**.
* It’s also called **sentence boundary detection**.
* Example:
  Text: *“I love NLP. It is fascinating. Let’s learn SpaCy and NLTK.”*
  Sentences:

  ```
  ["I love NLP.", "It is fascinating.", "Let's learn SpaCy and NLTK."]
  ```

---

### **Why is Sentence Segmentation important?**

* Prepares text for **downstream tasks** like:

  * **Tokenization** → Word-level processing
  * **POS tagging**
  * **NER**
  * **Sentiment analysis** (sentence-level sentiment)
  * **Text summarization** (sentence extraction)
* Helps **maintain context** when splitting large documents.

---

### **Challenges**

* Abbreviations: *“Dr. Smith is here.”* → Avoid splitting after “Dr.”
* Decimal numbers: *“Price is $5.50. Amazing deal!”*
* Ellipses or punctuation marks: *“Wait… what happened?”*
* Quotations and dialogues: *“Hello,” he said. “How are you?”*

---

## **2. Examples**

### **Sentence Segmentation with NLTK**

```python
import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

text = "Dr. Smith loves NLP. He works at OpenAI. Let's learn SpaCy and NLTK!"
sentences = sent_tokenize(text)

print("Sentences:")
for s in sentences:
    print(s)
```

**Output:**

```
Sentences:
Dr. Smith loves NLP.
He works at OpenAI.
Let's learn SpaCy and NLTK!
```

* **NLTK’s `punkt` tokenizer** handles common abbreviations automatically.

---

### **Sentence Segmentation with SpaCy**

```python
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Dr. Smith loves NLP. He works at OpenAI. Let's learn SpaCy and NLTK!"

doc = nlp(text)

print("Sentences:")
for sent in doc.sents:
    print(sent.text)
```

**Output:**

```
Sentences:
Dr. Smith loves NLP.
He works at OpenAI.
Let's learn SpaCy and NLTK!
```

* SpaCy’s sentence segmentation is **rule-based and model-enhanced**, often using **dependency parsing** to identify sentence boundaries.
* Can be **customized** with `sentencizer` for faster segmentation without parsing.

---

### **Custom Sentencizer in SpaCy**

```python
from spacy.lang.en import English

nlp = English()
# Add the sentencizer component
sentencizer = nlp.add_pipe("sentencizer")
doc = nlp("Dr. Smith loves NLP. He works at OpenAI. Let's learn SpaCy and NLTK!")

for sent in doc.sents:
    print(sent.text)
```

✅ This is faster if you don’t need full dependency parsing.

---

## **3. Interview-Style Q&A**

### **Basic Level**

**Q1. What is sentence segmentation in NLP?**
*A: Sentence segmentation is the process of splitting a text into individual sentences to prepare it for downstream NLP tasks.*

**Q2. Why is it important?**
*A: It maintains context, enables sentence-level processing, and prepares text for tokenization, POS tagging, NER, summarization, and sentiment analysis.*

---

### **Intermediate Level**

**Q3. How does NLTK handle sentence segmentation?**
*A: NLTK uses the Punkt tokenizer, which is unsupervised and trained on large corpora. It handles abbreviations and punctuation to detect sentence boundaries.*

**Q4. How does SpaCy handle sentence segmentation?**
*A: SpaCy uses a combination of rule-based heuristics, dependency parsing, and machine learning models to detect sentence boundaries. It can also be customized with the `sentencizer` component for faster segmentation.*

---

### **Advanced Level**

**Q5. What are common challenges in sentence segmentation?**
*A: Abbreviations (e.g., Dr., Mr.), decimal numbers, ellipses, quotations, and domain-specific punctuations can confuse boundary detection.*

**Q6. When would you customize SpaCy’s sentencizer?**
*A: When processing large texts where dependency parsing is unnecessary, to improve speed, or when domain-specific sentence boundaries need to be defined.*

---

