
# Lab: Rule-Based Sentence Tokenization

## Objective
In this lab, you will **design and implement the simplest rule-based sentence tokenizer (Decision Tree)**.

You will write a Python function that:
- Takes a **raw text corpus** as input
- Returns a **list of sentences**

This lab focuses on **reasoning about rules and edge cases**, not using pre-built NLP libraries.



## Background

Sentence tokenization (sentence segmentation) is the task of identifying sentence boundaries.
A naive approach is to split on punctuation such as:
- `.`
- `!`
- `?`

However, this approach fails in many real-world cases:
- Abbreviations (e.g., *Dr.*, *Mr.*)
- Decimal numbers (e.g., *3.14*)
- Titles and acronyms
- Quoted text

Your goal is to design **reasonable heuristic rules** to handle these cases.



## Task Description

You must implement the function:

```python
def rule_based_sentence_tokenizer(text):
    pass
```

### Requirements
- Input: a string containing multiple sentences
- Output: a list of sentence strings
- You should catch candidate sentence boundaries (".", "!", "?")
- Decide for each if it is a real sentence boundary or not using rules (if else statements)
- Do not use NLP libraries (NLTK, spaCy, CoreNLP, etc.) to solve the problem
- Compare your method result to the result of Punkt Tokenizer from NLTK on the given testing corpus.

Your solution does **not need to be perfect**, but it should handle common cases reasonably well.



## Testing Corpus

Use the following corpus to test your tokenizer.
It intentionally contains **edge cases**.

Pay attention to:
- Abbreviations
- Numbers
- Capitalization
- Quotation marks


In [9]:
test_corpus = """
Dr. Smith arrived at 5 p.m. He said, "This is unexpected.".
Apple released a new product today. It costs $999.99!
Is this the best option? Many people think so.
Mr. Johnson lives in the U.S. He works at Apple Inc.
The value of pi is approximately 3.14. It is used in math.
"""

print(test_corpus)


Dr. Smith arrived at 5 p.m. He said, "This is unexpected.".
Apple released a new product today. It costs $999.99!
Is this the best option? Many people think so.
Mr. Johnson lives in the U.S. He works at Apple Inc.
The value of pi is approximately 3.14. It is used in math.




## Your Task

1. Write a rule-based sentence tokenizer.
2. Apply it to the testing corpus.
3. Print each detected sentence on a new line.

ðŸ’¡ *Hint:* Start simple, then refine your rules.


In [10]:
ABREVATION = ["p.m.", "U.S.", "Inc."]

PROUNOUNS = ["Dr.", "Mr."]

import re


def is_EOS(token, next_token):

    # iza ken ra2em decimal
    if re.match(r"\d+\.\d+$", token):
        return False

    # iza kenit abr
    if token in PROUNOUNS and next_token[0].isupper():
        return False

    if token in ABREVATION:
        if token in ABREVATION and next_token[0].isupper():
            return True
        return False

    # iza 5lst b punctuation
    if re.search(r"[.!?]$", token):
        return True

    return False


def rule_based_sentence_tokenizer(text):
    tokens = re.findall(r"\S+", text)
    print("Tokens: ", tokens)

    sentences = []
    current_sentence = []
    for i, token in enumerate(tokens):
        current_sentence.append(token)
        if token == "\n":
            sentences.append(" ".join(current_sentence))
            continue
        if i < len(tokens) - 1 and is_EOS(token, tokens[i + 1]):
            sentences.append(" ".join(current_sentence))
            current_sentence = []

    if current_sentence:
        sentences.append(" ".join(current_sentence))

    return sentences

In [11]:
# Test your implementation

sentences = rule_based_sentence_tokenizer(test_corpus)
print(sentences)

for i, s in enumerate(sentences, 1):
    print(f"Sentence {i}: {s}")

Tokens:  ['Dr.', 'Smith', 'arrived', 'at', '5', 'p.m.', 'He', 'said,', '"This', 'is', 'unexpected.".', 'Apple', 'released', 'a', 'new', 'product', 'today.', 'It', 'costs', '$999.99!', 'Is', 'this', 'the', 'best', 'option?', 'Many', 'people', 'think', 'so.', 'Mr.', 'Johnson', 'lives', 'in', 'the', 'U.S.', 'He', 'works', 'at', 'Apple', 'Inc.', 'The', 'value', 'of', 'pi', 'is', 'approximately', '3.14.', 'It', 'is', 'used', 'in', 'math.']
['Dr. Smith arrived at 5 p.m.', 'He said, "This is unexpected.".', 'Apple released a new product today.', 'It costs $999.99!', 'Is this the best option?', 'Many people think so.', 'Mr. Johnson lives in the U.S.', 'He works at Apple Inc.', 'The value of pi is approximately 3.14.', 'It is used in math.']
Sentence 1: Dr. Smith arrived at 5 p.m.
Sentence 2: He said, "This is unexpected.".
Sentence 3: Apple released a new product today.
Sentence 4: It costs $999.99!
Sentence 5: Is this the best option?
Sentence 6: Many people think so.
Sentence 7: Mr. Johnson 

In [12]:
# TODO: Compare to Punkt Tokenizer of NLTK

from nltk.tokenize import sent_tokenize

nltk_sentences = sent_tokenize(test_corpus)
print("\nNLTK Sentences:")
for i, s in enumerate(nltk_sentences, 1):
    print(f"Sentence {i}: {s}")


NLTK Sentences:
Sentence 1: 
Dr. Smith arrived at 5 p.m.
Sentence 2: He said, "This is unexpected.".
Sentence 3: Apple released a new product today.
Sentence 4: It costs $999.99!
Sentence 5: Is this the best option?
Sentence 6: Many people think so.
Sentence 7: Mr. Johnson lives in the U.S.
Sentence 8: He works at Apple Inc.
Sentence 9: The value of pi is approximately 3.14.
Sentence 10: It is used in math.
