# Text Tokenization Exercise

This exercise explores the challenges of splitting text into sentences and words when dealing with complex real-world text containing dates, amounts, URLs, emails, acronyms, and multi-word expressions.

In [1]:
# Sample text with challenging elements
text = """Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M."""

print("Original text:")
print(text) 

Original text:
Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.


Before

In [2]:
import re

sentences = re.split(r'(?<=[.!?])\s+', text)
for s in sentences:
    print(s)

Dr.
John Smith, Ph.D., earned $1,250.50 on Jan.
15, 2024, for his work at A.I.
Corp.
You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.
The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.


In [3]:
words = text.split()
print(words)

['Dr.', 'John', 'Smith,', 'Ph.D.,', 'earned', '$1,250.50', 'on', 'Jan.', '15,', '2024,', 'for', 'his', 'work', 'at', 'A.I.', 'Corp.', 'You', 'can', 'reach', 'him', 'at', 'j.smith@ai-corp.co.uk', 'or', 'visit', 'https://www.ai-corp.co.uk/team/dr-smith', 'for', 'more', 'info.', 'The', 'U.S.A.-based', 'company', 'reported', 'a', '23.5%', 'increase', 'in', 'Q3', 'revenue,', 'totaling', '€2.5M.']


After

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
sentences = []
for s in doc.sents:
    sentences.append(s.text)


sentences

['Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp.',
 'You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.',
 'The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.']

# Corpus Tokenization Exercise

This exercise explores the challenges of splitting words in large corpuses and find the most common words. 

## The Challenge

Given a file `shakes.txt` in the book folder. Find the words that are more common in Shakespeare's book. 