# Text Tokenization Exercise

This exercise explores the challenges of splitting text into sentences and words when dealing with complex real-world text containing dates, amounts, URLs, emails, acronyms, and multi-word expressions.

## The Challenge

Given a text variable, split it into:
1. **Sentences** - logical units of meaning ending with terminal punctuation
2. **Words (tokens)** - individual meaningful units

In [36]:
# Sample text with challenging elements
text = """Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M."""

print("Original text:")
print(text)


Original text:
Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.


In [37]:
# Im adding "re" as we saw
import re

## 1. Sentences

In [38]:
# No special symbols like in the S02 Exercices
sentence_pattern = r'(?<=[.!?])\s+(?=[A-Z])'

sentences = re.split(sentence_pattern, text)

print("New sentence")
for s in sentences:
    print("-", s)

New sentence
- Dr.
- John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I.
- Corp.
- You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.
- The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.


## 2. Words (tokens)

In [39]:
url = r'https?://\S+'
email = r'\S+@\S+'
abr = r'(?:[A-Z]\.)+'
money = r'[€$£]\d+(?:\.\d+)?[A-Za-z]?'
number = r'\d+(?:,\d+)*(?:\.\d+)?%?'
word = r'[A-Za-z]+(?:-[A-Za-z]+)*'


pattern = f'{url}|{email}|{abr}|{money}|{number}|{word}'

tokens = re.findall(pattern, text)

print("tokens")
print(tokens)


tokens
['Dr', 'John', 'Smith', 'Ph', 'D.', 'earned', '$1', '250.50', 'on', 'Jan', '15', '2024', 'for', 'his', 'work', 'at', 'A.I.', 'Corp', 'You', 'can', 'reach', 'him', 'at', 'j.smith@ai-corp.co.uk', 'or', 'visit', 'https://www.ai-corp.co.uk/team/dr-smith', 'for', 'more', 'info', 'The', 'U.S.A.', 'based', 'company', 'reported', 'a', '23.5%', 'increase', 'in', 'Q', '3', 'revenue', 'totaling', '€2.5M']


# Corpus Tokenization Exercise

This exercise explores the challenges of splitting words in large corpuses and find the most common words. 

## The Challenge

Given a file `shakes.txt` in the book folder. Find the words that are more common in Shakespeare's book. 

In [40]:
import re

# easy import of the shakes text
with open("shakes.txt", "r", encoding="utf-8") as f:
    text = f.read().lower()

words = re.findall(r"[a-z']+", text) # no special characters, justo words

word_counts = {}

for w in words:
    if w in word_counts:
        word_counts[w] += 1
    else:
        word_counts[w] = 1

In [None]:
sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

print("common words")
for word, count in sorted_words[:5]:
    print(word, count)

ommon words
the 27801
and 26834
i 20296
to 19749
of 18299
