# Text Tokenization Exercise

This exercise explores the challenges of splitting text into sentences and words when dealing with complex real-world text containing dates, amounts, URLs, emails, acronyms, and multi-word expressions.

## The Challenge

Given a text variable, split it into:
1. **Sentences** - logical units of meaning ending with terminal punctuation
2. **Words (tokens)** - individual meaningful units

In [10]:
# Sample text with challenging elements
import re
text = """Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M."""

print("Original text:")
print(text)
sentences = re.split(r'(?<=[.!?])\s+', text)

print("Sentences:")
for s in sentences:
    print("-", s)

Original text:
Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.
Sentences:
- Dr.
- John Smith, Ph.D., earned $1,250.50 on Jan.
- 15, 2024, for his work at A.I.
- Corp.
- You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.
- The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.


In [11]:
tokens = re.findall(r'\b\w+\b', text)

print("\nTokens:")
print(tokens)


Tokens:
['Dr', 'John', 'Smith', 'Ph', 'D', 'earned', '1', '250', '50', 'on', 'Jan', '15', '2024', 'for', 'his', 'work', 'at', 'A', 'I', 'Corp', 'You', 'can', 'reach', 'him', 'at', 'j', 'smith', 'ai', 'corp', 'co', 'uk', 'or', 'visit', 'https', 'www', 'ai', 'corp', 'co', 'uk', 'team', 'dr', 'smith', 'for', 'more', 'info', 'The', 'U', 'S', 'A', 'based', 'company', 'reported', 'a', '23', '5', 'increase', 'in', 'Q3', 'revenue', 'totaling', '2', '5M']


# Corpus Tokenization Exercise

This exercise explores the challenges of splitting words in large corpuses and find the most common words. 

## The Challenge

Given a file `shakes.txt` in the book folder. Find the words that are more common in Shakespeare's book. 

In [14]:
import re
from collections import Counter

file_path = r'C:\Users\Anxo\Downloads\shakes.txt'

try:
    with open(file_path, 'r', encoding='utf-8') as f:
        # 1. Read and lowercase the text
        content = f.read().lower()
        
        # 2. Extract only the words
        words = re.findall(r'\b\w+\b', content)
        
        # 3. Count the 10 most frequent words
        top_words = Counter(words).most_common(10)

    print(f"Analysis complete for: {file_path}")
    print("-" * 30)
    for word, count in top_words:
        print(f"{word}: {count}")

except FileNotFoundError:
    print(f"Error: The file was not found at {file_path}. Check the filename!")
except Exception as e:
    print(f"An error occurred: {e}")

Analysis complete for: C:\Users\Anxo\Downloads\shakes.txt
------------------------------
the: 27843
and: 26847
i: 22538
to: 19883
of: 18307
a: 14800
you: 13928
my: 12489
that: 11563
in: 11183
