# 1. What is Tokenization?

## üìñ What is Tokenization?

**Tokenization** is the process of breaking down text into smaller units called "tokens" - these can be words, subwords, characters, or even sentences.

**Token Examples:**
- **Text**: "Hello, world!"
- **Word tokens**: ["Hello", ",", "world", "!"]
- **Subword tokens**: ["Hello", ",", "wo", "##rld", "!"]
- **Character tokens**: ["H", "e", "l", "l", "o", ",", " ", "w", "o", "r", "l", "d", "!"]

**Key Concepts:**
- **Input**: Raw text string
- **Output**: List of tokens + metadata (positions, IDs, attention masks)
- **Reversible**: Can reconstruct original text from tokens
- **Language-dependent**: Different rules for different languages

## üéØ Why Use Tokenization?

### **Advantages:**
1. **Machine-Readable Format** - Convert text to numerical IDs for ML models
2. **Vocabulary Management** - Handle millions of words with fixed vocabulary
3. **OOV Handling** - Subword tokenization handles unknown words
4. **Language Agnostic** - Works across different languages
5. **Feature Engineering** - Tokens become features for ML

### **Disadvantages:**
1. **Information Loss** - May lose subtle nuances
2. **Context Dependency** - Same word, different meanings
3. **Language Specific** - Needs different strategies per language
4. **Ambiguity** - "New York" vs ["New", "York"]

## ‚è±Ô∏è When to Use Tokenization

### ‚úÖ **Use When:**

**1. Text Classification**
- Example: Spam detection, sentiment analysis
- Why: Need to convert text to features
- Benefit: Model can learn from word patterns

**2. Named Entity Recognition (NER)**
- Example: Extract names, dates, locations
- Why: Need word-level or subword-level processing
- Use case: "Apple Inc. in Cupertino" ‚Üí ["Apple", "Inc.", "Cupertino"]

**3. Machine Translation**
- Example: English to French translation
- Why: Need to map words/subwords across languages
- Benefit: Subword tokens handle morphology

**4. Question Answering**
- Example: BERT for SQuAD dataset
- Why: Need to align question and context tokens
- Use case: Find answer span in token positions

**5. Text Generation**
- Example: GPT-3, ChatGPT
- Why: Generate text token by token
- Use case: Predict next token given previous tokens

**6. Search and Information Retrieval**
- Example: Google Search, Elasticsearch
- Why: Index documents by tokens
- Benefit: Fast keyword matching

### ‚ùå **Don't Use When:**

**1. Simple String Matching**
- Problem: Overhead for exact matches
- Better: Use string.find() or regex
- Why: Tokenization adds unnecessary complexity

**2. Character-Level Tasks Only**
- Problem: Word tokenization loses information
- Better: Use character embeddings directly
- Why: Some tasks need character-level features

**3. Very Short Text**
- Problem: Single word doesn't need tokenization
- Better: Process directly
- Why: "Yes" ‚Üí ["Yes"] adds no value

**4. Binary/Non-Text Data**
- Problem: Tokenization is for text
- Better: Use appropriate encoding (images ‚Üí pixels)
- Why: Not designed for non-linguistic data

## üìä How It Works

**Tokenization Pipeline:**
1. **Normalization**: Lowercase, remove accents, etc.
2. **Pre-tokenization**: Split by whitespace/punctuation
3. **Tokenization**: Apply algorithm (BPE, WordPiece, etc.)
4. **Post-processing**: Add special tokens ([CLS], [SEP])
5. **Conversion**: Map tokens to IDs from vocabulary
6. **Padding/Truncation**: Make sequences same length

**Example Flow:**
```
Text: "I love NLP!"
  ‚Üì Normalize
"i love nlp!"
  ‚Üì Pre-tokenize
["i", "love", "nlp", "!"]
  ‚Üì Tokenize (subword)
["i", "love", "nl", "##p", "!"]
  ‚Üì Add special tokens
["[CLS]", "i", "love", "nl", "##p", "!", "[SEP]"]
  ‚Üì Convert to IDs
[101, 1045, 2293, 17953, 2361, 999, 102]
```

## üåç Real-World Applications

1. **Search Engines** - Google, Bing tokenize queries and documents
2. **Chatbots** - ChatGPT, Alexa tokenize user input
3. **Translation** - Google Translate, DeepL
4. **Sentiment Analysis** - Twitter sentiment, product reviews
5. **Email Filtering** - Gmail spam detection
6. **Content Moderation** - Facebook, YouTube content filtering
7. **Voice Assistants** - Siri, Alexa process transcribed speech
8. **Legal Tech** - Contract analysis, e-discovery
9. **Healthcare** - Clinical notes processing, medical coding
10. **Finance** - News sentiment analysis, fraud detection

## üí° Key Insights

‚úÖ Tokenization is the **first step** in all NLP pipelines  
‚úÖ Choice of tokenizer affects model performance significantly  
‚úÖ **Subword tokenization** (BPE, WordPiece) dominates modern NLP  
‚úÖ Always use **same tokenizer** for training and inference  
‚úÖ **Special tokens** ([CLS], [SEP], [PAD]) are task-specific  
‚úÖ Fast tokenizers (Rust-based) are 10x faster than Python  
‚úÖ Vocabulary size is a tradeoff: larger = better coverage, slower  
‚úÖ **Context matters**: "bank" (river) vs "bank" (financial)  
‚úÖ Multilingual models use shared subword vocabulary  
‚úÖ Save tokenizer config with model for reproducibility

In [None]:
# TOKENIZATION - COMPLETE INTRODUCTION

print("="*80)
print("TOKENIZATION FUNDAMENTALS - COMPREHENSIVE GUIDE")
print("="*80)

# NOTE: Install required libraries first
# pip install nltk spacy transformers
# python -m spacy download en_core_web_sm

import re
from collections import Counter

# 1. BASIC TOKENIZATION CONCEPTS
print("\n1. BASIC TOKENIZATION CONCEPTS")
print("-"*80)

text = "Hello, world! Natural Language Processing is amazing."
print(f"Original text: {text}")

# Method 1: Simple split by whitespace
tokens_whitespace = text.split()
print(f"\nWhitespace tokenization: {tokens_whitespace}")
print(f"  Token count: {len(tokens_whitespace)}")
print(f"  Problem: Punctuation attached to words!")

# Method 2: Split by whitespace and punctuation
tokens_punct = re.findall(r'\w+|[^\w\s]', text)
print(f"\nPunctuation-aware tokenization: {tokens_punct}")
print(f"  Token count: {len(tokens_punct)}")
print(f"  Better: Punctuation separated!")

# Method 3: Split only words (ignore punctuation)
tokens_words = re.findall(r'\w+', text)
print(f"\nWord-only tokenization: {tokens_words}")
print(f"  Token count: {len(tokens_words)}")
print(f"  Issue: Lost punctuation information!")

# 2. CHARACTER-LEVEL TOKENIZATION
print("\n2. CHARACTER-LEVEL TOKENIZATION")
print("-"*80)

text_short = "Hello"
char_tokens = list(text_short)
print(f"Text: {text_short}")
print(f"Character tokens: {char_tokens}")
print(f"Token count: {len(char_tokens)}")
print(f"\nUse case: Language modeling, spell checking, OCR")

# Reconstruct text from characters
reconstructed = ''.join(char_tokens)
print(f"Reconstructed: {reconstructed}")
print(f"Match: {reconstructed == text_short}")

# 3. SENTENCE TOKENIZATION
print("\n3. SENTENCE TOKENIZATION")
print("-"*80)

paragraph = """Natural Language Processing is exciting. It enables computers to understand text.
Machine learning models need tokenization. This is a fundamental step!"""

print(f"Original paragraph:\n{paragraph}")

# Simple sentence split (basic)
sentences_basic = paragraph.split('.')
sentences_basic = [s.strip() for s in sentences_basic if s.strip()]
print(f"\nBasic sentence split:")
for i, sent in enumerate(sentences_basic, 1):
    print(f"  {i}. {sent}")

# Better: Regex-based sentence tokenization
sentences_regex = re.split(r'[.!?]+', paragraph)
sentences_regex = [s.strip() for s in sentences_regex if s.strip()]
print(f"\nRegex sentence split:")
for i, sent in enumerate(sentences_regex, 1):
    print(f"  {i}. {sent}")

# 4. TOKENIZATION WITH NLTK
print("\n4. TOKENIZATION WITH NLTK")
print("-"*80)

try:
    import nltk
    # Download required data (run once)
    # nltk.download('punkt')
    # nltk.download('punkt_tab')
    
    from nltk.tokenize import word_tokenize, sent_tokenize
    
    text_nltk = "Dr. Smith works at N.Y.U. He earned $50,000 last year!"
    print(f"Text: {text_nltk}")
    
    # Word tokenization
    tokens_nltk = word_tokenize(text_nltk)
    print(f"\nNLTK word tokens: {tokens_nltk}")
    print(f"  Handles: abbreviations (Dr., N.Y.U.), currency ($), punctuation")
    
    # Sentence tokenization
    paragraph_nltk = "I love NLP. It's amazing! Do you agree? Yes, I do."
    sentences_nltk = sent_tokenize(paragraph_nltk)
    print(f"\nNLTK sentence tokens:")
    for i, sent in enumerate(sentences_nltk, 1):
        print(f"  {i}. {sent}")
    
except ImportError:
    print("NLTK not installed. Install with: pip install nltk")
except LookupError:
    print("NLTK data not downloaded. Run: nltk.download('punkt')")

# 5. TOKENIZATION WITH SPACY
print("\n5. TOKENIZATION WITH SPACY")
print("-"*80)

try:
    import spacy
    
    # Load English model (run once: python -m spacy download en_core_web_sm)
    nlp = spacy.load('en_core_web_sm')
    
    text_spacy = "Apple Inc. is looking at buying U.K. startup for $1 billion."
    print(f"Text: {text_spacy}")
    
    # Process text
    doc = nlp(text_spacy)
    
    # Extract tokens with attributes
    print(f"\nspaCy tokens with attributes:")
    for token in doc:
        print(f"  {token.text:15s} | POS: {token.pos_:10s} | Lemma: {token.lemma_:15s} | Stop: {token.is_stop}")
    
    # Just token text
    tokens_spacy = [token.text for token in doc]
    print(f"\nToken list: {tokens_spacy}")
    
    # Sentence segmentation
    text_sents = "I love NLP. It's powerful. What do you think?"
    doc_sents = nlp(text_sents)
    print(f"\nspaCy sentences:")
    for i, sent in enumerate(doc_sents.sents, 1):
        print(f"  {i}. {sent.text}")
    
except ImportError:
    print("spaCy not installed. Install with: pip install spacy")
except OSError:
    print("spaCy model not downloaded. Run: python -m spacy download en_core_web_sm")

# 6. SUBWORD TOKENIZATION (SIMPLE BPE CONCEPT)
print("\n6. SUBWORD TOKENIZATION CONCEPT")
print("-"*80)

# Simulate simple subword tokenization
text_rare = "unhappiness"
print(f"Word: {text_rare}")
print(f"\nProblem: Rare word, might not be in vocabulary")
print(f"\nSolution: Split into subwords")
print(f"  Subwords: ['un', 'happiness'] or ['un', 'happi', 'ness']")
print(f"  Benefit: Common subwords likely in vocabulary")
print(f"  Meaning: Preserved through subword composition")

# Manual demonstration
subwords = ['un', 'happi', 'ness']
print(f"\nSubword tokens: {subwords}")
print(f"  'un' ‚Üí negative prefix")
print(f"  'happi' ‚Üí root word (happy)")
print(f"  'ness' ‚Üí noun suffix")

# 7. TRANSFORMER TOKENIZATION (BERT)
print("\n7. TRANSFORMER TOKENIZATION (BERT)")
print("-"*80)

try:
    from transformers import BertTokenizer
    
    # Load pre-trained BERT tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    
    text_bert = "Tokenization is fundamental for NLP!"
    print(f"Text: {text_bert}")
    
    # Tokenize
    tokens = tokenizer.tokenize(text_bert)
    print(f"\nBERT tokens: {tokens}")
    print(f"  Note: ## prefix indicates subword continuation")
    
    # Convert to IDs
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    print(f"\nToken IDs: {token_ids}")
    
    # Full encoding (includes special tokens)
    encoding = tokenizer(text_bert, 
                        add_special_tokens=True,
                        return_tensors='pt',
                        padding=True,
                        truncation=True)
    
    print(f"\nFull encoding:")
    print(f"  Input IDs: {encoding['input_ids'].tolist()}")
    print(f"  Attention mask: {encoding['attention_mask'].tolist()}")
    
    # Decode back to text
    decoded = tokenizer.decode(encoding['input_ids'][0])
    print(f"\nDecoded text: {decoded}")
    print(f"  [CLS] = Classification token (start)")
    print(f"  [SEP] = Separator token (end)")
    
    # Vocabulary info
    print(f"\nVocabulary size: {tokenizer.vocab_size:,}")
    print(f"Max length: {tokenizer.model_max_length:,} tokens")
    
except ImportError:
    print("Transformers not installed. Install with: pip install transformers")

# 8. TOKENIZATION STATISTICS
print("\n8. TOKENIZATION STATISTICS")
print("-"*80)

corpus = """Machine learning is a subset of artificial intelligence. 
It enables computers to learn from data. Deep learning is a subset of machine learning.
Neural networks are the foundation of deep learning."""

print(f"Corpus:\n{corpus}")

# Simple tokenization
tokens_stats = re.findall(r'\w+', corpus.lower())

print(f"\nStatistics:")
print(f"  Total tokens: {len(tokens_stats)}")
print(f"  Unique tokens: {len(set(tokens_stats))}")

# Token frequency
token_freq = Counter(tokens_stats)
print(f"\nTop 10 most frequent tokens:")
for token, count in token_freq.most_common(10):
    print(f"  {token:15s}: {count} times")

# Average token length
avg_length = sum(len(t) for t in tokens_stats) / len(tokens_stats)
print(f"\nAverage token length: {avg_length:.2f} characters")

# 9. PRACTICAL EXAMPLE: TEXT PREPROCESSING PIPELINE
print("\n9. PRACTICAL EXAMPLE: TEXT PREPROCESSING PIPELINE")
print("-"*80)

def preprocess_and_tokenize(text):
    """Complete preprocessing and tokenization pipeline"""
    
    # Step 1: Lowercase
    text_lower = text.lower()
    
    # Step 2: Remove special characters (keep alphanumeric and spaces)
    text_clean = re.sub(r'[^a-zA-Z0-9\s]', '', text_lower)
    
    # Step 3: Tokenize
    tokens = text_clean.split()
    
    # Step 4: Remove stop words (simplified list)
    stop_words = {'is', 'a', 'the', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for'}
    tokens_filtered = [t for t in tokens if t not in stop_words]
    
    return tokens_filtered

sample_text = "Natural Language Processing is the key to understanding text!"
print(f"Original: {sample_text}")

processed_tokens = preprocess_and_tokenize(sample_text)
print(f"Processed tokens: {processed_tokens}")
print(f"\nSteps applied:")
print(f"  1. Lowercased")
print(f"  2. Removed punctuation")
print(f"  3. Split into words")
print(f"  4. Removed stop words")

# 10. COMPARISON OF TOKENIZATION METHODS
print("\n10. COMPARISON OF TOKENIZATION METHODS")
print("-"*80)

test_text = "Don't split contractions! Email: user@example.com"
print(f"Test text: {test_text}")

# Method 1: Split
method1 = test_text.split()
print(f"\nMethod 1 (split): {method1}")
print(f"  Problem: Punctuation attached, email not handled")

# Method 2: Regex word boundaries
method2 = re.findall(r'\b\w+\b', test_text)
print(f"\nMethod 2 (regex \\b\\w+\\b): {method2}")
print(f"  Problem: Lost punctuation, split contractions")

# Method 3: Regex with punctuation
method3 = re.findall(r"\w+(?:[-']\w+)*|[^\w\s]", test_text)
print(f"\nMethod 3 (regex with contractions): {method3}")
print(f"  Better: Keeps contractions, separates punctuation")

# Method 4: NLTK (if available)
try:
    from nltk.tokenize import word_tokenize
    method4 = word_tokenize(test_text)
    print(f"\nMethod 4 (NLTK): {method4}")
    print(f"  Best: Handles contractions, emails, punctuation intelligently")
except:
    pass

print("\n" + "="*80)
print("SUMMARY")
print("="*80)
print("‚úì Tokenization converts text into processable units (tokens)")
print("‚úì Different methods: whitespace, regex, NLTK, spaCy, transformers")
print("‚úì Character-level: Fine-grained, large vocab")
print("‚úì Word-level: Intuitive, OOV problems")
print("‚úì Subword-level: Best of both worlds (BPE, WordPiece)")
print("‚úì NLTK: Good for traditional NLP tasks")
print("‚úì spaCy: Fast, production-ready, includes POS tagging")
print("‚úì Transformers (BERT, GPT): State-of-the-art, subword tokenization")
print("‚úì Choose tokenizer based on: task, language, vocabulary size")
print("‚úì Always use same tokenizer for training and inference!")
print("="*80)

# 2. Whitespace Tokenization

## üìñ What is Whitespace Tokenization?

**Whitespace tokenization** splits text into tokens based on whitespace characters (spaces, tabs, newlines).

**Method:**
```python
tokens = text.split()  # Split on any whitespace
```

**Example:**
- Input: `"Hello world! How are you?"`
- Output: `["Hello", "world!", "How", "are", "you?"]`

**Key Features:**
- Simplest tokenization method
- Fast and memory efficient
- Language agnostic
- Preserves punctuation with words

## üéØ Why Use Whitespace Tokenization?

### **Advantages:**
1. **Simplicity** - One line of code: `text.split()`
2. **Speed** - Fastest tokenization method
3. **No Dependencies** - Built-in Python function
4. **Predictable** - Easy to understand behavior
5. **Memory Efficient** - Minimal overhead

### **Disadvantages:**
1. **Punctuation Issues** - "world!" includes punctuation
2. **Contractions** - "don't" stays as one token
3. **No Special Handling** - URLs, emails treated naively
4. **Language Limitations** - Fails for languages without spaces (Chinese, Japanese)

## ‚è±Ô∏è When to Use Whitespace Tokenization

### ‚úÖ **Use When:**

**1. Quick Prototyping**
- Example: Initial data exploration
- Why: Get quick token count, basic stats
- Benefit: No setup required

**2. Pre-cleaned Text**
- Example: Text already processed (punctuation removed)
- Why: Whitespace split is sufficient
- Use case: "hello world how are you" ‚Üí ["hello", "world", "how", "are", "you"]

**3. Simple Bag-of-Words Models**
- Example: Basic text classification
- Why: Don't need sophisticated tokenization
- Performance: Good enough for simple tasks

**4. Code/Log Processing**
- Example: Parse log files already space-separated
- Why: Structure already well-defined
- Use case: "2024-12-11 ERROR Failed" ‚Üí ["2024-12-11", "ERROR", "Failed"]

**5. Performance-Critical Applications**
- Example: Process millions of documents quickly
- Why: 10-100x faster than complex tokenizers
- Benefit: Low latency, high throughput

### ‚ùå **Don't Use When:**

**1. Need Punctuation Separation**
- Problem: "world!" vs "world" treated differently
- Better: Use NLTK or regex tokenization
- Why: Punctuation affects meaning and frequency counts

**2. Handling Contractions**
- Problem: "don't" should be ["do", "n't"] or ["do", "not"]
- Better: Use TreeBank tokenizer
- Why: Contractions need special handling

**3. Production NLP Systems**
- Problem: Too naive for real-world text
- Better: Use spaCy or transformers
- Why: Need robust handling of edge cases

**4. Languages Without Spaces**
- Problem: Chinese, Japanese have no word boundaries
- Better: Use language-specific tokenizers (Jieba for Chinese)
- Why: Whitespace tokenization fails completely

**5. Academic/Research Work**
- Problem: Not reproducible, not standard
- Better: Use established tokenizers (NLTK, spaCy)
- Why: Need consistency with published benchmarks

## üìä How It Works

**Algorithm:**
1. Scan text left to right
2. When whitespace found, split
3. Collect non-whitespace sequences as tokens
4. Return list of tokens

**Whitespace Characters:**
- Space: ` `
- Tab: `\t`
- Newline: `\n`
- Carriage return: `\r`

**Time Complexity:** O(n) where n is text length  
**Space Complexity:** O(n) for token storage

## üåç Real-World Applications

1. **Log Analysis** - Parse server logs, error messages
2. **Data Exploration** - Quick token counts for EDA
3. **Simple Search** - Basic keyword search engines
4. **Word Clouds** - Generate visualizations quickly
5. **Code Parsing** - Split code into identifiers
6. **CSV Processing** - Split tab/space-separated values
7. **Command Line Parsing** - Split shell commands

## üí° Key Insights

‚úÖ Fastest tokenization method available  
‚úÖ Use `.split()` with no args (splits on any whitespace)  
‚úÖ Use `.split(' ')` to split only on spaces  
‚úÖ Handles multiple consecutive spaces automatically  
‚úÖ Removes leading/trailing whitespace  
‚úÖ Good for **initial exploration**, not production  
‚úÖ Combine with `.lower()` for case-insensitive matching  
‚úÖ Consider `.strip()` to remove edge whitespace first

In [None]:
# WHITESPACE TOKENIZATION - COMPLETE EXAMPLE

print("="*80)
print("WHITESPACE TOKENIZATION - COMPREHENSIVE GUIDE")
print("="*80)

import time
from collections import Counter

# 1. BASIC WHITESPACE TOKENIZATION
print("\n1. BASIC WHITESPACE TOKENIZATION")
print("-"*80)

text = "Hello world! How are you today?"
print(f"Text: {text}")

# Simple split
tokens = text.split()
print(f"\nTokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"\nObservation: Punctuation attached to words (world!, today?)")

# 2. DIFFERENT WHITESPACE CHARACTERS
print("\n2. HANDLING DIFFERENT WHITESPACE TYPES")
print("-"*80)

# Multiple spaces
text_spaces = "Hello    world     with    spaces"
print(f"Text with multiple spaces: '{text_spaces}'")
tokens_spaces = text_spaces.split()
print(f"Tokens: {tokens_spaces}")
print(f"Note: Multiple spaces treated as single separator")

# Tabs
text_tabs = "Hello\tworld\twith\ttabs"
print(f"\nText with tabs: '{text_tabs}'")
tokens_tabs = text_tabs.split()
print(f"Tokens: {tokens_tabs}")

# Newlines
text_newlines = "Hello\nworld\nwith\nnewlines"
print(f"\nText with newlines: '{text_newlines}'")
tokens_newlines = text_newlines.split()
print(f"Tokens: {tokens_newlines}")

# Mixed whitespace
text_mixed = "Hello\t  world\n  with   mixed\r\nwhitespace"
print(f"\nText with mixed whitespace: '{text_mixed}'")
tokens_mixed = text_mixed.split()
print(f"Tokens: {tokens_mixed}")
print(f"Note: .split() handles all whitespace types automatically")

# 3. SPLIT WITH SPECIFIC SEPARATOR
print("\n3. SPLIT WITH SPECIFIC SEPARATOR")
print("-"*80)

text_sep = "apple,banana,orange,grape"
print(f"CSV text: {text_sep}")

# Split on comma
tokens_comma = text_sep.split(',')
print(f"Split on comma: {tokens_comma}")

# Split only on single space (not tab, newline)
text_space = "hello world\twith\ttabs"
print(f"\nText: '{text_space}'")
tokens_space_only = text_space.split(' ')
print(f"Split on space only: {tokens_space_only}")
print(f"Note: Tabs preserved in tokens")

# 4. LEADING/TRAILING WHITESPACE
print("\n4. LEADING/TRAILING WHITESPACE HANDLING")
print("-"*80)

text_edges = "   Hello world   "
print(f"Text with edge spaces: '{text_edges}'")

# Without strip
tokens_no_strip = text_edges.split()
print(f"Tokens: {tokens_no_strip}")
print(f"Note: .split() automatically removes leading/trailing whitespace")

# Explicit strip (redundant but clear)
tokens_strip = text_edges.strip().split()
print(f"With .strip(): {tokens_strip}")
print(f"Same result!")

# 5. CASE SENSITIVITY
print("\n5. CASE SENSITIVITY IN TOKENIZATION")
print("-"*80)

text_case = "Python python PYTHON PyThOn"
print(f"Text: {text_case}")

# Case-sensitive
tokens_case = text_case.split()
print(f"\nCase-sensitive tokens: {tokens_case}")
print(f"Unique tokens: {len(set(tokens_case))}")

# Case-insensitive
tokens_lower = text_case.lower().split()
print(f"\nLowercase tokens: {tokens_lower}")
print(f"Unique tokens: {len(set(tokens_lower))}")
print(f"Note: All variants become 'python'")

# 6. PERFORMANCE COMPARISON
print("\n6. PERFORMANCE COMPARISON")
print("-"*80)

# Generate large text
large_text = " ".join(["word"] * 100000)  # 100K words
print(f"Text size: {len(large_text):,} characters, ~100,000 words")

# Whitespace tokenization
start = time.time()
tokens_ws = large_text.split()
time_ws = time.time() - start
print(f"\nWhitespace tokenization: {time_ws:.6f} seconds")
print(f"Tokens: {len(tokens_ws):,}")

# Compare with list comprehension (slower)
start = time.time()
tokens_loop = []
current = ""
for char in large_text:
    if char == ' ':
        if current:
            tokens_loop.append(current)
            current = ""
    else:
        current += char
if current:
    tokens_loop.append(current)
time_loop = time.time() - start
print(f"\nManual loop: {time_loop:.6f} seconds")
print(f"Speedup: {time_loop/time_ws:.1f}x faster with .split()")

# 7. HANDLING PUNCTUATION
print("\n7. PUNCTUATION HANDLING")
print("-"*80)

text_punct = "Hello, world! How are you? I'm fine."
print(f"Text: {text_punct}")

tokens_punct = text_punct.split()
print(f"\nTokens: {tokens_punct}")
print(f"\nIssues:")
print(f"  - 'Hello,' includes comma")
print(f"  - 'world!' includes exclamation")
print(f"  - 'you?' includes question mark")
print(f"  - 'I'm' is single token (contraction)")

# Remove punctuation manually
import string
text_no_punct = text_punct.translate(str.maketrans('', '', string.punctuation))
tokens_clean = text_no_punct.split()
print(f"\nAfter removing punctuation: {tokens_clean}")
print(f"Note: 'I'm' became 'Im' (lost apostrophe)")

# 8. FREQUENCY ANALYSIS
print("\n8. TOKEN FREQUENCY ANALYSIS")
print("-"*80)

text_freq = """Machine learning is amazing. Machine learning transforms data.
Data is the new oil. Machine learning needs data."""

print(f"Text:\n{text_freq}")

# Tokenize and normalize
tokens_freq = text_freq.lower().split()
print(f"\nTotal tokens: {len(tokens_freq)}")

# Count frequencies
token_counts = Counter(tokens_freq)
print(f"Unique tokens: {len(token_counts)}")

print(f"\nTop 5 most frequent tokens:")
for token, count in token_counts.most_common(5):
    print(f"  {token:15s}: {count} times")

# 9. PRACTICAL EXAMPLE: LOG FILE PARSING
print("\n9. PRACTICAL EXAMPLE: LOG FILE PARSING")
print("-"*80)

log_entries = [
    "2024-12-11 10:30:45 INFO User login successful",
    "2024-12-11 10:31:12 ERROR Database connection failed",
    "2024-12-11 10:31:45 WARNING Disk space low",
    "2024-12-11 10:32:01 INFO User logout"
]

print("Log entries:")
for entry in log_entries:
    print(f"  {entry}")

print(f"\nParsed logs:")
for entry in log_entries:
    tokens = entry.split()
    date = tokens[0]
    time = tokens[1]
    level = tokens[2]
    message = ' '.join(tokens[3:])
    
    print(f"  Date: {date}, Time: {time}, Level: {level}, Message: {message}")

# Count log levels
levels = [entry.split()[2] for entry in log_entries]
level_counts = Counter(levels)
print(f"\nLog level summary:")
for level, count in level_counts.items():
    print(f"  {level}: {count}")

# 10. LIMITATIONS DEMONSTRATION
print("\n10. LIMITATIONS OF WHITESPACE TOKENIZATION")
print("-"*80)

examples = [
    ("don't", "Contraction not split"),
    ("user@example.com", "Email kept as one token"),
    ("http://example.com", "URL kept as one token"),
    ("$100.50", "Currency symbol attached"),
    ("Dr. Smith", "Abbreviation period attached"),
    ("New York", "Multi-word entity split")
]

print("Problematic cases:")
for text, issue in examples:
    tokens = text.split()
    print(f"  '{text}' ‚Üí {tokens}")
    print(f"    Issue: {issue}")

print("\n" + "="*80)
print("SUMMARY")
print("="*80)
print("‚úì Whitespace tokenization: text.split()")
print("‚úì Fastest tokenization method (built-in Python)")
print("‚úì Handles all whitespace types (space, tab, newline)")
print("‚úì Automatically removes leading/trailing whitespace")
print("‚úì Good for: quick exploration, simple tasks, pre-cleaned text")
print("‚úì Limitations: punctuation attached, no special handling")
print("‚úì Use .lower() for case-insensitive matching")
print("‚úì Combine with punctuation removal for cleaner tokens")
print("‚úì Not suitable for: production NLP, complex text, contractions")
print("="*80)