# 1. Tokenization

## üìñ What is Tokenization?

**Tokenization** is the process of breaking down text into smaller units called **tokens**. These tokens can be words, sentences, subwords, or characters.

**Types of Tokenization:**
```
1. Word Tokenization
   Input: "Hello, how are you?"
   Output: ["Hello", ",", "how", "are", "you", "?"]

2. Sentence Tokenization
   Input: "Hello! How are you? I'm fine."
   Output: ["Hello!", "How are you?", "I'm fine."]

3. Subword Tokenization (BPE, WordPiece)
   Input: "unhappiness"
   Output: ["un", "##happiness"] or ["un", "happy", "ness"]

4. Character Tokenization
   Input: "Hello"
   Output: ["H", "e", "l", "l", "o"]
```

**Key Concepts:**
- **Vocabulary**: Set of all unique tokens
- **OOV (Out-of-Vocabulary)**: Words not in vocabulary
- **Subword Tokenization**: Handles OOV by breaking into smaller parts
- **BPE (Byte Pair Encoding)**: Iteratively merges frequent character pairs
- **WordPiece**: Used by BERT, similar to BPE
- **SentencePiece**: Language-agnostic tokenization

## üéØ Why Use Tokenization?

### **Advantages:**
1. **First Step in NLP** - Required for all text processing
2. **Standardization** - Converts text into processable units
3. **Feature Extraction** - Tokens become features for ML models
4. **Handles Multiple Languages** - Works across languages
5. **Vocabulary Control** - Limits model size via subword tokenization

### **Challenges:**
1. **Language-Specific Rules** - English ‚â† Chinese ‚â† Arabic
2. **Ambiguity** - "New York" (1 token or 2?)
3. **Special Cases** - Contractions ("don't"), hyphenated words
4. **OOV Problem** - New/rare words not in vocabulary

## ‚è±Ô∏è When to Use Different Tokenization Types

### ‚úÖ **Word Tokenization - Use When:**

**1. Traditional ML Models**
- Example: Naive Bayes, Logistic Regression for text classification
- Why: Simple features, interpretable
- Models expect word-level features

**2. Bag-of-Words / TF-IDF**
- Example: Document similarity, search engines
- Why: Word counts are meaningful
- Each word becomes a dimension

**3. Small, Controlled Vocabulary**
- Example: Domain-specific chatbot (banking terms)
- Why: Limited set of words, no rare terms
- Vocabulary < 10K words

**4. Keyword Extraction**
- Example: SEO analysis, document tagging
- Why: Whole words are meaningful units
- "Machine learning" better than "machine" + "learning"

### ‚úÖ **Sentence Tokenization - Use When:**

**1. Summarization**
- Example: Extract 3 most important sentences
- Why: Sentences are atomic units of meaning
- Preserve complete thoughts

**2. Translation**
- Example: Machine translation systems
- Why: Translate sentence-by-sentence
- Context stays within sentence boundaries

**3. Question Answering**
- Example: Find sentence containing answer
- Why: Answers typically span 1-2 sentences
- Granular enough for precise answers

### ‚úÖ **Subword Tokenization - Use When:**

**1. Transformer Models (BERT, GPT)**
- Example: Using pre-trained BERT
- Why: These models use subword tokenization
- WordPiece/BPE is built-in

**2. Handling OOV Words**
- Example: Social media text ("coooool", "lololol")
- Why: Break rare words into known subwords
- "unhappiness" ‚Üí "un" + "happiness"

**3. Morphologically Rich Languages**
- Example: German, Finnish, Turkish
- Why: Words have many inflections
- "un√ºbersetzbarkeiten" ‚Üí multiple subwords

**4. Large Vocabulary Control**
- Example: Multilingual models
- Why: Keep vocab size manageable (30K-50K)
- Balance between word and character level

### ‚ùå **Don't Use Word Tokenization When:**

**1. High OOV Rate**
- Problem: Medical texts with rare terms
- Better: Subword tokenization
- Why: Word tokenization fails on unknowns

**2. Character-Level Features Matter**
- Problem: Spell checking, DNA sequences
- Better: Character tokenization
- Why: Need to see individual characters

**3. Using Pre-trained Transformers**
- Problem: Want to use BERT embeddings
- Better: Use BERT's tokenizer (WordPiece)
- Why: Mismatch causes errors

## üìä How It Works

**Word Tokenization Algorithm:**
1. Split on whitespace
2. Handle punctuation (keep or remove)
3. Apply language-specific rules
4. Return list of word tokens

**Subword Tokenization (BPE) Algorithm:**
1. Start with character vocabulary
2. Find most frequent character pair
3. Merge pair into new token
4. Repeat until vocab size reached
5. Example: "low" + "est" ‚Üí "lowest" (learned merge)

**Sentence Tokenization:**
1. Detect sentence boundaries (. ! ?)
2. Handle abbreviations (Dr., Mr., etc.)
3. Consider context ("Ph.D." not sentence end)
4. Return list of sentences

## üåç Real-World Applications

1. **Search Engines** - Tokenize queries and documents (Google, Bing)
2. **Chatbots** - Tokenize user messages (Siri, Alexa)
3. **Translation** - Sentence tokenization (Google Translate)
4. **Sentiment Analysis** - Word tokenization (Twitter sentiment)
5. **Text Classification** - Spam detection, category classification
6. **Information Extraction** - Named entity recognition
7. **Question Answering** - Tokenize questions and contexts
8. **Text Summarization** - Sentence tokenization
9. **Code Analysis** - Tokenize programming languages
10. **Speech Recognition** - Tokenize transcribed text

## üí° Key Insights

‚úÖ **Use pre-built tokenizers** (NLTK, spaCy, transformers)  
‚úÖ **Match tokenizer to model** - BERT needs WordPiece  
‚úÖ **Consider language** - Chinese/Japanese need special tokenizers  
‚úÖ **Handle contractions** - "don't" ‚Üí "do" + "n't" or "don't"?  
‚úÖ **Preserve important tokens** - "New York" as single entity  
‚úÖ **Lowercase after tokenization** - Preserve "Apple" vs "apple"  
‚úÖ **Subword for production** - Handles OOV gracefully  
‚úÖ **Sentence boundaries matter** - Use robust sentence tokenizer  
‚úÖ **Benchmark speed** - Tokenization can be bottleneck  
‚úÖ **Version tokenizers** - Changes affect downstream models

In [None]:
# TOKENIZATION - COMPLETE EXAMPLE

print("="*80)
print("TOKENIZATION - COMPREHENSIVE GUIDE")
print("="*80)

import nltk
import re
from collections import Counter

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    print("Downloading NLTK punkt tokenizer...")
    nltk.download('punkt', quiet=True)

# Sample text for demonstration
sample_text = """
Natural Language Processing (NLP) is amazing! It's a field of AI that focuses on 
the interaction between computers and humans. Dr. Smith published a paper on this 
topic in 2023. He said, "NLP will revolutionize technology." Key techniques include:
tokenization, stemming, and lemmatization. Visit www.nlp-tutorial.com for more info.
"""

# 1. WORD TOKENIZATION
print("\n1. WORD TOKENIZATION")
print("-"*80)

# Method 1: Simple split (naive)
simple_tokens = sample_text.split()
print("Simple split (naive):")
print(f"Tokens: {simple_tokens[:10]}")
print(f"Total tokens: {len(simple_tokens)}")
print("Issues: Punctuation attached, doesn't handle special cases\n")

# Method 2: Regex-based
regex_tokens = re.findall(r'\b\w+\b', sample_text)
print("Regex-based (\\b\\w+\\b):")
print(f"Tokens: {regex_tokens[:10]}")
print(f"Total tokens: {len(regex_tokens)}")
print("Issues: Removes all punctuation, loses contractions\n")

# Method 3: NLTK word_tokenize (recommended)
from nltk.tokenize import word_tokenize
nltk_tokens = word_tokenize(sample_text)
print("NLTK word_tokenize (recommended):")
print(f"Tokens: {nltk_tokens[:15]}")
print(f"Total tokens: {len(nltk_tokens)}")
print("Benefits: Handles punctuation, contractions, special cases\n")

# Method 4: spaCy (if installed)
try:
    import spacy
    try:
        nlp = spacy.load('en_core_web_sm')
        doc = nlp(sample_text)
        spacy_tokens = [token.text for token in doc]
        print("spaCy tokenizer:")
        print(f"Tokens: {spacy_tokens[:15]}")
        print(f"Total tokens: {len(spacy_tokens)}")
        print("Benefits: Includes POS tags, dependencies, named entities\n")
    except OSError:
        print("spaCy model not installed. Run: python -m spacy download en_core_web_sm\n")
except ImportError:
    print("spaCy not installed. Run: pip install spacy\n")

# 2. SENTENCE TOKENIZATION
print("\n2. SENTENCE TOKENIZATION")
print("-"*80)

from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(sample_text)
print(f"Number of sentences: {len(sentences)}\n")
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence.strip()}")

print("\nHandles complex cases:")
complex_text = "Dr. Smith earned his Ph.D. in 2020. He works at A.I. Corp. The company is great!"
complex_sentences = sent_tokenize(complex_text)
for i, sent in enumerate(complex_sentences, 1):
    print(f"  {i}. {sent}")

# 3. SUBWORD TOKENIZATION (BPE SIMULATION)
print("\n3. SUBWORD TOKENIZATION (BPE-STYLE)")
print("-"*80)

# Simple BPE demonstration
def simple_bpe_tokenize(word, max_subword_len=3):
    """Simple subword tokenization (BPE-style)"""
    if len(word) <= max_subword_len:
        return [word]
    
    tokens = []
    i = 0
    while i < len(word):
        # Try to take max_subword_len characters
        end = min(i + max_subword_len, len(word))
        tokens.append(word[i:end])
        i = end
    return tokens

test_words = ['unhappiness', 'unbelievable', 'preprocessing', 'tokenization']
print("Subword tokenization examples:\n")
for word in test_words:
    subwords = simple_bpe_tokenize(word, max_subword_len=4)
    print(f"  {word:15s} ‚Üí {subwords}")

print("\nBenefits:")
print("  ‚úì Handles OOV words (e.g., 'supercalifragilistic')")
print("  ‚úì Smaller vocabulary size")
print("  ‚úì Captures morphology (prefixes/suffixes)")

# 4. TRANSFORMER TOKENIZATION (BERT-STYLE)
print("\n4. TRANSFORMER TOKENIZATION (BERT/GPT)")
print("-"*80)

try:
    from transformers import BertTokenizer, GPT2Tokenizer
    
    # BERT tokenizer (WordPiece)
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    test_sentence = "Tokenization is preprocessing text."
    
    bert_tokens = bert_tokenizer.tokenize(test_sentence)
    print("BERT (WordPiece) tokenization:")
    print(f"  Input: {test_sentence}")
    print(f"  Tokens: {bert_tokens}")
    print(f"  Token IDs: {bert_tokenizer.convert_tokens_to_ids(bert_tokens)}")
    
    # GPT-2 tokenizer (BPE)
    gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    gpt2_tokens = gpt2_tokenizer.tokenize(test_sentence)
    print("\nGPT-2 (BPE) tokenization:")
    print(f"  Input: {test_sentence}")
    print(f"  Tokens: {gpt2_tokens}")
    print(f"  Token IDs: {gpt2_tokenizer.convert_tokens_to_ids(gpt2_tokens)}")
    
    # Handling OOV with subword tokenization
    oov_word = "supercalifragilisticexpialidocious"
    print(f"\nHandling OOV word: '{oov_word}'")
    print(f"  BERT tokens: {bert_tokenizer.tokenize(oov_word)}")
    print(f"  GPT-2 tokens: {gpt2_tokenizer.tokenize(oov_word)}")
    
except ImportError:
    print("transformers library not installed.")
    print("Run: pip install transformers")

# 5. CHARACTER TOKENIZATION
print("\n5. CHARACTER TOKENIZATION")
print("-"*80)

text = "Hello!"
char_tokens = list(text)
print(f"Text: {text}")
print(f"Character tokens: {char_tokens}")
print(f"Total characters: {len(char_tokens)}")
print("\nUse cases:")
print("  ‚úì Spell checking")
print("  ‚úì Character-level language models")
print("  ‚úì DNA/protein sequence analysis")
print("  ‚úì Handwriting recognition")

# 6. TOKENIZATION COMPARISON
print("\n6. TOKENIZATION METHOD COMPARISON")
print("-"*80)

comparison_text = "She's reading Dr. Johnson's book about AI."

print(f"Input: \"{comparison_text}\"\n")

methods = {
    'Simple split': comparison_text.split(),
    'Regex \\w+': re.findall(r'\w+', comparison_text),
    'NLTK word_tokenize': word_tokenize(comparison_text),
    'Character-level': list(comparison_text.replace(' ', ''))
}

for method, tokens in methods.items():
    print(f"{method:20s}: {tokens}")

# 7. VOCABULARY BUILDING
print("\n7. VOCABULARY BUILDING FROM TOKENS")
print("-"*80)

corpus = [
    "I love natural language processing",
    "Natural language processing is amazing",
    "I love learning about NLP"
]

# Tokenize all documents
all_tokens = []
for doc in corpus:
    tokens = word_tokenize(doc.lower())
    all_tokens.extend(tokens)

# Build vocabulary
vocab = sorted(set(all_tokens))
token_freq = Counter(all_tokens)

print(f"Corpus: {len(corpus)} documents")
print(f"Total tokens: {len(all_tokens)}")
print(f"Vocabulary size: {len(vocab)}")
print(f"\nVocabulary: {vocab}")
print(f"\nTop 5 most frequent tokens:")
for token, freq in token_freq.most_common(5):
    print(f"  '{token}': {freq} times")

# 8. PRACTICAL APPLICATION: PREPROCESSING PIPELINE
print("\n8. PRACTICAL APPLICATION: TEXT PREPROCESSING PIPELINE")
print("-"*80)

def preprocess_text(text, lowercase=True, remove_punct=False):
    """
    Complete preprocessing pipeline
    """
    # Tokenize
    tokens = word_tokenize(text)
    
    # Lowercase
    if lowercase:
        tokens = [t.lower() for t in tokens]
    
    # Remove punctuation
    if remove_punct:
        tokens = [t for t in tokens if t.isalnum()]
    
    return tokens

demo_text = "Hello! This is a TEST sentence. It's quite simple."

print(f"Original text: {demo_text}\n")
print(f"Default: {preprocess_text(demo_text)}")
print(f"No lowercase: {preprocess_text(demo_text, lowercase=False)}")
print(f"Remove punct: {preprocess_text(demo_text, remove_punct=True)}")

print("\n" + "="*80)
print("SUMMARY")
print("="*80)
print("‚úì Tokenization breaks text into processable units")
print("‚úì Word tokenization: Traditional ML (Naive Bayes, TF-IDF)")
print("‚úì Sentence tokenization: Summarization, translation")
print("‚úì Subword tokenization: Transformers (BERT, GPT), handles OOV")
print("‚úì Character tokenization: Spell checking, char-level models")
print("‚úì Use NLTK/spaCy for robust tokenization")
print("‚úì Match tokenizer to downstream model")
print("‚úì BERT uses WordPiece, GPT uses BPE")
print("‚úì Build vocabulary from tokenized corpus")
print("="*80)

# 2. Lowercasing & Case Normalization

## üìñ What is Lowercasing?

**Lowercasing** is the process of converting all text characters to lowercase to reduce vocabulary size and treat words like "Apple" and "apple" as the same token.

**Types of Case Normalization:**
```
1. Full Lowercasing
   Input: "Apple Inc. Makes iPhones"
   Output: "apple inc. makes iphones"

2. Selective Lowercasing (Preserve Acronyms)
   Input: "NASA launched the Artemis mission"
   Output: "NASA launched the artemis mission"

3. Title Case
   Input: "natural language processing"
   Output: "Natural Language Processing"

4. Sentence Case
   Input: "HELLO WORLD. HOW ARE YOU?"
   Output: "Hello world. How are you?"
```

**Key Concepts:**
- **Case-Insensitive Matching**: "Apple" = "apple" = "APPLE"
- **Vocabulary Reduction**: "The" and "the" become single token
- **Information Loss**: "Apple" (company) vs "apple" (fruit)
- **Language-Specific**: Turkish ƒ∞/i, German √ü

## üéØ Why Use Lowercasing?

### **Advantages:**
1. **Smaller Vocabulary** - Reduces unique tokens by 30-50%
2. **Better Generalization** - Model sees "car" and "Car" as same
3. **Simpler Matching** - Case-insensitive search
4. **Consistent Features** - No duplicate features for case variants
5. **Faster Training** - Smaller vocab = faster model

### **Disadvantages:**
1. **Loses Information** - "Apple Inc." vs "apple fruit"
2. **Acronym Confusion** - "US" (country) vs "us" (pronoun)
3. **Named Entity Issues** - "Paris" (city) vs "paris" (word)
4. **Sentiment Loss** - "AMAZING!!!" vs "amazing" (intensity lost)

## ‚è±Ô∏è When to Use Lowercasing

### ‚úÖ **Use Lowercasing When:**

**1. Text Classification (Sentiment, Spam)**
- Example: Email spam detection
- Why: Case doesn't affect spam/not-spam
- "FREE MONEY" and "free money" both spam

**2. Search Engines**
- Example: Google search
- Why: Users type queries in any case
- "Python tutorial" = "python tutorial" = "PYTHON TUTORIAL"

**3. Bag-of-Words / TF-IDF**
- Example: Document similarity
- Why: "Machine" and "machine" should count together
- Reduces vocabulary size

**4. Small Datasets**
- Example: 500 training samples
- Why: Not enough data to learn case patterns
- "The" appears 50 times, "the" appears 200 times ‚Üí merge

**5. Language Models (Informal Text)**
- Example: Twitter sentiment analysis
- Why: Social media text has inconsistent casing
- "lol", "LOL", "Lol" all mean same thing

**6. Keyword Matching**
- Example: Filter support tickets by keywords
- Why: Keywords may appear in any case
- Match "refund", "Refund", "REFUND"

### ‚ùå **Don't Use Lowercasing When:**

**1. Named Entity Recognition (NER)**
- Problem: Detect people, places, organizations
- Better: Keep original case
- Why: "Apple" (company) vs "apple" (fruit)
- Example: "Washington" (person/city) vs "washington" (word)

**2. Part-of-Speech Tagging**
- Problem: Tag word types (noun, verb, etc.)
- Better: Preserve case
- Why: "Polish" (adjective) vs "polish" (verb)
- Example: "Turkey" (noun) vs "turkey" (noun, different meaning)

**3. Machine Translation**
- Problem: Translate German to English
- Better: Keep case
- Why: German nouns always capitalized
- Example: "Die Katze" (the cat) - "Die" is article, not verb

**4. Case Carries Meaning**
- Problem: Acronyms, proper nouns
- Better: Selective lowercasing
- Why: "US" ‚â† "us", "IT" (info tech) ‚â† "it"

**5. Sentiment Analysis (Intensity)**
- Problem: Detect emotion strength
- Better: Preserve case as feature
- Why: "LOVE IT!!!" stronger than "love it"

**6. Question Answering**
- Problem: Answer "Who is the president?"
- Better: Keep case
- Why: Answer is proper noun ("Joe Biden")

## üìä How It Works

**Simple Lowercasing:**
```python
text.lower()  # Python built-in
```

**Selective Lowercasing (Preserve Acronyms):**
```python
def selective_lowercase(text):
    tokens = text.split()
    return ' '.join(
        token if token.isupper() and len(token) > 1  # Keep acronyms
        else token.lower()
        for token in tokens
    )
```

## üåç Real-World Applications

1. **Search Engines** - Case-insensitive search (Google, Bing)
2. **Email Filters** - Spam detection
3. **Chatbots** - User intent classification
4. **Sentiment Analysis** - Product reviews (when case doesn't matter)
5. **Text Classification** - Topic categorization
6. **Information Retrieval** - Document search
7. **Autocomplete** - Suggest terms regardless of case

## üí° Key Insights

‚úÖ **Lowercase AFTER tokenization** - Preserve "U.S." before lowercasing  
‚úÖ **Consider task requirements** - NER needs case, spam detection doesn't  
‚úÖ **Test both ways** - A/B test with and without lowercasing  
‚úÖ **Preserve acronyms** - "USA", "NASA", "FBI" if important  
‚úÖ **Language-specific** - Handle German √ü, Turkish ƒ∞  
‚úÖ **Check pre-trained models** - BERT has cased/uncased versions  
‚úÖ **Document decision** - Note if model is case-sensitive  
‚úÖ **Vocabulary size impact** - Measure before/after  
‚úÖ **Sentiment intensity** - Keep ALL CAPS if analyzing emotion  
‚úÖ **Combine with other preprocessing** - Stop words, stemming

In [None]:
# LOWERCASING & CASE NORMALIZATION - COMPLETE EXAMPLE

print("="*80)
print("LOWERCASING & CASE NORMALIZATION - COMPREHENSIVE GUIDE")
print("="*80)

import re
from collections import Counter

# Sample texts demonstrating case importance
sample_texts = [
    "Apple Inc. makes iPhones in California.",
    "I love eating apples from the farmers market.",
    "NASA launched the Artemis mission.",
    "The US president visited Paris, France.",
    "AMAZING product!!! HIGHLY RECOMMENDED!!!",
    "The IT department uses IT infrastructure."
]

# 1. BASIC LOWERCASING
print("\n1. BASIC LOWERCASING")
print("-"*80)

for text in sample_texts[:3]:
    lowercased = text.lower()
    print(f"Original:   {text}")
    print(f"Lowercased: {lowercased}")
    print()

# 2. VOCABULARY SIZE COMPARISON
print("\n2. VOCABULARY SIZE IMPACT")
print("-"*80)

corpus = " ".join(sample_texts)

# Original vocabulary
original_tokens = corpus.split()
original_vocab = set(original_tokens)

# Lowercased vocabulary
lowercased_tokens = [t.lower() for t in original_tokens]
lowercased_vocab = set(lowercased_tokens)

print(f"Original vocabulary size: {len(original_vocab)}")
print(f"Lowercased vocabulary size: {len(lowercased_vocab)}")
print(f"Reduction: {len(original_vocab) - len(lowercased_vocab)} tokens")
print(f"Percentage: {(1 - len(lowercased_vocab)/len(original_vocab))*100:.1f}% smaller")

print(f"\nOriginal vocab (first 10): {sorted(list(original_vocab))[:10]}")
print(f"Lowercased vocab (first 10): {sorted(list(lowercased_vocab))[:10]}")

# 3. INFORMATION LOSS DEMONSTRATION
print("\n3. INFORMATION LOSS EXAMPLES")
print("-"*80)

ambiguous_cases = [
    ("Apple Inc.", "apple inc.", "Company vs fruit"),
    ("US president", "us president", "Country vs pronoun"),
    ("Paris, France", "paris, france", "City vs common word"),
    ("IT department", "it department", "Information Tech vs pronoun"),
    ("Polish the car", "polish the car", "Nationality vs verb"),
    ("READ THIS!!!", "read this!!!", "Emphasis lost")
]

print("Case matters for meaning:\n")
for original, lowercased, explanation in ambiguous_cases:
    print(f"  Original:   '{original}'")
    print(f"  Lowercased: '{lowercased}'")
    print(f"  Issue: {explanation}\n")

# 4. SELECTIVE LOWERCASING (PRESERVE ACRONYMS)
print("\n4. SELECTIVE LOWERCASING (PRESERVE ACRONYMS)")
print("-"*80)

def selective_lowercase(text):
    """
    Lowercase text but preserve acronyms (all caps, 2+ letters)
    """
    tokens = text.split()
    result = []
    
    for token in tokens:
        # Remove punctuation for checking
        word = re.sub(r'[^A-Za-z]', '', token)
        
        # Keep if acronym (all uppercase, 2+ letters)
        if word.isupper() and len(word) >= 2:
            result.append(token)
        else:
            result.append(token.lower())
    
    return ' '.join(result)

test_texts = [
    "NASA launched the Artemis mission to the Moon.",
    "The FBI and CIA work for the US government.",
    "I work in IT at IBM and use SQL daily.",
    "The CEO of Apple Inc. announced new iPhones."
]

print("Preserving acronyms:\n")
for text in test_texts:
    print(f"Original:  {text}")
    print(f"Selective: {selective_lowercase(text)}")
    print(f"Full low:  {text.lower()}")
    print()

# 5. CASE-SENSITIVE VS CASE-INSENSITIVE COMPARISON
print("\n5. CASE-SENSITIVE VS CASE-INSENSITIVE COMPARISON")
print("-"*80)

documents = [
    "Apple makes great products. I love Apple.",
    "I ate an apple today. The apple was delicious.",
    "Apple Inc. is a technology company."
]

# Case-sensitive word count
case_sensitive_counts = Counter()
for doc in documents:
    words = doc.split()
    case_sensitive_counts.update(words)

# Case-insensitive word count
case_insensitive_counts = Counter()
for doc in documents:
    words = [w.lower() for w in doc.split()]
    case_insensitive_counts.update(words)

print("Case-sensitive counts:")
for word in ['Apple', 'apple']:
    print(f"  '{word}': {case_sensitive_counts.get(word, 0)}")

print("\nCase-insensitive counts:")
print(f"  'apple': {case_insensitive_counts['apple']}")

print("\nConclusion:")
print("  Case-sensitive: Distinguishes 'Apple' (company) from 'apple' (fruit)")
print("  Case-insensitive: Treats both as same word (higher count)")

# 6. IMPACT ON MACHINE LEARNING
print("\n6. IMPACT ON MACHINE LEARNING FEATURES")
print("-"*80)

from sklearn.feature_extraction.text import CountVectorizer

sample_docs = [
    "The Quick Brown Fox",
    "the quick brown fox",
    "THE QUICK BROWN FOX"
]

# Case-sensitive vectorizer
vec_case_sensitive = CountVectorizer(lowercase=False)
X_case_sensitive = vec_case_sensitive.fit_transform(sample_docs)

# Case-insensitive vectorizer
vec_case_insensitive = CountVectorizer(lowercase=True)
X_case_insensitive = vec_case_insensitive.fit_transform(sample_docs)

print("Case-sensitive features:")
print(f"  Vocabulary size: {len(vec_case_sensitive.vocabulary_)}")
print(f"  Features: {sorted(vec_case_sensitive.vocabulary_.keys())}")

print("\nCase-insensitive features:")
print(f"  Vocabulary size: {len(vec_case_insensitive.vocabulary_)}")
print(f"  Features: {sorted(vec_case_insensitive.vocabulary_.keys())}")

print("\nImpact:")
print(f"  Case-sensitive: 12 features (The, the, THE, Quick, quick, QUICK, ...)")
print(f"  Case-insensitive: 4 features (the, quick, brown, fox)")
print(f"  ‚úì 3x smaller feature space!")

# 7. BEST PRACTICES
print("\n7. BEST PRACTICES & RECOMMENDATIONS")
print("-"*80)

recommendations = {
    "Text Classification (Spam, Sentiment)": "‚úì Use lowercasing",
    "Named Entity Recognition (NER)": "‚úó Keep original case",
    "Part-of-Speech Tagging": "‚úó Keep original case",
    "Machine Translation": "‚úó Keep original case",
    "Search Engines": "‚úì Use lowercasing",
    "Bag-of-Words / TF-IDF": "‚úì Use lowercasing",
    "Question Answering": "‚ö†Ô∏è Selective (preserve proper nouns)",
    "Sentiment with Intensity": "‚ö†Ô∏è Consider keeping ALL CAPS",
    "Chatbots (Intent Classification)": "‚úì Use lowercasing",
    "Code Analysis": "‚úó Keep original case (camelCase matters)"
}

print("Task-specific recommendations:\n")
for task, recommendation in recommendations.items():
    print(f"  {task:40s} ‚Üí {recommendation}")

# 8. PRACTICAL FUNCTION
print("\n8. PRACTICAL PREPROCESSING FUNCTION")
print("-"*80)

def normalize_case(text, strategy='lowercase'):
    """
    Normalize text case based on strategy
    
    Args:
        text: Input text
        strategy: 'lowercase', 'uppercase', 'titlecase', 'preserve_acronyms'
    """
    if strategy == 'lowercase':
        return text.lower()
    
    elif strategy == 'uppercase':
        return text.upper()
    
    elif strategy == 'titlecase':
        return text.title()
    
    elif strategy == 'preserve_acronyms':
        return selective_lowercase(text)
    
    else:
        return text  # No change

demo_text = "NASA and the FBI work with US agencies."

print(f"Original: {demo_text}\n")
for strategy in ['lowercase', 'uppercase', 'titlecase', 'preserve_acronyms']:
    result = normalize_case(demo_text, strategy)
    print(f"{strategy:20s}: {result}")

print("\n" + "="*80)
print("SUMMARY")
print("="*80)
print("‚úì Lowercasing reduces vocabulary by 30-50%")
print("‚úì Use for: Text classification, search, bag-of-words")
print("‚úó Avoid for: NER, POS tagging, machine translation")
print("‚úì Lowercase AFTER tokenization to preserve 'U.S.'")
print("‚úì Consider preserving acronyms (NASA, FBI, IT)")
print("‚úì Test with and without lowercasing")
print("‚úì Check pre-trained models (BERT cased vs uncased)")
print("‚úì Document your case normalization strategy")
print("‚úì Information loss: 'Apple Inc.' vs 'apple fruit'")
print("="*80)