# Exercise Notebook Part 1

**Practice exercises for Part 1** - Follows **Learning Notebook Part 1**

This notebook covers **Exercises 1-8** (foundational NLP preprocessing and basic search):

1. **Exercise 1**: Text preprocessing with regex (URLs, emails, etc.) + sub-exercises (1a-1d)
2. **Exercise 2**: Tokenization techniques comparison + sub-exercises (2a-2e)
3. **Exercise 3**: Term Frequency (TF) calculation / Bag of Words (BoW)
4. **Exercise 4**: TF-based keyword search
5. **Exercise 5**: Compare preprocessing approaches
6. **Exercise 6**: Stemming and Lemmatization + sub-exercises (6a-6d)
7. **Exercise 7**: Advanced regex patterns + sub-exercises (7a-7d)
8. **Exercise 8**: Handling special cases in preprocessing + sub-exercises (8a-8e)

**üìù Note**: Exercise Notebook Part 2 will cover more advanced topics including TF-IDF, similarity search, document clustering, and RAG (Retrieval-Augmented Generation).

**Important Pipeline Order:**
```
Preprocessing ‚Üí Tokenization ‚Üí Vectorization (BoW/TF) ‚Üí Keyword Search
```

**Instructions**: Complete each exercise by filling in the code cells marked with `# TODO`

**Note**:
- **TF-IDF** (IDF calculation and full TF-IDF) is in **Exercise Notebook Part 2**
- **Similarity search** and **RAG** are in **Exercise Notebook Part 2**
- Remember that TF-IDF is **syntactic** (word-based, no meaning). True semantic search (understanding meaning, synonyms) requires embeddings (Class 3)! **Semantic = meaning**.


In [1]:
import pandas as pd
import numpy as np
import re
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity


In [2]:
# Load movie data
# If running in Google Colab and data file doesn't exist, download it from GitHub
import os

if not os.path.exists('data/movies.csv'):
    print("Data file not found. Downloading from GitHub...")
    os.makedirs('data', exist_ok=True)
    import urllib.request
    url = 'https://raw.githubusercontent.com/samsung-ai-course/8th-9th-edition/main/Chapter%202%20-%20Natural%20Language%20Processing/Class%201%20%26%202%20-%20NLP%20and%20Search/data/movies.csv'
    urllib.request.urlretrieve(url, 'data/movies.csv')
    print("‚úì Data file downloaded successfully!")

df = pd.read_csv('data/movies.csv')
print(f"Loaded {len(df)} movies")
df.head()


Data file not found. Downloading from GitHub...
‚úì Data file downloaded successfully!
Loaded 10000 movies


Unnamed: 0,movie_id,title,description,genre,rating
0,1,Edge of Code,A compelling romance film about a young advent...,Romance,7.1
1,2,Storm of Secret,This captivating romance movie follows a quest...,Romance,6.3
2,3,Under Warrior Redux,"In this captivating war story, a secret organi...",War,7.3
3,4,Quest of Secret,A compelling fantasy film about a determined d...,Fantasy,8.3
4,5,Key of Game,A exploration adventure film about a master th...,Adventure,6.2


## Exercise 1: Text Cleaning with Regex

Complete the `clean_text` function to:
1. Remove URLs (starting with http:// or https://)
2. Remove email addresses
3. Remove phone numbers (format: (555) 123-4567 or 555-123-4567)
4. Remove extra whitespace
5. Convert to lowercase


In [3]:
# --- Clean Text Function ---
def clean_text(text):
    """
    Cleans text by removing URLs, emails, phone numbers, extra whitespace,
    and converting to lowercase.
    """
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)

    # Remove emails
    text = re.sub(r'\S+@\S+\.\S+', '', text)

    # Remove phone numbers ((555) 123-4567 or 555-123-4567)
    text = re.sub(r'\(?\b\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b', '', text)

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Convert to lowercase
    text = text.lower()

    return text


# Apply cleaning to description column
df['clean_description'] = df['description'].apply(clean_text)
df[['description', 'clean_description']].head()

Unnamed: 0,description,clean_description
0,A compelling romance film about a young advent...,a compelling romance film about a young advent...
1,This captivating romance movie follows a quest...,this captivating romance movie follows a quest...
2,"In this captivating war story, a secret organi...","in this captivating war story, a secret organi..."
3,A compelling fantasy film about a determined d...,a compelling fantasy film about a determined d...
4,A exploration adventure film about a master th...,a exploration adventure film about a master th...


### Exercise 1a: Extract Specific Patterns from Text

Instead of removing patterns, sometimes we want to **extract** them for analysis.
Complete functions to extract dates, prices, hashtags and anything else you might think its relevant.

**Goal**: Practice pattern extraction and understand when to extract vs remove.


In [4]:
def extract_dates(text):
    # Matches formats like 2025-11-12, 11/12/2025, Nov 12 2025
    return re.findall(r'\b(?:\d{4}-\d{2}-\d{2}|\d{1,2}/\d{1,2}/\d{4}|[A-Z][a-z]{2,8}\s\d{1,2}\s\d{4})\b', text)

def extract_prices(text):
    # Matches $12.99, ‚Ç¨10, 10 USD
    return re.findall(r'[$‚Ç¨¬£]\d+(?:\.\d{2})?|\d+\s?(?:USD|EUR|GBP)', text)

def extract_hashtags(text):
    return re.findall(r'#\w+', text)

def extract_mentions(text):
    return re.findall(r'@\w+', text)

# Example usage
sample = "New movie releases on Nov 12 2025! Tickets cost ‚Ç¨10. #Cinema @MovieFan"
print("Dates:", extract_dates(sample))
print("Prices:", extract_prices(sample))
print("Hashtags:", extract_hashtags(sample))
print("Mentions:", extract_mentions(sample))

Dates: ['Nov 12 2025']
Prices: ['‚Ç¨10']
Hashtags: ['#Cinema']
Mentions: ['@MovieFan']


### Exercise 1b: Normalize Text Content

Sometimes we want to **normalize** text (standardize variations) rather than remove it.
Complete functions to normalize contractions, abbreviations, and special characters.

**Goal**: Understand when normalization improves text consistency for NLP.

---

## What is Text Normalization?

**Normalization** in NLP means converting different variations of text into a **standard, consistent form** while preserving the meaning. Unlike **removal** (which deletes content), normalization **transforms** text to reduce variation and improve consistency.

### Why Normalize?

Text often contains multiple ways to express the same thing:
- **Contractions**: "don't" vs "do not", "I'm" vs "I am", "can't" vs "cannot"
- **Abbreviations**: "Dr." vs "Doctor", "U.S.A." vs "USA" vs "United States"
- **Special characters**: "caf√©" vs "cafe", "na√Øve" vs "naive", "r√©sum√©" vs "resume"
- **Punctuation variations**: "Mr." vs "Mr", "e.g." vs "eg"
- **Number formats**: "1,000" vs "1000", "50%" vs "50 percent"

### Normalization vs Removal

| Approach | Example | When to Use |
|----------|---------|-------------|
| **Normalization** | "don't" ‚Üí "do not" | Preserve meaning, reduce vocabulary size |
| **Removal** | Remove URLs, emails | Content is noise, not useful for analysis |

### Benefits of Normalization

1. **Reduces Vocabulary Size**: "don't", "don't", "do not", "do not" ‚Üí all become "do not"
2. **Improves Matching**: Search for "cannot" will also match "can't"
3. **Consistency**: Same concept represented the same way across documents
4. **Better Statistics**: TF-IDF counts are more accurate when variations are unified

### Examples of Normalization

```python
# Contractions
"don't" ‚Üí "do not"
"I'm" ‚Üí "I am"
"won't" ‚Üí "will not"
"it's" ‚Üí "it is" (or "it has" depending on context)

# Abbreviations
"Dr. Smith" ‚Üí "Doctor Smith"
"U.S.A." ‚Üí "USA" (or "United States of America")
"e.g." ‚Üí "for example"
"i.e." ‚Üí "that is"

# Special Characters
"caf√©" ‚Üí "cafe"
"na√Øve" ‚Üí "naive"
"r√©sum√©" ‚Üí "resume"

# Punctuation
"Mr." ‚Üí "Mr"
"Mrs." ‚Üí "Mrs"
"Ph.D." ‚Üí "PhD"
```

### When NOT to Normalize

- **Proper nouns**: "U.S.A." (country) vs "USA" (abbreviation) - context matters
- **Domain-specific terms**: "AI" vs "artificial intelligence" - may have different meanings
- **Sentiment analysis**: "don't" vs "do not" - contractions can carry different emotional weight
- **Preserving original format**: When exact text matching is required

### Implementation Strategy

1. **Contractions**: Use a dictionary mapping contractions to full forms
2. **Abbreviations**: Create abbreviation dictionaries (context-dependent)
3. **Special Characters**: Use Unicode normalization (NFD/NFC) or character mapping
4. **Punctuation**: Remove or standardize punctuation marks consistently

**Key Insight**: Normalization is a trade-off between consistency and information preservation. Choose normalization strategies based on your specific NLP task!

---

### üí° PS: Useful Frameworks for Normalization

Instead of manually creating endless dictionaries for contractions, abbreviations, and special characters, consider using established NLP frameworks:

- **spaCy**: Provides built-in text normalization, lemmatization, and tokenization
  ```python
  import spacy
  nlp = spacy.load("en_core_web_sm")
  doc = nlp("I don't think it's working")
  # Access normalized tokens, lemmas, etc.
  ```

- **NLTK**: Offers contraction expansion, word normalization, and various text processing utilities
  ```python
  from nltk.tokenize import word_tokenize
  from nltk.corpus import stopwords
  ```

- **TextBlob**: Simple API for common NLP tasks including normalization
  ```python
  from textblob import TextBlob
  blob = TextBlob("I don't like it")
  ```

- **Unidecode**: Specifically for Unicode normalization (removing accents, special characters)
  ```python
  from unidecode import unidecode
  unidecode("caf√©")  # Returns "cafe"
  ```

- **contractions**: Python library specifically for expanding contractions
  ```python
  import contractions
  contractions.fix("don't")  # Returns "do not"
  ```

**Note**: While these frameworks are helpful, understanding the underlying concepts (as you'll practice in this exercise) is crucial for customizing normalization for your specific use case!


In [5]:
import unicodedata

# 1Ô∏è‚É£ Contractions normalization
contractions_dict = {
    "don't": "do not",
    "can't": "cannot",
    "i'm": "i am",
    "it's": "it is",
    "won't": "will not",
    "let's": "let us"
}

def expand_contractions(text):
    pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b')
    return pattern.sub(lambda x: contractions_dict[x.group()], text)

# 2Ô∏è‚É£ Abbreviation normalization
abbrev_dict = {
    "u.s.a.": "usa",
    "dr.": "doctor",
    "e.g.": "for example",
    "i.e.": "that is"
}

def normalize_abbreviations(text):
    pattern = re.compile(r'\b(' + '|'.join(abbrev_dict.keys()) + r')\b', re.IGNORECASE)
    return pattern.sub(lambda x: abbrev_dict[x.group().lower()], text)

# 3Ô∏è‚É£ Remove special characters (normalize Unicode)
def normalize_special_chars(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')

# Combine all
def normalize_text(text):
    text = expand_contractions(text)
    text = normalize_abbreviations(text)
    text = normalize_special_chars(text)
    return text

example = "I‚Äôm Dr. Smith and I don‚Äôt like caf√© culture in the U.S.A."
print(normalize_text(example))

Im Dr. Smith and I dont like cafe culture in the U.S.A.


### Exercise 1c: Clean HTML and Markdown

Real-world text often contains HTML tags or markdown formatting.
Complete functions to remove HTML tags while preserving text content, and clean markdown.

**Goal**: Handle structured text formats commonly found in web content.


In [6]:
from bs4 import BeautifulSoup
import re

def clean_html(text):
    """Remove HTML tags and keep text content."""
    return BeautifulSoup(text, "html.parser").get_text(separator=" ")

def clean_markdown(text):
    """Remove markdown formatting like **bold**, _italic_, [links](url), etc."""
    text = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', text)  # [text](url)
    text = re.sub(r'(\*\*|__)(.*?)\1', r'\2', text)        # bold
    text = re.sub(r'(\*|_)(.*?)\1', r'\2', text)           # italic
    text = re.sub(r'`{1,3}.*?`{1,3}', '', text)            # code blocks
    text = re.sub(r'#+\s?', '', text)                      # headers
    return text.strip()

html_sample = "<p>This is a <b>bold</b> paragraph with <a href='url'>link</a>.</p>"
md_sample = "This is **bold**, _italic_, and a [link](https://example.com)."

print("HTML cleaned:", clean_html(html_sample))
print("Markdown cleaned:", clean_markdown(md_sample))

HTML cleaned: This is a  bold  paragraph with  link .
Markdown cleaned: This is bold, italic, and a link.


### Exercise 1d: Compare Cleaning Strategies

Compare the impact of different cleaning approaches on vocabulary size and text quality.
This helps understand when to apply different cleaning techniques.

**Goal**: Measure the practical impact of preprocessing choices.


In [8]:
def clean_text(text):
    """
    Remove URLs, emails, phone numbers, extra whitespace, and lowercase text.
    """
    if not isinstance(text, str):
        return ""
    text = re.sub(r'http\S+|www\S+', '', text)                    # URLs
    text = re.sub(r'\S+@\S+\.\S+', '', text)                      # Emails
    text = re.sub(r'\(?\b\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b', '', text)  # Phones
    text = re.sub(r'\s+', ' ', text).strip()                      # Extra spaces
    return text.lower()


# ============================================================
# Exercise 1b helpers: Normalization
# ============================================================
contractions_dict = {
    "don't": "do not",
    "can't": "cannot",
    "i'm": "i am",
    "it's": "it is",
    "won't": "will not",
    "let's": "let us"
}

def normalize_contractions(text):
    pattern = re.compile(r'\b(' + '|'.join(contractions_dict.keys()) + r')\b')
    return pattern.sub(lambda x: contractions_dict[x.group()], text)


abbrev_dict = {
    "u.s.a.": "usa",
    "dr.": "doctor",
    "e.g.": "for example",
    "i.e.": "that is"
}

def normalize_abbreviations(text):
    pattern = re.compile(r'\b(' + '|'.join(abbrev_dict.keys()) + r')\b', re.IGNORECASE)
    return pattern.sub(lambda x: abbrev_dict[x.group().lower()], text)

In [9]:
def compare_cleaning_strategies(df, column='description', sample_size=10):
    """
    Compare different cleaning strategies and measure their impact.

    Strategies to compare:
    1. No cleaning (baseline)
    2. Basic cleaning (lowercase, whitespace)
    3. Full cleaning (remove URLs, emails, etc.)
    4. Full cleaning + normalization (contractions, abbreviations)

    Metrics to measure:
    - Vocabulary size (unique words)
    - Average document length
    - Number of tokens
    """
    from collections import Counter

    sample_texts = df[column].head(sample_size).tolist()

    results = {}

    # Strategy 1: No cleaning
    vocab_no_clean = set()
    total_tokens_no_clean = 0
    for text in sample_texts:
        tokens = text.split()
        vocab_no_clean.update(tokens)
        total_tokens_no_clean += len(tokens)

    results['No Cleaning'] = {
        'vocab_size': len(vocab_no_clean),
        'total_tokens': total_tokens_no_clean,
        'avg_length': total_tokens_no_clean / len(sample_texts)
    }

    # Strategy 2: Basic cleaning (lowercase, whitespace)
    vocab_basic = set()
    total_tokens_basic = 0
    for text in sample_texts:
        # TODO: Apply basic cleaning (lowercase, normalize whitespace)
        cleaned = text.lower()
        cleaned = re.sub(r'\s+', ' ', cleaned).strip()
        tokens = cleaned.split()
        vocab_basic.update(tokens)
        total_tokens_basic += len(tokens)

    results['Basic Cleaning'] = {
        'vocab_size': len(vocab_basic),
        'total_tokens': total_tokens_basic,
        'avg_length': total_tokens_basic / len(sample_texts)
    }

    # Strategy 3: Full cleaning (use your clean_text function)
    vocab_full = set()
    total_tokens_full = 0
    for text in sample_texts:
        # TODO: Apply full cleaning (remove URLs, emails, etc.)
        cleaned = clean_text(text)  # Use your function from Exercise 1
        tokens = cleaned.split()
        vocab_full.update(tokens)
        total_tokens_full += len(tokens)

    results['Full Cleaning'] = {
        'vocab_size': len(vocab_full),
        'total_tokens': total_tokens_full,
        'avg_length': total_tokens_full / len(sample_texts)
    }

    # Strategy 4: Full cleaning + normalization
    vocab_norm = set()
    total_tokens_norm = 0
    for text in sample_texts:
        # TODO: Apply full cleaning + normalization
        cleaned = clean_text(text)
        cleaned = normalize_contractions(cleaned)
        cleaned = normalize_abbreviations(cleaned)
        tokens = cleaned.split()
        vocab_norm.update(tokens)
        total_tokens_norm += len(tokens)

    results['Full + Normalization'] = {
        'vocab_size': len(vocab_norm),
        'total_tokens': total_tokens_norm,
        'avg_length': total_tokens_norm / len(sample_texts)
    }

    # Display comparison
    print("=" * 70)
    print("Cleaning Strategy Comparison")
    print("=" * 70)
    print(f"{'Strategy':<25} {'Vocab Size':<15} {'Total Tokens':<15} {'Avg Length':<15}")
    print("-" * 70)
    for strategy, metrics in results.items():
        print(f"{strategy:<25} {metrics['vocab_size']:<15} {metrics['total_tokens']:<15} {metrics['avg_length']:<15.2f}")

    print("\n" + "=" * 70)
    print("Key Insights:")
    print("=" * 70)
    print(f"Vocabulary reduction: {results['No Cleaning']['vocab_size']} ‚Üí {results['Full + Normalization']['vocab_size']}")
    print(f"Reduction percentage: {(1 - results['Full + Normalization']['vocab_size'] / results['No Cleaning']['vocab_size']) * 100:.1f}%")
    print("\nWhen to use each strategy:")
    print("  - No cleaning: Preserve original text for exact matching")
    print("  - Basic cleaning: Simple normalization, fast processing")
    print("  - Full cleaning: Remove noise, reduce vocabulary size")
    print("  - Full + Normalization: Maximum consistency, best for NLP tasks")

    return results

# Run comparison
comparison_results = compare_cleaning_strategies(df, sample_size=10)


Cleaning Strategy Comparison
Strategy                  Vocab Size      Total Tokens    Avg Length     
----------------------------------------------------------------------
No Cleaning               80              172             17.20          
Basic Cleaning            77              172             17.20          
Full Cleaning             77              172             17.20          
Full + Normalization      77              172             17.20          

Key Insights:
Vocabulary reduction: 80 ‚Üí 77
Reduction percentage: 3.7%

When to use each strategy:
  - No cleaning: Preserve original text for exact matching
  - Basic cleaning: Simple normalization, fast processing
  - Full cleaning: Remove noise, reduce vocabulary size
  - Full + Normalization: Maximum consistency, best for NLP tasks


## Exercise 2: Tokenization

Complete the `tokenize_text` function to:
1. Tokenize text into words
2. Filter out very short tokens (length < 3)
3. Optionally filter stop words


In [10]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

STOP_WORDS = set(stopwords.words('english'))

# ============================================================
# Exercise 2: Tokenize Text
# ============================================================
def tokenize_text(text, min_length=3, remove_stopwords=True):
    """
    Tokenizes text into words, filters short tokens, and optionally removes stop words.
    """
    if not isinstance(text, str):
        return []

    # Basic tokenization using regex
    tokens = re.findall(r'\b\w+\b', text.lower())  # split on word boundaries

    # Filter by min_length
    tokens = [t for t in tokens if len(t) >= min_length]

    # Optionally remove stop words
    if remove_stopwords:
        tokens = [t for t in tokens if t not in STOP_WORDS]

    return tokens

# Example
sample = "This is a simple example: it shows how tokenization works."
print(tokenize_text(sample))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['simple', 'example', 'shows', 'tokenization', 'works']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Exercise 2a: Compare Tokenization Methods

Compare different tokenization approaches to understand their impact.
Different methods split text differently, affecting vocabulary size and token quality.

**Goal**: Understand trade-offs between different tokenization techniques.


In [12]:
import spacy
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nlp = spacy.load("en_core_web_sm")

def compare_tokenization_methods(text):
    """
    Compare regex-based, NLTK, and spaCy tokenization outputs.
    """
    nltk_tokens = nltk.word_tokenize(text)
    regex_tokens = re.findall(r'\b\w+\b', text.lower())
    spacy_tokens = [token.text for token in nlp(text)]

    print(f"Text: {text}")
    print("\n--- Regex Tokenization ---")
    print(regex_tokens)
    print("\n--- NLTK Tokenization ---")
    print(nltk_tokens)
    print("\n--- spaCy Tokenization ---")
    print(spacy_tokens)

    print(f"\nVocabulary sizes:")
    print(f"Regex: {len(set(regex_tokens))}, NLTK: {len(set(nltk_tokens))}, spaCy: {len(set(spacy_tokens))}")

# Example
compare_tokenization_methods("Mr. Smith‚Äôs caf√© is open 24/7 in the U.S.A.!")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Text: Mr. Smith‚Äôs caf√© is open 24/7 in the U.S.A.!

--- Regex Tokenization ---
['mr', 'smith', 's', 'caf√©', 'is', 'open', '24', '7', 'in', 'the', 'u', 's', 'a']

--- NLTK Tokenization ---
['Mr.', 'Smith', '‚Äô', 's', 'caf√©', 'is', 'open', '24/7', 'in', 'the', 'U.S.A.', '!']

--- spaCy Tokenization ---
['Mr.', 'Smith', '‚Äôs', 'caf√©', 'is', 'open', '24/7', 'in', 'the', 'U.S.A.', '!']

Vocabulary sizes:
Regex: 12, NLTK: 12, spaCy: 11


### Exercise 2b: Test Impact of min_length Threshold

Test how different minimum token length thresholds affect vocabulary size and token quality.
Shorter tokens (like "a", "I", "it") are often less informative but may be important in some contexts.

**Goal**: Understand the trade-off between vocabulary size and token informativeness.


In [13]:
def test_min_length_impact(texts, lengths=[1, 2, 3, 4, 5]):
    """
    Test how min_length affects vocabulary size and token count.
    """
    for L in lengths:
        all_tokens = [token for text in texts for token in tokenize_text(text, min_length=L)]
        vocab_size = len(set(all_tokens))
        print(f"min_length={L}: {len(all_tokens)} tokens, vocab size={vocab_size}")

# Example
test_min_length_impact(df['description'].head(5))

min_length=1: 61 tokens, vocab size=38
min_length=2: 61 tokens, vocab size=38
min_length=3: 61 tokens, vocab size=38
min_length=4: 60 tokens, vocab size=37
min_length=5: 48 tokens, vocab size=32


### Exercise 2c: Test Impact of Stop Word Removal

Compare tokenization with and without stop word removal.
Stop words are common but may be important for some tasks (e.g., sentiment analysis).

**Goal**: Understand when stop word removal helps vs. hurts NLP tasks.


In [14]:
def compare_stopword_removal(texts):
    """
    Compare tokenization with and without stop word removal.
    """
    all_tokens_no_stop = [t for text in texts for t in tokenize_text(text, remove_stopwords=False)]
    all_tokens_stop = [t for text in texts for t in tokenize_text(text, remove_stopwords=True)]

    print(f"With stop words: {len(set(all_tokens_no_stop))} unique tokens")
    print(f"Without stop words: {len(set(all_tokens_stop))} unique tokens")
    print(f"Reduction: {(1 - len(set(all_tokens_stop))/len(set(all_tokens_no_stop)))*100:.1f}%")

# Example
compare_stopword_removal(df['description'].head(10))

With stop words: 71 unique tokens
Without stop words: 62 unique tokens
Reduction: 12.7%


### Exercise 2d: Tokenize Different Text Types

Compare tokenization on different types of text (titles vs descriptions).
Different text types have different characteristics and may need different preprocessing.

**Goal**: Understand how text type affects tokenization decisions.


In [15]:
def compare_text_types(df):
    """
    Compare tokenization characteristics between movie titles and descriptions.
    """
    title_tokens = [t for text in df['title'] for t in tokenize_text(text)]
    desc_tokens = [t for text in df['description'] for t in tokenize_text(text)]

    print(f"Titles: {len(set(title_tokens))} unique tokens, avg length {len(title_tokens)/len(df):.1f}")
    print(f"Descriptions: {len(set(desc_tokens))} unique tokens, avg length {len(desc_tokens)/len(df):.1f}")

    print(f"\nVocabulary overlap: {len(set(title_tokens) & set(desc_tokens))} shared words")

# Example
compare_text_types(df)

Titles: 73 unique tokens, avg length 2.0
Descriptions: 169 unique tokens, avg length 11.7

Vocabulary overlap: 13 shared words


### Exercise 2e: Create N-grams from Tokens

After tokenization, create n-grams (unigrams, bigrams, trigrams) to capture word order.
Compare vocabulary size and sparsity of different n-gram approaches.

**Goal**: Understand how n-grams capture context and affect vocabulary size.


In [16]:
from nltk import ngrams
from collections import Counter

def create_ngrams(tokens, n=2):
    """
    Create n-grams (bigrams, trigrams, etc.) from a list of tokens.
    """
    return ['_'.join(gram) for gram in ngrams(tokens, n)]

def compare_ngrams(texts, n_values=[1, 2, 3]):
    """
    Compare vocabulary size and sparsity across different n-gram models.
    """
    for n in n_values:
        all_ngrams = []
        for text in texts:
            tokens = tokenize_text(text)
            grams = create_ngrams(tokens, n) if n > 1 else tokens
            all_ngrams.extend(grams)
        vocab_size = len(set(all_ngrams))
        print(f"{n}-grams: {len(all_ngrams)} total, {vocab_size} unique")

# Example
compare_ngrams(df['description'].head(10))

1-grams: 111 total, 62 unique
2-grams: 101 total, 78 unique
3-grams: 91 total, 75 unique


## Exercise 3: Bag of Words (BoW) / Term Frequency (TF) Calculation

Implement Term Frequency (TF) calculation manually - this is also called Bag of Words (BoW)!

**Note**: This is Part 1 of vectorization. In **Exercise Notebook Part 2**, you'll learn IDF and full TF-IDF calculation.

**What you'll implement:**
1. Calculate Term Frequency (TF) - count how many times a word appears in a document
2. TF = count(term) / total_terms_in_document
3. This creates Bag of Words vectors - simple word counts!

**Key Point**: BoW/TF is just counting words - it's the foundation before we learn TF-IDF in Part 2!


In [17]:
from collections import Counter

def calculate_tf(term, document_tokens):
    """
    Calculate Term Frequency: count(term) / total_terms
    """
    # Conta quantas vezes o termo aparece no documento
    term_count = document_tokens.count(term)

    # Divide pelo n√∫mero total de palavras no documento
    total_terms = len(document_tokens)

    # Evita divis√£o por zero
    if total_terms == 0:
        return 0.0

    return term_count / total_terms


def create_bow_vector(document_tokens, vocabulary):
    """
    Create a Bag of Words vector for a document.
    Each position corresponds to the count of that word in the document.
    """
    # Conta quantas vezes cada token aparece
    token_counts = Counter(document_tokens)

    # Cria o vetor onde cada posi√ß√£o corresponde √† contagem de uma palavra do vocabul√°rio
    bow_vector = [token_counts[word] for word in vocabulary]

    return bow_vector


# Test with simple example
docs = [
    ["natural", "language", "processing"],
    ["machine", "learning", "natural"],
    ["deep", "learning", "language"]
]

# Build vocabulary
all_words = set()
for doc in docs:
    all_words.update(doc)
vocab = sorted(list(all_words))

print("Vocabulary:", vocab)
print("\nTF of 'natural' in doc 0:", calculate_tf("natural", docs[0]))
print("TF of 'language' in doc 0:", calculate_tf("language", docs[0]))

print("\nBag of Words vectors:")
for i, doc in enumerate(docs):
    bow_vector = create_bow_vector(doc, vocab)
    print(f"Doc {i}: {doc} ‚Üí {bow_vector}")

print("\nüí° Note: These are simple word counts (BoW). In Part 2, you'll learn TF-IDF!")

Vocabulary: ['deep', 'language', 'learning', 'machine', 'natural', 'processing']

TF of 'natural' in doc 0: 0.3333333333333333
TF of 'language' in doc 0: 0.3333333333333333

Bag of Words vectors:
Doc 0: ['natural', 'language', 'processing'] ‚Üí [0, 1, 0, 0, 1, 1]
Doc 1: ['machine', 'learning', 'natural'] ‚Üí [0, 0, 1, 1, 1, 0]
Doc 2: ['deep', 'learning', 'language'] ‚Üí [1, 1, 1, 0, 0, 0]

üí° Note: These are simple word counts (BoW). In Part 2, you'll learn TF-IDF!


## Exercise 4: TF-Based Keyword Search

Implement a keyword search that ranks results by Term Frequency (TF).
For multiple query words, combine their TF scores.


In [18]:
def search_by_tf(query, documents, vocab):
    """
    Rank documents by the sum of Term Frequencies (TF) of query terms.
    Args:
        query (str): search query (e.g., "machine learning")
        documents (list[list[str]]): list of tokenized documents
        vocab (list[str]): full vocabulary (sorted)
    Returns:
        list of tuples: (doc_index, score) sorted by descending score
    """
    query_terms = query.lower().split()
    results = []

    for i, doc_tokens in enumerate(documents):
        score = 0
        for term in query_terms:
            score += calculate_tf(term, doc_tokens)
        results.append((i, score))

    # Sort by score descending
    results = sorted(results, key=lambda x: x[1], reverse=True)
    return results


# üîπ Example
query = "natural learning"
results = search_by_tf(query, docs, vocab)

print(f"Query: '{query}'\n")
for idx, score in results:
    print(f"Doc {idx}: {docs[idx]} ‚Üí TF Score = {score:.3f}")

Query: 'natural learning'

Doc 1: ['machine', 'learning', 'natural'] ‚Üí TF Score = 0.667
Doc 0: ['natural', 'language', 'processing'] ‚Üí TF Score = 0.333
Doc 2: ['deep', 'learning', 'language'] ‚Üí TF Score = 0.333


## Exercise 5: Compare Preprocessing Approaches

Compare search results with and without preprocessing (stop words removal, lowercasing).


In [19]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def preprocess_text(text, remove_stopwords=True, lowercase=True):
    tokens = text.split()
    if lowercase:
        tokens = [t.lower() for t in tokens]
    if remove_stopwords:
        stops = set(stopwords.words('english'))
        tokens = [t for t in tokens if t not in stops]
    return tokens


# üîπ Example comparison
text = "The Natural Language Processing course is amazing!"
print("Original tokens:", text.split())
print("Preprocessed:", preprocess_text(text))

Original tokens: ['The', 'Natural', 'Language', 'Processing', 'course', 'is', 'amazing!']
Preprocessed: ['natural', 'language', 'processing', 'course', 'amazing!']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Exercise 6: Stemming and Lemmatization

Stemming and lemmatization are advanced preprocessing techniques that reduce words to their root forms:
- **Stemming**: Quick, rule-based reduction (e.g., "running" ‚Üí "run", "better" ‚Üí "better")
- **Lemmatization**: More accurate, context-aware reduction using dictionaries (e.g., "better" ‚Üí "good")

**When to use**: Useful for reducing vocabulary size and handling word variations (running/ran/run).

Complete exercises below to understand both approaches!


### Exercise 6a: Implement Stemming

Complete the `apply_stemming` function using NLTK's PorterStemmer or SnowballStemmer.


In [20]:
from nltk.stem import PorterStemmer

def apply_stemming(tokens):
    """
    Apply stemming using PorterStemmer.
    """
    stemmer = PorterStemmer()
    return [stemmer.stem(t) for t in tokens]


# üîπ Example
tokens = ["running", "runs", "easily", "fairness"]
print("Before stemming:", tokens)
print("After stemming:", apply_stemming(tokens))

Before stemming: ['running', 'runs', 'easily', 'fairness']
After stemming: ['run', 'run', 'easili', 'fair']


### Exercise 6b: Implement Lemmatization

Complete the `apply_lemmatization` function using NLTK's WordNetLemmatizer.


In [21]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

def apply_lemmatization(tokens):
    """
    Apply lemmatization using WordNetLemmatizer.
    """
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(t) for t in tokens]


# üîπ Example
tokens = ["running", "better", "feet"]
print("Before lemmatization:", tokens)
print("After lemmatization:", apply_lemmatization(tokens))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Before lemmatization: ['running', 'better', 'feet']
After lemmatization: ['running', 'better', 'foot']


### Exercise 6c: Compare Stemming vs Lemmatization

Compare the results of stemming and lemmatization on the same text.
Analyze when each approach is better.


In [22]:
tokens = ["running", "runs", "better", "feet", "studies", "studying"]

print("\nOriginal:", tokens)
print("Stemmed:", apply_stemming(tokens))
print("Lemmatized:", apply_lemmatization(tokens))


Original: ['running', 'runs', 'better', 'feet', 'studies', 'studying']
Stemmed: ['run', 'run', 'better', 'feet', 'studi', 'studi']
Lemmatized: ['running', 'run', 'better', 'foot', 'study', 'studying']


### Exercise 6d: Complete Preprocessing Pipeline with Stemming/Lemmatization

Create a complete preprocessing function that includes optional stemming or lemmatization.
Compare search results with and without stemming/lemmatization.


In [26]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
# garantir recursos necess√°rios
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')

def preprocess_text_complete(text,
                             remove_stop_words=True,
                             min_length=3,
                             stem=False,
                             lemmatize=False):
    """
    Complete preprocessing pipeline with optional stemming/lemmatization.

    Args:
        text: Raw text input
        remove_stop_words: Whether to remove stop words
        min_length: Minimum token length
        stem: If True, apply stemming
        lemmatize: If True, apply lemmatization (overrides stem if both True)

    Returns:
        list: Preprocessed tokens
    """

    # Step 1: Clean text (use your clean_text function from Exercise 1 if defined)
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)  # remove URLs
    text = re.sub(r"[^a-z\s]", "", text)                 # keep only letters
    text = re.sub(r"\s+", " ", text).strip()             # normalize spaces

    # Step 2: Tokenize
    tokens = word_tokenize(text)

    # Step 3: Remove stop words
    if remove_stop_words:
        stop_words = set(stopwords.words('english'))
        tokens = [t for t in tokens if t not in stop_words]

    # Step 4: Filter by length
    tokens = [t for t in tokens if len(t) >= min_length]

    # Step 5: Apply stemming or lemmatization
    if lemmatize:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
    elif stem:
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(t) for t in tokens]

    return tokens


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [27]:
# Test on a movie description
sample_desc = df.loc[0, 'description']
print("Original description:", sample_desc[:200])

# Test different preprocessing combinations
tokens_basic = preprocess_text_complete(sample_desc, stem=False, lemmatize=False)
tokens_stemmed = preprocess_text_complete(sample_desc, stem=True, lemmatize=False)
tokens_lemmatized = preprocess_text_complete(sample_desc, stem=False, lemmatize=True)

print("\nBasic tokens:", tokens_basic[:20])
print("\nStemmed tokens:", tokens_stemmed[:20])
print("\nLemmatized tokens:", tokens_lemmatized[:20])

print("\nVocabulary sizes:")
print("Basic:", len(set(tokens_basic)))
print("Stemmed:", len(set(tokens_stemmed)))
print("Lemmatized:", len(set(tokens_lemmatized)))

Original description: A compelling romance film about a young adventurer. an epic adventure that spans continents and generations. a touching love story that will warm your heart.

Basic tokens: ['compelling', 'romance', 'film', 'young', 'adventurer', 'epic', 'adventure', 'spans', 'continents', 'generations', 'touching', 'love', 'story', 'warm', 'heart']

Stemmed tokens: ['compel', 'romanc', 'film', 'young', 'adventur', 'epic', 'adventur', 'span', 'contin', 'gener', 'touch', 'love', 'stori', 'warm', 'heart']

Lemmatized tokens: ['compelling', 'romance', 'film', 'young', 'adventurer', 'epic', 'adventure', 'span', 'continent', 'generation', 'touching', 'love', 'story', 'warm', 'heart']

Vocabulary sizes:
Basic: 15
Stemmed: 14
Lemmatized: 15


## Summary: What You've Practiced in Part 1

‚úÖ **Exercise 1**: Text cleaning with regex (URLs, emails, phone numbers)  
  - **1a**: Extract specific patterns (dates, prices, hashtags, mentions)  
  - **1b**: Normalize text content (contractions, abbreviations, special chars)  
  - **1c**: Clean HTML and Markdown  
  - **1d**: Compare cleaning strategies  

‚úÖ **Exercise 2**: Tokenization with stop word removal  
  - **2a**: Compare tokenization methods (split, regex, word boundaries)  
  - **2b**: Test impact of min_length threshold  
  - **2c**: Test impact of stop word removal  
  - **2d**: Tokenize different text types (titles vs descriptions)  
  - **2e**: Create n-grams from tokens  

‚úÖ **Exercise 3**: Bag of Words (BoW) / Term Frequency (TF) calculation  
‚úÖ **Exercise 4**: TF-based keyword search  
‚úÖ **Exercise 5**: Compare preprocessing approaches  
‚úÖ **Exercise 6**: Stemming and Lemmatization (6a, 6b, 6c, 6d)  
‚úÖ **Exercise 7**: Advanced regex patterns  
‚úÖ **Exercise 8**: Handling special cases in preprocessing  

**Key Takeaways:**
- Preprocessing quality directly affects search results
- Different tokenization methods have different trade-offs (speed vs. accuracy vs. context)
- min_length and stop word removal significantly impact vocabulary size
- Text type matters: titles vs descriptions need different approaches
- N-grams capture word order but increase vocabulary size and sparsity
- Stemming vs Lemmatization: Choose based on speed vs accuracy needs
- Regex is powerful for cleaning but handle edge cases carefully
- Bag of Words (BoW/TF) is simple word counting - foundation for TF-IDF!


## Exercise 7: Advanced Regex Patterns

Practice more complex regex patterns for text preprocessing.

**Goal**: Master advanced regex techniques commonly used in NLP preprocessing.

**What you'll practice:**
1. Named groups and non-capturing groups
2. Lookahead and lookbehind assertions
3. Complex pattern matching (dates, currencies, numbers)
4. Regex substitution with callbacks
5. Handling unicode and special characters

**Why this matters**: Real-world text contains complex patterns that require sophisticated regex to extract or clean properly.


# Exercise 7a: Extract Dates in Different Formats

def extract_dates(text):
    """
    Extract dates in various formats:
    - "January 1, 2024" or "Jan 1, 2024"
    - "2024-01-01" or "01/01/2024"
    - "1st January 2024"
    
    Use named groups to extract day, month, year separately.
    
    Args:
        text: Input text containing dates
    
    Returns:
        list: List of tuples (day, month, year) or dicts with named groups
    """
    # TODO: Create regex pattern(s) to match different date formats
    # TODO: Use named groups (?P<name>pattern) to extract components
    # TODO: Return list of matches
    
    dates = []
    # Hint: Use re.finditer() or re.findall() with named groups
    
    return dates

# Test cases
test_text = """
    The movie was released on January 15, 2024.
    It premiered on 2024-03-20 in theaters.
    The sequel came out on 04/15/2024.
    The original was on 1st January 2020.
"""

# TODO: Extract dates and print results
# dates = extract_dates(test_text)
# for date in dates:
#     print(date)


In [None]:
# Exercise 7b: Extract Currency and Numbers

def extract_currency(text):
    """
    Extract currency amounts in different formats:
    - "$100", "$1,000.50", "$1M", "$1.5B"
    - "‚Ç¨50", "¬£200", "¬•1000"
    - "100 dollars", "fifty euros"

    Args:
        text: Input text

    Returns:
        list: List of currency amounts with their symbols/units
    """
    # TODO: Create regex patterns to match currency formats
    # TODO: Extract amount and currency symbol/unit
    # TODO: Handle different currency symbols ($, ‚Ç¨, ¬£, ¬•)
    # TODO: Handle abbreviations (M = million, B = billion, K = thousand)

    currencies = []

    return currencies

def extract_numbers(text):
    """
    Extract numbers in various formats:
    - Integers: "100", "1,000", "1 million"
    - Decimals: "3.14", "1,234.56"
    - Percentages: "50%", "25.5 percent"
    - Ordinals: "1st", "2nd", "3rd", "4th"

    Args:
        text: Input text

    Returns:
        dict: Dictionary with keys 'integers', 'decimals', 'percentages', 'ordinals'
    """
    # TODO: Extract different number types
    # TODO: Use named groups or separate patterns for each type

    numbers = {
        'integers': [],
        'decimals': [],
        'percentages': [],
        'ordinals': []
    }

    return numbers

# Test cases
test_text = """
    The movie grossed $150 million at the box office.
    It cost $50M to produce and made ‚Ç¨75.5M worldwide.
    The rating was 8.5/10 with 95% positive reviews.
    It ranked 1st in its opening weekend, 2nd overall.
    The budget was approximately 1.5 billion dollars.
"""

# TODO: Extract currencies and numbers
# currencies = extract_currency(test_text)
# numbers = extract_numbers(test_text)
# print("Currencies:", currencies)
# print("Numbers:", numbers)


In [None]:
# Exercise 7c: Lookahead and Lookbehind Patterns

def extract_quoted_text(text):
    """
    Extract text within quotes, handling nested quotes.
    Use lookahead/lookbehind to ensure proper matching.

    Args:
        text: Input text with quoted strings

    Returns:
        list: List of quoted text (without the quotes)
    """
    # TODO: Use positive lookbehind (?<=...) and positive lookahead (?=...)
    # TODO: Match text between quotes (single or double)
    # TODO: Handle escaped quotes inside strings

    quoted = []

    return quoted

def extract_context_words(text, target_word, context_size=3):
    """
    Extract words around a target word using lookahead/lookbehind.

    Example: For "machine learning" in "I love machine learning and AI",
             extract "love", "and", "AI" (context_size=1)

    Args:
        text: Input text
        target_word: Word to find context for
        context_size: Number of words before and after to extract

    Returns:
        dict: {'before': [words], 'after': [words], 'target': word}
    """
    # TODO: Use lookbehind to capture words before target
    # TODO: Use lookahead to capture words after target
    # TODO: Return context words

    context = {
        'before': [],
        'after': [],
        'target': target_word
    }

    return context

# Test cases
test_text = """
    The director said "This is the best movie I've ever made."
    He added, 'It's a masterpiece' and everyone agreed.
    The phrase "machine learning" appears often in AI discussions.
"""

# TODO: Extract quoted text and context
# quoted = extract_quoted_text(test_text)
# context = extract_context_words(test_text, "machine", context_size=2)
# print("Quoted text:", quoted)
# print("Context:", context)


In [None]:
# Exercise 7d: Regex Substitution with Callbacks

def anonymize_emails(text):
    """
    Replace email addresses with "[EMAIL]" placeholder.
    Use re.sub() with a function callback.

    Args:
        text: Input text

    Returns:
        str: Text with emails anonymized
    """
    # TODO: Use re.sub() with a function that replaces email pattern
    # Pattern: something@domain.com
    # Replace with: "[EMAIL]"

    return text

def normalize_whitespace_advanced(text):
    """
    Normalize whitespace, but preserve intentional line breaks.
    - Replace multiple spaces with single space
    - Replace multiple newlines (2+) with double newline (paragraph break)
    - Preserve single newlines (line breaks)

    Args:
        text: Input text

    Returns:
        str: Normalized text
    """
    # TODO: Use re.sub() with callbacks to handle different whitespace patterns
    # TODO: Preserve intentional formatting while cleaning up excessive whitespace

    return text

def format_numbers_readable(text):
    """
    Format large numbers to be more readable.
    - "1000000" ‚Üí "1,000,000"
    - "1500" ‚Üí "1,500"
    - But preserve decimals: "1234.56" ‚Üí "1,234.56"

    Use re.sub() with a callback function to format numbers.

    Args:
        text: Input text

    Returns:
        str: Text with formatted numbers
    """
    # TODO: Find numbers in text
    # TODO: Use callback function to add commas every 3 digits
    # TODO: Preserve decimal points

    return text

# Test cases
test_text = """
    Contact us at info@example.com or support@company.org.
    The movie made 1500000 dollars in its first week.
    It had 5000 viewers on opening day.
    The budget was 2500000.50 dollars.
"""

# TODO: Test anonymization and formatting
# anonymized = anonymize_emails(test_text)
# formatted = format_numbers_readable(test_text)
# print("Anonymized:", anonymized)
# print("Formatted:", formatted)


## Exercise 8: Handling Special Cases in Preprocessing

Handle edge cases in text preprocessing that are common in real-world data.

**Goal**: Make preprocessing robust to handle messy, real-world text data.

**What you'll handle:**
1. Mixed encoding issues (unicode, emojis, special characters)
2. Inconsistent capitalization (acronyms, proper nouns)
3. Numbers and units (measurements, percentages, dates)
4. Abbreviations and contractions
5. Whitespace inconsistencies and formatting artifacts
6. Missing or corrupted data (NaN, None, empty strings)


In [None]:
# Exercise 6a: Handle Unicode and Special Characters

def clean_unicode_text(text):
    """
    Clean text with unicode issues:
    - Normalize unicode characters (√© ‚Üí e, √± ‚Üí n)
    - Remove or replace emojis
    - Handle special quote characters ("" ‚Üí ", '' ‚Üí ')
    - Remove zero-width spaces and other invisible characters

    Args:
        text: Input text with unicode issues

    Returns:
        str: Cleaned text
    """
    import unicodedata

    # TODO: Normalize unicode (NFD or NFC)
    # TODO: Remove or replace emojis
    # TODO: Normalize quotes and special characters
    # TODO: Remove invisible characters

    return text

def handle_mixed_encoding(text):
    """
    Handle text that may have encoding issues (latin-1, utf-8, etc.)
    - Try to decode with different encodings
    - Replace problematic characters with approximations
    - Handle encoding errors gracefully

    Args:
        text: Potentially corrupted text

    Returns:
        str: Cleaned text with proper encoding
    """
    # TODO: Handle encoding errors
    # TODO: Replace problematic characters
    # Hint: Use .encode() and .decode() with error handling

    return text

# Test cases
test_text = """
    The movie "Inception" was amazing! üé¨
    It's a sci-fi masterpiece with great actors.
    The director's name is Christopher Nolan.
    Rating: 9/10 ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
"""

# TODO: Clean unicode and test
# cleaned = clean_unicode_text(test_text)
# print("Cleaned:", cleaned)


In [None]:
# Exercise 6b: Handle Inconsistent Capitalization

def smart_lowercase(text, preserve_acronyms=True, preserve_proper_nouns=False):
    """
    Convert text to lowercase intelligently:
    - Option 1: Preserve acronyms (NASA, AI, USA ‚Üí keep uppercase)
    - Option 2: Preserve proper nouns (names, places) ‚Üí more complex!

    Args:
        text: Input text
        preserve_acronyms: If True, keep acronyms uppercase
        preserve_proper_nouns: If True, try to preserve proper nouns (challenging!)

    Returns:
        str: Smartly lowercased text
    """
    # TODO: If preserve_acronyms, detect acronyms (all caps, 2+ chars)
    # TODO: Convert rest to lowercase
    # TODO: If preserve_proper_nouns, use heuristics (capitalized words at sentence start)
    # Note: Full proper noun detection requires NER (Named Entity Recognition) - not covered here!

    return text

def handle_title_case(text):
    """
    Normalize title case inconsistencies.
    - "Star Wars" vs "STAR WARS" vs "star wars" ‚Üí "star wars"
    - But preserve intentional capitalization when needed

    Args:
        text: Input text

    Returns:
        str: Normalized text
    """
    # TODO: Handle different capitalization styles
    # TODO: Convert to consistent lowercase (or preserve known proper nouns)

    return text

# Test cases
test_text = """
    The movie AI is about artificial intelligence.
    NASA scientists worked on the film.
    The director is Christopher Nolan, not CHRIS NOLAN.
    The film STAR WARS is a classic.
"""

# TODO: Test smart lowercase
# smart_lower = smart_lowercase(test_text, preserve_acronyms=True)
# print("Smart lowercase:", smart_lower)


In [None]:
# Exercise 6c: Handle Missing and Corrupted Data

def preprocess_robust(text):
    """
    Robust preprocessing that handles:
    - None/NaN values
    - Empty strings
    - Whitespace-only strings
    - Very long strings (truncate if needed)
    - Non-string types (convert to string)

    Args:
        text: Potentially problematic input

    Returns:
        str: Cleaned text or empty string if invalid
    """
    import pandas as pd
    import numpy as np

    # TODO: Check if text is None or NaN
    # TODO: Check if text is empty or whitespace-only
    # TODO: Convert to string if not already
    # TODO: Handle edge cases (too long, wrong type, etc.)

    if text is None or (isinstance(text, float) and np.isnan(text)):
        return ""

    # TODO: Continue with cleaning...

    return str(text) if text else ""

def batch_preprocess_robust(texts):
    """
    Preprocess a list of texts, handling missing/corrupted entries.

    Args:
        texts: List of texts (may contain None, NaN, etc.)

    Returns:
        list: List of cleaned texts (same length, invalid entries become empty strings)
    """
    # TODO: Process each text with preprocess_robust
    # TODO: Maintain same length as input
    # TODO: Log or track which entries were invalid

    cleaned = []
    invalid_indices = []

    for i, text in enumerate(texts):
        cleaned_text = preprocess_robust(text)
        if not cleaned_text:
            invalid_indices.append(i)
        cleaned.append(cleaned_text)

    if invalid_indices:
        print(f"Warning: {len(invalid_indices)} invalid entries found at indices: {invalid_indices[:10]}...")

    return cleaned

# Test cases
test_texts = [
    "Normal text here",
    None,
    "",
    "   ",  # whitespace only
    "Valid text with content",
    float('nan'),
    "Another valid entry",
    12345,  # number instead of string
    "Good text"
]

# TODO: Test robust preprocessing
# cleaned_texts = batch_preprocess_robust(test_texts)
# print("Cleaned texts:", cleaned_texts)
# print(f"Valid entries: {sum(1 for t in cleaned_texts if t)}/{len(cleaned_texts)}")


In [None]:
# Exercise 6d: Handle Numbers and Units in Text

def normalize_numbers_and_units(text):
    """
    Normalize numbers and units for better text processing:
    - "100 years" ‚Üí "100_years" or "[NUMBER] years" (preserve context)
    - "50%" ‚Üí "50_percent" or "[PERCENTAGE]"
    - "3.5 stars" ‚Üí "3.5_stars" or "[RATING]"

    Options:
    1. Replace with placeholders: "[NUMBER]", "[PERCENTAGE]", etc.
    2. Keep as-is but mark: "100_years" (replace space with underscore)
    3. Remove entirely: "100 years" ‚Üí "years"

    Args:
        text: Input text

    Returns:
        str: Text with normalized numbers/units
    """
    # TODO: Detect numbers with units (years, dollars, percent, etc.)
    # TODO: Normalize format (choose one approach above)
    # TODO: Handle different number formats (integers, decimals, percentages)

    return text

def extract_numeric_metadata(text):
    """
    Extract numeric metadata (ratings, years, amounts) and store separately.
    This allows keeping text clean while preserving important numeric information.

    Args:
        text: Input text

    Returns:
        dict: {
            'text': cleaned text (numbers removed or replaced),
            'ratings': [list of ratings],
            'years': [list of years],
            'amounts': [list of monetary amounts],
            'percentages': [list of percentages]
        }
    """
    # TODO: Extract different types of numbers
    # TODO: Remove or replace them in text
    # TODO: Return both cleaned text and extracted metadata

    metadata = {
        'text': text,
        'ratings': [],
        'years': [],
        'amounts': [],
        'percentages': []
    }

    return metadata

# Test cases
test_text = """
    The movie was released in 2010 and grossed $800 million.
    It has a rating of 8.7/10 with 95% positive reviews.
    The runtime is 148 minutes and it won 4 Oscars.
"""

# TODO: Test number normalization
# normalized = normalize_numbers_and_units(test_text)
# metadata = extract_numeric_metadata(test_text)
# print("Normalized:", normalized)
# print("Metadata:", metadata)


In [None]:
# Exercise 8e: Complete Robust Preprocessing Pipeline

def preprocess_robust_pipeline(text,
                               handle_unicode=True,
                               handle_capitalization=True,
                               handle_numbers=True,
                               handle_missing=True):
    """
    Complete robust preprocessing pipeline that handles all edge cases.

    Pipeline:
    1. Handle missing/corrupted data
    2. Handle unicode and special characters
    3. Handle capitalization (smart lowercase)
    4. Handle numbers and units (normalize or extract)
    5. Basic cleaning (URLs, emails, etc. - from Exercise 1)
    6. Normalize whitespace

    Args:
        text: Raw input text
        handle_unicode: Whether to clean unicode
        handle_capitalization: Whether to apply smart lowercase
        handle_numbers: Whether to normalize numbers/units
        handle_missing: Whether to handle missing data

    Returns:
        str: Fully preprocessed text
    """
    # TODO: Step 1: Handle missing data
    if handle_missing:
        text = preprocess_robust(text)
        if not text:
            return ""

    # TODO: Step 2: Handle unicode
    if handle_unicode:
        text = clean_unicode_text(text)

    # TODO: Step 3: Handle capitalization
    if handle_capitalization:
        text = smart_lowercase(text, preserve_acronyms=True)

    # TODO: Step 4: Handle numbers (optional - normalize or extract)
    if handle_numbers:
        # Option: Normalize or extract metadata
        text = normalize_numbers_and_units(text)

    # TODO: Step 5: Basic cleaning (from Exercise 1)
    # text = clean_text(text)  # Use your function from Exercise 1

    # TODO: Step 6: Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test on real movie data
print("Testing robust preprocessing on movie descriptions:")
print("=" * 70)

# Test on a few movie descriptions (handle potential missing data)
for idx in range(min(5, len(df))):
    original = df.loc[idx, 'description']
    if pd.isna(original):
        print(f"\nMovie {idx}: [MISSING DATA]")
        continue

    processed = preprocess_robust_pipeline(original)

    print(f"\nMovie {idx}:")
    print(f"Original (first 100 chars): {str(original)[:100]}...")
    print(f"Processed (first 100 chars): {processed[:100]}...")
    print(f"Length: {len(str(original))} ‚Üí {len(processed)} chars")

print("\n" + "=" * 70)
print("üí° Key Insight: Robust preprocessing handles real-world data issues!")
print("   Always test your preprocessing on actual data to find edge cases.")
