# Module 01: Text Preprocessing and Tokenization

**Difficulty**: ‚≠ê‚≠ê Intermediate  
**Estimated Time**: 100 minutes  
**Prerequisites**: [Module 00: Introduction to NLP](00_introduction_to_nlp.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Apply advanced text cleaning techniques using regular expressions
2. Understand and implement modern tokenization strategies (BPE, WordPiece)
3. Handle special text elements (URLs, mentions, emojis, hashtags)
4. Build production-ready text preprocessing pipelines
5. Compare different tokenization methods and their use cases

## Why Preprocessing Matters

Text preprocessing is the **foundation** of any NLP pipeline. Poor preprocessing can:
- Introduce noise and reduce model accuracy
- Create inconsistent representations
- Waste computational resources
- Cause failures in production

**Good preprocessing**:
- Standardizes inputs
- Reduces vocabulary size
- Improves model generalization
- Handles edge cases gracefully

## Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter

# NLP libraries
import nltk
from nltk.tokenize import word_tokenize, TweetTokenizer
import spacy

# Hugging Face tokenizers
from tokenizers import Tokenizer
from tokenizers.models import BPE, WordPiece
from tokenizers.trainers import BpeTrainer, WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from transformers import AutoTokenizer

# Visualization
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

# Random seed
np.random.seed(42)

print("‚úì Libraries imported successfully!")

In [None]:
# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

# Load spaCy model (install with: python -m spacy download en_core_web_sm)
try:
    nlp = spacy.load('en_core_web_sm')
    print("‚úì spaCy model loaded successfully!")
except OSError:
    print("‚ö† spaCy model not found. Install with: python -m spacy download en_core_web_sm")
    nlp = None

## 1. Advanced Text Cleaning with Regular Expressions

Regular expressions (regex) are powerful tools for pattern matching and text manipulation.

### 1.1 Cleaning Social Media Text

Social media text contains many special elements that need careful handling.

In [None]:
# Sample social media text
social_media_text = """
@john_doe Check out this amazing article! https://example.com/article123 
#NLP #MachineLearning #AI üöÄüî•
Email me at contact@example.com for more info!!!
Price: $99.99 (50% OFF) - Limited time only!!!
RT @jane_smith: This is sooo cool üòçüòçüòç
"""

print("Original text:")
print(social_media_text)

In [None]:
def clean_social_media_text(text, remove_urls=True, remove_mentions=True, 
                           remove_hashtags=False, remove_emojis=True,
                           remove_emails=True, normalize_whitespace=True):
    """
    Clean social media text with configurable options.
    
    Parameters:
    -----------
    text : str
        Input text to clean
    remove_urls : bool
        Whether to remove URLs
    remove_mentions : bool
        Whether to remove @mentions
    remove_hashtags : bool
        Whether to remove #hashtags (keep False to preserve topics)
    remove_emojis : bool
        Whether to remove emoji characters
    remove_emails : bool
        Whether to remove email addresses
    normalize_whitespace : bool
        Whether to normalize multiple spaces to single space
        
    Returns:
    --------
    str : Cleaned text
    """
    
    # Remove URLs
    if remove_urls:
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove email addresses
    if remove_emails:
        text = re.sub(r'\S+@\S+', '', text)
    
    # Remove @mentions
    if remove_mentions:
        text = re.sub(r'@\w+', '', text)
    
    # Remove or clean hashtags
    if remove_hashtags:
        text = re.sub(r'#\w+', '', text)
    else:
        # Keep hashtag content but remove the # symbol
        text = re.sub(r'#(\w+)', r'\1', text)
    
    # Remove emojis
    if remove_emojis:
        # Emoji pattern covering most common emojis
        emoji_pattern = re.compile(
            "["
            "\U0001F600-\U0001F64F"  # emoticons
            "\U0001F300-\U0001F5FF"  # symbols & pictographs
            "\U0001F680-\U0001F6FF"  # transport & map symbols
            "\U0001F1E0-\U0001F1FF"  # flags
            "\U00002702-\U000027B0"
            "\U000024C2-\U0001F251"
            "]+", flags=re.UNICODE
        )
        text = emoji_pattern.sub(r'', text)
    
    # Remove RT (retweet indicator)
    text = re.sub(r'\bRT\b', '', text)
    
    # Normalize whitespace
    if normalize_whitespace:
        text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

# Test the cleaning function
cleaned_text = clean_social_media_text(social_media_text)
print("Cleaned text:")
print(cleaned_text)

**Exercise 1**: Custom text cleaner

Modify the cleaning function to:
1. Replace repeated punctuation ("!!!", "???") with single instances
2. Expand contractions ("don't" ‚Üí "do not", "won't" ‚Üí "will not")
3. Remove or replace price mentions ("$99.99") with a token like "[PRICE]"

In [None]:
# YOUR CODE HERE
def advanced_clean_text(text):
    """
    Apply advanced cleaning including:
    - Repeated punctuation normalization
    - Contraction expansion
    - Price tokenization
    """
    # Hint: Use re.sub() with appropriate patterns
    pass

# Test your function
test_text = "This is sooo cool!!! It won't cost $99.99 anymore!"
# Expected output: "This is so cool! It will not cost [PRICE] anymore!"

### 1.2 Unicode Normalization

Text from different sources may use different Unicode representations. Normalization ensures consistency.

In [None]:
import unicodedata

# Examples of Unicode variations
text1 = "caf√©"  # √© as single character (U+00E9)
text2 = "caf√©"  # √© as e + combining accent (U+0065 + U+0301)

print(f"Text 1: {text1} (length: {len(text1)})")
print(f"Text 2: {text2} (length: {len(text2)})")
print(f"Are they equal? {text1 == text2}")

# Normalize both to NFC (Canonical Decomposition followed by Canonical Composition)
normalized1 = unicodedata.normalize('NFC', text1)
normalized2 = unicodedata.normalize('NFC', text2)

print(f"\nAfter normalization: {normalized1 == normalized2}")

In [None]:
def normalize_unicode(text, form='NFC'):
    """
    Normalize Unicode text.
    
    Forms:
    - NFC: Canonical Decomposition + Composition (recommended)
    - NFD: Canonical Decomposition
    - NFKC: Compatibility Decomposition + Composition
    - NFKD: Compatibility Decomposition
    """
    return unicodedata.normalize(form, text)

# Test with accented characters
test_texts = ["na√Øve caf√©", "Z√ºrich", "se√±or"]
for text in test_texts:
    normalized = normalize_unicode(text)
    print(f"{text:15} ‚Üí {normalized:15} (length: {len(text)} ‚Üí {len(normalized)})")

## 2. Modern Tokenization Strategies

While simple word tokenization works for basic cases, modern NLP uses more sophisticated methods.

### 2.1 Comparison of Tokenization Methods

Let's compare different tokenization approaches on the same text.

In [None]:
sample_text = "The quick-brown fox jumps over the lazy dog. It's running at 25mph!"

# Method 1: Simple split on whitespace
simple_tokens = sample_text.split()

# Method 2: NLTK word tokenizer
nltk_tokens = word_tokenize(sample_text)

# Method 3: Tweet tokenizer (preserves hashtags, mentions)
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(sample_text)

# Method 4: spaCy tokenizer
if nlp:
    doc = nlp(sample_text)
    spacy_tokens = [token.text for token in doc]
else:
    spacy_tokens = ["spaCy not available"]

# Compare results
comparison_df = pd.DataFrame({
    'Method': ['Simple Split', 'NLTK', 'TweetTokenizer', 'spaCy'],
    'Token Count': [len(simple_tokens), len(nltk_tokens), 
                   len(tweet_tokens), len(spacy_tokens)],
    'Sample Tokens': [
        str(simple_tokens[:5]),
        str(nltk_tokens[:5]),
        str(tweet_tokens[:5]),
        str(spacy_tokens[:5])
    ]
})

print(comparison_df.to_string(index=False))

**Observations**:
- Simple split fails on punctuation
- NLTK and spaCy handle contractions better
- Different tokenizers make different decisions about hyphenated words

### 2.2 Subword Tokenization: BPE (Byte Pair Encoding)

**Why subword tokenization?**

Word-level tokenization has problems:
- **Large vocabulary**: English has 170,000+ words
- **Out-of-vocabulary (OOV)**: Can't handle new or misspelled words
- **Morphology**: "run", "running", "runner" treated as completely different

**BPE Solution**: Break words into subword units
- "running" ‚Üí ["run", "##ning"]
- "unhappiness" ‚Üí ["un", "##happi", "##ness"]
- "coronavirus" (new word) ‚Üí ["coron", "##avirus"] (can be understood from parts)

**Used by**: GPT, GPT-2, RoBERTa, BART

In [None]:
# Train a simple BPE tokenizer
# First, create training data
training_corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Natural language processing is amazing",
    "Machine learning transforms how we process text",
    "Deep learning models require lots of data",
    "Transformers revolutionized natural language understanding",
    "BERT and GPT are popular transformer models",
    "Fine-tuning pre-trained models saves time and resources",
    "Tokenization is a crucial preprocessing step",
] * 10  # Repeat for better training

# Save to file (BPE trainer needs file input)
with open('/tmp/training_data.txt', 'w') as f:
    for text in training_corpus:
        f.write(text + '\n')

print(f"Training corpus: {len(training_corpus)} sentences")

In [None]:
# Initialize BPE tokenizer
bpe_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
bpe_tokenizer.pre_tokenizer = Whitespace()

# Train BPE with small vocabulary
trainer = BpeTrainer(
    vocab_size=100,  # Small vocab for demonstration
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

bpe_tokenizer.train(['/tmp/training_data.txt'], trainer)
print("‚úì BPE tokenizer trained!")

# Get vocabulary
vocab = bpe_tokenizer.get_vocab()
print(f"Vocabulary size: {len(vocab)}")
print(f"\nSample vocabulary (first 20 tokens):")
print(list(vocab.keys())[:20])

In [None]:
# Test BPE tokenization
test_sentences = [
    "Natural language processing",
    "Transformers are revolutionary",  # 'revolutionary' might be split
    "Preprocessing text data",
]

print("BPE Tokenization Results:\n")
for sentence in test_sentences:
    encoding = bpe_tokenizer.encode(sentence)
    print(f"Input: {sentence}")
    print(f"Tokens: {encoding.tokens}")
    print(f"IDs: {encoding.ids}")
    print()

**Exercise 2**: Analyze BPE behavior

Test the BPE tokenizer on words that weren't in the training data:
1. "coronavirus" (new word)
2. "antidisestablishmentarianism" (very long word)
3. "happily" vs "unhappily" (morphological variants)

Observe how BPE breaks them into subwords.

In [None]:
# YOUR CODE HERE
oov_words = ["coronavirus", "antidisestablishmentarianism", "happily", "unhappily"]

# Tokenize each and observe the subword breakdown

### 2.3 WordPiece Tokenization

**WordPiece** is similar to BPE but uses a different merging criterion (likelihood-based).

**Used by**: BERT, DistilBERT, Electra

**Key difference**: Instead of frequency-based merging, WordPiece maximizes the likelihood of the training data.

In [None]:
# Initialize WordPiece tokenizer
wordpiece_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
wordpiece_tokenizer.pre_tokenizer = Whitespace()

# Train WordPiece
wp_trainer = WordPieceTrainer(
    vocab_size=100,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

wordpiece_tokenizer.train(['/tmp/training_data.txt'], wp_trainer)
print("‚úì WordPiece tokenizer trained!")

In [None]:
# Compare BPE vs WordPiece
test_text = "Preprocessing transformers for natural language understanding"

bpe_encoding = bpe_tokenizer.encode(test_text)
wp_encoding = wordpiece_tokenizer.encode(test_text)

print("Input text:", test_text)
print("\nBPE tokens:", bpe_encoding.tokens)
print("WordPiece tokens:", wp_encoding.tokens)
print(f"\nToken count - BPE: {len(bpe_encoding.tokens)}, WordPiece: {len(wp_encoding.tokens)}")

### 2.4 Using Pre-trained Tokenizers

In practice, we use tokenizers from pre-trained models like BERT, GPT-2, etc.

In [None]:
# Load BERT tokenizer (WordPiece)
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Load GPT-2 tokenizer (BPE)
gpt2_tokenizer = AutoTokenizer.from_pretrained('gpt2')

print("‚úì Pre-trained tokenizers loaded!")

In [None]:
# Compare BERT vs GPT-2 tokenization
test_text = "The unhappiest preprocessing experience!"

bert_tokens = bert_tokenizer.tokenize(test_text)
gpt2_tokens = gpt2_tokenizer.tokenize(test_text)

print(f"Input: {test_text}\n")
print(f"BERT (WordPiece): {bert_tokens}")
print(f"GPT-2 (BPE): {gpt2_tokens}")

# Get IDs (what the model actually sees)
bert_ids = bert_tokenizer.encode(test_text)
gpt2_ids = gpt2_tokenizer.encode(test_text)

print(f"\nBERT IDs: {bert_ids}")
print(f"GPT-2 IDs: {gpt2_ids}")

**Key Observations**:
- BERT uses `##` prefix for subword continuations
- GPT-2 uses `ƒ†` prefix for spaces (byte-level BPE)
- Both can handle OOV words by breaking them into subwords

**Exercise 3**: Tokenization comparison

Compare how BERT and GPT-2 tokenize these challenging cases:
1. "COVID-19"
2. "don't", "won't", "I'm"
3. "antidisestablishmentarianism"
4. "üöÄ rocket emoji"

Explain the differences you observe.

In [None]:
# YOUR CODE HERE
challenging_texts = [
    "COVID-19",
    "don't won't I'm",
    "antidisestablishmentarianism",
    "üöÄ rocket emoji"
]

# Compare BERT and GPT-2 tokenization for each

## 3. Building a Production-Ready Preprocessing Pipeline

Let's combine everything into a robust, reusable pipeline.

In [None]:
class TextPreprocessor:
    """
    Production-ready text preprocessing pipeline.
    """
    
    def __init__(self, 
                 lowercase=True,
                 remove_urls=True,
                 remove_mentions=True,
                 remove_hashtags=False,
                 remove_emojis=False,
                 normalize_unicode=True,
                 tokenizer_name='bert-base-uncased'):
        """
        Initialize preprocessor with configuration.
        
        Parameters:
        -----------
        lowercase : bool
            Convert text to lowercase
        remove_urls : bool
            Remove URLs from text
        remove_mentions : bool
            Remove @mentions
        remove_hashtags : bool
            Remove #hashtags
        remove_emojis : bool
            Remove emoji characters
        normalize_unicode : bool
            Apply Unicode normalization
        tokenizer_name : str
            Name of Hugging Face tokenizer to use
        """
        self.lowercase = lowercase
        self.remove_urls = remove_urls
        self.remove_mentions = remove_mentions
        self.remove_hashtags = remove_hashtags
        self.remove_emojis = remove_emojis
        self.normalize_unicode = normalize_unicode
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        
    def clean(self, text):
        """
        Apply cleaning rules to text.
        """
        if self.normalize_unicode:
            text = unicodedata.normalize('NFC', text)
        
        if self.lowercase:
            text = text.lower()
        
        if self.remove_urls:
            text = re.sub(r'http\S+|www\S+|https\S+', '', text)
        
        if self.remove_mentions:
            text = re.sub(r'@\w+', '', text)
        
        if self.remove_hashtags:
            text = re.sub(r'#\w+', '', text)
        
        if self.remove_emojis:
            emoji_pattern = re.compile(
                "["
                "\U0001F600-\U0001F64F"
                "\U0001F300-\U0001F5FF"
                "\U0001F680-\U0001F6FF"
                "\U0001F1E0-\U0001F1FF"
                "]+", flags=re.UNICODE
            )
            text = emoji_pattern.sub(r'', text)
        
        # Normalize whitespace
        text = re.sub(r'\s+', ' ', text)
        
        return text.strip()
    
    def tokenize(self, text, return_tensors=None):
        """
        Tokenize text using configured tokenizer.
        """
        return self.tokenizer(
            text,
            padding=True,
            truncation=True,
            return_tensors=return_tensors
        )
    
    def preprocess(self, text, return_tokens=False):
        """
        Complete preprocessing pipeline: clean + tokenize.
        """
        cleaned_text = self.clean(text)
        
        if return_tokens:
            return cleaned_text, self.tokenizer.tokenize(cleaned_text)
        else:
            return cleaned_text
    
    def batch_preprocess(self, texts, return_tensors='pt'):
        """
        Preprocess a batch of texts.
        """
        cleaned_texts = [self.clean(text) for text in texts]
        return self.tokenizer(
            cleaned_texts,
            padding=True,
            truncation=True,
            return_tensors=return_tensors
        )

print("‚úì TextPreprocessor class defined!")

In [None]:
# Test the preprocessor
preprocessor = TextPreprocessor(
    lowercase=True,
    remove_urls=True,
    remove_mentions=True,
    remove_emojis=True,
    tokenizer_name='bert-base-uncased'
)

# Test on social media text
test_text = """
@john Check out this NLP tutorial! https://example.com #NLP #AI üöÄ
It's really amazing and helpful!!!
"""

cleaned, tokens = preprocessor.preprocess(test_text, return_tokens=True)

print("Original text:")
print(test_text)
print("\nCleaned text:")
print(cleaned)
print("\nTokens:")
print(tokens)

In [None]:
# Batch processing example
batch_texts = [
    "I love natural language processing!",
    "@user This is an amazing tutorial https://example.com",
    "#NLP #MachineLearning #DeepLearning üî•",
]

# Process batch
batch_output = preprocessor.batch_preprocess(batch_texts)

print("Batch processing results:")
print(f"Input IDs shape: {batch_output['input_ids'].shape}")
print(f"Attention mask shape: {batch_output['attention_mask'].shape}")
print("\nFirst text tokens:")
print(preprocessor.tokenizer.convert_ids_to_tokens(batch_output['input_ids'][0]))

**Exercise 4**: Custom preprocessing pipeline

Extend the `TextPreprocessor` class to:
1. Add a method to handle contractions expansion
2. Add statistics tracking (number of URLs removed, mentions removed, etc.)
3. Add a method to save/load configuration from JSON

Test your extended preprocessor on a sample dataset.

In [None]:
# YOUR CODE HERE
class ExtendedTextPreprocessor(TextPreprocessor):
    """
    Extended preprocessor with additional features.
    """
    pass

## Summary

### Key Concepts Covered:

1. **Advanced Text Cleaning**:
   - Regular expressions for pattern matching
   - Handling social media elements (URLs, mentions, hashtags, emojis)
   - Unicode normalization for consistency

2. **Tokenization Methods**:
   - Word-level: Simple but limited
   - Subword tokenization: BPE and WordPiece
   - Pre-trained tokenizers from BERT, GPT-2
   - Trade-offs: vocabulary size vs OOV handling

3. **Production Pipeline**:
   - Configurable preprocessing
   - Batch processing support
   - Integration with Hugging Face tokenizers
   - Reusable and maintainable design

### Important Takeaways:

- **Always clean before tokenizing**: Garbage in, garbage out
- **Use subword tokenization**: Better OOV handling, smaller vocabulary
- **Match tokenizer to model**: BERT uses WordPiece, GPT uses BPE
- **Batch processing**: Much faster than processing one at a time
- **Make it configurable**: Different tasks need different preprocessing

### What's Next?

In **Module 02: Word Embeddings**, we'll learn:
- How to convert tokens into dense vector representations
- Word2Vec, GloVe, and FastText algorithms
- Semantic relationships in vector space
- Visualizing embeddings with t-SNE
- Limitations that led to contextual embeddings

### Additional Resources:

- **Hugging Face Tokenizers**: [huggingface.co/docs/tokenizers](https://huggingface.co/docs/tokenizers)
- **BPE Paper**: [Neural Machine Translation of Rare Words](https://arxiv.org/abs/1508.07909)
- **Regular Expressions**: [regex101.com](https://regex101.com/) (interactive tester)
- **Unicode Normalization**: [unicode.org/reports/tr15](https://www.unicode.org/reports/tr15/)