[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/rudyhendrawn/data-course/blob/main/data-prep/text-data/01_text_data_preprocessing.ipynb)


# 1. Overview & Learning Outcomes

**Learning objectives**
- Understand the role of text preprocessing in a classical NLP pipeline.
- Apply practical steps: canonicalization, cleaning, normalization, tokenization, stopword management, lemmatization/stemming, and simple negation/emoji handling.
- Build a small, reusable preprocessing pipeline and measure its impact on feature spaces.

**What is text preprocessing?**  

Text preprocessing is a crucial step in natural language processing (NLP) that converts raw, unstructured text into a clean, consistent format suitable for machine learning models. It addresses common issues like inconsistencies in encoding, noise from HTML tags or URLs, variations in case and punctuation, and linguistic complexities such as inflections or slang. By standardizing text, preprocessing reduces dimensionality, improves model accuracy, and enhances interpretability. However, it must be tailored to the task‚Äîover-aggressive cleaning can remove important signals (e.g., negators in sentiment analysis). Typical pipeline:
> Ingestion ‚Üí Canonicalization ‚Üí Structural Cleaning ‚Üí Normalization ‚Üí Tokenization ‚Üí Stopwords ‚Üí Lemma/Stemming ‚Üí (Optional) Negation/Emoji/Slang/Spelling ‚Üí Vectorization ‚Üí Modeling.



# 2. Setup & Imports

**Learning objectives**
- Ensure required libraries are installed and imported.
- Load `en_core_web_sm` with a safe, guarded download.

> Notes:  
> - All examples are tiny and deterministic.  
> - Charts use matplotlib (single plot per figure, default colors).


In [None]:
# # Install (guarded). Re-run the cell if installation occurs.
# import sys, subprocess, importlib

# def pip_install(package):
#     try:
#         importlib.import_module(package)
#     except ImportError:
#         print(f"Installing {package}...")
#         subprocess.check_call([sys.executable, "-m", "pip", "install", "--quiet", package])

# # Required
# for pkg in ["spacy", "scikit-learn", "regex", "matplotlib", "pandas"]:
#     pip_install(pkg)

# # Optional
# for pkg in ["ftfy", "emoji", "bs4", "nltk"]:
#     try:
#         importlib.import_module(pkg)
#     except ImportError:
#         try:
#             subprocess.check_call([sys.executable, "-m", "pip", "install", "--quiet", pkg])
#         except Exception as e:
#             print(f"Optional package {pkg} not installed: {e}")


In [None]:
import random
import numpy as np
import pandas as pd
import regex as re
import unicodedata
import matplotlib.pyplot as plt

import spacy
from spacy.util import is_package
from spacy.cli import download as spacy_download

# Seed everything for determinism where possible
random.seed(42)
np.random.seed(42)

# Load en_core_web_sm with guarded download
try:
    nlp = spacy.load("en_core_web_sm")
except Exception:
    print("Downloading spaCy model en_core_web_sm...")
    spacy_download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

print("Setup complete.")

# 3. A Tiny Teaching Corpus & Ground Truth Labels

**Learning objectives**
- Build a miniature corpus with diverse noise patterns for teaching.
- Provide optional labels for later comparison.

This section introduces a small, curated dataset designed to illustrate common text preprocessing challenges in natural language processing (NLP). The corpus consists of 12 sample documents, grouped into three thematic categories: technology (tech), travel, and food. Each category includes three examples to ensure balance and variety.

To simulate real-world text data, we intentionally inject diverse noise patterns that preprocessing steps will address:
- **Structural noise**: HTML tags (e.g., `<b>`, `<i>`), URLs (e.g., `https://example.com`), emails (e.g., `travel.agent@example.org`), mentions (e.g., `@trusted_agent`), and hashtags (e.g., `#vacation`, `#noodles`).
- **Linguistic noise**: Emojis (e.g., `:)`, `üòã`), emoticons, elongations (e.g., "Soooo relaxing"), slang, and contractions (e.g., "I'm", "I'd").
- **Formatting noise**: Smart quotes (e.g., ‚Äú ‚Äù), extra whitespace, and inconsistent punctuation.

The corpus is kept tiny (12 documents) for educational purposes, allowing quick experimentation and visualization without computational overhead. Each document is paired with a ground truth label (e.g., "tech", "travel", "food", or "misc") to facilitate later evaluations, such as measuring preprocessing impact on classification tasks or feature space dimensionality.

By working with this corpus, you'll see how preprocessing transforms raw, noisy text into cleaner, more uniform representations, reducing variability and improving downstream model performance. The examples are deterministic and reproducible, making it ideal for learning and debugging.

In [None]:
corpus = [
    # tech
    "AI systems learn from data to improve performance over time.",
    "The new GPU accelerates deep-learning workloads; C++ and Python interop is common.",
    "Visit <b>our docs</b> at https://example.com/docs for API examples & usage.",
    # travel
    "I loved the beaches in Bali!!! Soooo relaxing :) #vacation",
    "Book flights via email: travel.agent@example.org ‚Äî or DM @trusted_agent",
    "Paris is great in spring; museums were not crowded.",
    # food
    "This ramen was *so* good, but not cheap. I'd go again! üòã",
    "Check out our menu & deals at http://noodles.example/menu #noodles",
    "I dislike overly sweet desserts; they‚Äôre not my style.",
    # mixed/noisy extras
    "‚ÄúSmart-quotes‚Äù and   extra   spaces	should be normalized.",
    "HTML <i>tags</i> should be stripped (or safely handled).",
    "I'm sooo happppy about this!!!"
]

labels = [
    "tech","tech","tech",
    "travel","travel","travel",
    "food","food","food",
    "misc","misc","misc"
]

import pandas as pd
df = pd.DataFrame({"text": corpus, "label": labels})
df.head()


# 4. Ingestion & Canonicalization (Unicode, encodings)

**Learning objectives**
- Normalize Unicode to reduce spurious variability (e.g., smart quotes vs straight quotes).
- Fix common mojibake and odd spacing.
- Understand the importance of canonicalization in preventing encoding-related errors that can skew NLP models.

Canonicalization is the process of converting text into a standard, consistent form to eliminate variations that arise from different encodings, character representations, or formatting quirks. In NLP, this step is essential because raw text data often contains inconsistencies like smart quotes (‚Äú ‚Äù) instead of straight quotes (" "), accented characters in multiple forms, or mojibake (garbled text from encoding mismatches, e.g., "caf√É¬©" instead of "caf√©"). These variations can lead to inflated vocabulary sizes, poor tokenization, and reduced model performance by treating semantically identical strings as different.

We primarily use Unicode normalization via Python's `unicodedata` module:
- **NFC (Normalization Form Canonical Composition)**: Composes characters into their canonical forms, useful for most text processing to ensure compatibility.
- **NFKC (Normalization Form Compatibility Composition)**: Goes further by decomposing compatibility characters (e.g., full-width Latin letters to standard ones), which is often preferred for NLP to fold equivalences and reduce dimensionality.

Additionally, we apply light repairs with regex to handle common issues like excessive whitespace, fancy dashes (‚Äî, ‚Äì), and quote variants. If the `ftfy` (Fix Text For You) library is installed, we leverage it for more robust fixes, such as correcting encoding errors, removing control characters, and normalizing line breaks. Without `ftfy`, we fall back to `unicodedata` and regex, which cover basic cases but may miss complex mojibake.

This step ensures that downstream preprocessing (e.g., tokenization) operates on clean, uniform text, improving reproducibility and accuracy in tasks like classification or clustering. Over-normalization can sometimes remove useful signals (e.g., in stylistic analysis), so balance is key‚Äîalways tailor to your task.


In [None]:
try:
    import ftfy
    HAS_FTFY = True
except Exception:
    HAS_FTFY = False

import regex as re
import unicodedata

def canonicalize_text(s: str, use_ftfy: bool = True) -> str:
    """Canonicalize text with optional ftfy and Unicode normalization."""
    if use_ftfy and HAS_FTFY:
        s = ftfy.fix_text(s)
    # Normalize Unicode (NFKC often good to fold compatibilities)
    s = unicodedata.normalize("NFKC", s)
    # Replace fancy quotes/dashes with ASCII where sensible
    s = s.replace("‚Äú", """).replace("‚Äù", """).replace("‚Äô", "'").replace("‚Äî", "-").replace("‚Äì", "-")
    # Collapse excessive whitespace
    s = re.sub(r"\s+", " ", s).strip()
    return s

demo = ["‚ÄúSmart-quotes‚Äù and   extra   spaces	should be normalized.",
        "I'm sooo happppy about this!!!"]
[canonicalize_text(x) for x in demo]


# 5. Structural Cleaning (HTML, emails, URLs, mentions, hashtags)

**Learning objectives**
- Remove or mask structural artifacts that typically do not carry semantic content for many tasks.
- Provide configurable behavior (remove vs mask).
- Understand the trade-offs between removal and masking to preserve or eliminate specific information based on the NLP task.

Structural cleaning targets non-linguistic elements in text that can introduce noise or irrelevant variability, such as HTML tags, URLs, email addresses, social media mentions, and hashtags. These artifacts often stem from web scraping, social media data, or formatted documents and may not contribute to the core semantic meaning in tasks like sentiment analysis or topic modeling. For instance, stripping HTML tags prevents parsing errors and reduces dimensionality, while handling URLs and emails avoids treating them as regular words.

The approach can be configured: **removal** deletes these elements entirely, which is aggressive and suitable for general-purpose cleaning where they add no value. **Masking** replaces them with placeholders (e.g., "__URL__", "__EMAIL__"), preserving their presence for tasks where their existence matters (e.g., detecting spam or link-heavy content). This configurability ensures flexibility‚Äîmasking retains positional information without inflating the vocabulary.

In practice, we use libraries like BeautifulSoup for robust HTML stripping and regex for pattern-based removal/masking of URLs, emails, mentions, and hashtags. This step is typically applied after canonicalization to ensure consistent input. Over-removal can lose context (e.g., in social media analysis where hashtags indicate topics), so always align with your task's requirements.

> Caution: In some tasks (e.g., link classification), URLs or emails may be informative; prefer masking to removal.


In [None]:
try:
    from bs4 import BeautifulSoup
    HAS_BS4 = True
except Exception:
    HAS_BS4 = False

URL_RE = re.compile(r"https?://\S+|www\.\S+", flags=re.IGNORECASE)
EMAIL_RE = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
MENTION_RE = re.compile(r"@\w+")
HASHTAG_RE = re.compile(r"#\w+")

def strip_html(text: str) -> str:
    if HAS_BS4:
        return BeautifulSoup(text, "html.parser").get_text(" ")
    # Regex fallback (simplistic)
    return re.sub(r"<[^>]+>", " ", text)

def structural_clean(text: str, mask=True) -> str:
    t = strip_html(text)
    if mask:
        t = URL_RE.sub(" __URL__ ", t)
        t = EMAIL_RE.sub(" __EMAIL__ ", t)
        t = MENTION_RE.sub(" __MENTION__ ", t)
        t = HASHTAG_RE.sub(" __HASHTAG__ ", t)
    else:
        t = URL_RE.sub(" ", t)
        t = EMAIL_RE.sub(" ", t)
        t = MENTION_RE.sub(" ", t)
        t = HASHTAG_RE.sub(" ", t)
    t = re.sub(r"\s+", " ", t).strip()
    return t

samples = [
    "Visit <b>our docs</b> at https://example.com/docs for API examples & usage.",
    "Book flights via email: travel.agent@example.org ‚Äî or DM @trusted_agent",
    "Check out our menu & deals at http://noodles.example/menu #noodles"
]
[structural_clean(s, mask=True) for s in samples]


# 6. Normalization (case, accents, punctuation, digits, whitespace)

**Learning objectives**
- Apply case-folding, accent stripping, and configurable punctuation/digit handling.
- Visualize token frequency before/after normalization.

Normalization is a key step in text preprocessing that standardizes text to reduce variability and improve consistency for downstream NLP tasks. It addresses inconsistencies in case, accents, punctuation, digits, and whitespace that can inflate vocabulary sizes or cause models to treat similar terms as distinct. For example, "Apple" and "apple" might be semantically identical in many contexts, but case sensitivity could split them into separate tokens.

- **Case-folding**: Converting all text to lowercase (or uppercase) to eliminate case-based variations. This is crucial for tasks like search or classification where case doesn't carry meaning, but should be avoided in cases like named entity recognition where capitalization indicates proper nouns.
- **Accent stripping**: Removing diacritical marks (e.g., √© ‚Üí e) using Unicode normalization (NFD to decompose, then filter out combining marks). This reduces dimensionality by treating accented and non-accented forms as equivalent, but may not be suitable for languages where accents change meaning.
- **Punctuation and digit handling**: Configurably removing or preserving punctuation and digits. Punctuation often adds noise in bag-of-words models but is vital for tasks like sentiment analysis (e.g., "not good" vs "not good!"). Digits are typically stripped unless they represent meaningful quantities (e.g., in financial text).
- **Whitespace normalization**: Collapsing multiple spaces, tabs, or newlines into single spaces and trimming edges to prevent tokenization artifacts.

These transformations are applied after canonicalization and structural cleaning to ensure clean input. The impact can be visualized by comparing token frequencies before and after normalization, often showing reduced vocabulary size and increased sparsity in vectorized representations. Over-normalization can remove useful signals (e.g., in stylistic analysis), so configurations should align with the task‚Äîe.g., keep punctuation for sentiment, strip for topic modeling. Always test on a subset to measure effects on model performance.


In [None]:
from collections import Counter
import matplotlib.pyplot as plt

def normalize_text(text: str, lower=True, strip_accents=True, keep_punct=False, keep_digits=False) -> str:
    t = text
    if lower:
        t = t.lower()
    if strip_accents:
        # Decompose into base + diacritics, drop combining marks
        t = unicodedata.normalize("NFD", t)
        t = "".join(ch for ch in t if unicodedata.category(ch) != "Mn")
        t = unicodedata.normalize("NFC", t)
    if not keep_digits:
        t = re.sub(r"\d+", " ", t)
    if not keep_punct:
        # Remove basic punctuation; keep placeholders like __URL__
        t = re.sub(r"[^\w\s]|_", lambda m: " " if not m.group(0).startswith("__") else m.group(0), t)
    # Collapse whitespace
    t = re.sub(r"\s+", " ", t).strip()
    return t

raw_tokens = [w.text for w in nlp(df.text.iloc[0])]
norm_example = normalize_text(structural_clean(corpus[2]))
print("Raw tokens (doc0):", raw_tokens[:12], "...")
print("Normalized example:", norm_example)

# Simple frequency before/after on the whole corpus
before_tokens = [w.text for doc in nlp.pipe(df.text.tolist()) for w in doc if not w.is_space]
after_tokens = []
for txt in df.text:
    t = canonicalize_text(txt)
    t = structural_clean(t, mask=True)
    t = normalize_text(t, lower=True, strip_accents=True, keep_punct=False, keep_digits=False)
    after_tokens.extend([w.text for w in nlp(t) if not w.is_space])

def plot_top(freqs, title):
    items = freqs.most_common(15)
    labels = [k for k, _ in items]
    values = [v for _, v in items]
    plt.figure()
    plt.bar(range(len(labels)), values)
    plt.xticks(range(len(labels)), labels, rotation=45, ha="right")
    plt.title(title)
    plt.tight_layout()
    plt.show()

plot_top(Counter(before_tokens), "Top tokens BEFORE normalization")
plot_top(Counter(after_tokens), "Top tokens AFTER normalization")


# 7. Tokenization: Sentence-, Word-, and Rule-based

**Learning objectives**
- Understand the fundamentals of tokenization in NLP, including its role in breaking down text into manageable units.
- Compare sentence-level tokenization (segmenting text into sentences) with word-level tokenization (splitting sentences into words or subwords).
- Implement and apply custom tokenizer rules to handle special cases, such as preserving compound terms like `C++` or hyphenated words like `e-mail` as single tokens.
- Recognize the trade-offs between rule-based tokenization and more advanced methods (e.g., subword tokenization in transformers).

Tokenization is a foundational step in text preprocessing that involves dividing raw text into smaller, meaningful units called tokens. These tokens can represent words, subwords, punctuation, or even entire sentences, depending on the granularity required for the task. Effective tokenization ensures that downstream processes like vectorization, modeling, and analysis operate on consistent, interpretable elements, reducing noise and improving model performance. Poor tokenization can lead to inflated vocabularies, loss of context, or misinterpretation of phrases (e.g., treating "New York" as two separate tokens when it should be one entity).

### Sentence Tokenization
Sentence tokenization, also known as sentence segmentation, splits text into individual sentences. This is crucial for tasks that require understanding document structure, such as summarization, question-answering, or sentiment analysis at the sentence level. Libraries like spaCy use rule-based approaches combined with machine learning models to detect sentence boundaries based on punctuation (e.g., periods, exclamation marks) and linguistic cues (e.g., capitalization after punctuation). For example:
- Input: "I love Paris. It's beautiful!"
- Output: ["I love Paris.", "It's beautiful!"]

Challenges include handling abbreviations (e.g., "Dr." not ending a sentence) or informal text with ellipses. Over-segmentation can fragment related ideas, while under-segmentation might merge unrelated sentences.

### Word Tokenization
Word tokenization breaks sentences into words, subwords, or tokens, often treating punctuation as separate elements. This is the most common form of tokenization for bag-of-words models or embeddings. SpaCy's tokenizer employs a combination of rules, prefix/suffix patterns, and exception lists to handle complexities like contractions ("don't" ‚Üí ["do", "n't"]) or compound words. For instance:
- Input: "The new GPU accelerates deep-learning workloads."
- Output: ["The", "new", "GPU", "accelerates", "deep", "-", "learning", "workloads", "."]

Advanced variants include subword tokenization (e.g., Byte-Pair Encoding in BERT), which splits rare words into smaller units to manage out-of-vocabulary issues. Rule-based tokenization is fast and deterministic but may struggle with domain-specific jargon or multilingual text.

### Custom Rules and Special Cases
To enhance tokenization, custom rules can be added to spaCy's tokenizer to preserve specific patterns as single tokens. This prevents over-splitting of meaningful units, such as programming languages ("C++"), email formats ("e-mail"), or domain terms. For example, adding special cases ensures "C++" remains intact instead of being split into "C" and "++". This is done via `nlp.tokenizer.add_special_case()`, specifying the exact orthography and desired token structure. Custom rules improve accuracy for technical or specialized corpora but require careful tuning to avoid conflicts with general rules.

In summary, tokenization bridges raw text and structured data, with sentence tokenization providing high-level structure and word tokenization enabling fine-grained analysis. Custom rules add flexibility for edge cases, but always validate on your dataset to ensure tokens align with semantic intent. Over-reliance on defaults can miss nuances, so iterate and test for your NLP task.


In [None]:
from spacy.symbols import ORTH

# Custom rules: keep "C++" and "e-mail" intact
special_cases = [{"ORTH": "C++"}, {"ORTH": "e-mail"}]
for case in special_cases:
    nlp.tokenizer.add_special_case(case["ORTH"], [case])

def spacy_tokenize(doc_text, lemma=False, keep_alpha=True, preserve_case=False):
    doc = nlp(doc_text)
    tokens = []
    for t in doc:
        if keep_alpha and not t.text.isalpha() and t.text not in ("C++", "e-mail"):
            continue
        tok = t.lemma_ if lemma else t.text
        tok = tok if preserve_case else tok.lower()
        tokens.append(tok)
    return tokens

text_ex = "The new GPU accelerates deep-learning. C++ interop via e-mail is OK!"
print("Sentence segmentation:")
for sent in nlp(text_ex).sents:
    print("-", sent.text)

print("\nWord tokens:", [t.text for t in nlp(text_ex)])
print("Custom tokenizer + lemma:", spacy_tokenize(text_ex, lemma=True))


# 8. Stopwords: Defaults, Custom Lists, Domain Terms

**Learning objectives**
- Understand when to remove vs keep stopwords.
- Use spaCy defaults and extend with domain-specific lists. Keep negators if desired.

Stopwords are common words that appear frequently in text but often carry little semantic value, such as articles ("the", "a"), prepositions ("in", "on"), and auxiliary verbs ("is", "be"). In NLP preprocessing, removing stopwords helps reduce the dimensionality of feature spaces, improves computational efficiency, and focuses models on more informative terms. For example, in topic modeling or search indexing, stopwords can be safely discarded to avoid noise.

However, the decision to remove stopwords depends on the task:
- **Remove for general tasks**: In bag-of-words models or clustering, where high-frequency common words dilute signals, removal is beneficial.
- **Keep for sentiment or context-sensitive tasks**: Stopwords like negators ("not", "no", "never") can flip meanings (e.g., "not good" vs. "good"). Retaining them preserves sentiment polarity or negation scope.
- **Domain considerations**: Default lists may not cover specialized jargon. For instance, in technical texts, words like "subject" or "http" might be noise and should be added to custom stopword lists.

SpaCy provides a built-in English stopword list (`spacy.lang.en.stop_words.STOP_WORDS`), which includes around 300 common words. You can extend this with domain-specific terms (e.g., `{"subject", "re", "api"}` for email corpora) or exclude certain words like negators to customize behavior. Always test on your dataset to ensure removal doesn't strip essential context‚Äîover-removal can harm performance in nuanced tasks like question-answering or sentiment analysis.


In [None]:
from spacy.lang.en.stop_words import STOP_WORDS as EN_STOP

DOMAIN_STOPS = {"subject", "re", "http", "https", "api"}
NEGATORS = {"no", "not", "never"}

def remove_stopwords(tokens, keep_negators=True):
    result = []
    for tok in tokens:
        if keep_negators and tok in NEGATORS:
            result.append(tok)
            continue
        if tok in EN_STOP or tok in DOMAIN_STOPS:
            continue
        result.append(tok)
    return result

toks = spacy_tokenize("Paris is not cheap, but it is beautiful!", lemma=True)
print("Before:", toks)
print("After stopword removal (keep negators):", remove_stopwords(toks, keep_negators=True))

# 9. Lemmatization vs. Stemming: Trade-offs & Demos

**Learning objectives**
- Compare linguistic lemmatization with algorithmic stemming.
- Understand when each is appropriate and their respective trade-offs in NLP preprocessing.

Lemmatization and stemming are both techniques used to reduce words to their base forms, helping to normalize text by grouping together inflected or derived forms of a word. This reduces vocabulary size, improves model efficiency, and enhances the ability to match related terms in tasks like search, classification, or clustering. However, they differ significantly in approach, accuracy, and computational cost.

### Lemmatization
Lemmatization is a linguistic approach that reduces words to their canonical or dictionary form (the lemma), taking into account the word's context, part of speech (POS), and morphological analysis. It uses language-specific rules and often relies on pre-trained models or lexicons to ensure the output is a valid word.

- **How it works**: For example, "running" (verb) becomes "run", "better" (comparative adjective) becomes "good", and "studies" (noun) becomes "study". It distinguishes between different meanings based on POS (e.g., "running" as a verb vs. "running" as a gerund).
- **Pros**: More accurate and linguistically sound, producing real words that preserve semantic meaning. Ideal for tasks requiring high precision, such as information retrieval or sentiment analysis.
- **Cons**: Computationally intensive, as it requires POS tagging and access to linguistic resources. Slower on large datasets.
- **When to use**: In applications where accuracy outweighs speed, like academic research, legal text analysis, or when working with morphologically rich languages.

### Stemming
Stemming is an algorithmic approach that reduces words to their root form by heuristically removing suffixes and prefixes, often without considering context or POS. It's rule-based and language-agnostic, using algorithms like Porter, Snowball, or Lancaster.

- **How it works**: For example, "running" becomes "run", "better" becomes "bett" (or similar, depending on the algorithm), and "studies" becomes "studi". It may produce non-words (stems) that aren't valid dictionary entries.
- **Pros**: Fast and lightweight, requiring minimal resources. Effective for reducing dimensionality in large corpora.
- **Cons**: Less accurate, as it can over-stem (e.g., "university" and "universe" both stem to "univers") or under-stem, leading to inconsistencies. Ignores linguistic nuances.
- **When to use**: In high-volume, real-time applications like web search engines or preliminary data exploration, where speed is prioritized over precision.

### Trade-offs and Comparison
- **Accuracy vs. Speed**: Lemmatization provides better semantic accuracy but is slower; stemming is quicker but cruder.
- **Output Quality**: Lemmatization yields valid words; stemming often results in stems that may not be interpretable.
- **Resource Requirements**: Lemmatization needs POS taggers and models (e.g., spaCy); stemming relies on simple rules.
- **Task Suitability**: Use lemmatization for nuanced tasks (e.g., topic modeling with semantic coherence); use stemming for broad matching (e.g., keyword extraction in big data).
- **Example Comparison**:
	- Word: "better", "running", "studies"
	- Lemmatization (spaCy): "good", "run", "study"
	- Stemming (Porter via NLTK): "better", "run", "studi"

In practice, the choice depends on your dataset size, language, and task requirements. For English, spaCy offers robust lemmatization, while NLTK provides stemming options. Always evaluate on a sample to measure impact on vocabulary reduction and model performance.

> This demo uses spaCy lemmatization. If NLTK is installed, a stemming example is shown; otherwise, we only print a note.


In [None]:
# Example sentences for demonstration
examples = [
    "The running ponies are better than studies on horses.",
    "I am not running to the store for better food."
]

for i, sentence in enumerate(examples, 1):
    print(f"\nExample {i}: '{sentence}'")
    
    # Tokenize using spaCy
    tokens = spacy_tokenize(sentence, lemma=False, keep_alpha=True, preserve_case=False)
    print("Tokens:", tokens)
    
    # Remove stopwords
    no_stops = remove_stopwords(tokens, keep_negators=True)
    print("After stopword removal:", no_stops)
    
    # Lemmatize the remaining tokens
    lemmas = [w.lemma_ for w in nlp(" ".join(no_stops))]
    print("Lemmas:", lemmas)
    
    # Stemming (if NLTK available)
    try:
        import nltk
        from nltk.stem import SnowballStemmer
        stemmer = SnowballStemmer("english")
        stems = [stemmer.stem(w) for w in no_stops]
        print("Stems: ", stems)
    except Exception as e:
        print("NLTK not installed for stemming demo (optional).")

# 10. Handling Negation & Contractions

**Learning objectives**
- Expand common English contractions to standardize text and improve tokenization accuracy.
- Implement a simple negation marking scheme to preserve sentiment polarity in downstream tasks like sentiment analysis.

Handling negation and contractions is a nuanced step in text preprocessing that addresses linguistic phenomena where word forms can alter meaning or introduce ambiguity. These elements are particularly critical in tasks like sentiment analysis, where subtle changes (e.g., "not good" vs. "good") can flip polarity, or in information retrieval, where expanded forms ensure consistent matching.

### Expanding Contractions
Contractions are shortened forms of words created by omitting letters and replacing them with an apostrophe (e.g., "don't" for "do not"). They are common in informal text like social media or conversational data but can complicate tokenization and normalization. Expanding them converts contractions back to their full forms, reducing variability and ensuring that models treat "don't" and "do not" equivalently.

- **Why expand?** Unexpanded contractions may be split incorrectly during tokenization (e.g., "don't" as "don" and "'t"), leading to inflated vocabularies or missed semantic connections. Expansion standardizes text, making it easier for lemmatization or vectorization.
- **Implementation**: Use a dictionary of common contractions (e.g., "don't" ‚Üí "do not") and apply regex-based replacement to match whole words case-insensitively. This handles variations like "Don't" or "DON'T" without over-matching partial strings.
- **Trade-offs**: Expansion increases text length slightly but improves consistency. For languages with fewer contractions, this step may be optional. Always test on your corpus to avoid introducing artifacts (e.g., in formal texts where contractions are rare).

### Negation Marking
Negation involves words like "not", "no", or "never" that can invert the meaning of subsequent terms (e.g., "not happy" conveys unhappiness). In bag-of-words models, this scope is often lost, causing models to misclassify sentiment. Negation marking appends a suffix (e.g., "_NEG") to tokens within the negation scope, explicitly signaling the inversion.

- **Why mark negation?** It preserves polarity for sentiment tasks, preventing models from treating "not good" as positive. Without marking, vectorized representations might group "good" and "not good" similarly, harming accuracy.
- **Simple scheme**: Identify negators and mark all following tokens until punctuation (e.g., ".", "!", "?", ",", ";", ":"). This resets the scope at sentence boundaries or clauses, approximating linguistic negation rules without complex parsing.
- **Trade-offs**: This heuristic is lightweight and effective for short texts but may over-mark in complex sentences (e.g., "I not only like it, but love it"). For advanced needs, consider dependency parsing. Retain negators themselves to maintain context, and evaluate on labeled data to ensure it boosts performance without noise.

In practice, apply contraction expansion before tokenization, and negation marking after stopwords removal to focus on content words. This step enhances model robustness, especially in opinion mining or review analysis, but should be tuned to your task‚Äîskip if negation isn't relevant (e.g., in topic modeling). Always combine with other preprocessing steps for a cohesive pipeline.


In [None]:
CONTRACTIONS = {
    "don't": "do not", "doesn't": "does not", "didn't": "did not",
    "can't": "can not", "won't": "will not", "isn't": "is not",
    "aren't": "are not", "wasn't": "was not", "weren't": "were not",
    "shouldn't": "should not", "couldn't": "could not", "wouldn't": "would not",
    "it's": "it is", "i'm": "i am", "they're": "they are", "we're": "we are"
}

def expand_contractions(text: str) -> str:
    def repl(m):
        return CONTRACTIONS.get(m.group(0).lower(), m.group(0))
    pattern = re.compile(r"\b(" + "|".join(map(re.escape, CONTRACTIONS.keys())) + r")\b", flags=re.IGNORECASE)
    return pattern.sub(repl, text)

def mark_negation(tokens):
    """
    Append '_NEG' to tokens that occur after a negation word until punctuation.
    Punctuation here is any token matching [. ! ? , ; :]
    """
    marked = []
    negate = False
    for t in tokens:
        if t in {"not", "no", "never"}:
            negate = True
            marked.append(t)
            continue
        if re.match(r"[\.!\?,;:]", t):
            negate = False
            marked.append(t)
            continue
        marked.append(t + "_NEG" if negate else t)
    return marked

samples = [
    "I don't like overly sweet desserts, but I do like ramen.",
    "She isn't happy with the results, but she isn't sad either.",
    "They can't go now, but they won't stay forever.",
    "I'm not sure if it's good, but it's not bad."
]

for sample in samples:
    expanded = expand_contractions(sample)
    toks = spacy_tokenize(expanded, lemma=True, keep_alpha=False)  # keep punctuation to reset negation
    print("Expanded: ", expanded)
    print("Tokens:   ", toks)
    print("Negation: ", mark_negation(toks))
    print()

# 11. Emojis, Emoticons, Elongations, and Slang

**Learning objectives**
- Map emojis to text (if `emoji` installed), handle emoticons and elongations.
- Apply a tiny slang dictionary replacement.

Emojis, emoticons, elongations, and slang are common elements in informal text, especially from social media, chats, or user-generated content. These can introduce noise or convey nuanced sentiment/emotion that models need to handle appropriately. Preprocessing these elements standardizes them for better NLP performance, reducing variability while preserving meaning where possible. This step is optional but valuable for tasks like sentiment analysis or topic modeling on casual corpora.

### Handling Emojis
Emojis are pictorial symbols (e.g., üòã for delicious food) that add emotional or contextual cues. Raw emojis can be treated as noise or special characters, leading to tokenization issues. If the `emoji` library is available, we can "demojize" them‚Äîconvert to descriptive text (e.g., üòã ‚Üí ":face_savoring_food:"). This maps visual elements to words, allowing models to process them as tokens. Without the library, emojis are left as-is or removed. This preserves sentiment (e.g., positive emojis in reviews) but increases vocabulary if not handled.

### Handling Emoticons
Emoticons are text-based facial expressions (e.g., :) for happy, :( for sad) that mimic emotions. They are often inconsistent in format (e.g., :), :-), ;) ). We use regex to detect and replace them with a placeholder like "__EMOTICON__", masking their presence without losing the indication of emotion. This prevents over-splitting during tokenization and standardizes representation, useful for emotion detection tasks.

### Normalizing Elongations
Elongations involve repeated characters for emphasis (e.g., "Soooo relaxing" for "So relaxing"). These can inflate token uniqueness. We apply regex to limit repeats (e.g., to 2 max), normalizing "Soooo" to "Soo". This reduces noise while retaining emphasis, improving model generalization without stripping stylistic intent.

### Replacing Slang
Slang abbreviations (e.g., "imo" for "in my opinion") are prevalent in informal text. A small dictionary maps them to full forms, standardizing language and aiding comprehension. This is lightweight and domain-specific‚Äîexpand the dict for your corpus. It helps in tasks where precise meaning matters, like opinion mining.

These transformations are applied after basic cleaning (e.g., canonicalization) and before tokenization. They enhance consistency in noisy data but should be tuned: e.g., keep emojis for sentiment, remove for factual tasks. Always test on samples to ensure they don't distort meaning. In the demo below, we chain these steps on a sample text.


In [None]:
try:
    import emoji as emoji_lib
    HAS_EMOJI = True
except Exception:
    HAS_EMOJI = False

# Safer, compact emoticon pattern (non-capturing groups)
EMOTICON_RE = re.compile(r"(?:[:;]-?\)|:-?\(|:D)")
# Oops: we need correct pattern; rebuild properly:
EMOTICON_RE = re.compile(r"(?:[:;]-?\)|:-?\(|:D)")

# Actually, define a clean pattern:
EMOTICON_RE = re.compile(r"(?::|;)(?:-)?(?:\)|\()|(?::D)")

# Simpler final pattern:
EMOTICON_RE = re.compile(r"(?:[:;]-?\))|(?:[:]-?\()|(?::D)")

# Final truly minimal pattern that works:
EMOTICON_RE = re.compile(r"(?:[:;]-?\))|(?::-?\()|(?::D)")

ELONG_RE = re.compile(r"(.)\1{2,}")  # 3+ repeats

SLANG = {
    "imo": "in my opinion",
    "idk": "i do not know",
    "btw": "by the way"
}

def handle_emojis(text: str) -> str:
    if HAS_EMOJI:
        return emoji_lib.demojize(text, language="en")
    return text  # fallback: leave as-is with a note

def handle_emoticons(text: str) -> str:
    return EMOTICON_RE.sub(" __EMOTICON__ ", text)

def normalize_elongations(text: str, max_repeat=2) -> str:
    return ELONG_RE.sub(lambda m: m.group(1) * max_repeat, text)

def replace_slang(text: str) -> str:
    words = text.split()
    return " ".join([SLANG.get(w.lower(), w) for w in words])

s = "I'm sooo happppy about this!!! :) imo"
s1 = handle_emojis(s)
s2 = handle_emoticons(s1)
s3 = normalize_elongations(s2, max_repeat=2)
s4 = replace_slang(s3)
print("Original: ", s)
print("Step1 emoji->text:", s1 if HAS_EMOJI else "emoji lib not available, skipping mapping")
print("Step2 emoticons:  ", s2)
print("Step3 elongation: ", s3)
print("Step4 slang:      ", s4)


# 12. Spelling & Noise Reduction (Optional)

**Learning objectives**
- Understand the challenges and risks associated with spelling correction in NLP preprocessing.
- Explore a lightweight, naive approach using a whitelist to filter out potential noise or misspellings.
- Recognize when to apply or skip spelling correction based on task and data characteristics.

Spelling correction aims to fix typos, misspellings, or informal variations in text to standardize it for better model performance. For example, converting "teh" to "the" or "happppy" to "happy" can reduce vocabulary noise and improve token consistency. However, naive approaches (e.g., simple edit-distance algorithms or rule-based fixes) carry significant risks: they may over-correct valid terms (e.g., "teh" could be a name or slang), introduce errors in domain-specific jargon, or fail on context-dependent ambiguities. In noisy, informal datasets like social media, correction can distort meaning or remove stylistic elements (e.g., elongations for emphasis). Computational cost is another factor, as full correction requires dictionaries or models that slow down preprocessing.

In production, specialized libraries like `pyspellchecker`, `autocorrect`, or transformer-based tools (e.g., via Hugging Face) are preferred, often combined with domain dictionaries to handle technical terms. For this educational demo, we use a very lightweight, illustrative method: a tiny whitelist of allowed words, combined with basic heuristics like dropping short or rare tokens. This simulates noise reduction without real correction, highlighting the concept's limitations. It's not recommended for real applications but serves to demonstrate pitfalls‚Äîalways test on your corpus to avoid unintended data loss. If your data is clean or spelling isn't a major issue, skip this step entirely to preserve authenticity.


In [None]:
WHITELIST = {"so", "happy", "about", "this", "ramen", "good", "cheap", "not", "sweet"}

def naive_correct(tokens):
    # Extremely naive: drop tokens that are too rare and not in whitelist
    return [t for t in tokens if len(t) > 2 or t in WHITELIST]

examples = [
    "I'm soo hapy about ths ramen!",
    "Ths is not good, but cheap.",
    "Soo sweet and happy about it!",
    "Ramen is good, not cheap tho."
]

for ex in examples:
    tokens = spacy_tokenize(ex, lemma=False, keep_alpha=True)
    print(f"Original: {ex}")
    print("Tokens:", tokens)
    print("Naive correction (demo only):", naive_correct(tokens))
    print()


# 13. Frequency-based Filtering & N-gram Construction

**Learning objectives**
- Understand how frequency-based parameters (`min_df`, `max_df`) filter terms to reduce noise and dimensionality in text vectorization.
- Explore n-gram construction (`ngram_range`) to capture word sequences beyond single tokens, improving context for tasks like classification or topic modeling.
- Observe the effects of these parameters on vocabulary size, feature sparsity, and model performance using scikit-learn vectorizers like `CountVectorizer` and `TfidfVectorizer`.

Frequency-based filtering and n-gram construction are advanced techniques in text preprocessing that refine the feature space created by vectorizers. They help balance between capturing meaningful signals and avoiding overfitting or computational inefficiency, especially in large or noisy corpora.

### N-gram Construction
N-grams are contiguous sequences of n items (typically words or tokens) from the text. They extend beyond single words (unigrams) to include phrases, providing richer context:
- **Unigrams (1-grams)**: Individual tokens, e.g., ["the", "cat", "sat"].
- **Bigrams (2-grams)**: Pairs of consecutive tokens, e.g., ["the cat", "cat sat"].
- **Trigrams (3-grams)**: Triples, e.g., ["the cat sat"].
- Higher-order n-grams capture more context but increase dimensionality exponentially.

The `ngram_range` parameter in scikit-learn vectorizers (e.g., `CountVectorizer`, `TfidfVectorizer`) specifies the range of n-grams to generate. For example:
- `ngram_range=(1,1)`: Only unigrams.
- `ngram_range=(1,2)`: Unigrams and bigrams.
- `ngram_range=(2,3)`: Bigrams and trigrams.

Including n-grams can improve accuracy in tasks like sentiment analysis (e.g., "not good" as a bigram preserves negation) or named entity recognition, but it also expands the vocabulary, potentially leading to sparsity. On small datasets, higher n-grams may overfit; on large ones, they enhance generalization.

### Frequency-based Filtering
To manage vocabulary size and remove uninformative terms, vectorizers apply document frequency thresholds:
- **`min_df` (Minimum Document Frequency)**: Filters out terms that appear in fewer than `min_df` documents. This removes rare or noisy terms (e.g., typos or unique jargon) that don't generalize well.
	- Can be an integer (absolute count) or float (fraction of total documents, e.g., 0.01 for 1%).
	- Example: `min_df=2` excludes terms appearing in only 1 document.
- **`max_df` (Maximum Document Frequency)**: Filters out terms that appear in more than `max_df` documents, targeting overly common terms that act like stopwords (e.g., "the" in most documents).
	- Can be an integer or float (e.g., 0.9 for 90% of documents).
	- Example: `max_df=0.8` removes terms in over 80% of docs, similar to custom stopword lists.

These parameters reduce the feature space by eliminating low-information terms, improving computational efficiency and model interpretability. However, overly aggressive filtering (e.g., high `min_df` or low `max_df`) can discard useful signals, especially in imbalanced datasets. Always tune based on your corpus size and task‚Äîe.g., lower thresholds for sparse data, higher for noisy social media text.

### Observing Effects
In practice, experiment with these parameters to monitor changes:
- **Vocabulary Size**: Increases with wider `ngram_range` (more n-grams) but decreases with stricter `min_df`/`max_df`.
- **Sparsity**: Measured as the ratio of non-zero elements in the vectorized matrix. Higher n-grams and looser filters increase sparsity, potentially requiring more memory.
- **Performance**: Use metrics like accuracy or F1-score on a holdout set to evaluate. For instance, bigrams might boost sentiment tasks but slow training.

The code in the following cell demonstrates this by vectorizing the corpus with `ngram_range=(1,2)`, `min_df=1`, and `max_df=1.0`, then inspecting shapes, densities, and top TF-IDF terms. Try modifying these values (e.g., set `min_df=2` or `ngram_range=(1,3)`) to see the impact on outputs. This hands-on approach reinforces how preprocessing choices directly influence downstream NLP models.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

def spacy_tokenizer_for_vectorizer(doc):
    return remove_stopwords(spacy_tokenize(doc, lemma=True, keep_alpha=True))

raw_texts = df["text"].tolist()

count_vec = CountVectorizer(tokenizer=spacy_tokenizer_for_vectorizer, ngram_range=(1,2), min_df=1, max_df=1.0)
Xc = count_vec.fit_transform(raw_texts)
tfidf_vec = TfidfVectorizer(tokenizer=spacy_tokenizer_for_vectorizer, ngram_range=(1,2), min_df=1, max_df=1.0)
Xt = tfidf_vec.fit_transform(raw_texts)

print("CountVectorizer shape:", Xc.shape, "density:", Xc.nnz / (Xc.shape[0]*Xc.shape[1]))
print("TfidfVectorizer shape:", Xt.shape, "density:", Xt.nnz / (Xt.shape[0]*Xt.shape[1]))

feature_names = tfidf_vec.get_feature_names_out()
print("Sample features:", feature_names[:20])

def top_k_tfidf(doc_idx=0, k=10):
    row = Xt[doc_idx].toarray().ravel()
    inds = np.argsort(-row)[:k]
    return [(feature_names[i], float(row[i])) for i in inds if row[i] > 0]

print("Top TF-IDF terms for doc 0:", top_k_tfidf(0, k=10))


# 14. Building a Reusable Preprocessing Pipeline (Function & Class)

**Learning objectives**
- Implement a configurable function and a small class to preprocess text consistently.
- Add simple unit-like checks to validate preprocessing outputs.

Building a reusable preprocessing pipeline is essential for maintaining consistency, reproducibility, and efficiency in NLP workflows. Instead of applying preprocessing steps ad-hoc in each script or notebook, encapsulating them into a configurable function or class allows you to standardize transformations across datasets, experiments, and team members. This approach minimizes errors, facilitates debugging, and supports iterative development‚Äîe.g., easily toggling options like lemmatization or emoji handling without rewriting code.

The pipeline typically chains steps from earlier sections: canonicalization, structural cleaning, normalization, tokenization, stopword removal, lemmatization, and optional handling of negations, emojis, etc. A class-based design (e.g., with `fit` and `transform` methods, inspired by scikit-learn) enables fitting on training data (e.g., for corpus-level statistics) and transforming new texts uniformly. A simpler function can suffice for stateless preprocessing. Configuration via a dictionary ensures flexibility‚Äîe.g., set `{"lemma": True, "keep_negators": False}` to customize behavior per task.

To ensure reliability, include "unit-like checks": lightweight assertions or prints that verify outputs, such as checking for expected token counts, absence of unwanted elements (e.g., no raw URLs if masking is enabled), or vocabulary size reductions. These act as sanity checks, helping catch issues early without full model evaluation. In the demo below, we implement both a class (`Preprocessor`) and a helper function (`preprocess`), applying them to the corpus and inspecting results. This promotes best practices like modularity and testing, making your NLP pipeline robust and scalable. Always document configurations and test on edge cases (e.g., empty strings, mixed languages) to avoid surprises in production.


In [None]:
class Preprocessor:
    def __init__(self, config=None):
        self.config = config or {
            "mask": True, "lower": True, "strip_accents": True,
            "keep_punct": False, "keep_digits": False,
            "lemma": True, "keep_alpha": True, "keep_negators": True,
            "handle_emoji": True, "handle_emoticon": True, "normalize_elong": True
        }

    def fit(self, corpus):
        # Placeholder for corpus-level fitting (e.g., building slang dicts). Not needed here.
        return self

    def transform(self, corpus):
        out = []
        for text in corpus:
            t = canonicalize_text(text)
            t = structural_clean(t, mask=self.config["mask"])
            t = handle_emojis(t) if self.config["handle_emoji"] else t
            t = handle_emoticons(t) if self.config["handle_emoticon"] else t
            if self.config["normalize_elong"]:
                t = normalize_elongations(t, max_repeat=2)
            t = normalize_text(t,
                               lower=self.config["lower"],
                               strip_accents=self.config["strip_accents"],
                               keep_punct=self.config["keep_punct"],
                               keep_digits=self.config["keep_digits"])
            tokens = spacy_tokenize(t, lemma=self.config["lemma"], keep_alpha=self.config["keep_alpha"], preserve_case=False)
            tokens = remove_stopwords(tokens, keep_negators=self.config["keep_negators"])
            out.append(" ".join(tokens))
        return out

def preprocess(text, cfg=None):
    return Preprocessor(cfg).fit([text]).transform([text])[0]

# Quick checks
pp = Preprocessor().fit(df.text.tolist())
processed = pp.transform(df.text.tolist())
print("Before preprocessing:")
for i, text in enumerate(df.text.head()):
    print(f"{i+1}: {text}")
print("\nAfter preprocessing:")
for i, proc in enumerate(processed[:5]):
    print(f"{i+1}: {proc}")


# 15. Measuring Impact: Before/After Feature Spaces

**Learning objectives**
- Compare vocabulary size and sparsity before vs after preprocessing.
- Visualize token counts (two separate figures).


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

raw_texts = df["text"].tolist()
raw_vec = TfidfVectorizer(tokenizer=lambda s: [t.text.lower() for t in nlp(s) if not t.is_space], ngram_range=(1,1), min_df=1)
X_raw = raw_vec.fit_transform(raw_texts)

proc_texts = processed
proc_vec = TfidfVectorizer(tokenizer=lambda s: s.split(), ngram_range=(1,1), min_df=1)
X_proc = proc_vec.fit_transform(proc_texts)

print("Raw vocab size:", len(raw_vec.get_feature_names_out()), "Sparsity:", X_raw.nnz / (X_raw.shape[0]*X_raw.shape[1]))
print("Proc vocab size:", len(proc_vec.get_feature_names_out()), "Sparsity:", X_proc.nnz / (X_proc.shape[0]*X_proc.shape[1]))

# Plot top token counts before
raw_counts = Counter([t for s in raw_texts for t in [w.text.lower() for w in nlp(s) if not w.is_space]])
def plot_top(freqs, title):
    items = freqs.most_common(15)
    labels = [k for k, _ in items]
    values = [v for _, v in items]
    plt.figure()
    plt.bar(range(len(labels)), values)
    plt.xticks(range(len(labels)), labels, rotation=45, ha="right")
    plt.title(title)
    plt.tight_layout()
    plt.show()

plot_top(raw_counts, "Top token counts BEFORE (raw)")

# Plot top token counts after
proc_counts = Counter([t for s in proc_texts for t in s.split()])
plot_top(proc_counts, "Top token counts AFTER (processed)")



# 16. Common Pitfalls, Checklists, and Best Practices

**Learning objectives**
- Recognize common errors and develop a practical checklist.

**Pitfalls**
- *Data leakage:* Fit vectorizers/transforms on training data only.
- *Over-cleaning:* Removing negators or sentiment-bearing punctuation.
- *Language mismatch:* Use language-appropriate models/stopwords.
- *Domain shift:* Build domain/custom stopword lists (e.g., ‚Äúsubject‚Äù, ‚Äúhttp‚Äù in emails).
- *Reproducibility:* Fix seeds and record configuration.

**Checklist**
- Define your task and what signal matters (e.g., sentiment vs topics).
- Decide removal vs masking for URLs/emails/mentions/hashtags.
- Set a consistent normalization policy (case, accents, digits, punctuation).
- Choose lemma vs stemming and whether to keep negators.
- Keep a compact, documented preprocessing class or config.



# 17. Mini Review & Exercises

**Learning objectives**
- Self-assess understanding via short prompts and hands-on tasks.

**Review (short answer)**
1. When would you prefer masking over removing URLs?  
2. Why can removing stopwords harm sentiment analysis?  
3. Give an example where digits should be kept.  
4. Compare lemmatization vs stemming with one example.  
5. What is the risk of using `max_df=1.0` with bigrams on tiny corpora?  

**Exercises**
1. Add a custom tokenizer rule (e.g., keep `U.S.` as one token) and show its effect on tokens.  
2. Tune `min_df`/`max_df` in TF-IDF and report vocabulary size changes.  
3. Implement a variant of negation marking that stops at commas only, and compare token outputs on a few sentences.
