# Advanced Text Preprocessing Fundamentals for NLP 🧬

**Description**: Dive deep into advanced text preprocessing, solving challenges like code-switching, emoji handling, context-aware normalization, multi-lingual pipelines, social/text dialects, dependency parsing for preprocessing, specialized regular expressions, and transformer-friendly formatting. 🧠🛠️

***

## Table of Contents
- 1. Why Advanced Preprocessing? 💡
- 2. Unicode, Text Noise & Exotic Scripts 🌐
- 3. Regex Mastery & Adversarial Patterns 🧩
- 4. Social Text & Emoji Handling 💬🙂
- 5. Code-switching & Multilingual Contexts 🔀🌍
- 6. POS, Morphology & Dependency-driven Cleaning 🧠🧭
- 7. Context-aware Normalization (Rule-based & ML-based) ⚖️🤖
- 8. Handling Outlier/Niche Text: URLs, Dates, Numbers, Code Snippets 📎📅🔢💻

***

## 1. Why Advanced Preprocessing? 💡
Traditional cleaning is not enough for real-world applications (social media, legal/medical, code-mixed, low-resource, web-scraped text).  
Focus on preserving signal while minimizing noise so downstream models remain accurate and robust. 🎯

***

In [1]:
"""
Text analytics setup: imports, model loading, and tokenization utilities.

This module centralizes common NLP dependencies (spaCy, NLTK, etc.), provides
safe loading for spaCy language models with helpful guidance, and initializes a
tweet-friendly tokenizer for robust social/text processing.

Usage:
- Ensure spaCy models are installed (see one-time setup below).
- Import EN_NLP / ES_NLP for language processing pipelines.
- Use tknzr for Twitter-aware tokenization.
"""

# Standard library
from collections import Counter
import re

# Third-party: data and NLP
import numpy as np
import pandas as pd
import emoji
import langid
import spacy
from spacy.language import Language  # for type hints

# NLTK: tokenization & stopwords
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer
from nltk.corpus import stopwords

# Visualization
from wordcloud import WordCloud
import matplotlib.pyplot as plt


# -----------------------------------------------------------------------------
# One-time setup (uncomment and run if models/data are missing)
# -----------------------------------------------------------------------------
# spaCy small English and Spanish models:
# !python -m spacy download en_core_web_sm
# !python -m spacy download es_core_news_sm
#
# NLTK resources (if you get LookupError for tokenizers/stopwords):
# import nltk
# nltk.download("punkt")
# nltk.download("stopwords")


# -----------------------------------------------------------------------------
# Configuration
# -----------------------------------------------------------------------------
# Matplotlib style for consistent, clean visuals (adjust to preference)
plt.style.use("seaborn-v0_8-whitegrid")

# spaCy model names centralized as constants for easy change/override
EN_MODEL_NAME = "en_core_web_sm"
ES_MODEL_NAME = "es_core_news_sm"


def _load_spacy_model(name: str) -> Language:
    """
    Load a spaCy model by name with a helpful error if it's not installed.

    Parameters:
        name: The registered spaCy model name (e.g., "en_core_web_sm").

    Returns:
        A spaCy Language pipeline.

    Raises:
        OSError: If the model is not installed, with guidance to install it.
    """
    try:
        return spacy.load(name)
    except OSError as exc:
        raise OSError(
            f"spaCy model '{name}' is not installed. "
            f"Install it with: python -m spacy download {name}"
        ) from exc


# -----------------------------------------------------------------------------
# NLP Pipelines & Tokenizer
# -----------------------------------------------------------------------------
# English and Spanish NLP pipelines (small models; fast and lightweight)
EN_NLP: Language = _load_spacy_model(EN_MODEL_NAME)
ES_NLP: Language = _load_spacy_model(ES_MODEL_NAME)

# Tweet-aware tokenizer that handles mentions, hashtags, emojis, and URLs well
tknzr = TweetTokenizer()

# Optional: pre-load stopwords for convenience (wrap in try to avoid LookupError)
try:
    EN_STOPWORDS = set(stopwords.words("english"))
except LookupError:
    EN_STOPWORDS = set()
    # Tip: run the NLTK downloads in the setup block above if this is empty.


# -----------------------------------------------------------------------------
# Quick sanity checks (comment out in production)
# -----------------------------------------------------------------------------
# doc = EN_NLP("Hello world! This is a quick spaCy sanity check.")
# print([t.text for t in doc])
# print(tknzr.tokenize("Testing @mentions, #hashtags, and emojis 😄 http://example.com"))

## 2. Unicode, Text Noise & Exotic Scripts 🌐
Normalize Unicode (NFC/NFKC), strip zero-widths, and manage mixed scripts safely for consistent text flow.  
Sanitize HTML/XML, normalize whitespace, and define policies for removing or replacing non-text artifacts. 🧹

***

In [2]:
sample_text = "Café Münster — Привет! — تَجْرِبة 🧑🏽‍💻\u202e\n\n"
print('RAW:', sample_text.encode('unicode_escape'))

# Remove control chars, normalize unicode, strip bidirectional & zero-width
import unicodedata

def clean_unicode(text):
    text = unicodedata.normalize("NFKC", text)
    text = ''.join(ch for ch in text if not unicodedata.category(ch).startswith('C'))
    text = re.sub(r'[\u200e\u202e\u200b\u200c\u200d]', '', text)  # Remove ZW and Bidi
    return text

print('UNICODE CLEAN:', clean_unicode(sample_text))

RAW: b'Caf\\xe9 M\\xfcnster \\u2014 \\u041f\\u0440\\u0438\\u0432\\u0435\\u0442! \\u2014 \\u062a\\u064e\\u062c\\u0652\\u0631\\u0650\\u0628\\u0629 \\U0001f9d1\\U0001f3fd\\u200d\\U0001f4bb\\u202e\\n\\n'
UNICODE CLEAN: Café Münster — Привет! — تَجْرِبة 🧑🏽💻


## 3. Regex Mastery & Adversarial Patterns 🧩
Use explicit, compiled patterns with unit tests; avoid greedy matches and catastrophic backtracking.  
Design for adversarial inputs (nested brackets, quotes, long runs) and validate with fuzz tests. 🧪

***

In [5]:
"""
Robust extractors for emails, URLs (incl. Markdown), phone numbers, and emojis.
- Handles Markdown links: [label](https://example.com)
- Avoids trailing punctuation in URLs (e.g., ')', ']', '>')
- Uses compiled regex for performance and clarity
- Updated for emoji>=2.0 (no get_emoji_regexp); uses emoji_list with a safe fallback
"""

import re
import emoji

# Sample text
text = (
    "Contact: [john.doe@email.com](mailto:john.doe@email.com), "
    "Call: (555) 123-4567, "
    "Visit: [https://domain.com](https://domain.com) @user #hashtag 😊"
)

# -----------------------------
# Compiled regex patterns
# -----------------------------
# Email: reasonably permissive for extraction (not strict validation)
EMAIL_RE = re.compile(r"[A-Za-z0-9.!#$%&'*+/=?^_`{|}~-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")

# URL (raw): stop at whitespace or common closing punctuation in prose/Markdown
URL_RE = re.compile(r"https?://[^\s\])>]+")

# Markdown links: capture label and target URL separately
MD_LINK_RE = re.compile(r"\[([^\]]+)\]\((https?://[^)]+)\)")

# Phone (US-like): optional parentheses, separators (space, dash, dot), optional country code
PHONE_RE = re.compile(
    r"(?:\+?\d{1,3}[\s.-])?"        # optional country code, e.g., +1
    r"(?:\(?\d{3}\)?[\s.-])"        # area code with optional parentheses
    r"\d{3}[\s.-]\d{4}"             # local number
)

# -----------------------------
# Emoji extraction (emoji>=2.0)
# -----------------------------
def extract_emojis(txt: str):
    """
    Prefer emoji.emoji_list (emoji>=2). Fallback to a compiled regex built
    from emoji.EMOJI_DATA keys if emoji_list is unavailable.
    """
    if hasattr(emoji, "emoji_list"):
        # [{'emoji': '😊', 'match_start': 123, 'match_end': 124}, ...] -> just the glyphs
        return [d["emoji"] for d in emoji.emoji_list(txt)]
    else:
        # Older environments: build a regex from known emojis; sort by length to match sequences first
        try:
            from emoji import EMOJI_DATA
            emoji_re = re.compile(
                "|".join(re.escape(e) for e in sorted(EMOJI_DATA.keys(), key=len, reverse=True))
            )
            return emoji_re.findall(txt)
        except Exception:
            return []

# -----------------------------
# Extraction
# -----------------------------
# 1) Emails
emails = EMAIL_RE.findall(text)

# 2) URLs
#    - Extract Markdown URLs first (both label and target)
md_links = MD_LINK_RE.findall(text)  # list of tuples: (label, url)
md_urls = [u for _, u in md_links]

#    - Extract raw URLs not wrapped in Markdown
raw_urls = URL_RE.findall(text)

#    - Combine and de-duplicate while preserving order
seen = set()
urls = []
for u in md_urls + raw_urls:
    if u not in seen:
        urls.append(u)
        seen.add(u)

# 3) Phones
phones = PHONE_RE.findall(text)

# 4) Emojis (emoji>=2 preferred path)
emojis = extract_emojis(text)

# -----------------------------
# Output
# -----------------------------
print(f"Emails: {emails}")
print(f"URLs: {urls}")
print(f"Phones: {phones}")
print(f"Emojis: {emojis}")

Emails: ['john.doe@email.com', 'john.doe@email.com']
URLs: ['https://domain.com']
Phones: ['(555) 123-4567']
Emojis: ['😊']


## 4. Social Text & Emoji Handling 💬🙂
Adopt emoji-aware tokenizers; map emojis to sentiment/intent where useful instead of blanket removal.  
Preserve hashtags and mentions as features; separate display needs from modeling via normalization. #️⃣👤

***

In [6]:
import emoji

def normalize_emojis(text):
    return emoji.demojize(text, delimiters=("", ""))

def clean_social(text):
    text = re.sub(r'@\w+', ' user ', text)
    text = re.sub(r'#(\w+)', r'\1', text)  # Remove hashtag but keep keyword
    text = re.sub(r'https?://\S+', '', text)
    text = re.sub(r"\b(?:lol|omg|lmao)\b", "laugh", text, flags=re.I)
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)
    return text

tweet = "Loving this! 😂😍🔥 Visit https://t.co/abc123 @nlp_guru #NLP #AI lmaooo"
print("Norm EMOJI:", normalize_emojis(tweet))
print("Cleaned Social:", clean_social(normalize_emojis(tweet)))

Norm EMOJI: Loving this! face_with_tears_of_joysmiling_face_with_heart-eyesfire Visit https://t.co/abc123 @nlp_guru #NLP #AI lmaooo
Cleaned Social: Loving this! face_with_tears_of_joysmiling_face_with_heart-eyesfire Visit  user  NLP AI lmaoo


## 5. Code-switching & Multilingual Contexts 🔀🌍
Detect language at segment or sentence level and route through language-specific pipelines.  
For code-mixed text, combine tokenization strategies and locale-aware normalization without collapsing scripts. 🧭

***

In [7]:
texts = ["I need ayuda with this task.",
         "Vamos to the park after work.",
         "Bonjour! How are you doing?"]

for t in texts:
    lang, prob = langid.classify(t)
    print(f"{repr(t)} -- LANG: {lang} ({prob:.2f})")
    if lang == "en":
        doc = EN_NLP(t)
    elif lang == "es":
        doc = ES_NLP(t)
    else:
        doc = EN_NLP(t)
    print([{'text': token.text, 'pos': token.pos_} for token in doc])

'I need ayuda with this task.' -- LANG: en (-96.63)
[{'text': 'I', 'pos': 'PRON'}, {'text': 'need', 'pos': 'VERB'}, {'text': 'ayuda', 'pos': 'NOUN'}, {'text': 'with', 'pos': 'ADP'}, {'text': 'this', 'pos': 'DET'}, {'text': 'task', 'pos': 'NOUN'}, {'text': '.', 'pos': 'PUNCT'}]
'Vamos to the park after work.' -- LANG: en (-110.23)
[{'text': 'Vamos', 'pos': 'PROPN'}, {'text': 'to', 'pos': 'ADP'}, {'text': 'the', 'pos': 'DET'}, {'text': 'park', 'pos': 'NOUN'}, {'text': 'after', 'pos': 'ADP'}, {'text': 'work', 'pos': 'NOUN'}, {'text': '.', 'pos': 'PUNCT'}]
'Bonjour! How are you doing?' -- LANG: en (-41.62)
[{'text': 'Bonjour', 'pos': 'NOUN'}, {'text': '!', 'pos': 'PUNCT'}, {'text': 'How', 'pos': 'SCONJ'}, {'text': 'are', 'pos': 'AUX'}, {'text': 'you', 'pos': 'PRON'}, {'text': 'doing', 'pos': 'VERB'}, {'text': '?', 'pos': 'PUNCT'}]


## 6. POS, Morphology & Dependency-driven Cleaning 🧠🧭
Lemmatize with POS awareness; avoid lowercasing when casing carries signal (NER, acronyms).  
Use dependencies to trim boilerplate spans while keeping relation-bearing tokens intact. 🧱

***

In [8]:
"""
Lemmatize verbs and mask PII spans (PERSON, ORG) using spaCy.

- Verbs are lemmatized while other tokens remain unchanged.
- PII masking replaces full entity spans (e.g., "John Smith") with a single placeholder (e.g., "<PERSON>").
- Character-span masking preserves original whitespace and punctuation.
"""

from typing import Iterable, Tuple
import spacy

# Assumes EN_NLP is already loaded elsewhere:
# EN_NLP = spacy.load("en_core_web_sm")

TEXT = "Dr. Adams prescribed 10mg Ibuprofen to John Smith at 4pm."
MASK_LABELS = ("PERSON", "ORG")
PLACEHOLDER_BY_LABEL = {
    "PERSON": "<PERSON>",
    "ORG": "<ORG>",
}
DEFAULT_PLACEHOLDER = "<MASK>"


def lemmatize_verbs(doc: spacy.tokens.Doc) -> str:
    """Return a string where only verbs are lemmatized."""
    return " ".join(tok.lemma_ if tok.pos_ == "VERB" else tok.text for tok in doc)


def _entity_spans_for_labels(
    doc: spacy.tokens.Doc, labels: Iterable[str]
) -> Iterable[Tuple[int, int, str]]:
    """
    Collect (start_char, end_char, label_) for entities matching given labels.
    spaCy ensures entity spans do not overlap; sorting reverse avoids index shift on replacement.
    """
    spans = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents if ent.label_ in labels]
    spans.sort(key=lambda x: x[0], reverse=True)
    return spans


def mask_pii_spans(
    text: str,
    spans: Iterable[Tuple[int, int, str]],
    placeholder_by_label: dict,
    default_placeholder: str = "<MASK>",
) -> str:
    """
    Replace each entity span with a single placeholder token.
    Replacement occurs from right to left to keep indices valid.
    """
    masked = text
    for start, end, label in spans:
        repl = placeholder_by_label.get(label, default_placeholder)
        masked = masked[:start] + repl + masked[end:]
    return masked


# Process
doc = EN_NLP(TEXT)

# 1) Lemmatize only verbs
lemmas_str = lemmatize_verbs(doc)

# 2) Mask PII entity spans
spans = _entity_spans_for_labels(doc, MASK_LABELS)
masked_text = mask_pii_spans(TEXT, spans, PLACEHOLDER_BY_LABEL, DEFAULT_PLACEHOLDER)

print("Lemmas (verbs lemmatized):", lemmas_str)
print("Masked text:", masked_text)


Lemmas (verbs lemmatized): Dr. Adams prescribe 10 mg Ibuprofen to John Smith at 4 pm .
Masked text: Dr. <PERSON> prescribed 10mg Ibuprofen to <PERSON> at 4pm.


## 7. Context-aware Normalization (Rule-based & ML-based) ⚖️🤖
Blend curated rules (domain shorthands, units) with ML normalizers for ambiguous cases.  
Version rulesets, log diffs, and A/B their impact on downstream metrics before rollout. 📈

***

In [10]:
"""
Smart text normalization:
- Expands English contractions (e.g., "I'm" -> "I am").
- Normalizes common clinical shorthand (pt, pls, d/c, stat).
- Preserves possessives (e.g., "pt's" -> "patient's").
- Cleans extra spaces before punctuation.
"""

import re
import contractions
from typing import Literal

def smart_normalize(
    text: str,
    dc_meaning: Literal["discontinue", "discharge"] = "discontinue"
) -> str:
    """
    Normalize text by expanding contractions and replacing domain-specific slang.

    Parameters
    ----------
    text : str
        Input text to normalize.
    dc_meaning : {"discontinue", "discharge"}, optional
        Intended meaning for the shorthand "d/c" (default: "discontinue").

    Returns
    -------
    str
        Normalized text.
    """
    # 1) Expand standard English contractions
    s = contractions.fix(text)

    # 2) Domain-specific replacements (case-insensitive)
    #    Order matters (handle possessive "pt's" before bare "pt")
    rules = [
        (re.compile(r"\bpt's\b", flags=re.IGNORECASE), "patient's"),
        (re.compile(r"\bpt\b", flags=re.IGNORECASE), "patient"),
        (re.compile(r"\bpls\b", flags=re.IGNORECASE), "please"),
        (re.compile(r"\b(?:d\s*/\s*c)\b", flags=re.IGNORECASE), dc_meaning),
        (re.compile(r"\bstat\b", flags=re.IGNORECASE), "immediately"),
    ]
    for pattern, replacement in rules:
        s = pattern.sub(replacement, s)

    # 3) Tidy whitespace around punctuation and collapse multiple spaces
    s = re.sub(r"\s+([.,!?;:])", r"\1", s)   # no space before punctuation
    s = re.sub(r"\s{2,}", " ", s).strip()    # collapse runs of spaces

    return s


# Example
print(smart_normalize("I'm not sure if pt's dose is high. Pls d/c stat!"))
# -> "I am not sure if patient's dose is high. Please discontinue immediately!"

I am not sure if patient's dose is high. please discontinue immediately!


## 8. Handling Outlier/Niche Text: URLs, Dates, Numbers, Code Snippets 📎📅🔢💻
Prefer dedicated parsers for URLs, dates, and numbers; avoid brittle regex where standards exist.  
Fence code blocks and tokenize separately to prevent leakage into language features. 🔒

***

In [11]:
import re

example = "Glucose: 98mg/dl @ 7:45am on 2023-07-15. Use print('Hello, world!') to debug."

# Compile patterns
DATE_RE = re.compile(r"\b\d{4}-\d{2}-\d{2}\b")
TIME_RE = re.compile(r"\b(?:[01]?\d|2[0-3]):[0-5]\d\s?(?:am|pm)?\b", flags=re.IGNORECASE)

# Simple, non-greedy print(...) extraction (handles common cases without nesting)
PRINT_RE = re.compile(r"print\([^)]*?\)")

# Extract
find_dates = DATE_RE.findall(example)
find_times = TIME_RE.findall(example)
find_code  = PRINT_RE.findall(example)

print("Dates:", find_dates)
print("Times:", find_times)
print("Code:", find_code)

Dates: ['2023-07-15']
Times: ['7:45am']
Code: ["print('Hello, world!')"]
