# Text Preprocessing Notebook

This notebook covers various text preprocessing techniques used in Natural Language Processing (NLP).
Each topic includes its description/theory followed by the implementation code.

---

# PART 1: TEXT PREPROCESSING TECHNIQUES - THEORY & CONCEPTS

This section explains each preprocessing technique with descriptions, advantages, and disadvantages.

---

## 1. HTML Tag Removal

**What it does:** Removes HTML markup from text scraped from web pages.

**When needed:** Processing text from websites, web scraping results, emails with HTML formatting.

**How:** Uses regex pattern `<.*?>` to match and remove all HTML tags.

**Advantages:**
- Simple regex pattern
- Fast to execute
- Handles all standard HTML tags

**Disadvantages:**
- May leave unwanted content inside tags
- Doesn't decode HTML entities (e.g., `&nbsp;` remains)

---

## 2. URL Removal

**What it does:** Removes web addresses and URLs from text.

**When needed:** Processing social media data, web-scraped content, user-generated content with links.

**How:** Uses regex patterns to match common URL formats (http, https, www).

**Advantages:**
- Removes noisy web references
- Improves focus on actual content

**Disadvantages:**
- URLs may contain important info in some contexts
- Pattern may miss some URL variations

---

## 3. Punctuation Removal

**What it does:** Removes punctuation marks (.,!?;:'"etc) from text.

**When needed:** Text classification, sentiment analysis, when punctuation doesn't add meaning.

**How:** Two methods - loop-based (slow) or translation table (fast).

**Advantages:**
- Reduces feature space
- Faster model training
- Works well with bag-of-words models

**Disadvantages:**
- May lose sentiment/emphasis information
- Important for some NLP tasks (POS tagging)

---

## 4. Chat Word/Slang Treatment

**What it does:** Converts internet/text slang abbreviations to their full forms.

**When needed:** Processing social media, SMS, chat logs, informal user-generated content.

**How:** Uses a dictionary mapping slang to expanded forms (LOL → Laughing Out Loud).

**Advantages:**
- Improves text clarity
- Standardizes informal language

**Disadvantages:**
- Requires comprehensive slang dictionary
- Misses new/emerging slang

---

## 5. Spelling Correction

**What it does:** Identifies and corrects misspelled or typo'd words.

**When needed:** User-generated content, OCR output, noisy text, typo-prone sources.

**How:** Uses word frequency/edit distance databases to find closest correct word.

**Advantages:**
- Improves text quality/consistency
- Helps with vocabulary standardization

**Disadvantages:**
- May correct slang or intentional misspellings
- Computationally expensive
- Context-blind

---

## 6. Stop Word Removal

**What it does:** Removes common words that contribute little semantic meaning.

**When needed:** Text classification, information retrieval, feature reduction.

**Common stop words:** the, is, a, an, and, or, but, in, at, to, for, etc.

**Advantages:**
- Reduces feature dimensionality
- Improves model focus on meaningful words
- Faster training and inference

**Disadvantages:**
- May lose important context
- Different languages need different lists

---

## 7. Emoji Handling

**What it does:** Either removes or converts emojis to textual descriptions.

**When needed:** Processing social media, chat data with emojis.

**Two approaches:**
1. **Emoji Removal:** Completely eliminates emojis (noise removal)
2. **Emoji Demojization:** Converts emojis to text (❤️ → `:red_heart:`)

**Advantages:**
- Removes visual noise or preserves emoji context
- Makes emojis interpretable by ML models

**Disadvantages:**
- Removal loses sentiment info; demojization creates unusual tokens

---

## 8. Tokenization

**What it does:** Splits text into smaller meaningful units (tokens: words, subwords, characters).

**When needed:** ALL NLP tasks - fundamental preprocessing step.

**Four methods:**
1. **Simple split()** - Fast but keeps punctuation attached
2. **Regex** - Complex patterns for edge cases
3. **NLTK** - Linguistically aware, handles contractions
4. **spaCy** - Industrial-strength, maintains offsets

**Advantages:**
- Prepares text for further processing
- Different methods for different use cases

**Disadvantages:**
- Different tokenizers produce different results
- Affects downstream performance

---

## 9. Stemming

**What it does:** Reduces words to their root form (stem) by removing suffixes.

**When needed:** Information retrieval, grouping related words.

**How:** Uses rule-based algorithms (Porter Stemmer) - removes common suffixes.

**Examples:**
- walking, walks, walked → walk
- running, runs → run

**Advantages:**
- Fast computation
- Reduces vocabulary size
- Groups related words

**Disadvantages:**
- Produces non-words (happy → happi)
- Cannot distinguish different parts of speech
- Over-stemming can merge different words

---

## 10. Lemmatization

**What it does:** Reduces words to their dictionary form (lemma) using linguistic knowledge.

**When needed:** When you need valid English words, semantic understanding important.

**How:** Uses morphological analysis and vocabularies (spaCy, NLTK).

**Examples:**
- walking, walks, walked → walk
- am, are, is → be (context-aware!)

**Advantages:**
- Produces valid English words
- Better preserves semantic meaning
- Context-aware

**Disadvantages:**
- Slower than stemming
- Needs good vocabulary/morphology database

---

# PART 2: IMPLEMENTATION - CODE EXAMPLES

This section contains runnable code for each preprocessing technique.

---

---

# PART 2: TEXT PREPROCESSING TECHNIQUES - CODE IMPLEMENTATION

This section provides working code examples for each preprocessing technique.

---