# Tokenization & Text Preprocessing (NLP) — Detailed Guide

## 1) Why this matters

Before any model can use text, you turn raw strings into **tokens** and (optionally) normalize them. Good preprocessing improves downstream accuracy, speed, and robustness; bad preprocessing throws away meaning (e.g., removing “not”).

---

## 2) Tokenization: ways to split text

### A) Sentence tokenization

Split document → sentences.

* Use cases: summarization, translation, sentence-level classification.
* Tools: `spaCy` (`Doc.sents`), `nltk.sent_tokenize`, `bltk`/IndicNLP for Indic langs.

### B) Word tokenization (rule-based)

Split sentences → words with language rules (punctuation, clitics, contractions).

* Pros: human-interpretable, fast
* Cons: OOV issues, vocab explodes with morphology
* Tools: `spaCy` (`Doc`), `nltk.word_tokenize`

### C) Subword tokenization (modern, default for transformers)

Break words into frequent pieces:

* **BPE** (GPT/RoBERTa), **WordPiece** (BERT), **Unigram** (SentencePiece), **byte-level BPE** (GPT-2/3).
* Pros: tiny OOV rate; handles misspellings/morphology; small vocab
* Cons: less human-readable
* Tools: `tokenizers` (HF), `sentencepiece`, model-specific fast tokenizers

### D) Character/byte tokenization

Every char/byte = token.

* Pros: zero OOV, multilingual
* Cons: long sequences, slower training; context modeling harder

**Guideline:**

* Classical ML (TF-IDF): word tokens (possibly lemma).
* Transformers: subword tokenizer shipped with the model.
* Morphologically rich or noisy text: subword or char-level.

---

## 3) Text preprocessing: what to do (and when)

There’s no one-size-fits-all. Choose by task & model.

### Core steps (often safe)

1. **Unicode normalization** (`NFKC`) – make accents/width forms consistent.
2. **Whitespace normalization** – collapse multiple spaces; keep sentence breaks if needed.
3. **Lowercasing** – only if your model is uncased (e.g., `bert-base-uncased`). Keep case for NER/sentiment unless uncased model.

### Task-dependent steps

* **Punctuation**:

  * Keep for sentiment/emotion/NER; remove only for bag-of-words tasks where symbols add noise.
* **Stopwords**:

  * Remove for retrieval/TF-IDF; **keep** for transformers/sequence tasks (they carry syntax).
* **Numbers**:

  * Normalize to `<NUM>` for classical models to reduce sparsity; keep raw for transformers.
* **URLs, mentions, hashtags**:

  * Map to placeholders (`<URL>`, `<USER>`, `#Hashtag` → `hashtag` + word) for classical models; transformers usually handle fine.
* **Emojis/emoticons**:

  * Convert to text (`🙂`→“smiley\_positive”) for sentiment; keep otherwise.
* **Contractions & negation**:

  * Expand (`don’t`→`do not`) to preserve negation for classical models; transformers OK either way—**never drop “not”**.
* **Spelling correction**:

  * Use carefully—can distort named entities/usernames; helpful for ASR OCR noise.
* **Stemming vs Lemmatization**:

  * Stemming (Porter/Snowball) = fast, rough.
  * Lemmatization (spaCy/WordNet) = slower, linguistically correct. Prefer lemmatization for classical features.

### Language/domain specifics

* Indic/Japanese/Chinese require specialized tokenizers (e.g., MeCab, jieba, IndicNLP).
* Code-mixed text (Hindi+English): use multilingual models or custom rules.

---

## 4) Practical pipelines (code)

### A) Classic ML pipeline (word tokens + lemmatize)

```python
import re, unicodedata
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('punkt'); nltk.download('wordnet'); nltk.download('omw-1.4'); nltk.download('stopwords')

STOP = set(stopwords.words('english'))
LEM = WordNetLemmatizer()

def normalize(text: str) -> str:
    text = unicodedata.normalize('NFKC', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def preprocess_for_bow(text: str):
    text = normalize(text).lower()
    text = re.sub(r'https?://\S+|www\.\S+', '<URL>', text)
    text = re.sub(r'@\w+', '<USER>', text)
    # keep punctuation that signals negation; drop the rest if desired
    tokens = nltk.word_tokenize(text)
    kept = []
    for t in tokens:
        if t.isalpha() and t not in STOP:
            kept.append(LEM.lemmatize(t))
        elif t in {"not", "no"}:  # preserve negation
            kept.append(t)
    return kept

print(preprocess_for_bow("I don't like this movie 😞. Visit https://ex.com"))
# ['not', 'like', 'movie']
```

### B) Transformers (use the model’s tokenizer; minimal cleaning)

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "I don't like this movie 😞. Visit https://ex.com"
enc = tok(text, truncation=True, padding="max_length", max_length=32, return_tensors="pt")
print(tok.convert_ids_to_tokens(enc["input_ids"][0][:15]))
```

**Tip:** Avoid aggressive cleaning; the tokenizer knows how to handle URLs, emojis, casing (if uncased).

### C) spaCy sentence + word tokens + lemmas

```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple was founded by Steve Jobs. I don't like this movie.")
sents = [sent.text for sent in doc.sents]
words = [(t.text, t.lemma_, t.pos_) for t in doc]
```

---

## 5) Evaluating preprocessing choices

* **Sentiment**: keep emojis, punctuation like “!” and negations; avoid stopword removal.
* **NER**: keep casing, punctuation; no stemming; minimal cleaning.
* **IR/keyword search**: lowercase, remove stopwords, lemmatize; maybe normalize numbers/URLs.
* **Topic modeling**: lowercase, remove stopwords, lemmatize; keep nouns/adjectives.
* **ASR/OCR noisy text**: consider light spell-correction and aggressive normalization.

Always **A/B test**: train with/without a step and compare metrics (accuracy/F1/MAP).

---

## 6) Subword tokenizers in practice (quick notes)

* **BPE**: merges frequent pairs; used by GPT/Roberta (often byte-level).
* **WordPiece**: chooses pieces maximizing likelihood; used by BERT.
* **Unigram (SentencePiece)**: probabilistic; used by XLM-R/T5.
* **Byte-level BPE**: works on raw bytes → robust to any script/emoji.

**Implication:** Pre-tokenization is minimal (whitespace + punctuation split); **do not** alter input heavily.

---

## 7) Common pitfalls & fixes

* **Over-cleaning** (removing “not”, emojis, punctuation) → hurts sentiment/intent.
* **Stopwords removal for transformers** → unnecessary, harmful.
* **Mixed casing with uncased models** → fine; with cased models, don’t lowercase.
* **Language mismatch** → use correct language model/tokenizer.
* **Leaking labels in preprocessing** → never condition cleaning on labels.

---

## 8) Quick checklists

**Classical ML (TF-IDF)**

* [ ] Unicode normalize, lowercase
* [ ] Replace URLs/users/numbers with placeholders
* [ ] Tokenize words → lemmatize
* [ ] Remove stopwords (but keep negation)
* [ ] Optional: bigrams/trigrams

**Transformers**

* [ ] Minimal cleaning (Unicode/whitespace)
* [ ] Use the exact pretrained tokenizer
* [ ] Truncate/pad to max length
* [ ] Keep emojis, punctuation, URLs

---

If you tell me your **target task** (e.g., sentiment on Hindi-English tweets, resume parsing, Q\&A, NER), I’ll give you a tailored preprocessing recipe + starter code.


# Basic text cleaning

Great question. Let’s go step by step on **basic text cleaning in NLP**, since it’s one of the most important preprocessing steps before applying tokenization, vectorization, or modeling.

---

### **Why Text Cleaning is Needed?**

Raw text data often comes with noise:

* Extra spaces, punctuation, or numbers.
* Case inconsistencies (e.g., "Hello" vs "hello").
* Stopwords like *the, is, at* that may not add much meaning.
* Special symbols, emojis, or HTML tags.

If not cleaned, models may interpret these inconsistencies as different tokens, which hurts performance.

---

### **Steps in Basic Text Cleaning**

1. **Lowercasing**

   * Convert all text into lowercase for consistency.
   * Example: `"Hello NLP World!" → "hello nlp world!"`.

   ```python
   text = "Hello NLP World!"
   text = text.lower()
   print(text)  # hello nlp world!
   ```

---

2. **Removing Punctuation**

   * Punctuation usually does not add semantic meaning in most tasks.
   * Example: `"Hello, world!" → "Hello world"`

   ```python
   import string
   text = "Hello, NLP world!"
   cleaned = text.translate(str.maketrans('', '', string.punctuation))
   print(cleaned)  # Hello NLP world
   ```

---

3. **Removing Extra Whitespace**

   * Multiple spaces, tabs, or newlines are normalized to a single space.

   ```python
   text = "This   is   NLP\n"
   cleaned = " ".join(text.split())
   print(cleaned)  # This is NLP
   ```

---

4. **Removing Numbers (if not useful)**

   * Sometimes digits don’t matter (e.g., product reviews).
   * Example: `"I bought 2 phones"` → `"I bought phones"`

   ```python
   import re
   text = "I bought 2 phones for 500 dollars"
   cleaned = re.sub(r'\d+', '', text)
   print(cleaned)  # I bought  phones for  dollars
   ```

---

5. **Removing Stopwords**

   * Stopwords = very frequent words like *is, the, and*.
   * Useful to reduce noise for bag-of-words or TF-IDF.
   * Example: `"This is an example"` → `"example"`

   ```python
   import nltk
   from nltk.corpus import stopwords
   nltk.download('stopwords')

   stop_words = set(stopwords.words('english'))
   text = "This is an example of text cleaning"
   words = [w for w in text.split() if w.lower() not in stop_words]
   print(words)  # ['example', 'text', 'cleaning']
   ```

---

6. **Removing Special Characters / HTML tags**

   * Example: `"Hello @user! <br> NLP is #awesome"` → `"Hello NLP is awesome"`.

   ```python
   text = "Hello @user! <br> NLP is #awesome"
   cleaned = re.sub(r'<.*?>', '', text)   # remove HTML
   cleaned = re.sub(r'[^a-zA-Z\s]', '', cleaned)  # keep only letters
   print(cleaned)  # Hello NLP is awesome
   ```

---

7. **Stemming or Lemmatization (Optional in Cleaning)**

   * Stemming: Reduces words to root (e.g., "playing" → "play").
   * Lemmatization: Converts to dictionary form (e.g., "better" → "good").
   * Helps reduce vocabulary size.

---

### **Example Putting It Together**

```python
import re
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def clean_text(text):
    # lowercase
    text = text.lower()
    # remove html tags
    text = re.sub(r'<.*?>', '', text)
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # remove numbers
    text = re.sub(r'\d+', '', text)
    # remove extra whitespace
    text = " ".join(text.split())
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [w for w in text.split() if w not in stop_words]
    return " ".join(words)

sample = "Hello!!! I bought 2 Phones <br> and it is Amazing!!!"
print(clean_text(sample))
```

**Output:**

```
hello bought phones amazing
```

---

📌 So, **basic text cleaning in NLP = lowercasing + removing punctuation, numbers, stopwords, special symbols, and extra whitespace**.
After this step, tokenization and feature extraction become much more reliable.



# tokenization



## 1. What is Tokenization?

* **Definition**: Tokenization is the process of breaking down a large piece of text (a sentence, paragraph, or document) into smaller units called **tokens**.
* **Tokens** can be:

  * Words → e.g., `"I love NLP"` → `[I, love, NLP]`
  * Subwords → e.g., `"unhappiness"` → `[un, happiness]`
  * Characters → `"cat"` → `[c, a, t]`
  * Sentences → `"I love NLP. It is fun."` → `[I love NLP, It is fun]`

The main goal is to convert **raw text** into manageable chunks for further processing.

---

## 2. Why is Tokenization Important in NLP?

* Computers cannot directly understand raw text; they need **structured input**.
* Tokenization is the **first step** before:

  * Building vocabulary
  * Converting words into numerical representations (word embeddings, one-hot encoding)
  * Training models for tasks like classification, translation, sentiment analysis, etc.

---

## 3. Types of Tokenization

1. **Word Tokenization**

   * Splits text into words.
   * Example: `"Natural Language Processing is cool!"`
     → `["Natural", "Language", "Processing", "is", "cool", "!"]`
   * Tools: `nltk.word_tokenize()`, `spaCy`.

2. **Sentence Tokenization**

   * Splits text into sentences.
   * Example: `"I love NLP. It's amazing!"`
     → `["I love NLP.", "It's amazing!"]`
   * Tools: `nltk.sent_tokenize()`.

3. **Character Tokenization**

   * Breaks text into individual characters.
   * Example: `"NLP"` → `["N", "L", "P"]`
   * Used in speech recognition, language modeling.

4. **Subword Tokenization (Modern NLP models)**

   * Breaks words into **sub-parts**.
   * Example: `"unhappiness"` → `[un, happiness]`
   * Handles **out-of-vocabulary (OOV)** words better.
   * Algorithms:

     * **Byte Pair Encoding (BPE)** → Used in GPT
     * **WordPiece** → Used in BERT
     * **SentencePiece** → Used in T5, XLNet

---

## 4. Tokenization Challenges

* **Ambiguity**: `"Let's eat, grandma"` vs `"Let's eat grandma"`
* **Punctuation**: Should `"isn't"` → `["is", "n't"]` or `["isn't"]`?
* **Languages**:

  * English: spaces make tokenization easier.
  * Chinese, Japanese: words are not separated by spaces, so tokenization is harder.
* **Compound words**: `"New York"` should stay together.

---

## 5. Python Examples

### Word Tokenization (NLTK)

```python
import nltk
from nltk.tokenize import word_tokenize

text = "I love learning NLP with Python!"
tokens = word_tokenize(text)
print(tokens)
```

Output:

```
['I', 'love', 'learning', 'NLP', 'with', 'Python', '!']
```

### Sentence Tokenization

```python
from nltk.tokenize import sent_tokenize

text = "I love NLP. It's fun and powerful."
sentences = sent_tokenize(text)
print(sentences)
```

Output:

```
['I love NLP.', "It's fun and powerful."]
```

### Tokenization with spaCy

```python
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("FastAPI makes building APIs easy!")
tokens = [token.text for token in doc]
print(tokens)
```

Output:

```
['FastAPI', 'makes', 'building', 'APIs', 'easy', '!']
```

---

## 6. Summary

* Tokenization = breaking text into units (**words, sentences, characters, or subwords**).
* It is the **foundation of NLP preprocessing**.
* Modern NLP (transformers like BERT, GPT) prefer **subword tokenization**.



# normalization

Normalization in NLP refers to transforming text into a consistent, standardized form so that different variations of words or text can be treated as the same. This is important because natural language is messy, and models or algorithms need uniform inputs to perform well.

Here are the main techniques of normalization in NLP:

---

### 1. **Case Normalization (Lowercasing/Uppercasing)**

* Converts all text into the same case (usually lowercase).
* Example:

  * "Apple", "APPLE", "apple" → "apple"

Why? It avoids treating words as different just because of case.

---

### 2. **Removing Punctuation and Special Characters**

* Cleans symbols that are not meaningful for analysis (unless they are required, like in sentiment analysis with "!" or emojis).
* Example:

  * "Hello!!! How are you???" → "Hello How are you"

---

### 3. **Stopword Removal**

* Stopwords are common words that carry little meaning in tasks like classification.
  Examples: *the, is, in, and, of, to, a*
* Example:

  * "This is a good book" → "good book"

---

### 4. **Stemming**

* Reduces a word to its **root form** by chopping suffixes.
* It is rule-based and may produce non-dictionary words.
* Example:

  * "playing", "played", "plays" → "play"
  * "studies" → "studi"

---

### 5. **Lemmatization**

* Similar to stemming, but it uses a **dictionary (lexicon)** to return valid root words (lemmas).
* It considers the **part of speech (POS)** to provide context-aware roots.
* Example:

  * "better" → "good"
  * "studies" → "study"

---

### 6. **Handling Numbers**

* Depending on the task, numbers can be removed, normalized, or replaced with a token.
* Example:

  * "I have 2 apples" → "I have NUM apples"

---

### 7. **Unicode Normalization**

* Converts characters into a consistent Unicode format.
* Example:

  * "café" (with accent) → "cafe"

---

### 8. **Expanding Contractions**

* Converts contractions into their full form.
* Example:

  * "don't" → "do not"
  * "I'm" → "I am"

---

### 9. **Handling Accents and Diacritics**

* Removes or standardizes accents for consistency.
* Example:

  * "résumé" → "resume"

---

### Summary

Normalization ensures that text variations like case, tenses, or suffixes don’t confuse NLP models.
It often involves:

* Lowercasing
* Removing stopwords
* Stemming/Lemmatization
* Cleaning punctuation, numbers, and special symbols



**Thank You!**