# ✅ What is **Normalization** in NLP?

**Normalization** means **cleaning and standardizing text** to make it uniform before processing.

It is essential because raw text can be messy:

Normalization ensures that words with the same meaning are treated the same.

---

# ✅ Components of Normalization

### 1️⃣ **Lowercasing**

* Convert all text to lowercase.
* Example: `"The Cat Is Cute"` → `"the cat is cute"`
* Why: Makes `"The"`, `"the"`, `"THE"` equivalent.

---

### 2️⃣ **Punctuation Cleaning**

* Remove or handle punctuation marks: `. , ! ? @ # $ %`
* Example: `"Hello, world!"` → `"Hello world"`
* Why: Punctuation usually does not add meaning for NLP tasks like text classification, keyword extraction.

---

### 3️⃣ **Number Handling (Optional)**

* Remove numbers if irrelevant: `"I have 2 cats"` → `"I have cats"`
* Sometimes numbers are kept if they carry meaning (like dates or quantities).

---

### 4️⃣ **Extra Whitespace Removal**

* Replace multiple spaces with a single space
* Remove leading/trailing spaces
* Example: `"I   love  NLP "` → `"I love NLP"`

---

### When to do Normalization

* **Early in preprocessing**, usually **before tokenization**, lemmatization, and stopword removal.
* Order recommendation:

```
Raw text
   ↓
Normalization (lowercase + clean punctuation + numbers + spaces)
   ↓
Tokenization
   ↓
Stopword removal
   ↓
Stemming / Lemmatization
   ↓
Vectorization / embeddings
```



### Example 1: Simple Text Normalization

In [None]:
import re
# Regular Expressions = patterns to search, match, extract, or replace text.

text = "This is a simple Example, showing punctuation cleaning!  "

# 1. Lowercase
text = text.lower()

# 2. Remove punctuation (keep letters, numbers, spaces)
text = re.sub(r'[^\w\s]', '', text)

# 3. Remove extra spaces
text = re.sub(r'\s+', ' ', text).strip()

print(text)

this is a simple example showing punctuation cleaning


### Example 2: Normalization + Tokenization + Stopword Removal

In [3]:
import nltk
from nltk.corpus import stopwords

text = "This is a simple Example, showing punctuation cleaning!"

# 1. Lowercase
text = text.lower()

# 2. Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# 3. Tokenize
words = text.split()  # simple (use nltk, spacy, etc for complex data)

# 4. Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words if w not in stop_words]

print(filtered_words)

['simple', 'example', 'showing', 'punctuation', 'cleaning']


### Example 3: Using spaCy for Normalization + Lemmatization

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

text = "This is a simple Example, showing punctuation cleaning!"

# Lowercase (spaCy handles tokenization automatically)
doc = nlp(text.lower())

# Remove stopwords and lemmatize
cleaned = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]

print(cleaned)

['simple', 'example', 'show', 'punctuation', 'cleaning']


* `token.is_stop` → removes stopwords
* `token.is_alpha` → removes punctuation/numbers
* `token.lemma_` → lemmatizes the word


## ✅ Key Notes

1. **Always lowercase** first to unify word forms.
2. **Remove punctuation** early to avoid it affecting tokenization.
3. **Whitespace cleanup** prevents empty tokens.
4. **Numbers** can be removed or kept based on your task.
5. Normalization **makes later NLP steps like tokenization, stopword removal, and lemmatization more accurate**.

---
