# Stemming and Lemmatization

**Objective:** Reduce words to their base or root forms to normalize text and minimize redundancy in NLP tasks.

### Why?
Words like *running*, *runs*, and *ran* all mean the same thing — *run*. To make models understand this, we convert them into their **root forms** using stemming or lemmatization.

---
##  Stemming
Stemming removes suffixes and prefixes to reach the base form of a word — often not a real dictionary word.

Example: *studies → studi*, *running → run*

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt')

text = "The runners were running swiftly and studying various phenomena."
words = word_tokenize(text)

# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()

print("Porter Stemmer:")
print([porter.stem(w) for w in words])

print("\nSnowball Stemmer:")
print([snowball.stem(w) for w in words])

print("\nLancaster Stemmer:")
print([lancaster.stem(w) for w in words])

### Comparison of Common Stemmers
- **PorterStemmer** → Most widely used; conservative.
- **SnowballStemmer** → An improved version of Porter.
- **LancasterStemmer** → Very aggressive (may over-trim words).

---
## Lemmatization
Lemmatization converts words to their **base dictionary form (lemma)** using linguistic rules.

Unlike stemming, it considers **Part of Speech (POS)** and ensures valid words.

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
words = ["running", "ran", "better", "studies", "children"]

for w in words:
    print(f"{w:>10} → {lemmatizer.lemmatize(w)}")

### Lemmatization with POS Tags
We can specify the **POS tag** to improve lemmatization results. For instance, *better* as adjective vs adverb.

In [None]:
print(lemmatizer.lemmatize('running', pos='v'))  # verb
print(lemmatizer.lemmatize('better', pos='a'))   # adjective

---
## Lemmatization using spaCy
spaCy automatically handles POS tagging and lemmatization efficiently.

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

for token in doc:
    print(f"{token.text:<15} → Lemma: {token.lemma_}")

---
## 🧮 4️⃣ Visual Comparison of Stemming vs Lemmatization

| Word | Stemming | Lemmatization |
|------|-----------|---------------|
| studies | studi | study |
| running | run | run |
| children | childr | child |
| better | better | good |

---
## Summary
- **Stemming** → Fast but crude (cuts off endings).
- **Lemmatization** → Slower but linguistically accurate.
- Libraries used: `nltk`, `spaCy`.

---
 **Next:** `04-Bag_of_Words_and_TFIDF.ipynb` — Learn how to convert processed text into numerical features for ML models.