
# **Stemming and Lemmatization**

---

## **1. Theory**

### **Stemming**

* **Definition**: Process of reducing words to their **root form** by chopping off suffixes/prefixes.
* Often produces **non-dictionary words**.
* Example:

  * *“studies” → “studi”*
  * *“playing” → “play”*

👉 Algorithms: **Porter Stemmer, Snowball Stemmer, Lancaster Stemmer** (all available in NLTK).

---

### **Lemmatization**

* **Definition**: Process of reducing words to their **base/dictionary form (lemma)** using **linguistic knowledge** (vocabulary + grammar).
* Always produces **valid dictionary words**.
* Example:

  * *“studies” → “study”*
  * *“better” → “good”*

👉 Requires **POS tagging** for accuracy.
👉 SpaCy has a strong lemmatizer integrated into its pipeline.

---

### **Key Difference**

* **Stemming**: Fast, rule-based, less accurate.
* **Lemmatization**: Slower, linguistically accurate.

---

## **2. Examples**

### **NLTK Example**

```python
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download("wordnet")

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["studies", "studying", "better", "played"]

print("Stemming Results:")
for w in words:
    print(w, "->", stemmer.stem(w))

print("\nLemmatization Results:")
for w in words:
    print(w, "->", lemmatizer.lemmatize(w))
```

**Output:**

```
Stemming Results:
studies -> studi
studying -> studi
better -> better
played -> play

Lemmatization Results:
studies -> study
studying -> studying
better -> good
played -> play
```

---

### **SpaCy Example**

```python
import spacy
nlp = spacy.load("en_core_web_sm")

text = "studies studying better played"
doc = nlp(text)

for token in doc:
    print(token.text, "->", token.lemma_)
```

**Output:**

```
studies -> study
studying -> study
better -> good
played -> play
```

---

## **3. Interview-Style Q&A**

### **Basic Level**

**Q1. What is stemming?**
*A: Stemming reduces words to their base form by chopping off affixes, often resulting in non-dictionary words.*

**Q2. What is lemmatization?**
*A: Lemmatization reduces words to their valid dictionary form (lemma) using vocabulary and grammar rules.*

---

### **Intermediate Level**

**Q3. Difference between stemming and lemmatization?**
*A: Stemming is rule-based, fast, and may generate non-dictionary words. Lemmatization is linguistically accurate, slower, and produces valid dictionary words.*

**Q4. Why do we need POS tags in lemmatization?**
*A: Because the lemma of a word depends on its part of speech. For example, “better” is an adjective (lemma → “good”), but without POS, it may be treated incorrectly.*

---

### **Advanced Level**

**Q5. Which is better for production use: stemming or lemmatization?**
*A: Lemmatization is preferred for accuracy in production systems like search engines or chatbots. However, stemming is still useful for quick-and-dirty applications where speed matters more than linguistic correctness.*

**Q6. How do modern NLP models handle word normalization?**
*A: Modern deep learning models (e.g., BERT, GPT) rely on subword tokenization and embeddings, so explicit stemming/lemmatization is often unnecessary, though preprocessing can still help for simpler ML models.*

---

## **4. Comparison Table**

| **Aspect**        | **Stemming**                                | **Lemmatization**                                  |
| ----------------- | ------------------------------------------- | -------------------------------------------------- |
| **Approach**      | Rule-based (suffix stripping)               | Linguistic (dictionary + grammar)                  |
| **Output**        | May not be a valid word (*studies → studi*) | Always a valid word (*studies → study*)            |
| **Accuracy**      | Low to Medium                               | High                                               |
| **Speed**         | Fast                                        | Slower                                             |
| **Best Use Case** | Search engines, information retrieval       | Chatbots, machine translation, text classification |

---

