Stemming in NLP – Interview Theory Sheet

#### 1. What is stemming? What is stemming?
### Answer:
Stemming is the process of reducing inflected or derived words to their root/base form, usually by stripping suffixes. It is a type of text normalization primarily used to improve information retrieval and reduce dimensionality in NLP tasks.Stemming is the process of reducing inflected or derived words to their root/base form, usually by stripping suffixes. It is a type of text normalization primarily used to improve information retrieval and reduce dimensionality in NLP tasks.

#### 2. Why is stemming important? Why is stemming important?
### Answer:
Stemming helps group similar terms for better generalization. For instance:Stemming helps group similar terms for better generalization. For instance:

"connect", "connected", "connection" → "connect"  
This improves the performance of search engines, classifiers, and topic models by treating variants of a word as one.This improves the performance of search engines, classifiers, and topic models by treating variants of a word as one.

#### 3. Common Stemming Algorithms in NLTK
| Stemmer           | Description                                 | Language Support | Aggressiveness |port | Aggressiveness |
|-------------------|---------------------------------------------|------------------|---------------|--------------|---------------|
| PorterStemmer     | Rule-based suffix stripper (Porter, 1980)   | English          | Medium        |Medium        |
| LancasterStemmer  | More aggressive, faster but less accurate   | English          | High          | English          | High          |
| SnowballStemmer   | Newer, more accurate, multilingual support  | Multilingual     | Balanced      || SnowballStemmer   | Newer, more accurate, multilingual support  | Multilingual     | Balanced      |
| RegexpStemmer     | Custom stemming using regex rules           | Customizable     | Custom        |Custom stemming using regex rules           | Customizable     | Custom        |

### 🧾 Syntax and Examples🧾 Syntax and Examples



In [1]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
nltk.download('punkt')  # Required for tokenization if needed

## Example – PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("connected"))  # Output: connect
print(stemmer.stem("connection")) # Output: connect


##  Example – LancasterStemmer
stemmer = LancasterStemmer()
print(stemmer.stem("connected"))  # Output: connect
print(stemmer.stem("connection")) # Output: connect
print(stemmer.stem("connections"))# Output: connect

## Example – SnowballStemmer (Multilingual)

stemmer = SnowballStemmer("english")
print(stemmer.stem("connected"))   # Output: connect
print(stemmer.stem("nationality")) # Output: nation


[nltk_data] Downloading package punkt to C:\Users\Suraj
[nltk_data]     Khodade\AppData\Roaming\nltk_data...


connect
connect
connect
connect
connect
connect
nation


[nltk_data]   Package punkt is already up-to-date!


### Comparison Table

| Feature              | **PorterStemmer**            | **LancasterStemmer**            | **SnowballStemmer**                  | **RegexpStemmer**                 |
| -------------------- | ---------------------------- | ------------------------------- | ------------------------------------ | --------------------------------- |
| **Language Support** | English                      | English                         | 15+ languages                        | Custom                            |
| **Aggressiveness**   | Moderate                     | High                            | Balanced                             | Customizable                      |
| **Accuracy**         | Good                         | Lower (over-stemming)           | Better than Porter                   | Depends on pattern                |
| **Use Case**         | General text                 | Performance-sensitive           | Multilingual processing              | Domain-specific rules             |
| **Syntax**           | `PorterStemmer().stem(word)` | `LancasterStemmer().stem(word)` | `SnowballStemmer("lang").stem(word)` | `RegexpStemmer(regex).stem(word)` |


### Stemming vs Lemmatization

| Aspect          | **Stemming**            | **Lemmatization**                         |
| --------------- | ----------------------- | ----------------------------------------- |
| Output          | May not be a valid word | Produces valid dictionary word            |
| Method          | Rule-based stripping    | Uses vocabulary and grammar (POS tagging) |
| Speed           | Fast                    | Slower (needs linguistic resources)       |
| Example         | `"better"` → `"bett"`   | `"better"` → `"good"`                     |
| Library in NLTK | `PorterStemmer`, etc.   | `WordNetLemmatizer`                       |


#### Q1. When would you prefer stemming over lemmatization?
### For performance-sensitive applications like search indexing, where speed is critical and exact root words are not essential.

#### Q2. Which stemmer is best for aggressive stemming?
### LancasterStemmer, but it may lead to over-stemming and reduce precision.

#### Q3. What are the trade-offs in using stemming?
### ✅ Pros:
- Reduces vocabulary size  
- Increases recall  
- Faster normalization  

### ❌ Cons:
- May remove meaningful distinctions (e.g., "universe" → "univers")  
- Produces non-dictionary forms  
- Language-specific behavior  

#### 🏁 Final Tips
### Porter = standard for academic use  
### Lancaster = aggressive and fast  
### Snowball = accurate + multilingual  
### Use stemming in IR, search, topic modeling  
### Use lemmatization in NLP pipelines, language understanding tasks  
