# Stemming in NLP

## What is Stemming?

**Stemming** is the process of reducing a word to its **base or root form**, known as the **stem**.  
The stem may not be a real word but is used to group related words together.

- Example:
  - "running", "runs", "ran" → "run"
  - "happily", "happiness" → "happi"

## Why Use Stemming?

- Reduces vocabulary size.
- Helps match related terms in text (e.g., in search engines, classification, etc.).
- Useful for tasks like **information retrieval**, **text mining**, and **sentiment analysis**.

## Limitations of Stemming
- May produce non-real words (e.g., "fairly" → "fairli").
- Can lead to over-stemming (different words reduced to the same root) or under-stemming.
- Less accurate than lemmatization, which uses vocabulary and grammar.

## Bonus Tip
- Use Stemming for speed and simplicity.
- Use Lemmatization when accuracy and word meaning matter.

In [1]:
words = ["running", "ran", "runs", "easily", "fairly", "programmers", "programs", "history", "historical"]

### PorterStemmer(Classic, Rule-based)
- A widely used, rule-based stemmer that applies a series of steps to strip suffixes from English words.

In [2]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in words]
print(stems)

['run', 'ran', 'run', 'easili', 'fairli', 'programm', 'program', 'histori', 'histor']


### LancasterStemmer (More aggressive)
- A more aggressive stemmer than Porter, often leading to shorter stems and possible over-stemming.

In [3]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
stems = [stemmer.stem(word) for word in words]
print(stems)

['run', 'ran', 'run', 'easy', 'fair', 'program', 'program', 'hist', 'hist']


### RegexpStemmer class
- A customizable stemmer that removes word suffixes using user-defined regular expression patterns.

In [5]:
from nltk.stem import RegexpStemmer

stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)
stems = [stemmer.stem(word) for word in words]
print(stems)

['runn', 'ran', 'run', 'easily', 'fairly', 'programmer', 'program', 'history', 'historical']


### SnowballStemmer (More advanced, supports multiple languages)
- An improved version of PorterStemmer with support for multiple languages and more consistent stemming rules.

In [6]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")
stems = [stemmer.stem(word) for word in words]
print(stems)

['run', 'ran', 'run', 'easili', 'fair', 'programm', 'program', 'histori', 'histor']
