## 🌿 Stemming in NLTK

📖 Definition:
Stemming is the process of reducing words to their root or base form, typically by removing suffixes. It is a rule-based text preprocessing technique used to normalize words for downstream NLP tasks such as search, classification, and clustering.

### 🌿 Stemming Techniques in NLTK
| Stemmer              | Type                      | Description                                                                 |
| -------------------- | ------------------------- | --------------------------------------------------------------------------- |
| **PorterStemmer**    | Rule-based                | Oldest and widely used; simple suffix-stripping based on fixed rules.       |
| **LancasterStemmer** | Rule-based (aggressive)   | More aggressive and faster; uses an iterative rule set to strip suffixes.   |
| **SnowballStemmer**  | Rule-based (multilingual) | Successor of PorterStemmer, supports multiple languages with refined rules. |
| **RegexpStemmer**    | Regex-based               | Allows custom suffix stripping using regular expressions.                   |
| **ISRIStemmer**      | Rule-based (Arabic only)  | Specialized for Arabic morphology using root-based heuristics.              |


### 🏁 Summary of Differences
| Stemmer       | Strength                   | Limitation               | Use Case                   |
| ------------- | -------------------------- | ------------------------ | -------------------------- |
| **Porter**    | Stable, interpretable      | May miss complex forms   | Classic NLP preprocessing  |
| **Lancaster** | Fast, compact              | Over-stemming possible   | Space-limited environments |
| **Snowball**  | Best balance, multilingual | English version ≈ Porter | Enterprise-level NLP       |
| **Regexp**    | Fully customizable         | Language agnostic, naive | Domain-specific stemming   |
| **ISRI**      | Specialized Arabic support | Only for Arabic          | Arabic NLP systems         |

### 🎯 Recommendation Matrix

| Use Case                        | Recommended Stemmer |
| ------------------------------- | ------------------- |
| General English NLP             | `SnowballStemmer`   |
| Academic NLP Research           | `PorterStemmer`     |
| Performance-Constrained Systems | `LancasterStemmer`  |
| Custom Rule Scenarios           | `RegexpStemmer`     |
| Arabic Language Projects        | `ISRIStemmer`       |


In [33]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer, RegexpStemmer
from nltk.stem.isri import ISRIStemmer
# Initialize stemmers
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")
regexp = RegexpStemmer('ing$|ed$|ly$|tion$|er$', min=4)
isri = ISRIStemmer()  # Only works on Arabic text

words = [
    "compute", "computing", "computer", "computed", "computation", "computationally",
    "connect", "connected", "connection", "connecting", "connectional",
    "quickly", "happily", 
    "national", "nationalist", "nationalism",
    "relational", "organizational", "communication", "running"
]

# Apply each stemmer
output = {
    "Original": words,
    "Porter": [porter.stem(w) for w in words],
    "Lancaster": [lancaster.stem(w) for w in words],
    "Snowball": [snowball.stem(w) for w in words],
    "Regexp": [regexp.stem(w) for w in words],
}

# Display results
for i in range(len(words)):
    print(f"{'='*60}")
    print(f"Original   : {output['Original'][i]}")
    print(f"Porter     : {output['Porter'][i]}")
    print(f"Lancaster  : {output['Lancaster'][i]}")
    print(f"Snowball   : {output['Snowball'][i]}")
    print(f"Regexp     : {output['Regexp'][i]}")


Original   : compute
Porter     : comput
Lancaster  : comput
Snowball   : comput
Regexp     : compute
Original   : computing
Porter     : comput
Lancaster  : comput
Snowball   : comput
Regexp     : comput
Original   : computer
Porter     : comput
Lancaster  : comput
Snowball   : comput
Regexp     : comput
Original   : computed
Porter     : comput
Lancaster  : comput
Snowball   : comput
Regexp     : comput
Original   : computation
Porter     : comput
Lancaster  : comput
Snowball   : comput
Regexp     : computa
Original   : computationally
Porter     : comput
Lancaster  : comput
Snowball   : comput
Regexp     : computational
Original   : connect
Porter     : connect
Lancaster  : connect
Snowball   : connect
Regexp     : connect
Original   : connected
Porter     : connect
Lancaster  : connect
Snowball   : connect
Regexp     : connect
Original   : connection
Porter     : connect
Lancaster  : connect
Snowball   : connect
Regexp     : connec
Original   : connecting
Porter     : connect
Lanca

In [34]:
from nltk import PorterStemmer

In [35]:
stemming = PorterStemmer()

In [36]:
words = ["running", "ran", "easily", "fairly","ArithmeticError", "arithmetic", "arithmeticity","optimization", "optimizing", "optimizes","finally", "finalize", "finalized"]

In [37]:
for word in words:
    print(f"{word} -> {stemming.stem(word)}")

running -> run
ran -> ran
easily -> easili
fairly -> fairli
ArithmeticError -> arithmeticerror
arithmetic -> arithmet
arithmeticity -> arithmet
optimization -> optim
optimizing -> optim
optimizes -> optim
finally -> final
finalize -> final
finalized -> final


In [38]:
stemming.stem('congratulations')

'congratul'

In [39]:
from nltk.stem import RegexpStemmer

In [40]:
reg_stemming = RegexpStemmer('ing$|ly$|ed$|s$',min=4)

In [41]:
for word in words:
    print(f"{word} -> {reg_stemming.stem(word)}")

running -> runn
ran -> ran
easily -> easi
fairly -> fair
ArithmeticError -> ArithmeticError
arithmetic -> arithmetic
arithmeticity -> arithmeticity
optimization -> optimization
optimizing -> optimiz
optimizes -> optimize
finally -> final
finalize -> finalize
finalized -> finaliz


In [42]:
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
for word in words:
    print(f"{word} -> {snowball_stemmer.stem(word)}")

running -> run
ran -> ran
easily -> easili
fairly -> fair
ArithmeticError -> arithmeticerror
arithmetic -> arithmet
arithmeticity -> arithmet
optimization -> optim
optimizing -> optim
optimizes -> optim
finally -> final
finalize -> final
finalized -> final


In [43]:
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("german")
for word in ["laufen", "läuft", "gelaufen", "laufend"]:
    print(f"{word} -> {snowball_stemmer.stem(word)}")

laufen -> lauf
läuft -> lauft
gelaufen -> gelauf
laufend -> laufend
