Stemming and its Methods

What is Stemming?

* Stemming refers to the process of reducing a word to its base or root form, which may not correspond to a valid dictionary entry.
* Example:
  * running → run
  * studies → studi
* The primary purpose of stemming is to treat morphologically similar words as equivalent during tasks such as information retrieval (IR), search engine indexing, text mining, and sentiment analysis.

Porter Stemmer

* The Porter Stemmer is the most widely used rule-based stemming algorithm. It was developed in 1980 by Martin Porter.
* This algorithm applies a series of rules to remove common suffixes such as -ing, -ed, and -s from words.
* An advantage of the Porter Stemmer is its simplicity and speed, which make it effective for processing English text.
* A limitation of the Porter Stemmer is its tendency to over-stem or under-stem certain words.

In [8]:
from nltk.stem import PorterStemmer

In [10]:
P_stemming=PorterStemmer()

In [12]:
words = ["running", "played", "cars", "happiness","working","eating","eaten"]

In [18]:
for word in words:
    print(word+"---->"+P_stemming.stem(word))

running---->run
played---->play
cars---->car
happiness---->happi
working---->work
eating---->eat
eaten---->eaten


Snowball Stemmer

* The Snowball Stemmer is an improved version of the Porter Stemmer, also developed by Martin Porter.
* It supports multiple languages, including English, French, and German.
* The Snowball Stemmer is more consistent and aggressive than the original Porter Stemmer.

In [20]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")

In [24]:
for word in words:
    print(word+"---->"+stemmer.stem(word))

running---->run
played---->play
cars---->car
happiness---->happi
working---->work
eating---->eat
eaten---->eaten


In [28]:
from nltk.stem import RegexpStemmer
R_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=3)

In [30]:
for word in words:
    print(word+"---->"+R_stemmer.stem(word))

running---->runn
played---->played
cars---->car
happiness---->happines
working---->work
eating---->eat
eaten---->eaten


Regular Expression Stemmer (RegexpStemmer)

The Regular Expression Stemmer removes word endings using user-defined regular expression patterns.

Users can specify custom suffix patterns to tailor the stemming process.

Advantage: The approach is fully customizable.

Disadvantage: This method can be crude and may not generalize effectively across different word forms.

Lancaster Stemmer

* The Lancaster Stemmer is highly aggressive and often removes substantial portions of words, which can result in excessive truncation.
* This algorithm may generate stems that do not resemble the original words, potentially reducing interpretability.
* Advantage: The Lancaster Stemmer is very fast.
* Disadvantage: It often results in over-stemming.

In [33]:
from nltk.stem import LancasterStemmer
L_stemmer = LancasterStemmer()

In [35]:
for word in words:
    print(word+"---->"+L_stemmer.stem(word))

running---->run
played---->play
cars---->car
happiness---->happy
working---->work
eating---->eat
eaten---->eat
