## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [2]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

### stemming algorithms
Porter Stemmer – Most common, rule-based, lightweight.

Snowball Stemmer – An improvement over Porter with better accuracy.

Lancaster Stemmer – More aggressive, may over-stem words.

RegexpStemmer (short for Regular Expression Stemmer) is a type of stemmer in NLTK that uses a regular expression rule to remove or replace word endings instead of applying a whole set of fixed linguistic rules like the Porter or Snowball stemmers.

### PorterStemmer

In [8]:
from nltk.stem import PorterStemmer

In [9]:
stemming=PorterStemmer()

In [10]:
for word in words:
    print(word+"---->"+stemming.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [11]:
stemming.stem('congratulations')

'congratul'

In [12]:
stemming.stem("sitting")

'sit'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [13]:
from nltk.stem import RegexpStemmer

In [14]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [15]:
reg_stemmer.stem('eating')

'eat'

In [16]:
reg_stemmer.stem('ingeating')

'ingeat'

### Snowball Stemmer
 It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [17]:
from nltk.stem import SnowballStemmer

In [18]:
snowballsstemmer=SnowballStemmer('english')

In [19]:
for word in words:
    print(word+"---->"+snowballsstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [21]:
#porter Stemmer
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [22]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

In [23]:
snowballsstemmer.stem('goes')

'goe'

In [24]:
stemming.stem('goes')

'goe'

### Lancaster Stemmer
The Lancaster Stemmer tends to strip suffixes (and sometimes prefixes) very aggressively and repeatedly until it can’t strip anymore.

This means it often chops off more characters from words compared to other stemmers like Porter or Snowball.

The result can sometimes be over-stemming, where different words get reduced to the same stem even if they are semantically different, or the stem ends up being very short or even meaningless.

In [25]:
from nltk.stem import LancasterStemmer, PorterStemmer

# Initialize stemmers
lancaster = LancasterStemmer()
porter = PorterStemmer()

words = ["maximum", "maximums", "general", "generally", "generation", "generate", "playing", "played"]

print("Original Word - Lancaster Stemmer - Porter Stemmer")
for word in words:
    print(f"{word:12} - {lancaster.stem(word):15} - {porter.stem(word)}")


Original Word - Lancaster Stemmer - Porter Stemmer
maximum      - maxim           - maximum
maximums     - maximum         - maximum
general      - gen             - gener
generally    - gen             - gener
generation   - gen             - gener
generate     - gen             - gener
playing      - play            - play
played       - play            - play
