**STEMMING**

The stem of the verb wait is wait: it is the part that is common to all its inflected variants.

wait (infinitive)

wait (imperative)

waits (present, 3rd person, singular)

wait (present, other persons and/or plural)

waited (simple past)

waited (past participle)

waiting (progressive)


**Inflection** is a process of word formation, in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and definiteness

Sometime spelling may also change in order to make a new word.

beauty, duty + -ful → beautiful, dutiful (-y changes to i)

heavy, ready + -ness → heaviness, readiness (-y changes to i)

able, possible + -ity → ability, possibility (-le changes to il)

permit, omit + -ion → permission, omission (-t changes to ss)

OverStemming

Over-stemming is when two words with different stems are stemmed to the same root. This is also known as a false positive.

universal

university

universe

All the above 3 words are stemmed to univers which is wrong behavior.

**UnderStemming**

Under-stemming is when two words that should be stemmed to the same root are not. This is also known as a false negative. Below is the example for the same.

alumnus

alumni

alumnae


Truncating Stemming Algorithm

*1. Porter Stemmer*

This is one of the most common and gentle stemmer, Its fast but not very precise.

In [None]:
pip install nltk

In [1]:
import nltk
from nltk.stem.porter import *

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
porterStemmer = PorterStemmer()

In [4]:
sentence = "Provision Maximum multiply owed caring on go gone going was this"

In [5]:
wordList = nltk.word_tokenize(sentence)

stemWords = [porterStemmer.stem(word) for word in wordList]

print(' '.join(stemWords))

provis maximum multipli owe care on go gone go wa thi


**Problem?**

Look at the input and you can see we are passing “was” and getting “wa” as output. This is something which should be considered under less precise algorithm. To increase the precision another algorithm came which was SnowBall Stemmer.

*2. Snowball Stemmer*

The actual name of this stemmer is English Stemmer or Porter2 Stemmer
There were some improvements done on Porter Stemmer which made it more precise over large data-sets

In [None]:
from nltk.stem.snowball import SnowballStemmer

snowBallStemmer = SnowballStemmer("english")

sentence = "Provision Maximum multiply owed caring on go gone going was this"
wordList = nltk.word_tokenize(sentence)

stemWords = [snowBallStemmer.stem(word) for word in wordList]

print(' '.join(stemWords))

As this was an improvement over Porter Stemmer hence we can see in the results how gracefully it handled “was” input. There was lots of improvement done in this algorithm. 

Including Stopword stemming

In [7]:
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer

stemmer2 = SnowballStemmer("english", ignore_stopwords=True)

sentence = "Provision Maximum multiply owed caring on go gone going was this"
wordList = nltk.word_tokenize(sentence)

stemWords = [snowBallStemmer.stem(word) for word in wordList]

print(' '.join(stemWords))

provis maximum multipli owe care on go gone go was this


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Language support

In [8]:
print(" ".join(SnowballStemmer.languages))

arabic danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish


3. Lancaster Stemmmer

It is very aggressive algorithm

It will hugely trim down your working set, this statement itself has pros and cons, sometime you many want this in your datasets but maximum time you will be avoiding it.

In [9]:
from nltk.stem.lancaster import *

lancasterStemmer = LancasterStemmer()

sentence = "Provision Maximum multiply owed caring on go gone going was this"
wordList = nltk.word_tokenize(sentence)

stemWords = [lancasterStemmer.stem(word) for word in wordList]

print(' '.join(stemWords))

provid maxim multiply ow car on go gon going was thi


Aggression can be observed by “Caring” input, It was converted to “car” which is altogether a different word in English dictionary.

Small comparison

In [10]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
word_list = ["friend", "friendship", "friends", "friendships"]
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))

Word                Porter Stemmer      Snowball Stemmer    Lancaster Stemmer             Regexp Stemmer                          
friend              friend              friend              friend                        friend                                  
friendship          friendship          friendship          friend                        friendship                              
friends             friend              friend              friend                        friend                                  
friendships         friendship          friendship          friend                        friendship                              
