# Stemming

-> When we search text for a certain keyword, it helps if the search return variations of the word.

-> For example, searching for  "boat" might also return "boats" and "boating". Here "boat" would be the stem for [boat, boater, boating, boats]

-> Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached.

-> This works fairly well in most cases, but unfortunately English has many exceptions where a more sphisticated process required.



# Porters' Algorithm


In [30]:
"""

-> One of the most common and effective stemming tools is
Porter's Algorithm developed by Martin Porter in 1980.

-> The algorithm emoploys 5 phases  of word reduction,
each with its own set of mapping rules.

-> In the 1st phase, simple suffix mapping rules are defined such as:

S1 to S2:

1.) SSES -> SS
2.) IES -> I
3.) SS -> SS
4.) S ->



word to stem:

caresses -> caress
ponies -> poni
ties -> ti
caress -> caress
cats -> cat

-> SO from the given set of stemming rules only one rule is applied,
 based off the longest suffix, S1.

-> Thus caresses reduces to caress but not to cares.
That way there is not confusion and mixing up different
types of words to the wrong stem.

-> There are some more sophisticated phases consider
the length/ complexity of the word before applying a rule.
For example,

S1 --> S2
(m>0) ATIONAL -> ATE
(m>0) EED -> EE


WORD --> stem

relational -> relate
national -> national

agreed-> agree
feed -> feed


"""

"\n\n-> One of the most common and effective stemming tools is \nPorter's Algorithm developed by Martin Porter in 1980.\n\n-> The algorithm emoploys 5 phases  of word reduction, \neach with its own set of mapping rules.\n\n-> In the 1st phase, simple suffix mapping rules are defined such as:\n\nS1 to S2:\n\n1.) SSES -> SS\n2.) IES -> I\n3.) SS -> SS\n4.) S -> \n\n\n\nword to stem:\n\ncaresses -> caress\nponies -> poni\nties -> ti\ncaress -> caress\ncats -> cat\n\n-> SO from the given set of stemming rules only one rule is applied,\n based off the longest suffix, S1.\n\n-> Thus caresses reduces to caress but not to cares. \nThat way there is not confusion and mixing up different \ntypes of words to the wrong stem.\n\n-> There are some more sophisticated phases consider \nthe length/ complexity of the word before applying a rule.\nFor example,\n\nS1 --> S2\n(m>0) ATIONAL -> ATE\n(m>0) EED -> EE\n\n\nWORD --> stem\n\nrelational -> relate\nnational -> national\n\nagreed-> agree\nfeed -> fe

# Snowball

In [31]:
"""

Snowball is the name of a stemming language also developed by Martin Porter.

The algorithm used here is more accurately called the "English Stemmer"
or "Porter2 Stemmer"

If offers a slight improvement over the original Porter Stemmer, both
in logic and speed.


"""



'\n\nSnowball is the name of a stemming language also developed by Martin Porter.\n\nThe algorithm used here is more accurately called the "English Stemmer"\nor "Porter2 Stemmer"\n\nIf offers a slight improvement over the original Porter Stemmer, both \nin logic and speed.\n\n\n'

In [32]:
import nltk

In [33]:
from nltk.stem.porter import PorterStemmer

In [34]:
p_stem= PorterStemmer()

In [35]:
words = ["run","runner","ran","runs","Love","Malav","Maya","Mrugesh",\
         "fairly","easily","notably"]

In [36]:
for word in words:
    print(word + "--->"+p_stem.stem(word))
    print("\n")



run--->run


runner--->runner


ran--->ran


runs--->run


Love--->love


Malav--->malav


Maya--->maya


Mrugesh--->mrugesh


fairly--->fairli


easily--->easili


notably--->notabl




In [37]:
from nltk.stem.snowball import SnowballStemmer

In [38]:
s_stem = SnowballStemmer(language='english')

In [39]:
for word in words:
    print(word + "--->"+s_stem.stem(word))
    print("\n")



run--->run


runner--->runner


ran--->ran


runs--->run


Love--->love


Malav--->malav


Maya--->maya


Mrugesh--->mrugesh


fairly--->fair


easily--->easili


notably--->notabl




In [40]:
# There is some set of algorithmic rules  that these stemmers are
# following to try to reduce down these words to some sort of
# root idea or root word.



In [41]:

words=['Adventure','Harmony','Breeze','Enigma','Velvet','Radiance',\
       'Whimsy','Infinite','Reverie','Cascade']

for word in words:
    print(word + " ------->"+p_stem.stem(word))

print(" ")

for word in words:
    print(word + " ------->"+s_stem.stem(word))

Adventure ------->adventur
Harmony ------->harmoni
Breeze ------->breez
Enigma ------->enigma
Velvet ------->velvet
Radiance ------->radianc
Whimsy ------->whimsi
Infinite ------->infinit
Reverie ------->reveri
Cascade ------->cascad
 
Adventure ------->adventur
Harmony ------->harmoni
Breeze ------->breez
Enigma ------->enigma
Velvet ------->velvet
Radiance ------->radianc
Whimsy ------->whimsi
Infinite ------->infinit
Reverie ------->reveri
Cascade ------->cascad
