# Stemming
Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the **stem** for [boat, boater, boating, boats].

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. For those interested, there's some background on this decision [here](https://github.com/explosion/spaCy/issues/327). We discuss the virtues of *lemmatization* in the next section.

Instead, we'll use another popular NLP tool called **nltk**, which stands for *Natural Language Toolkit*. For more information on nltk visit https://www.nltk.org/

## Porter Stemmer

One of the most common - and effective - stemming tools is [*Porter's Algorithm*](https://tartarus.org/martin/PorterStemmer/) developed by Martin Porter in [1980](https://tartarus.org/martin/PorterStemmer/def.txt). The algorithm employs five phases of word reduction, each with its own set of mapping rules. In the first phase, simple suffix mapping rules are defined

In [2]:
!pip install nltk



In [3]:
# Import the toolkit and the full Porter Stemmer library
import nltk

from nltk.stem.porter import *

In [4]:
p_stemmer = PorterStemmer()

In [14]:
words = ['run', 'running', 'ran', 'runs', 'runner', 'passively', 'Casually', 'Casuality', 'fairly']

In [15]:
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
running --> run
ran --> ran
runs --> run
runner --> runner
passively --> passiv
Casually --> casual
Casuality --> casual
fairly --> fairli


<font color=green>Note how the stemmer recognizes "runner" as a noun, not a verb form or participle. Also, the adverbs "easily" and "fairly" are stemmed to the unusual root "easili" and "fairli"</font>

## Snowball Stemmer
This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the "English Stemmer" or "Porter2 Stemmer". It offers a slight improvement over the original Porter stemmer, both in logic and speed. Since **nltk** uses the name SnowballStemmer, we'll use it here.

In [9]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')

In [10]:
words = ['run','runner','running','ran','runs','easily','fairly']

In [11]:
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


In [None]:
words = ['run', 'running', 'ran', 'runs', 'runner', 'passively', 'Casually', 'Casuality']

In [13]:
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
running --> run
ran --> ran
runs --> run
runner --> runner
passively --> passiv
Casually --> casual
Casuality --> casual


<font color=green>In this case the stemmer performed the same as the Porter Stemmer, with the exception that it handled the stem of "fairly" more appropriately with "fair"</font>

#Some of random word try :)

In [16]:
words = ['constantly', 'constant']

In [17]:
print('Porter Stemmer:')
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

Porter Stemmer:
constantly --> constantli
constant --> constant


In [18]:
print('Porter2 Stemmer:')
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

Porter2 Stemmer:
constantly --> constant
constant --> constant


In [22]:
words = ['difference', 'indifferent', 'different', 'differentiation', 'defer']

In [23]:
print('Porter Stemmer:')
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

Porter Stemmer:
difference --> differ
indifferent --> indiffer
different --> differ
differentiation --> differenti
defer --> defer


In [24]:
print('Porter2 Stemmer:')
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

Porter2 Stemmer:
difference --> differ
indifferent --> indiffer
different --> differ
differentiation --> differenti
defer --> defer


___
Stemming has its drawbacks. If given the token `saw`, stemming might always return `saw`, whereas lemmatization would likely return either `see` or `saw` depending on whether the use of the token was as a verb or a noun. As an example, consider the following:

In [25]:
phrase = 'I am meeting him tomorrow at the meeting'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

I --> i
am --> am
meeting --> meet
him --> him
tomorrow --> tomorrow
at --> at
the --> the
meeting --> meet


In [26]:
phrase = 'I saw him today'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

I --> i
saw --> saw
him --> him
today --> today


In [27]:
phrase = 'I saw him yesterday'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

I --> i
saw --> saw
him --> him
yesterday --> yesterday


In [28]:
phrase = 'I"m seeing him now'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

I"m --> i"m
seeing --> see
him --> him
now --> now
