# Stemming
Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the **stem** for [boat, boater, boating, boats].

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. In fact, spaCy doesn't include a stemmer, opting instead to rely entirely on lemmatization. For those interested, there's some background on this decision [here](https://github.com/explosion/spaCy/issues/327). We discuss the virtues of *lemmatization* in the next section.

Instead, we'll use another popular NLP tool called **nltk**, which stands for *Natural Language Toolkit*. For more information on nltk visit https://www.nltk.org/

## Porter Stemmer

One of the most common - and effective - stemming tools is [*Porter's Algorithm*](https://tartarus.org/martin/PorterStemmer/) developed by Martin Porter in [1980](https://tartarus.org/martin/PorterStemmer/def.txt). The algorithm employs five phases of word reduction, each with its own set of mapping rules. In the first phase, simple suffix mapping rules are defined, such as:

![image.png](attachment:image.png)

From a given set of stemming rules only one rule is applied, based on the longest suffix S1. Thus, `caresses` reduces to `caress` but not `cares`.

More sophisticated phases consider the length/complexity of the word before applying a rule. For example:

![image.png](attachment:image.png)

Here `m>0` describes the "measure" of the stem, such that the rule is applied to all but the most basic stems.

## spaCy does not include a stemmer, instead opting for
## lemmatization. NLTK, however, does include a stemmer

In [2]:
import nltk

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


## Porter Stemmer

In [None]:
from nltk.stem.porter import PorterStemmer

In [4]:
p_stemmer = PorterStemmer()

In [13]:
words = ['run','runner','ran','runs','easily','fairly','fairness']

In [20]:
# run stemmer to see the stem of each word
print('WORD:     STEM:')
for word in words:
    print(word + ' -----> ' + p_stemmer.stem(word))

WORD:     STEM:
run -----> run
runner -----> runner
ran -----> ran
runs -----> run
easily -----> easili
fairly -----> fairli
fairness -----> fair


## Snowball Stemmer

In [15]:
from nltk.stem.snowball import SnowballStemmer

In [16]:
s_stemmer = SnowballStemmer(language='english')

In [19]:
# run stemmer to see the stem of each word
print('WORD:     STEM:')
for word in words:
    print(word + ' -----> ' + s_stemmer.stem(word))

WORD:     STEM:
run -----> run
runner -----> runner
ran -----> ran
runs -----> run
easily -----> easili
fairly -----> fair
fairness -----> fair


### Compare Porter Stemmer vs. Snowball Stemmer

In [21]:
words = ['generous','generation','generously','generate']

In [22]:
# PORTER STEMMER
print('WORD:     STEM:')
for word in words:
    print(word + ' -----> ' + p_stemmer.stem(word))

WORD:     STEM:
generous -----> gener
generation -----> gener
generously -----> gener
generate -----> gener


In [23]:
# SNOWBALL STEMMER 
print('WORD:     STEM:')
for word in words:
    print(word + ' -----> ' + s_stemmer.stem(word))

WORD:     STEM:
generous -----> generous
generation -----> generat
generously -----> generous
generate -----> generat
