# Stemming
Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the **stem** for [boat, boater, boating, boats].

Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached.

Instead, we'll use another popular NLP tool called **nltk**, which stands for *Natural Language Toolkit*. For more information on nltk visit https://www.nltk.org/

## Porter Stemmer

One of the most common - and effective - stemming tools is [*Porter's Algorithm*](https://tartarus.org/martin/PorterStemmer/) developed by Martin Porter in [1980](https://tartarus.org/martin/PorterStemmer/def.txt). The algorithm employs five phases of word reduction, each with its own set of mapping rules. In the first phase, simple suffix mapping rules are defined, such as:

![stemming1.png](https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/stemming1.png)

From a given set of stemming rules only one rule is applied, based on the longest suffix S1. Thus, `caresses` reduces to `caress` but not `cares`.

More sophisticated phases consider the length/complexity of the word before applying a rule. For example:

![stemming1.png](https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/stemming2.png)

Here `m>0` describes the "measure" of the stem, such that the rule is applied to all but the most basic stems.

In [None]:
# Import the toolkit and the full Porter Stemmer library
import nltk

from nltk.stem.porter import *

In [None]:
p_stemmer = PorterStemmer()

In [None]:
words = ['run','runner','running','ran','runs','easily','fairly']

In [None]:
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli


<font color=green>Note how the stemmer recognizes "runner" as a noun, not a verb form or participle. Also, the adverbs "easily" and "fairly" are stemmed to the unusual root "easili" and "fairli"</font>
___

## Snowball Stemmer - Italian
This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by Martin Porter. The algorithm used here is more acurately called the "English Stemmer" or "Porter2 Stemmer". It offers a slight improvement over the original Porter stemmer, both in logic and speed. Since **nltk** uses the name SnowballStemmer, we'll use it here.

In [None]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='italian')

In [None]:
words = ['divertente','diverte','divertito','divertire','mio','miei']
# words = ['generous','generation','generously','generate']

In [None]:
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

divertente --> divertent
diverte --> divert
divertito --> divert
divertire --> divert
mio --> mio
miei --> mie


<font color=green>In this case the stemmer performed the same as the Porter Stemmer, with the exception that it handled the stem of "fairly" more appropriately with "fair"</font>
___

## Esercizi

In [None]:
words = ['consigliammo']

In [None]:
print('Porter Stemmer:')
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

Porter Stemmer:
consigliammo --> consigliammo


In [None]:
print('Porter2 Stemmer:')
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

Porter2 Stemmer:
consigliammo --> consigl


___
Stemming has its drawbacks. If given the token `saw`, stemming might always return `saw`, whereas lemmatization would likely return either `see` or `saw` depending on whether the use of the token was as a verb or a noun. As an example, consider the following:

In [None]:
phrase = 'Giocare a bowling con i miei amici mi diverte molto'
for word in phrase.split():
    print(word+' --> '+p_stemmer.stem(word))

Giocare --> giocar
a --> a
bowling --> bowl
con --> con
i --> i
miei --> miei
amici --> amici
mi --> mi
diverte --> divert
molto --> molto


### Stemming and Lemmatization

💡 “Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes….Stemmers use language-specific rules, but they require less knowledge than a lemmatizer…”

💡 “Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma….a lemmatizer, which needs a complete vocabulary and morphological analysis to correctly lemmatize words…”

In [None]:
## import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
# Import packages
import pandas as pd
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer
# Instantiate stemmers and lemmatiser
porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatiser = WordNetLemmatizer()
# Create function that normalises text using all three techniques
def normalise_text(words, pos='v'):
    """Stem and lemmatise each word in a list. Return output in a dataframe."""
    normalised_text = pd.DataFrame(index=words, columns=['Porter', 'Lancaster', 'Lemmatiser'])
    for word in words:
        normalised_text.loc[word,'Porter'] = porter.stem(word)
        normalised_text.loc[word,'Lancaster'] = lancaster.stem(word)
        normalised_text.loc[word,'Lemmatiser'] = lemmatiser.lemmatize(word, pos=pos)
    return normalised_text

💡 PorterStemmer: One of the most commonly used stemmers. It is based on Porter Stemming Algorithm. For more information, check out the official web page: https://tartarus.org/martin/PorterStemmer/

💡 LancasterStemmer: It is based on Lancaster Stemming Algorithm and can sometimes result in more aggressive stemming than PorterStemmer.

In [None]:
normalise_text(['apples', 'pears', 'tasks', 'children', 'earrings', 
                'dictionary', 'marriage', 'connections', 'universe', 'university'], pos='n')

Unnamed: 0,Porter,Lancaster,Lemmatiser
apples,appl,appl,apple
pears,pear,pear,pear
tasks,task,task,task
children,children,childr,child
earrings,ear,ear,earring
dictionary,dictionari,dict,dictionary
marriage,marriag,marry,marriage
connections,connect,connect,connection
universe,univers,univers,universe
university,univers,univers,university


In [None]:
normalise_text(['wrote', 'thinking', 'remembered', 'relies', 'ate', 'gone', 'won',
                'ran', 'swimming', 'mistreated'], pos='v')

Unnamed: 0,Porter,Lancaster,Lemmatiser
wrote,wrote,wrot,write
thinking,think,think,think
remembered,rememb,rememb,remember
relies,reli,rely,rely
ate,ate,at,eat
gone,gone,gon,go
won,won,won,win
ran,ran,ran,run
swimming,swim,swim,swim
mistreated,mistreat,mist,mistreat
