# Stemming and Lemmatization with NLTK

This notebook is an overview on stemming and lemmatization and what the methods can do for you. Since the Natural Language Toolkit (NLTK) provides many different methods to conduct both aspects, we will be comparing them against eachother to give you a better idea of which one will suit your needs.


## Importing Libraries

Before we begin, we will import all the required libraries for the notebook.

In [1]:
# The Stemmers presented by nltk
from nltk.stem.regexp import RegexpStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

# The Lemmatization presented by nltk
from nltk.stem import WordNetLemmatizer


## What is Stemming?

Stemming is the process of creating root words from their derivative forms. A simple example is removing "s", or the plural suffix letters, from a word. This is important in different applications. For example, streamlining a corpus or reducing variation in a body of work.

## Stemming

We will first demonstrate how to use PorterStemmer, as it is one of the more popular libraries. All other stemmers will follow the same structure to initalize and conduct stemming. Therefore, all that is required is a simple change in terms from one stemmer to the next.


In [2]:
# Import the stemmer to use
portStem = PorterStemmer()

# Trying out the Stemmer on variations of the word "Work" 
print("Stemming 'Working' \t= " + portStem.stem("Working"))
print("Stemming 'Works' \t= " + portStem.stem("Works"))
print("Stemming 'Worked' \t= " + portStem.stem("Worked"))
print("Stemming 'Work' \t= " + portStem.stem("Work"))

Stemming 'Working' 	= work
Stemming 'Works' 	= work
Stemming 'Worked' 	= work
Stemming 'Work' 	= work


As you can see, stemming was able to convert the different forms of the word "Work" back into the word "work". Now we are going to showcase other Stemmers found in NLTK. The differences between each stemmer can be found in the documentation here:
http://www.nltk.org/api/nltk.stem.html

Although there may not be variations seen in the following examples, this is not always the case. Therefore, proper research into the different stemmers is recommended. 

Feel free to skip this part and proceed to the lemmatization section if the previous stemmer fits your needs.

In [3]:
# Lancaster Stemmer
lanStem = LancasterStemmer()
print("LancasterStemmer 'Working' \t= " + lanStem.stem("Working"))
print("LancasterStemmer 'Works' \t= " + lanStem.stem("Works"))
print("LancasterStemmer 'Worked' \t= " + lanStem.stem("Worked"))
print("LancasterStemmer 'Work' \t= " + lanStem.stem("Work"))
print("\n========================================\n") 

# SnowballStemmer (It also allows different lanugages outside english)
snowStem = SnowballStemmer("english") # Choose a language
print("SnowballStemmer 'Working' \t= " + snowStem.stem("Working"))
print("SnowballStemmer 'Works' \t= " + snowStem.stem("Works"))
print("SnowballStemmer 'Worked' \t= " + snowStem.stem("Worked"))
print("SnowballStemmer 'Work'  \t= " + snowStem.stem("Work"))
print("\n========================================\n") 

# RegexpStemmer (Stems via regular expressions)
RegStem = RegexpStemmer('ing$|s$|ed$', min=3)
print("RegexpStemmer 'Working' \t= " + RegStem.stem("Working"))
print("RegexpStemmer 'Works'   \t= " + RegStem.stem("Works"))
print("RegexpStemmer 'Worked'   \t= " + RegStem.stem("Worked"))
print("RegexpStemmer 'Work'    \t= " + RegStem.stem("Work"))
print("\n========================================\n") 


LancasterStemmer 'Working' 	= work
LancasterStemmer 'Works' 	= work
LancasterStemmer 'Worked' 	= work
LancasterStemmer 'Work' 	= work


SnowballStemmer 'Working' 	= work
SnowballStemmer 'Works' 	= work
SnowballStemmer 'Worked' 	= work
SnowballStemmer 'Work'  	= work


RegexpStemmer 'Working' 	= Work
RegexpStemmer 'Works'   	= Work
RegexpStemmer 'Worked'   	= Work
RegexpStemmer 'Work'    	= Work




## Lemmatization 

Lemmatization is the process of grouping the different inflected forms of a word into a single item. 

At first glance this seems very similar to stemming. It is true that stemming and lemmatization are quite similar, however there are subtle differences. Stemmers work without knowledge of the content the word or words are being used for, whereas lemmatizsation makes use of a vocabulary and morphological analysis. For this reason, stemming is faster than lemmatization on larger bodys of text.

For additional information, follow this link to the documentation: http://www.nltk.org/api/nltk.stem.html


In [4]:
# Initalize the lemmatizer
wordLemmatizer = WordNetLemmatizer()

print("WordNetLemmatizer 'Working' \t= " + wordLemmatizer.lemmatize("Working"))
print("WordNetLemmatizer 'Works'   \t= " + wordLemmatizer.lemmatize("Works"))
print("WordNetLemmatizer 'Worked'   \t= " + wordLemmatizer.lemmatize("Worked"))
print("WordNetLemmatizer 'Work'    \t= " + wordLemmatizer.lemmatize("Work"))

WordNetLemmatizer 'Working' 	= Working
WordNetLemmatizer 'Works'   	= Works
WordNetLemmatizer 'Worked'   	= Worked
WordNetLemmatizer 'Work'    	= Work


## Showcase the difference between lemmatization and stemming

Now that both methods have been demonstrated, we will provide an example that shows their differences in more detail. We have chosen to use Porter stemmer as the example for stemming.

In [5]:
# Initialize Stemmer
portStem = PorterStemmer()

# Initalize Lemmatizer
wordLemmatizer = WordNetLemmatizer()

# List of words to test on
Words = ["operate" ,"operating" ,"operates", "operation" ,"operative", "operatives" ,"operational"]
StemWords = []
LemmWords = []

# Stem and Lemmatise each word in the list
for word in Words:
    StemWords.append(portStem.stem(word))
    LemmWords.append(wordLemmatizer.lemmatize(word))
    
# Print the results of each entry
for entry in range(len(Words)):
    print("Original  Word: " + Words[entry])
    print("Stemmed   Word: " + StemWords[entry])
    print("Lemmatize Word: " + LemmWords[entry])
    print("\n")

Original  Word: operate
Stemmed   Word: oper
Lemmatize Word: operate


Original  Word: operating
Stemmed   Word: oper
Lemmatize Word: operating


Original  Word: operates
Stemmed   Word: oper
Lemmatize Word: operates


Original  Word: operation
Stemmed   Word: oper
Lemmatize Word: operation


Original  Word: operative
Stemmed   Word: oper
Lemmatize Word: operative


Original  Word: operatives
Stemmed   Word: oper
Lemmatize Word: operative


Original  Word: operational
Stemmed   Word: oper
Lemmatize Word: operational




## Conclusion 

In this notebook we went over the different stemming and lemmatization methods found in NLTK. Additionaly, we showcased the differences between them. 