# Stemming and Lemmatization with NLTK

This notebook will be an overview on what stemming and lemmatization are and discuss what they can do for you. Since NLTK (Natural Language Toolkit) Provides many different methods to conduct both aspects, we will be comparing them against eachother to give you a better idea on which one will suit your needs.

Written Janu

## Importing Libraries

Before we begin we will import all the needed libraries for the notebook

In [26]:
# The Stemmers presented by nltk
from nltk.stem.regexp import RegexpStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

# The Lemmatization presented by nltk
from nltk.stem import WordNetLemmatizer


## What is Stemming?

Stemming is the process of derviving words into their base or root. A simple example is removing "s" or the pural off words. This is important in different applications, from streamlining a corpus or reducing varaition in a body of work.

## Stemming

We will first demostrate how to use PorterStemmer as an example since is popular one to use. All other stemmer will follow the same structure to initalize and conduct stemming, and therefore all it requires is simple change in terms from one stemmer to another.

In [23]:
# Import the stemmer to use
portStem = PorterStemmer()

# Trying out the Stemmer on variation of the word "Work" 
print("Stemming 'Working' \t= " + portStem.stem("Working"))
print("Stemming 'Works' \t= " + portStem.stem("Works"))
print("Stemming 'Worked' \t= " + portStem.stem("Worked"))
print("Stemming 'Work' \t= " + portStem.stem("Work"))

Stemming 'Working' 	= work
Stemming 'Works' 	= work
Stemming 'Worked' 	= work
Stemming 'Work' 	= work


As you can see, stemming was able to convert the different forms of the word "Work" back into the word "work". Now we are going to showcase other Stemmers found in nltk. The differences between each stemmers can be found in their documentation found here:
http://www.nltk.org/api/nltk.stem.html

Although there may not be variations seen in the following examples, there might be certain cases where this is not the case. Therefore proper research into the different stemmers is required

Feel free to skip this part to the lemmatization section if the previous stemmer fits your needs.

In [24]:
# Lancaster Stemmer
lanStem = LancasterStemmer()
print("LancasterStemmer 'Working' \t= " + lanStem.stem("Working"))
print("LancasterStemmer 'Works' \t= " + lanStem.stem("Works"))
print("LancasterStemmer 'Worked' \t= " + lanStem.stem("Worked"))
print("LancasterStemmer 'Work' \t= " + lanStem.stem("Work"))
print("\n========================================\n") 

# SnowballStemmer (It also allows different lanugages outside english)
snowStem = SnowballStemmer("english") # Choose a language
print("SnowballStemmer 'Working' \t= " + snowStem.stem("Working"))
print("SnowballStemmer 'Works' \t= " + snowStem.stem("Works"))
print("SnowballStemmer 'Worked' \t= " + snowStem.stem("Worked"))
print("SnowballStemmer 'Work'  \t= " + snowStem.stem("Work"))
print("\n========================================\n") 

# RegexpStemmer (Stems via regular expressions)
RegStem = RegexpStemmer('ing$|s$|ed$', min=3)
print("RegexpStemmer 'Working' \t= " + RegStem.stem("Working"))
print("RegexpStemmer 'Works'   \t= " + RegStem.stem("Works"))
print("RegexpStemmer 'Worked'   \t= " + RegStem.stem("Worked"))
print("RegexpStemmer 'Work'    \t= " + RegStem.stem("Work"))
print("\n========================================\n") 


LancasterStemmer 'Working' 	= work
LancasterStemmer 'Works' 	= work
LancasterStemmer 'Worked' 	= work
LancasterStemmer 'Work' 	= work


SnowballStemmer 'Working' 	= work
SnowballStemmer 'Works' 	= work
SnowballStemmer 'Worked' 	= work
SnowballStemmer 'Work'  	= work


RegexpStemmer 'Working' 	= Work
RegexpStemmer 'Works'   	= Work
RegexpStemmer 'Worked'   	= Work
RegexpStemmer 'Work'    	= Work




## Lemmatisation 

Lemmatisation is the process of grouping the different inflected forms of a word into a single item. 

At first glance this seems very similar to stemming as discussed eariler. It is true, stemming and lemmatization are quite similar, however there is a subtle differences. That is stemmers works without knowledge of the content the word or words are being used for. Therefore lemmatizsation makes use of the vocavulary and morphological anaylsys. 

For additional information look at the documentation at: http://www.nltk.org/api/nltk.stem.html

### Note:
It is because of this that stemming is faster than lemmatisation on larger body of texts

In [27]:
# Initalize the lemmatizer
wordLemmatizer = WordNetLemmatizer()

print("WordNetLemmatizer 'Working' \t= " + wordLemmatizer.lemmatize("Working"))
print("WordNetLemmatizer 'Works'   \t= " + wordLemmatizer.lemmatize("Works"))
print("WordNetLemmatizer 'Worked'   \t= " + wordLemmatizer.lemmatize("Worked"))
print("WordNetLemmatizer 'Work'    \t= " + wordLemmatizer.lemmatize("Work"))

WordNetLemmatizer 'Working' 	= Working
WordNetLemmatizer 'Works'   	= Works
WordNetLemmatizer 'Worked'   	= Worked
WordNetLemmatizer 'Work'    	= Work


## Showcase the difference between lemmatisation and stemming

Now that both methods are demostrated, we are going to show an example that shows their difference in more detail. Using Porter stemmer as the example for stemming.

In [50]:
# Initialize Stemmer
portStem = PorterStemmer()

# Initalize Lemmatizer
wordLemmatizer = WordNetLemmatizer()

# List of words to test on
Words = ["operate" ,"operating" ,"operates", "operation" ,"operative", "operatives" ,"operational"]
StemWords = []
LemmWords = []

# Stem and Lemmatise each word in the list
for word in Words:
    StemWords.append(portStem.stem(word))
    LemmWords.append(wordLemmatizer.lemmatize(word))
    
# Print the results of each entry
for entry in range(len(Words)):
    print("Original  Word: " + Words[entry])
    print("Stemmed   Word: " + StemWords[entry])
    print("Lemmatize Word: " + LemmWords[entry])
    print("\n")

Original  Word: operate
Stemmed   Word: oper
Lemmatize Word: operate


Original  Word: operating
Stemmed   Word: oper
Lemmatize Word: operating


Original  Word: operates
Stemmed   Word: oper
Lemmatize Word: operates


Original  Word: operation
Stemmed   Word: oper
Lemmatize Word: operation


Original  Word: operative
Stemmed   Word: oper
Lemmatize Word: operative


Original  Word: operatives
Stemmed   Word: oper
Lemmatize Word: operative


Original  Word: operational
Stemmed   Word: oper
Lemmatize Word: operational




## Conclusion 

In this notebook we went over the different stemming methods and lemmatization found in nltk. In addition we also showcased the differences between them. 