* Lemmatization is the process of finding the form of the related word in the dictionary. It is different from Stemming. It involves longer processes to calculate than Stemming. Let’s examine a definition made about this.

* The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead, it uses lexical knowledge bases to get the correct base forms of words.

* NLTK provides **WordNetLemmatizer** class which is a thin wrapper around the wordnet corpus.

In [1]:
import nltk
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()

In [2]:
# Lemmatize single word

print(lemmatizer.lemmatize("workers"))
print(lemmatizer.lemmatize("beeches"))

worker
beech


* Then we have a text. Let’s break this text down to tokens first. Then let’s apply the lemmatizer one by one on these tokens.

In [3]:
text = "Let’s lemmatize a simple sentence. We first tokenize the sentence into words using nltk.word_tokenize and then we will call lemmatizer.lemmatize() on each word. "
word_list = nltk.word_tokenize(text)
print(word_list)

['Let', '’', 's', 'lemmatize', 'a', 'simple', 'sentence', '.', 'We', 'first', 'tokenize', 'the', 'sentence', 'into', 'words', 'using', 'nltk.word_tokenize', 'and', 'then', 'we', 'will', 'call', 'lemmatizer.lemmatize', '(', ')', 'on', 'each', 'word', '.']


In [4]:
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)

Let ’ s lemmatize a simple sentence . We first tokenize the sentence into word using nltk.word_tokenize and then we will call lemmatizer.lemmatize ( ) on each word .


* In the first example of Lemmatizer, we used WordNet Lemmatizer from the NLTK library. Let’s do similar operations with TextBlob. As a result, we will reach similar results.

In [8]:
# pip install textblob

from textblob import TextBlob, Word

In [9]:
word = 'stripes'
w = Word(word)
w.lemmatize()

'stripe'

 * When we apply the ‘lemmatize’ process to the word ‘stripes’, it deletes the ‘s’ suffix and reaches the word ‘stripe’, which is the dictionary form of the word. Now let’s do the same on a sentence.

In [10]:
text = "The striped bats are hanging on their feet for best"
sent = TextBlob(text)
" ". join([w.lemmatize() for w in sent.words])

'The striped bat are hanging on their foot for best'