Lemmatization in Python
-
Lemmatization deals with reducing the word or the search query to its canonical dictionary form. The root word is called a ‘lemma’ and the method is called lemmatization. This approach takes a part of the word into consideration in a way that it is recognized as a single element. The NLTK Lemmatization method is based on WorldNet’s built-in morph function. Text preprocessing includes both stemming as well as lemmatization. Many people find the two terms confusing.

In [9]:
import nltk
from nltk.stem import 	WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w))) 

Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


In [14]:
import nltk
from nltk.stem import 	WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
new_text = "The NLTK Lemmatization method is based on WorldNet’s built-in morph function. Text preprocessing includes both stemming as well as lemmatization. The cries is not good."
words = word_tokenize(new_text)

for w in words:
    print(w," : ",  wordnet_lemmatizer.lemmatize(w))

The  :  The
NLTK  :  NLTK
Lemmatization  :  Lemmatization
method  :  method
is  :  is
based  :  based
on  :  on
WorldNet  :  WorldNet
’  :  ’
s  :  s
built-in  :  built-in
morph  :  morph
function  :  function
.  :  .
Text  :  Text
preprocessing  :  preprocessing
includes  :  includes
both  :  both
stemming  :  stemming
as  :  a
well  :  well
as  :  a
lemmatization  :  lemmatization
.  :  .
The  :  The
cries  :  cry
is  :  is
not  :  not
good  :  good
.  :  .


# Stemming

stemming is a rule-based text normalization technique that reduces the prefix and suffix of the word to its root form. However, stemming is a faster process compared to lemmatization as it does not consider the context of the words.
It is a technique in which a set of words in a sentence are converted into a sequence to shorten its lookup.

In [1]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
sentence="Hello Rana Saeed, You have to build a very good site and I love visiting your site."
words = word_tokenize(sentence)
ps = PorterStemmer()
for w in words:
    rootWord=ps.stem(w)
    print(rootWord)

hello
rana
saeed
,
you
have
to
build
a
veri
good
site
and
i
love
visit
your
site
.


# temming words using NLTK

In [5]:
# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]

for w in words:
	print(w, " : ", ps.stem(w))


program  :  program
programs  :  program
programmer  :  programm
programming  :  program
programmers  :  programm


# Stemming words from sentences

In [6]:
# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)

for w in words:
	print(w, " : ", ps.stem(w))


Programmers  :  programm
program  :  program
with  :  with
programming  :  program
languages  :  languag
