## Stemming vs Lemmatization

Stemming and lemmatization are similar but it's different. Stemming is applying a set of rules based on the language used to remove affixes. Lemmatization also does the same thing, but it can understand the context. Here's the example for better understanding:

Stemming will do this:

-   Eating -> Eat
-   Solving -> Solv
-   Done -> Done

Lemmatization will do this:

-   Eating -> Eat
-   Solving -> Solve
-   Done -> Do

Well it looks like lemmatization is way better than stemming, then why don't we use lemmatization over stemming? Here's the thing, stemming perform faster than lemmatization because it only apply some set of rules, meanwhile lemmatization is much slower because not only applying some set of rules, it also have a dictionary to look up for base words. Hence both of them have their own advantage


### Stemming & Lemmatization with NLTK


In [1]:
import nltk

nltk.download("punkt")
nltk.download("wordnet")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk import WordNetLemmatizer


In [3]:
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()


In [4]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, " | ", porter_stemmer.stem(word))
    print(word, " | ", snowball_stemmer.stem(word))
    print(word, " | ", lemmatizer.lemmatize(word, "v"))

    print()


eating  |  eat
eating  |  eat
eating  |  eat

eats  |  eat
eats  |  eat
eats  |  eat

eat  |  eat
eat  |  eat
eat  |  eat

ate  |  ate
ate  |  ate
ate  |  eat

adjustable  |  adjust
adjustable  |  adjust
adjustable  |  adjustable

rafting  |  raft
rafting  |  raft
rafting  |  raft

ability  |  abil
ability  |  abil
ability  |  ability

meeting  |  meet
meeting  |  meet
meeting  |  meet



### Lemmatization with Spacy

Spacy it self doesn't have stemmer as it prefer lemmatization


In [5]:
import spacy

In [6]:
nlp = spacy.load("en_core_web_sm")

In [7]:
text = "eating eats eat ate adjustable rafting ability meeting better"
docs = nlp(text)

for token in docs:
    print(token, " | ", token.lemma_)


eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  raft
ability  |  ability
meeting  |  meeting
better  |  well


Exercise1:

-   Convert these list of words into base form using Stemming and Lemmatization and observe the transformations
-   Write a short note on the words that have different base words using stemming and Lemmatization


In [8]:
# using stemming in nltk
lst_words = [
    "running",
    "painting",
    "walking",
    "dressing",
    "likely",
    "children",
    "whom",
    "good",
    "ate",
    "fishing",
]


In [9]:
for word in lst_words:
    print(word, " | ", porter_stemmer.stem(word))


running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  like
children  |  children
whom  |  whom
good  |  good
ate  |  ate
fishing  |  fish


In [10]:
# using lemmatization in spacy
doc = nlp("running painting walking dressing likely children who good ate fishing")

for token in doc:
    print(token, " | ", token.lemma_)


running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  likely
children  |  child
who  |  who
good  |  good
ate  |  eat
fishing  |  fishing


Exercise2:

-   convert the given text into it's base form using both stemming and lemmatization


In [11]:
text = """Latha is very multi talented girl.She is good at many skills like dancing, running, singing, playing.She also likes eating Pav Bhagi. she has a 
habit of fishing and swimming too.Besides all this, she is a wonderful at cooking too.
"""


In [12]:
# using stemming in nltk
from nltk.tokenize import word_tokenize

# step1: Word tokenizing
tokens = word_tokenize(text)

# step2: getting the base form for each token using stemmer
stemmed_tokens = [porter_stemmer.stem(token) for token in tokens]

# step3: joining all words in a list into string using 'join()'
" ".join(stemmed_tokens)


'latha is veri multi talent girl.sh is good at mani skill like danc , run , sing , playing.sh also like eat pav bhagi . she ha a habit of fish and swim too.besid all thi , she is a wonder at cook too .'

In [13]:
# using lemmatisation in spacy


# step1: Creating the object for the given text
doc = nlp(text)

# step2: getting the base form for each token using spacy 'lemma_'
lemmatized = [token.lemma_ for token in doc]

# step3: joining all words in a list into string using 'join()'
" ".join(lemmatized)


'Latha be very multi talented girl . she be good at many skill like dancing , running , singing , play . she also like eat Pav Bhagi . she have a \n habit of fishing and swim too . besides all this , she be a wonderful at cook too . \n'