#### Stemming in NLTK

In [1]:
from nltk.stem import PorterStemmer

In [2]:
stemmer = PorterStemmer()

In [3]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


#### Lemmatization in Spacy

In [4]:
import spacy

In [5]:
nlp = spacy.load("en_core_web_lg")

doc = nlp("Mando talked for 3 hours although talking isn't his thing")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")

for token in doc:
    print(token, "|", token.lemma_)

eating | eating
eats | eat
eat | eat
ate | eat
adjustable | adjustable
rafting | rafting
ability | ability
meeting | meet
better | well


#### Customizing Lemma

In [6]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [7]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x280e74c96d0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x280e74ca2d0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x280e74f4890>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x280e76e2250>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x280e76cf650>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x280e74f4970>)]

In [8]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}], [{"TEXT":"Brah"}]], {"Lemma":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for token in doc:
    print(token, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhausted


In [9]:
doc[6]

Brah

In [10]:
doc[6].lemma_

'Brother'

#### Let import necessary libraries and create the object

In [11]:
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
nltk.download("all")

import spacy
nlp = spacy.load("en_core_web_lg")

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]    |

* Convert these list of words into base form using Stemming and Lemmatization and observe the transformations
* Write a short note on the words that have different base words using stemming and Lemmatization

In [12]:
# using stemming in nltk

lst_words = ['running', 'painting', 'walking', 'dressing', 'likely', 'children', 'whom', 'good', 'ate', 'fishing']

for word in lst_words:
    print(f"{word} | {stemmer.stem(word)}")

running | run
painting | paint
walking | walk
dressing | dress
likely | like
children | children
whom | whom
good | good
ate | ate
fishing | fish


In [13]:
# using lemmatization in spacy

doc = nlp("running painting walking dressing likely children whom good ate fishing")

for token in doc:
    print(token, "|", token.lemma_)

running | run
painting | paint
walking | walking
dressing | dress
likely | likely
children | child
whom | whom
good | good
ate | eat
fishing | fishing


#### Observations

Words that are different in stemming and lemmatization are:
* painting
* likely
* children
* ate
* fishing


As Stemming achieves the base word by removing the suffixes [ing, ly etc], so it successfully transform the words like 'painting', 'likely', 'fishing' and lemmatization fails for some words ending with suffixes here.

As Lemmatization uses the dictionary meanings while converting to the base form, so words like 'children' and 'ate' are successfully transformed and stemming fails here.

In [14]:
text = """Latha is very multi talented girl.She is good at many skills like dancing, running, singing, playing.She also likes eating Pav Bhagi. she has a 
habit of fishing and swimming too.Besides all this, she is a wonderful at cooking too.
"""

In [16]:
# using stemming in nltk

# step1: word tokenization
all_word_tokens = nltk.word_tokenize(text)

# step2: getting the base form for each token using stemmer
all_base_words = []

for token in all_word_tokens:
    base_form = stemmer.stem(token)
    all_base_words.append(base_form)

# step3: joining all words in a list into string using 'join()'
final_base_text = ' '.join(all_base_words)
print(final_base_text)

latha is veri multi talent girl.sh is good at mani skill like danc , run , sing , playing.sh also like eat pav bhagi . she ha a habit of fish and swim too.besid all thi , she is a wonder at cook too .


In [17]:
# using lemmetization in spacy

# step1: creating the object for the given text
doc = nlp(text)
all_base_words = []

# step2: getting the base form for each token using spacy 'lemma'
for token in doc:
    base_word = token.lemma_
    all_base_words.append(base_word)

# step3: joining all words in a list into string using 'join()'
final_base_text = ' '.join(all_base_words)
print(final_base_text)

Latha be very multi talented girl . she be good at many skill like dancing , run , singing , play . she also like eat Pav Bhagi . she have a 
 habit of fishing and swim too . besides all this , she be a wonderful at cook too . 

