Two techniques to reduce a given word to its base word. 

<h3>Stemming in NLTK</h3>

stemming uses a fixed set of rules to remove suffixes, and prefixes.  

In [1]:
import nltk

In [3]:
from nltk.stem import PorterStemmer

In [4]:
stemmer = PorterStemmer() # generating stemmer object of the PorterStemmer class. 

In [10]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

In [12]:
for word in words:
    print(word, "|", stemmer.stem(word)) #stemmer does not know language, it just applies rules. so ability-->abil, did not change
                                            # but still valuable, it is faster

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


<h3>Lemmatization in Spacy</h3>

In [None]:
import spacy

lemmatization use language knowledge to come up with a correct base word. 

In [13]:
nlp = spacy.load("en_core_web_sm")

In [19]:
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")

In [21]:
for token in doc:
    print(token, "|", token.lemma_, "|", token.lemma) # token.lemma: uniqque identifier

eating | eat | 9837207709914848172
eats | eat | 9837207709914848172
eat | eat | 9837207709914848172
ate | eat | 9837207709914848172
adjustable | adjustable | 6033511944150694480
rafting | rafting | 1196139325854331
ability | ability | 11565809527369121409
meeting | meeting | 14798207169164081740
better | well | 4525988469032889948


In [23]:
doc = nlp("Mando talked for 3 hours although talking isn't his thing he became talkative")

for token in doc:
    print(token, "|", token.lemma_, "|", token.lemma) 

Mando | Mando | 7837215228004622142
talked | talk | 13939146775466599234
for | for | 16037325823156266367
3 | 3 | 602994839685422785
hours | hour | 9748623380567160636
although | although | 343236316598008647
talking | talk | 13939146775466599234
is | be | 10382539506755952630
n't | n't | 2043519015752540944
his | his | 2661093235354845946
thing | thing | 2473243759842082748
he | he | 1655312771067108281
became | become | 12558846041070486771
talkative | talkative | 13364764166055324990


<h3>Customizing lemmatizer</h3>

In [24]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

In [25]:
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted") # by default model will not understand bro. 

In [26]:
for token in doc:
    print(token, "|", token.lemma_)

Bro | bro
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brah
, | ,
do | do
n't | n't
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [27]:
doc[0]

Bro

In [29]:
doc[0].lemma_

'bro'

In [33]:
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted") # by default model will not understand bro. 

ar = nlp.get_pipe("attribute_ruler")

ar.add([[{"TEXT":"Bro"}], [{"TEXT":"Brah"}]], {"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted") 

for token in doc:
    print(token, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | n't
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust
