Stemming and lemmatization are two popular techniques to reduce a given word to its base word. Stemming uses a fixed set of rules to remove suffixes, and prefixes whereas lemmatization use language knowledge to come up with a correct base word. Stemming will be demonstrated in ntlk (spacy doesn't support stemming) whereas code for lemmatization is written in spacy    

In [1]:
import spacy
import nltk

In [2]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [4]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]
for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


In [12]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, "|", token.lemma_)

eating | eat
eats | eat
eat | eat
ate | eat
adjustable | adjustable
rafting | raft
ability | ability
meeting | meeting
better | well


In [7]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Mando talked for 3 hours although talking isn't his thing")
for token in doc:
    print(token, "|", token.lemma_)

Mando | Mando
talked | talk
for | for
3 | 3
hours | hour
although | although
talking | talk
is | be
n't | not
his | his
thing | thing


## Customizing lemmatizer

In [10]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [11]:
ar=nlp.get_pipe('attribute_ruler')
ar.add([[{"Text":"Bro"}],[{"Text":"Brah"}]], {"LEMMA":"Brother"})
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust
