### Stemming and Lemmatization

Stemming -> Use of fixed rules such as removing "-able", "-ing" etc. to derive a base word.

Lemmatization -> Use of linguistic knowledge to derive a base word (lemma). 

In [1]:
import nltk
import spacy

#### Stemming in nltk

In [7]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [8]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


#### Lemmatization in SpaCy

SpaCy -> Does not support stemming

In [25]:
nlp = spacy.load("en_core_web_sm")

In [26]:
doc= nlp("eating eats eat ate adjustable rafting ability meeting better")

for token in doc:
    print(token, "|", token.lemma_, "|", token.lemma)    

eating | eat | 9837207709914848172
eats | eat | 9837207709914848172
eat | eat | 9837207709914848172
ate | eat | 9837207709914848172
adjustable | adjustable | 6033511944150694480
rafting | raft | 7154368781129989833
ability | ability | 11565809527369121409
meeting | meeting | 14798207169164081740
better | well | 4525988469032889948


In [27]:
doc= nlp("Mando talked for 3 hours although talking isn't his thing. He became talkative.")

for token in doc:
    print(token, "|", token.lemma_, "|", token.lemma)    

Mando | Mando | 7837215228004622142
talked | talk | 13939146775466599234
for | for | 16037325823156266367
3 | 3 | 602994839685422785
hours | hour | 9748623380567160636
although | although | 343236316598008647
talking | talk | 13939146775466599234
is | be | 10382539506755952630
n't | not | 447765159362469301
his | his | 2661093235354845946
thing | thing | 2473243759842082748
. | . | 12646065887601541794
He | he | 1655312771067108281
became | become | 12558846041070486771
talkative | talkative | 13364764166055324990
. | . | 12646065887601541794


In [28]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [30]:
# Customizing
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT": "Bro"}],[{"TEXT": "Brah"}]], {"LEMMA": "Brother"})

cus_doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for token in cus_doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust
