## Stemming & Lemmatization

Reducing a given word to its base word.

1. talking --> talk
2. eating  --> eat
3. adjustable --> adjust

Using fixed rules to obtain base word (removing able,ing) is called <b>Stemming</b>

But, in situations where we want to use the knowledge of language (linguistic knowledge) to derive a base word, we called it <b>Lemmatization</b>

1. ate --> eat
2. hidden --> hide

Stemming is not supported by <b>Spacy</b>, but BOTH Stemming and Lemmatization are supported by <b>NLTK</b>

In [1]:
import nltk
import spacy

#### 1. Stemming using NLTK

In [2]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

In [3]:
word_list = ['Run', 'Ran', 'Running', 'Runner', 'Book', 'Books', 'Bookstore', 'Bookshelf', 'Jump', 'Jumped', 'Jumping', 'Jumper']

for word in word_list:
    print(word, "-----", stemmer.stem(word))

Run ----- run
Ran ----- ran
Running ----- run
Runner ----- runner
Book ----- book
Books ----- book
Bookstore ----- bookstor
Bookshelf ----- bookshelf
Jump ----- jump
Jumped ----- jump
Jumping ----- jump
Jumper ----- jumper


#### 2. Lemmatization using Spacy

In [4]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Run Ran Running Runner Book Books Bookstore Bookshelf Jump Jumped Jumping Jumper")

for word in doc:
    print(word, "-----", word.lemma_)

Run ----- Run
Ran ----- Ran
Running ----- Running
Runner ----- Runner
Book ----- Book
Books ----- Books
Bookstore ----- Bookstore
Bookshelf ----- Bookshelf
Jump ----- Jump
Jumped ----- Jumped
Jumping ----- Jumping
Jumper ----- Jumper


In [6]:
for word in doc:
    print(word, "-----", word.lemma_,"---ID--->",word.lemma)

Run ----- Run ---ID---> 16173034612666007580
Ran ----- Ran ---ID---> 7920933193498127490
Running ----- Running ---ID---> 3515731218577779878
Runner ----- Runner ---ID---> 8055516351581086986
Book ----- Book ---ID---> 2282789789192439378
Books ----- Books ---ID---> 14374249698548956695
Bookstore ----- Bookstore ---ID---> 4802154648493700311
Bookshelf ----- Bookshelf ---ID---> 1977927564827388121
Jump ----- Jump ---ID---> 8517166225963065883
Jumped ----- Jumped ---ID---> 5367936027753836856
Jumping ----- Jumping ---ID---> 13951021020171923904
Jumper ----- Jumper ---ID---> 13831740166223223947


#### 3. Lemmatization using NLTK

In [9]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [17]:
doc = "Run Ran Running Runner Book Books Bookstore Bookshelf Jump Jumped Jumping Jumper"
tokens = word_tokenize(doc)
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
lemmatized_tokens

['Run',
 'Ran',
 'Running',
 'Runner',
 'Book',
 'Books',
 'Bookstore',
 'Bookshelf',
 'Jump',
 'Jumped',
 'Jumping',
 'Jumper']

In [18]:
for i,j in zip(tokens,lemmatized_tokens):
    print(i,j)

Run Run
Ran Ran
Running Running
Runner Runner
Book Book
Books Books
Bookstore Bookstore
Bookshelf Bookshelf
Jump Jump
Jumped Jumped
Jumping Jumping
Jumper Jumper


#### 4. Customize Lemmatizations

In [19]:
doc = nlp('''"Bro, did you catch the game last night?" asked Tom, turning to his younger sibling. "Yeah, bruh, it was intense," replied Alex, grinning at his older brother.''')

In [20]:
for token in doc:
    print(token,"-----",token.lemma_)

" ----- "
Bro ----- bro
, ----- ,
did ----- do
you ----- you
catch ----- catch
the ----- the
game ----- game
last ----- last
night ----- night
? ----- ?
" ----- "
asked ----- ask
Tom ----- Tom
, ----- ,
turning ----- turn
to ----- to
his ----- his
younger ----- young
sibling ----- sibling
. ----- .
" ----- "
Yeah ----- yeah
, ----- ,
bruh ----- bruh
, ----- ,
it ----- it
was ----- be
intense ----- intense
, ----- ,
" ----- "
replied ----- reply
Alex ----- Alex
, ----- ,
grinning ----- grin
at ----- at
his ----- his
older ----- old
brother ----- brother
. ----- .


In [21]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [25]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"bruh"}] ],{"LEMMA":"brother"})

In [26]:
doc = nlp('''"Bro, did you catch the game last night?" asked Tom, turning to his younger sibling. "Yeah, bruh, it was intense," replied Alex, grinning at his older brother.''')
for token in doc:
    print(token,"-----",token.lemma_)

" ----- "
Bro ----- brother
, ----- ,
did ----- do
you ----- you
catch ----- catch
the ----- the
game ----- game
last ----- last
night ----- night
? ----- ?
" ----- "
asked ----- ask
Tom ----- Tom
, ----- ,
turning ----- turn
to ----- to
his ----- his
younger ----- young
sibling ----- sibling
. ----- .
" ----- "
Yeah ----- yeah
, ----- ,
bruh ----- brother
, ----- ,
it ----- it
was ----- be
intense ----- intense
, ----- ,
" ----- "
replied ----- reply
Alex ----- Alex
, ----- ,
grinning ----- grin
at ----- at
his ----- his
older ----- old
brother ----- brother
. ----- .
