# 1. Stemming

Stemming is a text normalization technique in Natural Language Processing (NLP) used to reduce words to their root or base form (called the stem), typically by chopping off prefixes or suffixes.

![image.png](attachment:74ede9df-3c96-40c5-a2ac-34ead2b39363.png)

![image.png](attachment:b517a7c5-48cc-4047-9e9e-319cdcf56739.png)

![image.png](attachment:9708f6db-46b7-42e5-969f-721690c703d7.png)

In [17]:
# Example: Stemming in NLTK

import nltk
from nltk.stem import PorterStemmer

Stemmer = PorterStemmer()

words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:

    print(word,'|', Stemmer.stem(word))
    

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


# 2. Lemmatization

Lemmatization is the process of reducing a word to its base or dictionary form, known as a lemma. Unlike stemming, lemmatization uses linguistic knowledge (like part of speech and word meaning) to ensure that the reduced form is a valid word.

![image.png](attachment:2413d244-7ef4-4c94-bb49-cd122f3dfeac.png)

![image.png](attachment:992961c7-5b8c-473f-96b5-e72437b4be27.png)

![image.png](attachment:7848d408-e5d5-4c01-bc39-230d018cec99.png)

![image.png](attachment:09cc53ae-cad5-4279-bb28-f0aef3243c99.png)

In [32]:
# Example: Lemmatization in Spacy

import spacy

nlp = spacy.load("en_core_web_sm")


# Document 1

doc = nlp("eating eats eat ate adjustable rafting ability meeting better")

for token in doc:

    print(token,'|', token.lemma_,'|', token.lemma)


# Document 2

doc1 = nlp("Mando talked for 3 hours although talking isn't his thing")

print('\n')

for token in doc1:

    print(token,'|', token.lemma_,'|', token.lemma)   


eating | eat | 9837207709914848172
eats | eat | 9837207709914848172
eat | eat | 9837207709914848172
ate | eat | 9837207709914848172
adjustable | adjustable | 6033511944150694480
rafting | raft | 7154368781129989833
ability | ability | 11565809527369121409
meeting | meet | 6880656908171229526
better | well | 4525988469032889948


Mando | Mando | 7837215228004622142
talked | talk | 13939146775466599234
for | for | 16037325823156266367
3 | 3 | 602994839685422785
hours | hour | 9748623380567160636
although | although | 343236316598008647
talking | talk | 13939146775466599234
is | be | 10382539506755952630
n't | not | 447765159362469301
his | his | 2661093235354845946
thing | thing | 2473243759842082748


## 2.1. Customizing lemmatizer

In [36]:
# Pipeline

Pipeline = nlp.pipe_names
print('Pipeline :',Pipeline)


Pipeline : ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [50]:
# Customize the attribute_ruler

atr = nlp.get_pipe("attribute_ruler")

atr.add([[{"Text" : "Brah"}],[{"Text" : "Bro"}]],{"Lemma" : "Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for token in doc:
    print(token.text, "|", token.lemma_)


# To check the word

word1 = doc[0].lemma_
print('\nBro :', word1)

word2 = doc[6].lemma_
print('\nBrah :', word2)
    

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust

Bro : Brother

Brah : Brother
