<center><h1>Stemming and Lemmatization</center>

In [1]:
import nltk
import spacy

<b>Stemming at NLTK

<b>Porter Stemmer

In [2]:
from nltk.stem import PorterStemmer

In [3]:
stemmer = PorterStemmer()

In [4]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting","running",
         "flies", "better", "children", "leaves", "studies", "playing"]

In [5]:
for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet
running | run
flies | fli
better | better
children | children
leaves | leav
studies | studi
playing | play


<b>Lancaster Stemmer

In [6]:
from nltk.stem import LancasterStemmer

In [7]:
lancaster = LancasterStemmer()

In [8]:
for word in words:
    print(word, "|", lancaster.stem(word))

eating | eat
eats | eat
eat | eat
ate | at
adjustable | adjust
rafting | raft
ability | abl
meeting | meet
running | run
flies | fli
better | bet
children | childr
leaves | leav
studies | study
playing | play


<b>Snowball Stemmer

In [9]:
from nltk.stem import SnowballStemmer

In [10]:
snowball = SnowballStemmer("english")

In [11]:
for word in words:
    print(word, "|", snowball.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet
running | run
flies | fli
better | better
children | children
leaves | leav
studies | studi
playing | play


- Porter Stemmer: Commonly used, but sometimes over-stems words.
- Lancaster Stemmer: More aggressive, often cutting words too short.
- Snowball Stemmer: More accurate, supports multiple languages.

<b>Lemmatization in NLTK

In [12]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [13]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [14]:
lemmatizer = WordNetLemmatizer()

In [15]:
# Lemmatizing words as nouns (default)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized (default as noun):", lemmatized_words)

# Lemmatizing with different POS tags
print("Lemmatized (as verb):", lemmatizer.lemmatize("running", pos='v'))  # Run
print("Lemmatized (as adjective):", lemmatizer.lemmatize("better", pos='a'))  # Good

Lemmatized (default as noun): ['eating', 'eats', 'eat', 'ate', 'adjustable', 'rafting', 'ability', 'meeting', 'running', 'fly', 'better', 'child', 'leaf', 'study', 'playing']
Lemmatized (as verb): run
Lemmatized (as adjective): good


In [16]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize

In [17]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Hp\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

<b>Spacy doesn't support Stemming

<b>Lemmatization in Spacy

In [18]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, " | ", token.lemma_, " | ",token.lemma)

eating  |  eat  |  9837207709914848172
eats  |  eat  |  9837207709914848172
eat  |  eat  |  9837207709914848172
ate  |  eat  |  9837207709914848172
adjustable  |  adjustable  |  6033511944150694480
rafting  |  raft  |  7154368781129989833
ability  |  ability  |  11565809527369121409
meeting  |  meet  |  6880656908171229526
better  |  well  |  4525988469032889948


<b>Customizing lemmatizer

In [19]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [20]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [21]:
doc[6]

Brah

In [22]:
doc[6].lemma_

'Brother'