<h3>Stemming in NLTK</h3>

Stemming in NLTK, the Natural Language Toolkit, is a technique used to reduce words to their root or base form by removing suffixes. It simplifies words to their core form, enabling more effective text analysis and reducing vocabulary size. NLTK provides various stemming algorithms, with the Porter stemming algorithm being one of the most commonly used. By applying stemming to words in a text, variations of the same word are treated as identical, facilitating tasks like text search and information retrieval. However, stemming may not always produce valid words, as it focuses on simplifying the structure rather than maintaining grammatical accuracy.

In [4]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [10]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


<h3>Lemmatization in Spacy</h3>

Lemmatization in spaCy is a natural language processing technique used to transform words into their base or dictionary forms, called lemmas. Unlike stemming, which crudely chops off word endings, lemmatization considers the context and grammatical rules of the language to produce valid and meaningful lemmas. SpaCy provides lemmatization as part of its text processing pipeline, making it a valuable tool for tasks like text normalization and information retrieval.

In [24]:
import spacy

In [25]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Mando talked for 3 hours although talking isn't his thing")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, " | ", token.lemma_)

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  rafting
ability  |  ability
meeting  |  meeting
better  |  well


<h3>Customizing lemmatizer</h3>

Customizing the lemmatizer in spaCy means adjusting the lemmatization process to meet specific requirements or handle domain-specific words more accurately. It involves creating custom rules to define how particular words or forms should be transformed into their base or dictionary forms, known as lemmas. This customization allows for more precise and context-aware lemmatization in your natural language processing tasks.

In [26]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [29]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


In [35]:
doc[6]

Brah

In [36]:
doc[6].lemma_

'Brother'