# Stemming and Lemmatization in NLP 

Stemming and lemmatization are techniques used in natural language processing (NLP) to reduce words to their base
or root forms, which helps in normalizing and standardizing text data. While both techniques aim to achieve similar goals,
they operate differently and have distinct characteristics.

### Stemming:
Stemming is the process of removing affixes from words to derive their stems, which are the base or root forms.
Stemming algorithms apply heuristic rules to chop off prefixes or suffixes from words, resulting in the stem. 
The goal of stemming is to map different inflected or derived forms of a word to the same root form, 
thereby reducing variation and improving information retrieval or text analysis tasks.

For example, the word "running" would be stemmed to "run," "cats" would be stemmed to "cat," and "better" 
would be stemmed to "better." Stemming is typically a rule-based process and may not always produce valid words as stems,
but it is computationally efficient and straightforward to implement.

One of the most commonly used stemming algorithms is the Porter Stemmer, developed by Martin Porter. 
NLTK provides an implementation of the Porter Stemmer, along with other stemming algorithms like the 
Lancaster Stemmer and Snowball Stemmer.

In [1]:
# their are few types of stemmers available in NLTK package .We will talk about popular below two
# 1 Porter stemmer
# 2 Lancaster stemmer

import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

# Sample words
words = ['running', 'cats', 'better', 'running']

# Initialize stemmers
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer("english")  # You need to specify the language for SnowballStemmer

# Stem words using each stemmer
porter_stemmed_words = [porter_stemmer.stem(word) for word in words]
lancaster_stemmed_words = [lancaster_stemmer.stem(word) for word in words]
snowball_stemmed_words = [snowball_stemmer.stem(word) for word in words]

# Print stemmed words
print("Porter Stemmer:", porter_stemmed_words)
print("Lancaster Stemmer:", lancaster_stemmed_words)
print("Snowball Stemmer:", snowball_stemmed_words)

Porter Stemmer: ['run', 'cat', 'better', 'run']
Lancaster Stemmer: ['run', 'cat', 'bet', 'run']
Snowball Stemmer: ['run', 'cat', 'better', 'run']


### Poter Stemmer 

In [2]:

from nltk.stem import PorterStemmer

# Initialize PorterStemmer
porter_stemmer = PorterStemmer()

# Stem the word "hobby"
stemmed_word = porter_stemmer.stem("hobby")

# Print the stemmed word
print("Stemmed word:", stemmed_word)

Stemmed word: hobbi


In [3]:
from nltk.stem import PorterStemmer

# Initialize PorterStemmer
porter_stemmer = PorterStemmer()

# Stem the word "hobbies"
stemmed_word = porter_stemmer.stem("hobbies")

# Print the stemmed word
print("Stemmed word:", stemmed_word)

Stemmed word: hobbi


In [4]:
from nltk.stem import PorterStemmer

# Initialize PorterStemmer
porter_stemmer = PorterStemmer()

# Stem the word "computer"
stemmed_word = porter_stemmer.stem("computer")

# Print the stemmed word
print("Stemmed word:", stemmed_word)

Stemmed word: comput


In [5]:
from nltk.stem import PorterStemmer

# Initialize PorterStemmer
porter_stemmer = PorterStemmer()

# Stem the word "computation"
stemmed_word = porter_stemmer.stem("computation")

# Print the stemmed word
print("Stemmed word:", stemmed_word)

Stemmed word: comput


### Lancaster stemmer

In [6]:
from nltk.stem import LancasterStemmer

# Initialize LancasterStemmer
lancaster_stemmer = LancasterStemmer()

# Stem the word "hobby"
stemmed_word = lancaster_stemmer.stem("hobby")

# Print the stemmed word
print("Stemmed word:", stemmed_word)


Stemmed word: hobby


In [7]:
from nltk.stem import LancasterStemmer

# Initialize LancasterStemmer
lancaster_stemmer = LancasterStemmer()

# Stem the word "hobbies"
stemmed_word = lancaster_stemmer.stem("hobbies")

# Print the stemmed word
print("Stemmed word:", stemmed_word)


Stemmed word: hobby


In [8]:
from nltk.stem import LancasterStemmer

# Initialize LancasterStemmer
lancaster_stemmer = LancasterStemmer()

# Stem the word "computer"
stemmed_word = lancaster_stemmer.stem("computer")

# Print the stemmed word
print("Stemmed word:", stemmed_word)

Stemmed word: comput


In [9]:
from nltk.stem import LancasterStemmer

# Initialize LancasterStemmer
lancaster_stemmer = LancasterStemmer()

# Stem the word "computation"
stemmed_word = lancaster_stemmer.stem("computation")

# Print the stemmed word
print("Stemmed word:", stemmed_word)

Stemmed word: comput


In [10]:
sentence = "I was going to the office on my time bike when i saw car passing by hit the tree."
token = list(nltk.word_tokenize (sentence))
token

['I',
 'was',
 'going',
 'to',
 'the',
 'office',
 'on',
 'my',
 'time',
 'bike',
 'when',
 'i',
 'saw',
 'car',
 'passing',
 'by',
 'hit',
 'the',
 'tree',
 '.']

In [11]:
sentence

'I was going to the office on my time bike when i saw car passing by hit the tree.'

In [12]:
from nltk.stem import SnowballStemmer, LancasterStemmer, PorterStemmer
import nltk

# Sample tokens
tokens = "I was going to the office on my time bike when i saw car passing by hit the tree."

# Initialize stemmers
snowball_stemmer = SnowballStemmer("english")
lancaster_stemmer = LancasterStemmer()
porter_stemmer = PorterStemmer()

# Stem tokens using each stemmer
snowball_stemmed = [snowball_stemmer.stem(token) for token in tokens]
lancaster_stemmed = [lancaster_stemmer.stem(token) for token in tokens]
porter_stemmed = [porter_stemmer.stem(token) for token in tokens]

# Print stemmed tokens
print("Snowball Stemmer:", ' '.join(snowball_stemmed))
print("Lancaster Stemmer:", ' '.join(lancaster_stemmed))
print("Porter Stemmer:", ' '.join(porter_stemmed))


Snowball Stemmer: i   w a s   g o i n g   t o   t h e   o f f i c e   o n   m y   t i m e   b i k e   w h e n   i   s a w   c a r   p a s s i n g   b y   h i t   t h e   t r e e .
Lancaster Stemmer: i   w a s   g o i n g   t o   t h e   o f f i c e   o n   m y   t i m e   b i k e   w h e n   i   s a w   c a r   p a s s i n g   b y   h i t   t h e   t r e e .
Porter Stemmer: i   w a s   g o i n g   t o   t h e   o f f i c e   o n   m y   t i m e   b i k e   w h e n   i   s a w   c a r   p a s s i n g   b y   h i t   t h e   t r e e .


# Lemmatization:
Lemmatization, on the other hand, involves determining the lemma or canonical form of a word based on its intended 
meaning in the language. Unlike stemming, which simply chops off affixes, lemmatization considers the context 
and semantics of words to derive their base forms. Lemmatization requires access to a dictionary or lexicon that maps words
to their lemmas, allowing it to produce valid words as output.

For example, the word "better" would be lemmatized to "good," "running" would be lemmatized to "run," and "cats"
would be lemmatized to "cat." Lemmatization ensures that the resulting base forms are valid words, making it suitable 
for applications where word sense disambiguation and linguistic accuracy are crucial.

NLTK provides lemmatization functionality through the WordNet Lemmatizer, which is based on WordNet, 
a lexical database of English.
The WordNet Lemmatizer maps words to their lemmas using WordNet's hierarchical structure of synsets
(sets of synonymous words or phrases).

In [13]:
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()

In [14]:
print(lemma.lemmatize('running'))

running


In [15]:
print(lemma.lemmatize('runs'))

run


In [16]:
print(lemma.lemmatize('run'))

run


In [17]:
print(lemma.lemmatize('running',pos='v'))

run


In [18]:
print(lemma.lemmatize('runs',pos='v'))

run


In [19]:
print(lemma.lemmatize('run',pos='v'))

run


In [20]:
# stemming 
import nltk
from nltk.stem.porter import PorterStemmer
text = "Bring King Going Anything Sing Ring Nothing Thing."
porter_stemmer = PorterStemmer()
tokenization =nltk.word_tokenize(text)

for w in tokenization:
    print("Stemming for{} is {}".format(w,porter_stemmer.stem(w)))

Stemming forBring is bring
Stemming forKing is king
Stemming forGoing is go
Stemming forAnything is anyth
Stemming forSing is sing
Stemming forRing is ring
Stemming forNothing is noth
Stemming forThing is thing
Stemming for. is .


In [21]:
import nltk
from nltk.stem import WordNetLemmatizer

# Sample text
text = "Bring King Going Anything Sing Ring Nothing Thing."

# Initialize WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Tokenize the text into words
tokenization = nltk.word_tokenize(text)

# Lemmatize each word and print the result
for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))

Lemma for Bring is Bring
Lemma for King is King
Lemma for Going is Going
Lemma for Anything is Anything
Lemma for Sing is Sing
Lemma for Ring is Ring
Lemma for Nothing is Nothing
Lemma for Thing is Thing
Lemma for . is .


In [22]:
import nltk
from nltk.stem import WordNetLemmatizer

# Sample text
text = "The quick brown foxes are jumping over the lazy dogs"

# Initialize WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Tokenize the text into words
tokenization = nltk.word_tokenize(text)

# Lemmatize each word and print the result
for w in tokenization:
    print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))


Lemma for The is The
Lemma for quick is quick
Lemma for brown is brown
Lemma for foxes is fox
Lemma for are is are
Lemma for jumping is jumping
Lemma for over is over
Lemma for the is the
Lemma for lazy is lazy
Lemma for dogs is dog


In [23]:
import nltk
from nltk.stem import WordNetLemmatizer

# Sample text
text = "The quick brown foxes are jumping over the lazy dogs"

# Tokenize the text into words
words = nltk.word_tokenize(text)

# Initialize WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize each word in the text
lemmatized_words = [wordnet_lemmatizer.lemmatize(word) for word in words]

# Print the lemmatized words
print("Lemmatized words:")
for word, lemma in zip(words, lemmatized_words):
    print("{} -> {}".format(word, lemma))


Lemmatized words:
The -> The
quick -> quick
brown -> brown
foxes -> fox
are -> are
jumping -> jumping
over -> over
the -> the
lazy -> lazy
dogs -> dog


In [24]:
import nltk
from nltk.stem import WordNetLemmatizer

# Sample text
text = "The wolves were running through the forests and barking loudly at the moon."

# Tokenize the text into words
words = nltk.word_tokenize(text)

# Initialize WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize each word in the text
lemmatized_words = [wordnet_lemmatizer.lemmatize(word) for word in words]

# Print the lemmatized words
print("Lemmatized words:")
for word, lemma in zip(words, lemmatized_words):
    print("{} -> {}".format(word, lemma))


Lemmatized words:
The -> The
wolves -> wolf
were -> were
running -> running
through -> through
the -> the
forests -> forest
and -> and
barking -> barking
loudly -> loudly
at -> at
the -> the
moon -> moon
. -> .
