# What is Stemming in NLP?

* Stemming is a process of reducing a word to it's word stem that affixes to suffixes and prefixes or the root of words. Stemming in important in Natural Language Understanding(NLU) and Natural Language Processing(NLP).


* Stemming is a technique used to extract the root form of word by removing their affixes. Stemming is an important process in Natural Language Understanding(NLU) and Natural Language Processing(NLP).


* The goal of stemming is to reduce multiple forms of a word to a common base form, which can help to perform many tasks easier.


* This can be useful for text analysis tasks such as sentiment analysis, information retrieval, and document clustering, where the goal is to identify the underlying meaning of the words in a text.


* EG:- eats, eaten, eating, eated could be stemmed to 'eat'.

![maxresdefault.jpg](attachment:maxresdefault.jpg)

# Limitation or Disadvantages of Stemming:-


* Stemming will not take whole text, it will only take individual words that's why tokenization is needed before doing stemming.


* It is important to note that stemming is not a perfect process, and there can be cases where the stemmed form of a word does not accurately capture its meaning. However, it can be a useful tool in many NLU and NLP applications.


* Eg:- The word "amiable" might be stemmed to "amiabl", which is not a word in English and 'single' might be stemmed to 'singl' which don't have any meaning.

# Type of Stemmers:- 


There are several types of stemmers used in Natural Language Processing (NLP), each with their own approach to stemming words.
    
    
* Porter stemmer 


* Snowball stemmer 


* Lancaster Stemmer


* Regular Expression Stemmer


These algorithms use different rules and heuristics to identify the base form of a word, but they all aim to reduce the different forms of a word to a common base form.

# Porter Stemmer

* The Porter Stemmer is an algorithm for stemming words in Natural Language Processing (NLP). It was developed by Martin Porter in 1980 and is one of the most widely used stemmers in NLP.


* The Porter Stemmer was originally developed for the English language, but has since been adapted to work with other languages as well. There are Porter Stemmer implementations available for several languages, including French, German, Italian, Portuguese, Spanish, Dutch, and Swedish.


* The goal of the Porter Stemmer is to reduce words to their base or root form by removing any suffixes that might be attached to them. For example, the word "jumping" could be stemmed to "jump", while "jumps" could also be stemmed to "jump".


* Limitation:- It will only remove suffixes not prefixes.

In [1]:
from nltk.stem import PorterStemmer

PorterStemmer().stem('unacceptable')

'unaccept'

In [2]:
stemmer = PorterStemmer() # creating a object for Porter Stemmer

stemmer.stem('eating')

'eat'

In [3]:
# Stemming will not take whole text that's why tokenization is needed before doing stemming.

stemmer.stem('eating writing playing')

'eating writing play'

In [4]:
text = ['unacceptable', 'eating','EATS', 'eaten', 'playing','played','history', 'very', 'orderly', 'lovely', 'writing', 
        'writes', 'programming', 'programs', 'fairly', 'sportingly', 'a', 'an',]

for x in text:
    print(x+"----->"+stemmer.stem(x))

unacceptable----->unaccept
eating----->eat
EATS----->eat
eaten----->eaten
playing----->play
played----->play
history----->histori
very----->veri
orderly----->orderli
lovely----->love
writing----->write
writes----->write
programming----->program
programs----->program
fairly----->fairli
sportingly----->sportingli
a----->a
an----->an


In [5]:
# by default it will give output in lower case whether text is in upper case. Example is given in above text ('EATS')

for x in text:
    print(x+"----->"+stemmer.stem(x).upper(),'\n')

unacceptable----->UNACCEPT 

eating----->EAT 

EATS----->EAT 

eaten----->EATEN 

playing----->PLAY 

played----->PLAY 

history----->HISTORI 

very----->VERI 

orderly----->ORDERLI 

lovely----->LOVE 

writing----->WRITE 

writes----->WRITE 

programming----->PROGRAM 

programs----->PROGRAM 

fairly----->FAIRLI 

sportingly----->SPORTINGLI 

a----->A 

an----->AN 



In [6]:
for x in text:
    print(stemmer.stem(x))

unaccept
eat
eat
eaten
play
play
histori
veri
orderli
love
write
write
program
program
fairli
sportingli
a
an


In [7]:
Porter_Stemmer = [stemmer.stem(x) for x in text]

print(Porter_Stemmer)

len(Porter_Stemmer)

['unaccept', 'eat', 'eat', 'eaten', 'play', 'play', 'histori', 'veri', 'orderli', 'love', 'write', 'write', 'program', 'program', 'fairli', 'sportingli', 'a', 'an']


18

# Lancaster Stemmer

* The Lancaster Stemmer is an algorithm for stemming words in Natural Language Processing (NLP). It was developed by Chris Paice at Lancaster University in the United Kingdom in 1990.


* Like the Porter Stemmer, the Lancaster Stemmer aims to reduce words to their base or root form by removing any suffixes that might be attached to them.


* The Lancaster Stemmer uses a more aggressive approach than the Porter Stemmer, and is able to remove more suffixes from words. This can sometimes result in more aggressive stemming than the Porter Stemmer, and can be useful in certain NLP tasks.


* It is important to note that like all stemmers, the Lancaster Stemmer is not perfect and can sometimes result in stemmed words that do not accurately capture the meaning of the original word. 


* Limitation:- It will only remove suffixes not prefixes and it may be more prone to overstemming, or removing too many suffixes from a word.

In [8]:
from nltk.stem import LancasterStemmer

lancaster = LancasterStemmer()

In [9]:
for x in text:
    print(x+"----->"+lancaster.stem(x))

unacceptable----->unacceiv
eating----->eat
EATS----->eat
eaten----->eat
playing----->play
played----->play
history----->hist
very----->very
orderly----->ord
lovely----->lov
writing----->writ
writes----->writ
programming----->program
programs----->program
fairly----->fair
sportingly----->sport
a----->a
an----->an


In [10]:
Lancaster_Stemmer = [lancaster.stem(x) for x in text]

print(Lancaster_Stemmer)

len(Lancaster_Stemmer)

['unacceiv', 'eat', 'eat', 'eat', 'play', 'play', 'hist', 'very', 'ord', 'lov', 'writ', 'writ', 'program', 'program', 'fair', 'sport', 'a', 'an']


18

# Regex Stemmer


* A regex stemmer is a type of stemmer that uses regular expressions to identify and remove affixes(prefixes and suffixes both) from words. Any substrings that match the regular expressions will be removed.


* A Regex Stemmer can be customized to match the specific affixes and word forms present in a given language or corpus. This can make it more effective at handling irregular or rare word forms. 


* One limitation of a regex stemmer is that it may be less accurate than other stemmers that use more sophisticated algorithms to identify and remove suffixes. 


* It will give output in original form not in lower form like other stemmers.

In [11]:
from nltk.stem import RegexpStemmer

reg_stemmer = RegexpStemmer('^un|ing$|ingly$|ly$|s$|able$|ed$', min = 2)

In [12]:
reg_stemmer.stem('unacceptable')

'accept'

In [13]:
for x in text:
    print(x+"----->"+reg_stemmer.stem(x))

unacceptable----->accept
eating----->eat
EATS----->EATS
eaten----->eaten
playing----->play
played----->play
history----->history
very----->very
orderly----->order
lovely----->love
writing----->writ
writes----->write
programming----->programm
programs----->program
fairly----->fair
sportingly----->sport
a----->a
an----->an


In [14]:
Regex_Stemmer = [reg_stemmer.stem(x) for x in text]

print(Regex_Stemmer)

len(Regex_Stemmer)

['accept', 'eat', 'EATS', 'eaten', 'play', 'play', 'history', 'very', 'order', 'love', 'writ', 'write', 'programm', 'program', 'fair', 'sport', 'a', 'an']


18

# Snowball Stemmer


* The Snowball Stemmer, was developed by Martin Porter and his team in 2001 and is a more modern, algorithmic approach of stemming.


* Snowball Stemmer is available in multiple languages like Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.


* One advantage of the Snowball Stemmer is that it is more accurate than some other stemmers, such as the Porter Stemmer, which can sometimes overstem or understem words.

In [15]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
from nltk.stem import SnowballStemmer

SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

In [17]:
snow_stemmer = SnowballStemmer(language = 'english', ignore_stopwords=False)

In [18]:
# No stemmer will take the whole text to breakdown into root form of word.

snow_stemmer.stem('eating eats eaten')

'eating eats eaten'

In [19]:
for x in text:
    print(x+"----->"+snow_stemmer.stem(x))

unacceptable----->unaccept
eating----->eat
EATS----->eat
eaten----->eaten
playing----->play
played----->play
history----->histori
very----->veri
orderly----->order
lovely----->love
writing----->write
writes----->write
programming----->program
programs----->program
fairly----->fair
sportingly----->sport
a----->a
an----->an


In [20]:
Snowball_Stemmer = [snow_stemmer.stem(x) for x in text]

print(Snowball_Stemmer)

len(Snowball_Stemmer)

['unaccept', 'eat', 'eat', 'eaten', 'play', 'play', 'histori', 'veri', 'order', 'love', 'write', 'write', 'program', 'program', 'fair', 'sport', 'a', 'an']


18

# Compare the result of all Stemmers

In [21]:
import pandas as pd 

df = pd.DataFrame({'Text': text, 'Porter_Stemmer': Porter_Stemmer, 'Snowball_Stemmer': Snowball_Stemmer, 'Lancaster_Stemmer':
                  Lancaster_Stemmer, 'Regex_Stemmer':Regex_Stemmer})

df

Unnamed: 0,Text,Porter_Stemmer,Snowball_Stemmer,Lancaster_Stemmer,Regex_Stemmer
0,unacceptable,unaccept,unaccept,unacceiv,accept
1,eating,eat,eat,eat,eat
2,EATS,eat,eat,eat,EATS
3,eaten,eaten,eaten,eat,eaten
4,playing,play,play,play,play
5,played,play,play,play,play
6,history,histori,histori,hist,history
7,very,veri,veri,very,very
8,orderly,orderli,order,ord,order
9,lovely,love,love,lov,love
