# What is stemming

## Stemming is a natural language processing technique that lowers inflection in words to their root forms, hence aiding in the preprocessing of text, words, and documents for text normalization.

# Why Stemming is Important?

## As previously stated, the English language has several variants of a single term. The presence of these variances in a text corpus results in data redundancy when developing NLP or machine learning models. Such models may be ineffective.|

# Types of Stemmer in NLTK

## Porter Stemmer - PorterStemmer()

In [1]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
words = ['Connects','Connecting','Connections','Connected','Connection','Connectings','Connect']
for word in words:
    print(word,"--->",porter.stem(word))

Connects ---> connect
Connecting ---> connect
Connections ---> connect
Connected ---> connect
Connection ---> connect
Connectings ---> connect
Connect ---> connect


## Snowball Stemmer – SnowballStemmer()

### The method utilized in this instance is more precise and is referred to as “English Stemmer” or “Porter2 Stemmer.” It is somewhat faster and more logical than the original Porter Stemmer.

In [2]:
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')
words = ['generous','generate','generously','generation']
for word in words:
    print(word,"--->",snowball.stem(word))

generous ---> generous
generate ---> generat
generously ---> generous
generation ---> generat


## Lancaster Stemmer – LancasterStemmer()

In [3]:
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()
words = ['eating', 'eats', 'eaten', 'puts', 'putting']
for word in words:
    print(word, "---->", lancaster.stem(word))

eating ----> eat
eats ----> eat
eaten ----> eat
puts ----> put
putting ----> put


## Regexp Stemmer – RegexpStemmer()

### Regex stemmer identifies morphological affixes using regular expressions. Substrings matching the regular expressions will be discarded.

In [4]:
from nltk.stem import RegexpStemmer
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
words = ['mass','was','bee','computer','advisable']
for word in words:
    print(word,"--->",regexp.stem(word))

mass ---> mas
was ---> was
bee ---> bee
computer ---> computer
advisable ---> advis


## Porter vs Snowball vs Lancaster vs Regex Stemming in NLTK

In [5]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
word_list = ["friend", "friendship", "friends", "friendships"]
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))

Word                Porter Stemmer      Snowball Stemmer    Lancaster Stemmer             Regexp Stemmer                          
friend              friend              friend              friend                        friend                                  
friendship          friendship          friendship          friend                        friendship                              
friends             friend              friend              friend                        friend                                  
friendships         friendship          friendship          friend                        friendship                              
