## Stemming and it's types - Text Preprocessing using NLTK

### Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

Example: Classification Problem

Whether the comments on product is a positive review or negative review.
Reviews----> eating, eat,eaten [going,gone,goes]--->go so eat and go is a word stem

In [1]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

### PorterStemmer

In [2]:
from nltk.stem import PorterStemmer
stemming=PorterStemmer()
for word in words:
    print(f"${word}---->${stemming.stem(word)}")

$eating---->$eat
$eats---->$eat
$eaten---->$eaten
$writing---->$write
$writes---->$write
$programming---->$program
$programs---->$program
$history---->$histori
$finally---->$final
$finalized---->$final


In [4]:
# Stemming may not work for some word. It will chnage them incorrectly
stemming.stem("Congratulations")

'congratul'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example Q&A,Chatbots,Text Summarizer etc.

In [5]:
from nltk.stem import RegexpStemmer
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)
reg_stemmer.stem('eating')

'eat'

In [7]:
# As $ is after ing so only ing at end will be removed
reg_stemmer.stem("ingeating")

'ingeat'

### Snowball Stemmer
 It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [8]:
from nltk.stem import SnowballStemmer
snowball_stemmer=SnowballStemmer('english')

In [10]:
for word in words:
    print(f"${word}---->${snowball_stemmer.stem(word)}")

$eating---->$eat
$eats---->$eat
$eaten---->$eaten
$writing---->$write
$writes---->$write
$programming---->$program
$programs---->$program
$history---->$histori
$finally---->$final
$finalized---->$final


In [12]:
# With Porter Stemmer: Wring stems
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [13]:
# With SnowBall Stemmer good stems perform better
snowball_stemmer.stem("fairly"),snowball_stemmer.stem("sportingly")

('fair', 'sport')

In [14]:
# Snowball stemmer can also give us incorrect stem for some word
snowball_stemmer.stem("goes")

'goe'