## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).
- Stemming is used in text pre-processing
- For usecases like chatbots, stemming can not be used - as it sometimes performs better and sometimes doesn't

In [39]:
# Stemming can be used in a classification problem to check whether comments of product is a positive review or negative review
# Reviews (below)
# [eating, eat, eaten] --> eat 
# [going,gone,goes] --> go

words = ["eating", "eats", "eaten", "writing", "writes", "programming", "programs", "history", "finally", "finalized"]

### PorterStemmer

In [40]:
from nltk.stem import PorterStemmer

stemming = PorterStemmer()

for word in words:
    print(word + " --> " + stemming.stem(word))

eating --> eat
eats --> eat
eaten --> eaten
writing --> write
writes --> write
programming --> program
programs --> program
history --> histori
finally --> final
finalized --> final


In [41]:
# When stemming is applied, for some of the words we may not get correct meaning/stem. This is a major disadvantage w.r.t stemming. 
# In order to avoid this problem and get correct stem of a word, lemmatization is used.

In [42]:
# Further examples of words whose stem is not correctly found though stemming (PorterStemmer)

In [43]:
stemming.stem('congratulations')

'congratul'

In [44]:
stemming.stem("sitting")

'sit'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [45]:
from nltk.stem import RegexpStemmer

reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

# with $, it checks only in the last portion of the word and if present, it removes it
# with no $, it will check in the entire word and where-ever it is present, it removes it.
# with $ at the start, nothing gets removed. - this change doesnâ€™t work.

In [46]:
# look into what the "min" parameter is used for.

In [47]:
reg_stemmer.stem('Eating')

'Eat'

In [48]:
reg_stemmer.stem('ingeating')

'ingeat'

In [49]:
reg_stemmer.stem('Houses')

'House'

In [50]:
reg_stemmer.stem("Acceptable")

'Accept'

### Snowball Stemmer
 It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [51]:
from nltk.stem import SnowballStemmer

# snowball stemmer can be used with multiple languages
# snowball stemmer (when applied to various other words) performs better than porter stemmer
snowballstemmer = SnowballStemmer('english')

for word in words:
    print(word + " --> " + snowballstemmer.stem(word))

eating --> eat
eats --> eat
eaten --> eaten
writing --> write
writes --> write
programming --> program
programs --> program
history --> histori
finally --> final
finalized --> final


In [52]:
# Examples of a case where "snowball stemmer" performs better than "porter stemmer"

In [53]:
stemming.stem("fairly"), stemming.stem("sportingly")

('fairli', 'sportingli')

In [54]:
snowballstemmer.stem("fairly"), snowballstemmer.stem("sportingly")

('fair', 'sport')

In [55]:
snowballstemmer.stem('goes')

'goe'

In [56]:
stemming.stem('goes')

'goe'