# `Stemming`

- Stemming is a text normalization process that reduces words to their <b>`root form`</b> or <b>`base stem` [aka lemma].</b>
- For example, `running`, `runner` and `ran` might all be reduced to `run` . 
- The goal is to simplify and group similar words to improve text processing and analysis.

In [1]:
import nltk
words = ["Running","Runner","Ran","Eats","Eating","Eaten", "Happily",
         "Happiness","Happier","Studying","Student","Studies","Talked",
         "Talking","Talk"]

## `1. Porter Stemmer`

- Uses a set of predefined rules to <b>`remove suffixes`</b> from words.

In [2]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
for word in words:
    print(f'{word} ----> {stemmer.stem(word)}')

Running ----> run
Runner ----> runner
Ran ----> ran
Eats ----> eat
Eating ----> eat
Eaten ----> eaten
Happily ----> happili
Happiness ----> happi
Happier ----> happier
Studying ----> studi
Student ----> student
Studies ----> studi
Talked ----> talk
Talking ----> talk
Talk ----> talk


- The major disadvantage of stemming is <b>`Studying`</b> and <b>`Studies`</b> to <b>`studi`</b> 
- It should have been <b>`study`</b> with a 'y'
- This changes the entire meaning of the word <b>`Studying`</b> and <b>`Studies`</b>

In [3]:
stemmer.stem('congratulate')

'congratul'

- what is the meaning of the word <b>`Congratul`</b>???
- This is one of the major drawback of Stemming

## `2. RegexpStemmer`

- uses regular expressions to <b>`match and remove suffixes or patterns`</b> from words

In [4]:
from nltk.stem import RegexpStemmer
regexstemmer = RegexpStemmer('ing$|s$|e$|able$|^ing', min=4)
for word in words:
    print(f'{word} ----> {regexstemmer.stem(word)}')

Running ----> Runn
Runner ----> Runner
Ran ----> Ran
Eats ----> Eat
Eating ----> Eat
Eaten ----> Eaten
Happily ----> Happily
Happiness ----> Happines
Happier ----> Happier
Studying ----> Study
Student ----> Student
Studies ----> Studie
Talked ----> Talked
Talking ----> Talk
Talk ----> Talk


## `3. Snowball Stemmer`

In [6]:
from nltk.stem import SnowballStemmer
snowstemmer = SnowballStemmer('english')
for word in words:
    print(f'{word} ---> {snowstemmer.stem(word)}')

Running ---> run
Runner ---> runner
Ran ---> ran
Eats ---> eat
Eating ---> eat
Eaten ---> eaten
Happily ---> happili
Happiness ---> happi
Happier ---> happier
Studying ---> studi
Student ---> student
Studies ---> studi
Talked ---> talk
Talking ---> talk
Talk ---> talk


### `-> Difference between Snowball and Porter Stemmers`

In [9]:
stemmer.stem('fairly'), stemmer.stem('sportingly')

('fairli', 'sportingli')

In [10]:
snowstemmer.stem('fairly'), snowstemmer.stem('sportingly')

('fair', 'sport')

- snowball stemmer > porter