## Stemming 

Stemming is an important part of the pipelining process in Natural language processing. The input to the stemmer is tokenized words.
- Stemming brings words in the stem (basic) form
- Problem with stemming is that,it produce the intermediate repersentation of the word which may or may not have any meaning
- we use stemming where meaning of the meaning of word is not our concern


### Stemming algorithms 
- Porter’s Stemmer algorithm : It produces the best output as compared to other stemmers and it has less error rate.
But its applications are only limited to English words.
- Snowball Stemmer : It in an improved version of Porter’s Stemmer.The Snowball Stemmer can map non-English words too and is having greater computational speed. 
- Lancaster Stemmer : The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball Stemmers.


#### Application : 
- Search engines 
- Domain vocabularies 
- Email spam classification etc.


## Lemmatization 

Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word. Lemmatization is preferred over Stemming because lemmatization does morphological/structural analysis of the words.


#### Application : 
- Chatbots 
- Question Answer application etc.



##  Stemming code

In [1]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [8]:
# Use case to show how all stemmers convert the words : define a all the stemmer

from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

print('****Porter Stemmer****')

print(porter.stem('hobby'))
print(porter.stem('hobbies'))
print(porter.stem('computer'))
print(porter.stem('computations'))
print(porter.stem('history'))
print(porter.stem('histories'))

print('****Lancaster Stemmer****')

print(lancaster.stem('hobby'))
print(lancaster.stem('hobbies'))
print(lancaster.stem('computer'))
print(lancaster.stem('computations'))
print(lancaster.stem('history'))
print(lancaster.stem('histories'))

print('****Snowball Stemmer****')

print(snowball.stem('hobby'))
print(snowball.stem('hobbies'))
print(snowball.stem('computer'))
print(snowball.stem('computations'))
print(snowball.stem('history'))
print(snowball.stem('histories'))

****Porter Stemmer****
hobbi
hobbi
comput
comput
histori
histori
****Lancaster Stemmer****
hobby
hobby
comput
comput
hist
hist
****Snowball Stemmer****
hobbi
hobbi
comput
comput
histori
histori


In [11]:
# Problem solving :
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them.""" 
               

In [18]:
# convert paragraph to words in lower case :
words = nltk.word_tokenize(paragraph.lower())

# check output for all stemmers :

for stemmer in (lancaster,porter,snowball):
    print(f"{stem} Looks like :")
    output = [stemmer.stem(w) for w in words]
    print(" ".join(output))
    print()

<nltk.stem.snowball.SnowballStemmer object at 0x00000251BF6DF2E0> Looks like :
i hav three vis for ind . in 3000 year of our hist , peopl from al ov the world hav com and invad us , capt our land , conqu our mind . from alexand onward , the greek , the turk , the mog , the portugues , the brit , the french , the dutch , al of them cam and loot us , took ov what was our . yet we hav not don thi to any oth nat . we hav not conqu anyon . we hav not grab their land , their cult , their hist and tri to enforc our way of lif on them .

<nltk.stem.snowball.SnowballStemmer object at 0x00000251BF6DF2E0> Looks like :
i have three vision for india . in 3000 year of our histori , peopl from all over the world have come and invad us , captur our land , conquer our mind . from alexand onward , the greek , the turk , the mogul , the portugues , the british , the french , the dutch , all of them came and loot us , took over what wa our . yet we have not done thi to ani other nation . we have not conqu

In [25]:
# Now further : to make our data more clean we can use stopwords and remove punctuations.

import string
from nltk.corpus import stopwords

stop_word = stopwords.words("english")
punch = string.punctuation


clean_list = []

for word in set(words):
    if(word not in stop_word and word not in punch):
        clean_list.append(word)

for stemmer in (lancaster,porter,snowball):
    print(f"{stem} Looks like :")
    output = [stemmer.stem(w) for w in clean_list]
    print(" ".join(output))
    print()

<nltk.stem.snowball.SnowballStemmer object at 0x00000251BF6DF2E0> Looks like :
us mind 3000 loot anyon land year french brit lif don mog tri grab invad capt ind cult peopl three alexand turk greek cam conqu took com portugues way onward nat vis world land dutch hist yet enforc

<nltk.stem.snowball.SnowballStemmer object at 0x00000251BF6DF2E0> Looks like :
us mind 3000 loot anyon land year french british life done mogul tri grab invad captur india cultur peopl three alexand turk greek came conquer took come portugues way onward nation vision world land dutch histori yet enforc

<nltk.stem.snowball.SnowballStemmer object at 0x00000251BF6DF2E0> Looks like :
us mind 3000 loot anyon land year french british life done mogul tri grab invad captur india cultur peopl three alexand turk greek came conquer took come portugues way onward nation vision world land dutch histori yet enforc

