## Stemming 

Stemming is an important part of the pipelining process in Natural language processing. The input to the stemmer is tokenized words.
- Stemming brings words in the stem (basic) form
- Problem with stemming is that,it produce the intermediate repersentation of the word which may or may not have any meaning
- we use stemming where meaning of the word is not our concern


### Stemming algorithms 
- Porter’s Stemmer algorithm : It produces the best output as compared to other stemmers and it has less error rate.
But its applications are only limited to English words.
- Snowball Stemmer : It in an improved version of Porter’s Stemmer.The Snowball Stemmer can map non-English words too and is having greater computational speed. 
- Lancaster Stemmer : The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball Stemmers.


#### Application : 
- Search engines 
- Domain vocabularies 
- Email spam classification etc.


## Lemmatization 

Lemmatization convert the word into their root form.Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word. Lemmatization is preferred over Stemming because lemmatization does morphological/structural analysis of the words.


#### Application : 
- Chatbots 
- Question Answer application etc.


#### when to use stemming and when to lemmatization :
- It complete depends upon the use case to use case.
- if the real meaning of thw word is not important,use stemming otherwise lemmatization.
- stemming is faster than lemmatization
 

##  Stemming code

In [3]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [4]:
# Use case to show how all stemmers convert the words : define all the stemmer

from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

print('****Porter Stemmer****')

print(porter.stem('hobby'))
print(porter.stem('hobbies'))
print(porter.stem('computer'))
print(porter.stem('computations'))
print(porter.stem('history'))
print(porter.stem('histories'))

print('****Lancaster Stemmer****')

print(lancaster.stem('hobby'))
print(lancaster.stem('hobbies'))
print(lancaster.stem('computer'))
print(lancaster.stem('computations'))
print(lancaster.stem('history'))
print(lancaster.stem('histories'))

print('****Snowball Stemmer****')

print(snowball.stem('hobby'))
print(snowball.stem('hobbies'))
print(snowball.stem('computer'))
print(snowball.stem('computations'))
print(snowball.stem('history'))
print(snowball.stem('histories'))

****Porter Stemmer****
hobbi
hobbi
comput
comput
histori
histori
****Lancaster Stemmer****
hobby
hobby
comput
comput
hist
hist
****Snowball Stemmer****
hobbi
hobbi
comput
comput
histori
histori


In [5]:
# Problem solving :
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them.""" 
               

In [8]:
# convert paragraph to words in lower case :
words = nltk.word_tokenize(paragraph.lower())

# check output for all stemmers :

for stemmer in (lancaster,porter,snowball):
    print(f"{stemmer} Looks like :")
    output = [stemmer.stem(w) for w in words]
    print(" ".join(output))
    print()

<LancasterStemmer> Looks like :
i hav three vis for ind . in 3000 year of our hist , peopl from al ov the world hav com and invad us , capt our land , conqu our mind . from alexand onward , the greek , the turk , the mog , the portugues , the brit , the french , the dutch , al of them cam and loot us , took ov what was our . yet we hav not don thi to any oth nat . we hav not conqu anyon . we hav not grab their land , their cult , their hist and tri to enforc our way of lif on them .

<PorterStemmer> Looks like :
i have three vision for india . in 3000 year of our histori , peopl from all over the world have come and invad us , captur our land , conquer our mind . from alexand onward , the greek , the turk , the mogul , the portugues , the british , the french , the dutch , all of them came and loot us , took over what wa our . yet we have not done thi to ani other nation . we have not conquer anyon . we have not grab their land , their cultur , their histori and tri to enforc our way o

In [29]:
# Now further : to make our data more clean we can use stopwords and remove punctuations.

import string
from nltk.corpus import stopwords

stop_word = stopwords.words("english")
punch = string.punctuation


clean_list = []

for word in set(words):
    if(word not in stop_word and word not in punch):
        clean_list.append(word)

for stemmer in (lancaster,porter,snowball):
    print(f"{stemmer} Looks like :")
    print()
    output = [stemmer.stem(w) for w in clean_list]
    print(" ".join(output))
    print()

<LancasterStemmer> Looks like :

three mog cult tri way nat 3000 cam capt com turk loot yet lif onward brit vis year portugues ind us don grab land land conqu invad anyon world dutch enforc mind peopl alexand french took greek hist

<PorterStemmer> Looks like :

three mogul cultur tri way nation 3000 came captur come turk loot yet life onward british vision year portugues india us done grab land land conquer invad anyon world dutch enforc mind peopl alexand french took greek histori

<nltk.stem.snowball.SnowballStemmer object at 0x00000198A8127F40> Looks like :

three mogul cultur tri way nation 3000 came captur come turk loot yet life onward british vision year portugues india us done grab land land conquer invad anyon world dutch enforc mind peopl alexand french took greek histori



# Lemmatization code:

In [7]:
# Dowmload the model

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\monik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
# Call the library
from nltk.stem import WordNetLemmatizer

In [12]:
# Use case to show how all lemmatize convert the words :
lemma = WordNetLemmatizer()

print(lemma.lemmatize('running'))
print(lemma.lemmatize('runs'))
print(lemma.lemmatize('ran'))

# Here root form for running and ran should be run which didn't happen?why?
# to solve this needs to pass pos = 'v' otherwise it may consist it as Noun or somthing else.

running
run
ran


In [15]:
print(lemma.lemmatize('running',pos = 'v'))   
print(lemma.lemmatize('runs',pos = 'v'))
print(lemma.lemmatize('ran',pos = 'v'))
print()
print(lemma.lemmatize('see',pos = 'v'))
print(lemma.lemmatize('sees',pos = 'v'))
print(lemma.lemmatize('seeing',pos = 'v'))

# converted the word into a root form.

run
run
run

see
see
see


In [21]:
# Problem solving :

paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them.""" 
               

In [28]:
def cleaning_text_lemm(data):
    clean_data = []
    for word in nltk.word_tokenize(data.lower()):
        if (word not in stop_word and word not in punch):
            clean_data.append(lemma.lemmatize(word, pos ='v'))
    return " ".join(clean_data)
cleaning_text_lemm(paragraph)

'three visions india 3000 years history people world come invade us capture land conquer mind alexander onwards greeks turks moguls portuguese british french dutch come loot us take yet do nation conquer anyone grab land culture history try enforce way life'

## What should we the pipeline till here 
Pipeline  = text ----> lower case ---> word tokenization ---> stopword ---> remove punctuations ---> stemming or lemmatization