<a href="https://colab.research.google.com/github/SatishDeshbhratar/NLPSelfLearning/blob/main/Day_2_Stemming_and_Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 2 - Stemming and Lemmatization



## Stemming Words Code Examples

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

In [1]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
#create an object of class PorterStemmer
porter = PorterStemmer()
lancaster=LancasterStemmer()
#provide a word to be stemmed
print("Porter Stemmer")
print("cats => ",porter.stem("cats"))
print("trouble => ",porter.stem("trouble"))
print("troubling =>", porter.stem("troubling"))
print("troubled => ",porter.stem("troubled"))
print("Lancaster Stemmer")
print("cats => ",lancaster.stem("cats"))
print("trouble => ",lancaster.stem("trouble"))
print("troubling =>",lancaster.stem("troubling"))
print("troubled => ",lancaster.stem("troubled"))

Porter Stemmer
cats =>  cat
trouble =>  troubl
troubling => troubl
troubled =>  troubl
Lancaster Stemmer
cats =>  cat
trouble =>  troubl
troubling => troubl
troubled =>  troubl


### Stemming a  Complete Sentence

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')
def stemSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

sentence="Pythoners are very intelligent and work very pythonly and now they are pythoning their way to success."
x=stemSentence(sentence)
print(x)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
python are veri intellig and work veri pythonli and now they are python their way to success . 


## Lemmatization

Stemming is a heuristic procedure that chops off the ends of words in the hopes of getting it right most of the time, and it frequently includes the removal of derivational affixes.

Lemmatization usually refers to doing things correctly using a vocabulary and morphological study of words, with the goal of removing only inflectional endings and returning the base or dictionary form of a word, known as the lemma. Stemming might yield only s when confronted with the token saw, whereas lemmatization might try to return either see or saw depending on whether the token was used as a verb or a noun.

For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words. Because lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.
Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.


In [5]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Word                Lemma               
He                  He                  
was                 wa                  
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 ha                  
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hour                
in                  in                  
the                 the                 
Sun                 Sun         

In the above code output, you must be wondering that no actual root form has been given for any word, this is because they are given without context. 

You need to provide the context in which you want to lemmatize that is the parts-of-speech (POS). This is done by giving the value for pos parameter in wordnet_lemmatizer.lemmatize.


In [6]:
print("Lemmatization with POS Tagging ")
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word, pos="v")))


Lemmatization with POS Tagging 
Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


## Token Count

In [7]:
from nltk.tokenize import word_tokenize
# Import Counter
from collections import Counter

article = """If men wish to live, then they are forced to kill others. The entire struggle for survival is a
conquest of the means of existence, which in turn results in the elimination of others from these
same sources of subsistence. As long as there are peoples on this earth, there will be nations
against nations and they will be forced to protect their vital rights in the same way as the
individual is forced to protect his rights.
One is either the hammer or the anvil. We confess that it is our purpose to prepare the
German people again for the role of the hammer. We admit freely and openly that if our
movement is victorious, we will be concerned day and night with the question of how to produce
the armed forces, which are forbidden us by the peace treaty [Treaty of Versailles]. We solemnly
confess that we consider everyone a scoundrel who does not try day and night to figure out a way
to violate this treaty, for we have never recognized this treaty...We will take every step which
strengthens our arms, which augments the number of our forces, and which increases the strength
of our people. We confess further that we will dash anyone to pieces who should dare hinder us
in this undertaking...Our rights will be protected only when the German Reich [country] is again
supported by the point of the German dagger... """
#print()
# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))


[('the', 18), ('of', 10), ('to', 9), ('we', 9), (',', 8), ('.', 7), ('is', 6), ('will', 6), ('our', 6), ('which', 5)]
