# <font color='#00d2d3'>  NLTK: <br> Stemming: Porter Stemmer, Lancaster Stemmer and Snowball <br> Lemmatization<font/>

<font color='#e84393' > Strip affixes from the token and return the stem. <br/> token (str) – The token that should be stemmed.

In [None]:
#import the nltk package
import nltk

# <font color='#00d2d3' > 1. Porter Stemmer Algorithm - “An algorithm for suffix stripping.” <br/> 

One of the most widely used stemming algorithms is the simple and efficient Porter algorithm, which is based on a series of simple cascaded rewrite rules.  The algorithm contains a series of rules like these: <br/>
<font color='#e84393' > ATIONAL —►ATE (e.g., relational —►relate)<br/>
ING —>6 ifstemcontainsvowel(e.g.,motoring—►motor)<br/>
SSES —►SS (e.g., grasses —►grass) <br> </font> 
Commonly used, gentle algorithm but is not the best for precision. 
For more information on the Porter Stemmer Algorithm 
https://tartarus.org/martin/PorterStemmer/

In [None]:
from nltk.stem import PorterStemmer

In [None]:
porter = PorterStemmer()

In [None]:
print(porter.stem('bundles'))

bundl


In [None]:
print(porter.stem('tokenization'))

token


In [None]:
print(porter.stem('Illustrator'))    #since this is already a noun, stemming it does not help. we will address this later on

illustr


# <font color='#00d2d3' > 2. Lancaster - A word stemmer based on the Lancaster stemming algorithm <br/> 

<font color='#00d2d3' > 
An iterative algorithm with rules. About 120 rules are indexed by the last letter of a suffix. On each iteration, it tries to find an applicable rule by the last character of the word. Each rule specifies either a deletion or replacement of an ending. If there is no such rule, it terminates. It also terminates if a word starts with a vowel and there are only two letters left or if a word starts with a consonant and there are only three characters left. Otherwise, the rule is applied, and the process repeats. <br/> It is an aggressive algorithm that can give low precision. 

In [None]:
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()

In [None]:
print(lancaster.stem('organization'))

org


In [None]:
print(porter.stem('organization'))

organ


Shows how Lancaster is more aggressive.
It not only strips the suffix but also adds proper suffix to the stem to give it meaning.

In [None]:
print(lancaster.stem('Bunnies'))

bunny


In [None]:
print(porter.stem('Bunnies'))

bunni


In [None]:
print(lancaster.stem('Illustrator'))

illust


# <font color='#00d2d3' > 3. Snowball -  <br/> 

<font color='#00d2d3' > 3. Snowball - A modified version of the Porter Stemmer. Developed by the same guy. More precise for large datasets. <br/> 

We might not see much difference but it has higher precision than Porter Stemmer for large datasets. 

In [None]:
from nltk.stem import SnowballStemmer
snow = SnowballStemmer('english')             #can stem for multiple languages

In [None]:
print(snow.stem('Organization'))
print(snow.stem('Digitize'))
print(snow.stem('favourable'))
print(snow.stem('limitation'))
print(snow.stem('Illustrator'))

organ
digit
favour
limit
illustr


# Comparing the three methods

In [None]:
bundle = ['bunnies','organization','polarize','jaguar','stabalize','destabilize','democratic','kingdoms','dramatic','favourable']
print(bundle)

['bunnies', 'organization', 'polarize', 'jaguar', 'stabalize', 'destabilize', 'democratic', 'kingdoms', 'dramatic', 'favourable']


In [None]:
#structure of columns
#20 is the padding
print("{0:20}{1:20}{2:20}{3:20}".format("Word","Porter","Lancaster","Snowball"))   

for word in bundle:
  print("{0:20}{1:20}{2:20}{3:20}".format(word,porter.stem(word),lancaster.stem(word),snow.stem(word)))

Word                Porter              Lancaster           Snowball            
bunnies             bunni               bunny               bunni               
organization        organ               org                 organ               
polarize            polar               pol                 polar               
jaguar              jaguar              jagu                jaguar              
stabalize           stabal              stab                stabal              
destabilize         destabil            dest                destabil            
democratic          democrat            democr              democrat            
kingdoms            kingdom             kingdom             kingdom             
dramatic            dramat              dram                dramat              
favourable          favour              favo                favour              


We can see that : <br>
  Lancaster is more aggressive. <br>
  It overiterates which can strip suffixes and create a word which linguistically does not make much sense.<br>
  Overstemming.

Thus, Porter and Snowball are more famous as the stems they give make much more sense.

# <font color='#fd79a8'>  Lemmatization  

### Libraries used : NLTK's Wordnet, spaCy, TextBlob, Pattern & Standford Core NLP

In [None]:
import nltk   #library
from nltk.stem import WordNetLemmatizer    #package
nltk.download('wordnet')       #wordnet is a lexicon with nouns, verbs and so on
nltk.download('omw-1.4')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

[Wordnet Resource Search](http://wordnetweb.princeton.edu/perl/webwn)

In [None]:
#creating an object of WordNetLemmatizer class.
lem = WordNetLemmatizer()
token = "stockings"

result_lemma = lem.lemmatize(token)
print(token," => ", result_lemma)

stockings  =>  stocking


In [None]:
porter.stem("stockings")        #incorrect stemming method

'stock'

Lemmatizer uses the wordnet to figure out the word and gives us it's noun form

### Lemmatizing an entire sentence

In [None]:
string1 = "The girls sang louder. The bankers banked at other banks." 
string2 = "These were better shoes for her feet. The grocer was stocking the shelves at the grocery" 


Tokenizing sentences

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
tokens = nltk.word_tokenize(string1)
print("Tokenized Sentence : ")
print(tokens)

#Lemmatize the tokenized sentences
lemmatized_tokens = ' '.join(lem.lemmatize(w) for w in tokens)
print("\nLemmatized Sentence :\n",lemmatized_tokens)


Tokenized Sentence : 
['The', 'girls', 'sang', 'louder', '.', 'The', 'bankers', 'banked', 'at', 'other', 'banks', '.']
Lemmatized Sentence :
 The girl sang louder . The banker banked at other bank .


In [None]:
#lemmatizing without tokens
lemmatized_tokens = ' '.join(lem.lemmatize(w) for w in string1)
print("\nLemmatized Sentence :\n",lemmatized_tokens)


Lemmatized Sentence :
 T h e   g i r l s   s a n g   l o u d e r .   T h e   b a n k e r s   b a n k e d   a t   o t h e r   b a n k s .


We can see that the accuracy is much better when we use tokens instead of sentence directly

In [None]:
lemmatized_tokens = ' '.join(lem.lemmatize(word) for word in string1.split())
print("\nLemmatized Sentence :\n",lemmatized_tokens)


Lemmatized Sentence :
 The girl sang louder. The banker banked at other banks.


Split method also gives us tokens but we not that it doesn't give us expanded clitics which we'd want

In [None]:
sentence = "Split method also gives us tokens but we know that it doesn't give us expanded clitics which we'd want"

lemmatized_tokens = ' '.join(lem.lemmatize(word) for word in sentence.split())
print("\nLemmatized Sentence :\n",lemmatized_tokens)


Lemmatized Sentence :
 Split method also give u token but we know that it doesn't give u expanded clitics which we'd want


Thus, Lemmatization after Tokenization is to be done.

In [None]:
tokens = nltk.word_tokenize(string1)
print("Tokenized Sentence : ")
print(tokens)

#Lemmatize the tokenized sentences
lemmatized_tokens = ' '.join(lem.lemmatize(w) for w in tokens)
print("\nLemmatized Sentence :\n",lemmatized_tokens)

Tokenized Sentence : 
['The', 'girls', 'sang', 'louder', '.', 'The', 'bankers', 'banked', 'at', 'other', 'banks', '.']

Lemmatized Sentence :
 The girl sang louder . The banker banked at other bank .


Problem :<br>
should give us verb of sang (sing) as the lemma.
<br>
also, banked is a verb whose root lemma is bank. 

Thus, accuracy in terms of verbs is not good.

In [None]:
tokens2 = nltk.word_tokenize(string2)
print("Tokenized Sentence : ")
print(tokens2)

#Lemmatize the tokenized sentences
lemmatized_tokens2 = ' '.join(lem.lemmatize(w) for w in tokens2)
print("\nLemmatized Sentence :\n",lemmatized_tokens2)

Tokenized Sentence : 
['These', 'were', 'better', 'shoes', 'for', 'her', 'feet', '.', 'The', 'grocer', 'was', 'stocking', 'the', 'shelves', 'at', 'the', 'grocery']

Lemmatized Sentence :
 These were better shoe for her foot . The grocer wa stocking the shelf at the grocery


Problem:<br>
were => were, correct lemma(be ,verb)
was => wa , correct lemma(be ,verb)
stocking => stocking, correct lemma(stock, verb)

<br>
Precised Output:
<br>These be better shoe for her foot.
<br>The grocer be stock the shelf at the grocery.

The algorithm has to consider the part of speech before lemmatizing.

<font color='#fd79a8'> A second argument must be added to lemmatize() - to include (or to tag) the Parts-of-Speech. 