<center><u><H1>Stemming and Lemmatization</H1></u></center><br>
Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing.<br>


## Stemming:
Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem(root) itself is not a valid word in the Language.

In [1]:
import nltk

In [2]:
from nltk import PorterStemmer, LancasterStemmer, SnowballStemmer

In [4]:
def words_stemmer(words, type="PorterStemmer", lang="english", encoding="utf8"):
    stemmers = ["PorterStemmer", "LancasterStemmer", "SnowballStemmer"]
    
    if type is False or type not in stemmers:
        return words
    else:
        stem_words = []
        if type == "PorterStemmer":
            stemmer = PorterStemmer()
            for w in words:
                stem_words.append(stemmer.stem(w).encode(encoding))
        if type == "LancasterStemmer":
            stemmer = LancasterStemmer()
            for w in words:
                stem_words.append(stemmer.stem(w).encode(encoding))
        if type == "SnowballStemmer":
            stemmer = SnowballStemmer(lang)
            for w in words:
                stem_words.append(stemmer.stem(w).encode(encoding))
        return b" ".join(stem_words)        

In [5]:
words = "caring cares carefully cared"

In [7]:
wt = nltk.word_tokenize(words)
wt

['caring', 'cares', 'carefully', 'cared']

In [8]:
print("Original:", words)
print("Porter: ", words_stemmer(wt, "PorterStemmer"))
print("Lancaster: ", words_stemmer(wt, "LancasterStemmer"))
print("Snowball: ", words_stemmer(wt, "SnowballStemmer"))

Original: caring cares carefully cared
Porter:  b'care care care care'
Lancaster:  b'car car car car'
Snowball:  b'care care care care'


## Lemmatization:
Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form of a set of words.<br>
Lemmatization takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. 

In [24]:
from nltk.stem import WordNetLemmatizer

In [25]:
wlem = WordNetLemmatizer()

In [26]:
#Function to apply lemmatization to list of words
def words_lemmatizer(text, encoding="utf8"):
    words = nltk.word_tokenize(text)
    lemma_words = []
    wl = WordNetLemmatizer()
    
    for w in words:
        pos = find_pos(w)
        lemma_words.append(wl.lemmatize(w, pos).encode(encoding))
    return b" ".join(lemma_words)    

In [27]:
#n    NOUN 
#v    VERB 
#a    ADJECTIVE 
#s    ADJECTIVE SATELLITE 
#r    ADVERB 

In [28]:
def find_pos(word):
    #part of speech constants
    pos = nltk.pos_tag(nltk.word_tokenize(word))[0][1]
    # Adjective tags : "JJ", "JJR", "JJS"
    if pos.lower()[0] == 'j':
        return 'a'
    # Adverb tags : "RB", "RBR", "RBS"
    elif pos.lower()[0] == 'r':
        return 'r'
    # Verb tags: "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"
    elif pos.lower()[0] == 'v':
        return 'v'
    # Noun tags: "NN", "NNS", "NNP", "NNPS"
    else:
        return 'n'

In [30]:
print("Lemmatized: ", words_lemmatizer(words))

Lemmatized:  b'care care carefully care'


### Getting synonyms and antonyms for a given word with wordnet

In [31]:
# Wordnet is a large lexical database for English words that are linked together
# by their semantic relationships. 
# It groups words together based on their meanings.

In [32]:
from nltk.corpus import wordnet

In [33]:
s = wordnet.synsets("suitable")
print("Definition: ", s[0].definition())
print("Example: ", s[0].examples())

Definition:  meant or adapted for an occasion or use
Example:  ['a tractor suitable (or fit) for heavy duty', 'not an appropriate (or fit) time for flippancy']


In [34]:
synonyms = []
antonyms = []
for s in wordnet.synsets("better"):
    for l in s.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print("synonyms: \n", set(synonyms))
print("antonyms: \n", set(antonyms))

synonyms: 
 {'proficient', 'skilful', 'full', 'in_effect', 'estimable', 'honorable', 'meliorate', 'effective', 'improve', 'bettor', 'just', 'practiced', 'beneficial', 'ripe', 'sound', 'undecomposed', 'wagerer', 'punter', 'amend', 'dear', 'right', 'better', 'well', 'respectable', 'intimately', 'best', 'safe', 'easily', 'comfortably', 'near', 'honest', 'upright', 'advantageously', 'skillful', 'salutary', 'secure', 'ameliorate', 'adept', 'expert', 'serious', 'unspoilt', 'unspoiled', 'in_force', 'considerably', 'dependable', 'break', 'good', 'substantially'}
antonyms: 
 {'ill', 'worsen', 'worse', 'evil', 'badly', 'disadvantageously', 'bad'}


## References:

http://www.nltk.org/api/nltk.stem.html

http://en.wikipedia.org/wiki/Stemming

https://wordnet.princeton.edu/wordnet/man/wndb.5WN.html#sect3

http://www.nltk.org/howto/wordnet.html