## Python – Lemmatization Approaches with Examples

The following is a step by step guide to exploring various kinds of Lemmatization approaches in python along with a few examples and code implementation. It is highly recommended that you stick to the given flow unless you have an understanding of the topic, in which case you can look up any of the approaches given below. 

### What is Lemmatization? 

In contrast to 'stemming', lemmatization is a lot more powerful. It looks beyond word reduction and considers a language’s full vocabulary to apply a **morphological analysis** to words, aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the **lemma.**

**TIP: Always convert your text to lowercase before performing any NLP task including lemmatizing**

**Various Approaches to Lemmatization:** 

We will be going over **9 different approaches** to perform Lemmatization along with multiple examples and code implementations.

1. WordNet
2. WordNet (with POS tag)
3. TextBlob
4. TextBlob (with POS tag)
5. spaCy
6. TreeTagger
7. Pattern
8. Gensim
9. Stanford CoreNLP

**1. Wordnet Lemmatizer**

Wordnet is a publicly available lexical database of over 200 languages that provides semantic relationships between its words. It is one of the earliest and most commonly used lemmatizer technique. 

- It is present in the nltk library in python.
- Wordnet links words into semantic relations. ( eg. synonyms )
- It groups synonyms in the form of synsets.
    - synsets : a group of data elements that are semantically equivalent. 


**How to use:**

1. Download nltk package : In your anaconda prompt or terminal, type:

**pip install nltk**

2. Download Wordnet from nltk : In your python console, do the following :

**import nltk** 

**nltk.download(‘wordnet’)** 

**nltk.download(‘averaged_perceptron_tagger’)**

**Code**

In [1]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Create WordNetLemmatizer object
obj = WordNetLemmatizer()

In [6]:
# single word lemmatization examples
list1 = ['kites', 'babies', 'dogs', 'flying', 'smiling', 
         'driving', 'died', 'tried', 'feet']

for words in list1:
    print(words + "  -----> " +obj.lemmatize(words))

kites  -----> kite
babies  -----> baby
dogs  -----> dog
flying  -----> flying
smiling  -----> smiling
driving  -----> driving
died  -----> died
tried  -----> tried
feet  -----> foot


**Sentence lemmatization examples**

In [7]:
# sentence lemmatization examples
string = 'the cat is sitting with the bats on the striped mat under many flying geese'

# Converting String into tokens
list2 = nltk.word_tokenize(string)
print(list2)
#> ['the', 'cat', 'is', 'sitting', 'with', 'the', 'bats', 'on',
# 'the', 'striped', 'mat', 'under', 'many', 'flying', 'geese']

lemmatized_string = ' '.join([obj.lemmatize(words) for words in list2])

print(lemmatized_string) 
#> the cat is sitting with the bat on the striped mat under many flying goose


['the', 'cat', 'is', 'sitting', 'with', 'the', 'bats', 'on', 'the', 'striped', 'mat', 'under', 'many', 'flying', 'geese']
the cat is sitting with the bat on the striped mat under many flying goose


### 2. Wordnet Lemmatizer (with POS tag) 

In the above approach, we observed that Wordnet results were not up to the mark. Words like ‘sitting’, ‘flying’ etc remained the same after lemmatization. This is because these words are treated as a noun in the given sentence rather than a verb. To overcome come this, we use POS (Part of Speech) tags. 
We add a tag with a particular word defining its type (verb, noun, adjective etc). 

**For Example**

| Word     | POS Tag (Type) | Lemmatized Word |
|----------|----------------|-----------------|
| driving  | verb ('v')      | drive           |
| dogs     | noun ('n')      | dog             |


In [8]:
# WORDNET LEMMATIZER (with appropriate pos tags)
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
 

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [9]:
lemmatizer= WordNetLemmatizer()

In [10]:
# Define function to lemmatize each word with its POS tag
# POS_TAGGER_FUNCTION : TYPE 1

def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None
    
sentence = 'the cat is sitting with the bats on the striped mat under many badly flying geese'
# tokenize the sentence and find the POS tag for each token
pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence)) 

In [11]:
print(pos_tagged)

[('the', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('with', 'IN'), ('the', 'DT'), ('bats', 'NNS'), ('on', 'IN'), ('the', 'DT'), ('striped', 'JJ'), ('mat', 'NN'), ('under', 'IN'), ('many', 'JJ'), ('badly', 'RB'), ('flying', 'VBG'), ('geese', 'JJ')]


In [12]:
# we use our own pos_tagger function to make things simpler to understand.
wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))
print(wordnet_tagged)

[('the', None), ('cat', 'n'), ('is', 'v'), ('sitting', 'v'), ('with', None), ('the', None), ('bats', 'n'), ('on', None), ('the', None), ('striped', 'a'), ('mat', 'n'), ('under', None), ('many', 'a'), ('badly', 'r'), ('flying', 'v'), ('geese', 'a')]


In [13]:
lemmatized_sentence = []
for word, tag in wordnet_tagged:
    if tag is None:
        # if there is no available tag, append the token as is
        lemmatized_sentence.append(word)
    else:        
        # else use the tag to lemmatize the token
        lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
lemmatized_sentence = " ".join(lemmatized_sentence)
 
print(lemmatized_sentence)

the cat be sit with the bat on the striped mat under many badly fly geese


### 3. TextBlob 

TextBlob is a python library used for processing textual data. It provides a simple API to access its methods and perform basic NLP tasks.



**Download TextBlob package :**
 In your anaconda prompt or terminal, type: 
 
pip install textblob

In [29]:
from textblob import TextBlob, Word

my_word='coolings'
 

In [30]:
## create a Word object
w = Word(my_word)

In [31]:
print(w.lemmatize())

cooling


In [32]:
sentence = 'the bats saw the cats with stripes hanging upside down by their feet.'
 

In [33]:
S = TextBlob(sentence)

In [35]:
lemmatized_sentence=' '.join([w.lemmatize() for w in S.words])

In [36]:
print(lemmatized_sentence)

the bat saw the cat with stripe hanging upside down by their foot


### 4. TextBlob (with POS tag) 

Same as in Wordnet approach without using appropriate POS tags, we observe the same limitations in this approach as well. So, we use one of the more powerful aspects of the TextBlob module the **‘Part of Speech’** tagging to overcome this problem.

In [38]:
from textblob import TextBlob

# Define function to lemmatize each word with its POS tag

# POS_TAGGER_FUNCTION : TYPE 2
def pos_tagger(sentence):
	sent = TextBlob(sentence)
	tag_dict = {"J": 'a', "N": 'n', "V": 'v', "R": 'r'}
	words_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags] 
	lemma_list = [wd.lemmatize(tag) for wd, tag in words_tags]
	return lemma_list

In [39]:
# Lemmatize
sentence = "the bats saw the cats with stripes hanging upside down by their feet"
lemma_list = pos_tagger(sentence)
lemmatized_sentence = " ".join(lemma_list)
print(lemmatized_sentence)

the bat saw the cat with stripe hang upside down by their foot


In [43]:
t_blob = TextBlob(sentence)
lemmatized_sentence = " ".join([w.lemmatize() for w in t_blob.words])
print(lemmatized_sentence)
#> the bat saw the cat with stripe hanging upside down by their foot


the bat saw the cat with stripe hanging upside down by their foot
