## Lemmatization

This technique usually retrieves the root word of a given word.

In [2]:
import nltk
from nltk.corpus import wordnet
import spacy

In [3]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

### Lemmatization using SpaCy

In [4]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("eating etas eta ate adjustable rafting ability meeting better")

for token in doc:
    print(f"{token} => {token.lemma_}")

eating => eat
etas => eta
eta => eta
ate => eat
adjustable => adjustable
rafting => raft
ability => ability
meeting => meeting
better => well


## Lemmatization using NLTK `WordNetLemmatizer`

In [5]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

In [6]:
lemmatizer = WordNetLemmatizer()

The **lemmatize** method takes two arguments these are the word itself and the POS tags  

<ins>**POS tags**</ins>

Noun => n    
Verb => v  
Adjective => a  
Adverb => r  

In [7]:
words = ["goes", "fairly", "sportingly", "eating", "eats", "eaten", "writing", "programming", "programs", "finally", "history"]

In [8]:
for word in words:
  print(f"{word} => {lemmatizer.lemmatize(word, pos='v')}")

goes => go
fairly => fairly
sportingly => sportingly
eating => eat
eats => eat
eaten => eat
writing => write
programming => program
programs => program
finally => finally
history => history


In [9]:
# Function to map POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

# List of words
words = ["eating", "ate", 'faulting', "rigorous" ,'Programs', 'Subodh']

# Process each word
for word in words:
    pos_type = pos_tag([word])
    print(f"{word}: {pos_type}")

    # Extract POS label
    pos_label = pos_type[0][1]

    # Convert the Penn Treebank tag to a WordNet tag
    wordnet_pos = get_wordnet_pos(pos_label)

    # Lemmatize only if the tag is a valid WordNet POS tag
    if wordnet_pos:
        lemma = lemmatizer.lemmatize(word, pos=wordnet_pos)
        print(f"Lemmatized form of {word}: {lemma}")
    else:
        print(f"No lemmatization for {word}")

eating: [('eating', 'VBG')]
Lemmatized form of eating: eat
ate: [('ate', 'NN')]
Lemmatized form of ate: ate
faulting: [('faulting', 'VBG')]
Lemmatized form of faulting: fault
rigorous: [('rigorous', 'JJ')]
Lemmatized form of rigorous: rigorous
Programs: [('Programs', 'NNS')]
Lemmatized form of Programs: Programs
Subodh: [('Subodh', 'NN')]
Lemmatized form of Subodh: Subodh


#### Performance: Lemmatization vs Stemming
Although, Lemmatization can accurately predict the root but since it uses more complex rules and also that it uses a dictionary to maintain all the possible relation between an input word, it take more time to proceed the result.