<a href="https://colab.research.google.com/github/Priyanshu-Naik/Gen_AI/blob/main/Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lemmatization with NLTK


In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

text = "The cats were running in the garden."

tokens = word_tokenize(text)
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

print(lemmatized_tokens)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


['The', 'cat', 'were', 'running', 'in', 'the', 'garden', '.']


In this output, we can see that:

"cats" is reduced to its lemma "cat" (noun).
"running" remains "running" (since no POS tag is provided, NLTK doesn't convert it to "run").

## Improving Lemmatization with Part of Speech (POS) Tagging

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

sentence = "The children are running towards a better place."

tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return 'a'
    elif tag.startswith('V'):
        return 'v'
    elif tag.startswith('N'):
        return 'n'
    elif tag.startswith('R'):
        return 'r'
    else:
        return 'n'

lemmatized_tokens = []

for word, tag in pos_tags:
    if word.lower() == 'are' or word.lower() == ['is', 'am']:
        lemmatized_tokens.append(word)
    else:
        lemmatized_tokens.append(lemmatizer.lemmatize(word, get_wordnet_pos(tag)))

print(" ".join(lemmatized_tokens))

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


The child are run towards a good place .


In this improved version:

"children" is lemmatized to "child" (noun).

"running" is lemmatized to "run" (verb).

"better" is lemmatized to "good" (adjective).