# ***Engr.Muhammad Javed***

# 5. Lemmatization

Lemmatization reduces words to their valid dictionary root (lemma). Unlike stemming, it requires POS tagging to be accurate.

## Lemma vs Stem
- **Stem:** 'studying' -> 'studi' (Just chops off suffix)
- **Lemma:** 'studying' -> 'study' (Actual word)

## Why better than Stemming?
- Produces meaningful words.

In [1]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Ensure corpora are downloaded
try:
    nltk.data.find('corpora/wordnet')
    nltk.data.find('corpora/omw-1.4')
except LookupError:
    nltk.download('wordnet')
    nltk.download('omw-1.4')
    
lemmatizer = WordNetLemmatizer()

print("Rocks:", lemmatizer.lemmatize("rocks"))
print("Corporaporas:", lemmatizer.lemmatize("corpora"))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Rocks: rock
Corporaporas: corpus


## POS Tagging Use
To lemmatize correctly (e.g., 'better' -> 'good'), we need to tell the lemmatizer the part of speech.

In [2]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
    nltk.download('averaged_perceptron_tagger')

word = "better"
print(f"Lemmatizing '{word}' without POS:", lemmatizer.lemmatize(word))
print(f"Lemmatizing '{word}' with POS:", lemmatizer.lemmatize(word, get_wordnet_pos(word)))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


Lemmatizing 'better' without POS: better


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger_eng[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger_eng')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger_eng[0m

  Searched in:
    - 'C:\\Users\\HP/nltk_data'
    - 'c:\\Users\\HP\\AppData\\Local\\Programs\\Python\\Python313\\nltk_data'
    - 'c:\\Users\\HP\\AppData\\Local\\Programs\\Python\\Python313\\share\\nltk_data'
    - 'c:\\Users\\HP\\AppData\\Local\\Programs\\Python\\Python313\\lib\\nltk_data'
    - 'C:\\Users\\HP\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************
