<a href="https://colab.research.google.com/github/TAruna-SP/NLP/blob/week-1/Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Today we saw Lemmatization which is advanced compared to stemming.

*   Lemmatization cuts the word at the same time keeps its meaning. eg. computers-> computer ( stemming results as 'compute' - non sense)


In [7]:
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("Downloads complete. Lemmatizer ready.")

Downloads complete. Lemmatizer ready.


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


In [4]:
# Use the filtered_words from Day 3
# Let's make sure we have the list. If not, re-create it:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_words = ['natural', 'language', 'processing', 'nlp', 'amazing', 'helps', 'computers', 'understand', 'human', 'language', 'think', 'fascinating']

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered_words]

print("The Problem with Stemming:")
for original, stem in zip(filtered_words, stemmed):
    print(f"  {original:15} -> {stem:15}")
print()

The Problem with Stemming:
  natural         -> natur          
  language        -> languag        
  processing      -> process        
  nlp             -> nlp            
  amazing         -> amaz           
  helps           -> help           
  computers       -> comput         
  understand      -> understand     
  human           -> human          
  language        -> languag        
  think           -> think          
  fascinating     -> fascin         



In [5]:
# Basic lemmatization (treats every word as a noun)
lemmatized_noun = [lemmatizer.lemmatize(w) for w in filtered_words]
print("Basic Lemmatization (as Nouns):")
for original, lemma in zip(filtered_words, lemmatized_noun):
    print(f"  {original:15} -> {lemma:15}")
print()

Basic Lemmatization (as Nouns):
  natural         -> natural        
  language        -> language       
  processing      -> processing     
  nlp             -> nlp            
  amazing         -> amazing        
  helps           -> help           
  computers       -> computer       
  understand      -> understand     
  human           -> human          
  language        -> language       
  think           -> think          
  fascinating     -> fascinating    



In [8]:
from nltk import pos_tag
# First, tag each word with its POS
tagged_words = pos_tag(filtered_words)
print("Part-of-Speech Tags:")
print(tagged_words)
print()

Part-of-Speech Tags:
[('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('nlp', 'JJ'), ('amazing', 'NN'), ('helps', 'VBZ'), ('computers', 'NNS'), ('understand', 'VBP'), ('human', 'JJ'), ('language', 'NN'), ('think', 'VBP'), ('fascinating', 'VBG')]



In [9]:
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):
    """Convert POS tag to WordNet format."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default to noun

# Now lemmatize each word with its correct POS
lemmatized_correct = []
for word, tag in tagged_words:
    wordnet_pos = get_wordnet_pos(tag)
    lemma = lemmatizer.lemmatize(word, pos=wordnet_pos)
    lemmatized_correct.append(lemma)

print("Advanced Lemmatization (with correct POS):")
for original, tag, lemma in zip(filtered_words, [t for w,t in tagged_words], lemmatized_correct):
    print(f"  {original:15} ({tag:5}) -> {lemma:15}")

Advanced Lemmatization (with correct POS):
  natural         (JJ   ) -> natural        
  language        (NN   ) -> language       
  processing      (NN   ) -> processing     
  nlp             (JJ   ) -> nlp            
  amazing         (NN   ) -> amazing        
  helps           (VBZ  ) -> help           
  computers       (NNS  ) -> computer       
  understand      (VBP  ) -> understand     
  human           (JJ   ) -> human          
  language        (NN   ) -> language       
  think           (VBP  ) -> think          
  fascinating     (VBG  ) -> fascinate      


In [10]:
print("\n=== FINAL COMPARISON ===")
print(f"{'Word':<15} | {'Stem':<10} | {'Lemma (Smart)':<15}")
print("-" * 45)
for i in range(len(filtered_words)):
    print(f"{filtered_words[i]:<15} | {stemmed[i]:<10} | {lemmatized_correct[i]:<15}")


=== FINAL COMPARISON ===
Word            | Stem       | Lemma (Smart)  
---------------------------------------------
natural         | natur      | natural        
language        | languag    | language       
processing      | process    | processing     
nlp             | nlp        | nlp            
amazing         | amaz       | amazing        
helps           | help       | help           
computers       | comput     | computer       
understand      | understand | understand     
human           | human      | human          
language        | languag    | language       
think           | think      | think          
fascinating     | fascin     | fascinate      
