# ‚úÖ What is **Lemmatization**?

**Lemmatization = converting a word to its dictionary/base form (lemma) using grammar rules**

Unlike stemming, it **doesn‚Äôt just chop endings**, it considers:

* Part-of-speech (POS) ‚Üí verb, noun, adjective
* Dictionary meaning

### Examples:

| Word    | Lemma |
| ------- | ----- |
| playing | play  |
| studies | study |
| better  | good  |
| mice    | mouse |
| running | run   |

* More accurate than stemming
* Slower but useful for **NLP modeling**

---

### Why use Lemmatization?

* Reduces words to their base form accurately
* Useful for **text classification, NLP pipelines, chatbots**
* Keeps grammar and meaning intact


# ‚úÖ Lemmatization with NLTK

### ‚ñ∂ Install NLTK (if not installed)

In [None]:
pip install nltk

### ‚ñ∂ Import and download necessary resources

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')  # POS tagging

### ‚ñ∂ Use WordNet Lemmatizer

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Simple example
words = ["playing", "studies", "better", "running", "mice"]

lemmas = [lemmatizer.lemmatize(word) for word in words]
print("Default (noun) lemma:", lemmas)

Default (noun) lemma: ['playing', 'study', 'better', 'running', 'mouse']


‚ö† Note: Default POS is **noun**, so ‚Äúplaying‚Äù and ‚Äúrunning‚Äù didn‚Äôt reduce to ‚Äúplay‚Äù or ‚Äúrun‚Äù.

In [None]:
lemmas = [lemmatizer.lemmatize(word,pos='v') for word in words] # pos='v' for verb
print("Default (noun) lemma:", lemmas)

Default (noun) lemma: ['play', 'study', 'better', 'run', 'mice']


### ‚ñ∂ Use POS tags for accurate lemmatization

In [21]:
from nltk import pos_tag
# pos_tag() takes a list of words and assigns POS tags like NN, VBG, RBR, etc.

words = ["playing", "studies", "better", "running", "mice"]
pos_tags = pos_tag(words)

print(pos_tags)

[('playing', 'VBG'), ('studies', 'NNS'), ('better', 'RBR'), ('running', 'VBG'), ('mice', 'NN')]


In [20]:
from nltk.corpus import wordnet

# Function to convert NLTK POS tags to WordNet POS
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
    
for word, pos in pos_tags:
    print(word,get_wordnet_pos(pos)) 

playing v
studies n
better r
running v
mice n


In [None]:
lemmas = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]
print("With POS tagging:", lemmas)

With POS tagging: ['play', 'study', 'well', 'run', 'mouse']


‚úÖ Much better! Now grammar and meaning are preserved.

# ‚úÖ Lemmatization with spaCy (Easier & Cleaner)

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
# Load the English model used for tokenization, POS, NER, etc.

text = "I was playing with mice and running faster. She is better at chess."
doc = nlp(text)

lemmas = [token.lemma_ for token in doc]
print(lemmas)

['I', 'be', 'play', 'with', 'mouse', 'and', 'run', 'fast', '.', 'she', 'be', 'well', 'at', 'chess', '.']


* spaCy automatically uses POS tagging ‚Üí no extra steps needed
* Works in one line
* More accurate and industry-friendly



## üîπ **Quick Comparison: Stemming vs Lemmatization**

| Feature  | Stemming        | Lemmatization            |
| -------- | --------------- | ------------------------ |
| Method   | Rules/chopping  | Dictionary + grammar     |
| Accuracy | Low/rough       | High/accurate            |
| Output   | May be nonsense | Real dictionary word     |
| Speed    | Fast            | Slower                   |
| Use case | Search, quick   | NLP modeling, production |


