# üìå Topic: Lemmatization

### What you will learn
- What lemmatization is and how it differs from stemming
- How the WordNet Lemmatizer works
- When to use lemmatization vs. stemming
- Practical applications: accuracy over speed
- How lemmatization preserves word meaning

### Why this matters
Lemmatization is the **smarter, slower cousin of stemming**. Instead of blindly stripping suffixes, it uses a dictionary (WordNet) to find the actual root word (lemma). While slower, it produces real English words and preserves semantic meaning, making it essential for accurate NLP tasks like sentiment analysis, NER, and question answering.

---

## What is Lemmatization?

**Lemmatization** is the process of reducing words to their base form (called the **lemma**) using morphological analysis and a dictionary.

### Example:
- "connecting", "connected", "connects" ‚Üí "connect"
- "better" ‚Üí "good" (not "better"!)
- "was", "is", "are" ‚Üí "be"

### Key differences from stemming:

| Feature | Stemming | Lemmatization |
|---------|----------|----------------|
| **Approach** | Rule-based suffix stripping | Dictionary lookup |
| **Speed** | ‚ö° Fast | üê¢ Slow |
| **Output** | May be non-words | Always real words |
| **Accuracy** | Lower | ‚úÖ Higher |
| **"better"** | "better" | "good" |
| **"was"** | "wa" (non-word) | "be" |

### Why choose lemmatization?
1. **Semantically correct**: "better" actually means "good in comparative form"
2. **Real words only**: No garbage like "poni" or "wa"
3. **Better for downstream tasks**: NER, sentiment analysis benefit from accurate lemmas
4. **Improves interpretability**: You're working with actual English words

In [None]:
# Lemmatization reduces a word to its base form (lemma) while preserving semantic meaning
# Unlike stemming, it uses a dictionary to find the actual root word
# Example: "better" ‚Üí "good" (not "better")
# Example: "was" ‚Üí "be" (not "wa")

In [None]:
# Import NLTK and download the WordNet corpus
# WordNet is a large lexical database of English words with relationships
# It contains lemmas, definitions, and relationships between words
import nltk
nltk.download('wordnet')  # Download once per environment
nltk.download('averaged_perceptron_tagger')  # For part-of-speech tagging (optional but helpful)

# Import the WordNetLemmatizer
# This lemmatizer uses WordNet to find the correct base form of words
from nltk.stem import WordNetLemmatizer

## Creating a Lemmatizer Instance

To use lemmatization, we create an instance of the `WordNetLemmatizer` class. This object has a `.lemmatize()` method that converts words to their lemmas.

In [None]:
# Create an instance of the WordNetLemmatizer
# This object will use WordNet to look up lemmas
lemmatizer = WordNetLemmatizer()

## Basic Lemmatization Example

Let's lemmatize the same word family we stemmed before (connect, connecting, connected, etc.) and compare results.

In [None]:
# List of connect-related words to lemmatize
connect_words = ["connecting", "connected", "connectivity", "connects"]

# Lemmatize each word
# Note: .lemmatize() is simpler than stemming - just pass the word
print("WordNet Lemmatizer Output:")
print("=" * 45)
for word in connect_words:
    # .lemmatize() returns the base form from WordNet dictionary
    lemma = lemmatizer.lemmatize(word)
    print(f"{word:15} ‚Üí {lemma}")

# Notice: All reduce to "connect" (same as stemmer in this case)
# But lemmatizer always returns real English words!

## Advanced Examples: Where Lemmatization Shines

Lemmatization is smarter for tricky cases where stemming fails or produces non-words.

In [None]:
# Demonstrate where lemmatization outperforms stemming
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Test words where stemming and lemmatization differ significantly
test_words = ["better", "ponies", "was", "arguing", "universal"]

print("Stemming vs. Lemmatization Comparison:")
print("=" * 60)
print(f"{'Word':<15} | {'Stemmed':<15} | {'Lemmatized':<15} | {'Status':<10}")
print("-" * 60)

for word in test_words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word)
    
    # Check if results differ
    status = "‚úì Same" if stem == lemma else "‚úì Different"
    print(f"{word:<15} | {stem:<15} | {lemma:<15} | {status:<10}")

print("\n‚úì Key insight: 'better' ‚Üí 'good' (lemmatization) is semantically correct!")
print("‚úì Key insight: 'ponies' ‚Üí 'pony' (lemmatization) is a real word!")

## Advanced: Part-of-Speech (POS) Tagging with Lemmatization

Lemmatizers work even better when you tell them the **part of speech** (noun, verb, adjective, etc.). This helps disambiguate words with multiple lemmas.

### Example:
- "leads" (verb) ‚Üí "lead"
- "leads" (noun) ‚Üí "lead" (same in this case)
- But for "reading" ‚Üí "read" (verb) or "reading" (noun) is different context

### POS Tags:
- `v` = verb
- `n` = noun
- `a` = adjective
- `r` = adverb

In [None]:
# Lemmatization with part-of-speech hints
lemmatizer = WordNetLemmatizer()

# Same word can have different lemmas depending on part of speech
# Syntax: lemmatizer.lemmatize(word, pos='v')  # 'v' for verb, 'n' for noun, 'a' for adjective

print("Lemmatization with POS Tags:")
print("=" * 50)

# Example 1: "running"
print("\n'running' (as verb):")
print(f"  Without POS: {lemmatizer.lemmatize('running')}")
print(f"  With POS (v): {lemmatizer.lemmatize('running', pos='v')}")

# Example 2: "better"
print("\n'better' (as adjective):")
print(f"  Without POS: {lemmatizer.lemmatize('better')}")
print(f"  With POS (a): {lemmatizer.lemmatize('better', pos='a')}")

# Example 3: "studies"
print("\n'studies' (as verb vs noun):")
print(f"  As verb (v): {lemmatizer.lemmatize('studies', pos='v')}")
print(f"  As noun (n): {lemmatizer.lemmatize('studies', pos='n')}")

## Complete Workflow: Tokenization ‚Üí Lemmatization

In practice, you'll combine tokenization with lemmatization. Here's the full pipeline:

In [None]:
from nltk.tokenize import word_tokenize

# Sample text
text = "The better artists were arguing about the universal laws"

# Step 1: Tokenize into words
tokens = word_tokenize(text.lower())  # Lowercase first

# Step 2: Lemmatize each token
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokens]

# Display the pipeline
print("Complete Preprocessing Pipeline:")
print("=" * 60)
print(f"\nOriginal text:")
print(f"  {text}")

print(f"\nTokenized (lowercase):")
print(f"  {tokens}")

print(f"\nAfter lemmatization:")
print(f"  {lemmatized}")

print(f"\nKey transformations:")
print(f"  'better' ‚Üí '{lemmatizer.lemmatize('better', pos='a')}'")
print(f"  'artists' ‚Üí '{lemmatizer.lemmatize('artists', pos='n')}'")
print(f"  'arguing' ‚Üí '{lemmatizer.lemmatize('arguing', pos='v')}'")

## When to Use Lemmatization

### ‚úÖ Good use cases:
1. **Sentiment Analysis**: "better" ‚Üí "good" preserves semantic meaning
2. **Named Entity Recognition**: Helps identify base forms in context
3. **Information Extraction**: Accurate base forms for database storage
4. **Small to medium datasets**: When accuracy matters more than speed
5. **Text analysis tools**: Tools where interpretability is important

### ‚ùå When NOT to use:
1. **Billions of documents**: Too slow (lemmatization is 10-100x slower than stemming)
2. **Real-time applications**: API responses need < 100ms latency
3. **Modern neural networks**: Transformers learn morphology themselves (no need to lemmatize)
4. **Character-level models**: They work at sub-word level

### Rule of thumb:
**Use lemmatization for accuracy-critical NLP tasks.** Use stemming for speed-critical applications.

## Limitations of WordNet Lemmatization

Even lemmatization has limitations:

### ‚ö†Ô∏è Problem 1: Default POS is noun
Without specifying POS, WordNet assumes the word is a noun, which can be wrong.

```python
lemmatizer.lemmatize('running')  # Returns 'running' (assumes noun)
lemmatizer.lemmatize('running', pos='v')  # Returns 'run' (correct!)
```

### ‚ö†Ô∏è Problem 2: WordNet is English-focused
Works great for English, but limited support for other languages.

### ‚ö†Ô∏è Problem 3: Domain-specific words not in WordNet
Technical terms, slang, or neologisms won't lemmatize correctly.

### ‚úÖ Solutions:
- Use spaCy's lemmatizer (automatic POS tagging)
- Build domain-specific lemmatization dictionaries
- For non-English, use language-specific tools

## Preview: spaCy's Superior Lemmatization

spaCy is a modern NLP library that **automatically tags part-of-speech** before lemmatization. This makes it smarter than WordNet Lemmatizer.

```python
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("The better artists were arguing")
for token in doc:
    print(f"{token.text} ‚Üí {token.lemma_}")
# Output:
# The ‚Üí the
# better ‚Üí good (automatically detects adjective!)
# artists ‚Üí artist
# were ‚Üí be
# arguing ‚Üí argue
```

We'll cover spaCy in a future notebook.

## Key Takeaways

1. **Lemmatization is smarter than stemming**: Uses dictionaries, not just rules
2. **Always produces real words**: No garbage output like stemming
3. **Semantically correct**: "better" ‚Üí "good", not "better"
4. **Slower but more accurate**: Trade-off is speed vs. quality
5. **POS tags improve accuracy**: Tell the lemmatizer if word is verb/noun/adjective
6. **WordNet Lemmatizer has limitations**: Default POS is noun, only works well in English
7. **spaCy is better for production**: Automatic POS tagging + better accuracy

## Next Steps:
- Learn about **parts of speech (POS) tagging**
- Explore **spaCy for production-quality lemmatization**
- Compare lemmatization across different tasks (sentiment analysis vs. text classification)

## Summary: Stemming vs. Lemmatization

| Aspect | Stemming | Lemmatization |
|--------|----------|----------------|
| **Algorithm** | Rule-based suffix stripping | Dictionary lookup + morphology |
| **Speed** | ‚ö°‚ö°‚ö° Very fast | üê¢ Slower |
| **Output quality** | May be non-words | Always real words |
| **Semantic accuracy** | Lower | ‚úÖ Higher |
| **"better"** | "better" | "good" |
| **"ponies"** | "poni" | "pony" |
| **Best for** | Large-scale search, IR | Sentiment analysis, NER, accuracy-critical |
| **Implementation** | Porter, Snowball | WordNet, spaCy |

**Remember**: Choose based on your task's requirements for speed vs. accuracy.