### **What is Lemmatization?**

**Lemmatization** is the process of reducing a word to its base or dictionary form, called a **lemma**. Unlike **stemming**, which merely chops off prefixes or suffixes to crudely get to a root form (often resulting in non-real words), lemmatization considers the context and the morphological analysis of the word. This results in more accurate base forms, making lemmatization more sophisticated and meaningful than stemming.

For example:
- Stemming might reduce the word "better" to "bett," but lemmatization would return "good," which is the correct dictionary form (lemma).
- Stemming might reduce "running" to "run," and lemmatization would also return "run," but in the case of irregular forms like "ran," lemmatization would return the correct lemma "run."

### **Lemmatization vs. Stemming**
- **Lemmatization** is context-aware and uses a dictionary or lexicon to find the base form of a word.
- **Stemming** is simpler but more error-prone, as it cuts off word endings using heuristic rules and often results in invalid words.

| **Feature**               | **Lemmatization**              | **Stemming**                  |
|---------------------------|-------------------------------|-------------------------------|
| **Output**                | Returns dictionary form (lemma) | Returns word stem (may not be a valid word) |
| **Complexity**            | More complex (uses linguistic rules) | Simpler (uses rule-based stripping) |
| **Accuracy**              | More accurate, context-aware   | Less accurate, sometimes incorrect |
| **Language Dependence**   | Language-specific (relies on dictionaries) | Language-independent (rule-based) |

### **How Lemmatization Works**:
1. **Morphological Analysis**: Lemmatization examines the structure of words and uses part-of-speech tagging to understand the role the word plays in a sentence.
2. **Dictionary Lookup**: It refers to a lexicon (a dictionary) to find the base form of the word.
3. **Handling Irregular Forms**: Lemmatization can accurately handle irregular word forms, such as "went" (lemmatized to "go") or "better" (lemmatized to "good").

### **Classes in NLTK for Lemmatization**

In the **NLTK (Natural Language Toolkit)** library, there are different tools and classes for performing lemmatization. The two primary ones are:

1. **WordNet Lemmatizer**
2. **TextBlob Lemmatizer (an external library built on top of NLTK)**

### **1. WordNet Lemmatizer**
The **WordNet Lemmatizer** is the most common tool used in NLTK for lemmatization. It relies on the **WordNet** corpus, which is a large lexical database of English, to determine the lemma of a word.

#### Key Features:
- Handles different parts of speech (POS).
- Returns the lemma of a word by consulting the WordNet corpus.
- Takes into account the POS tag for better accuracy (verbs, nouns, adjectives, etc.).

#### Example Code:
```python
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to get the WordNet POS tag from standard POS tags
def get_wordnet_pos(word):
    """Map POS tag to first character WordNetLemmatizer accepts"""
    from nltk import pos_tag
    from nltk.corpus import wordnet
    
    # Get the POS tag for the word
    tag = pos_tag([word])[0][1][0].upper()
    
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    
    return tag_dict.get(tag, wordnet.NOUN)

# Lemmatize words with appropriate POS tag
words = ["running", "better", "happier", "ate", "flying"]
lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]

# Display original and lemmatized words
for original, lemmatized in zip(words, lemmatized_words):
    print(f"Original: {original} -> Lemmatized: {lemmatized}")
```

### Explanation:
1. **POS Tagging**: Since lemmatization depends on the part of speech (POS), it is useful to identify whether a word is a noun, verb, adjective, or adverb. The `get_wordnet_pos()` function helps map regular POS tags (from NLTK's `pos_tag()` function) to the format that the **WordNet Lemmatizer** accepts.
2. **Lemmatization**: The `lemmatize()` method from `WordNetLemmatizer` uses WordNet’s dictionary to return the base form of the word based on its POS.

### Example Output:
```
Original: running -> Lemmatized: run
Original: better -> Lemmatized: good
Original: happier -> Lemmatized: happy
Original: ate -> Lemmatized: eat
Original: flying -> Lemmatized: fly
```

#### Explanation of Results:
- "running" → "run": Verbs are reduced to their base form.
- "better" → "good": The irregular comparative is correctly lemmatized to its adjective root.
- "happier" → "happy": Comparatives and superlatives are reduced to the base form.
- "ate" → "eat": Past tense is lemmatized to the base form.

### **2. TextBlob Lemmatizer**
While **TextBlob** isn't part of NLTK directly, it builds on top of NLTK and simplifies tasks like lemmatization and POS tagging. If you're working in environments where you want quick and easy lemmatization, **TextBlob** can be a good choice.

#### Installation:
You can install TextBlob via pip:
```bash
pip install textblob
```

#### Example Code:
```python
from textblob import Word

# Lemmatize words using TextBlob
words = ["running", "better", "happier", "ate", "flying"]
lemmatized_words = [Word(word).lemmatize() for word in words]

# Display original and lemmatized words
for original, lemmatized in zip(words, lemmatized_words):
    print(f"Original: {original} -> Lemmatized: {lemmatized}")
```

### Example Output:
```
Original: running -> Lemmatized: run
Original: better -> Lemmatized: better
Original: happier -> Lemmatized: happier
Original: ate -> Lemmatized: ate
Original: flying -> Lemmatized: flying
```

#### Explanation:
TextBlob’s lemmatizer works similarly to WordNet but is less powerful when it comes to irregular forms like "better" or "ate." You might still want to use NLTK’s **WordNetLemmatizer** for more accurate results in complex cases.

---

### **Other Important Classes and Libraries for Lemmatization**

1. **spaCy**:
   spaCy offers a high-performance NLP pipeline and is widely used for industrial applications. Its lemmatizer is powerful and efficient.
   
   Installation:
   ```bash
   pip install spacy
   python -m spacy download en_core_web_sm
   ```
   
   Example Code:
   ```python
   import spacy
   nlp = spacy.load("en_core_web_sm")
   
   # Process text using spaCy
   doc = nlp("running better happier ate flying")
   
   # Lemmatize each word in the text
   for token in doc:
       print(f"Original: {token.text}, Lemma: {token.lemma_}")
   ```

   Output:
   ```
   Original: running, Lemma: run
   Original: better, Lemma: well
   Original: happier, Lemma: happy
   Original: ate, Lemma: eat
   Original: flying, Lemma: fly
   ```

2. **Stanford CoreNLP**:
   This is another robust library for lemmatization and other NLP tasks. It provides high-quality results but is more resource-intensive.

3. **Pattern Library**:
   The Pattern library also includes a lemmatizer. It is more lightweight than spaCy and is useful for simple tasks.

---

### **Conclusion**

Lemmatization is a critical step in many **NLP tasks** such as **text classification**, **information retrieval**, and **machine translation**, where reducing words to their base form can help normalize the input and improve model performance. Compared to stemming, lemmatization is more accurate and context-aware, making it preferable for tasks requiring semantic understanding.

- **WordNet Lemmatizer** in **NLTK** is the go-to tool for lemmatization, especially when combined with proper **POS tagging**.
- **TextBlob** provides a simpler interface but may not handle all edge cases like irregular verbs and adjectives.
- More advanced libraries like **spaCy** provide high-performance lemmatization suited for production-level NLP tasks.

By choosing the appropriate tool and understanding its strengths, you can effectively incorporate lemmatization into your **NLP pipeline** for better text processing and analysis.

In [1]:
! pip install nltk




DEPRECATION: Loading egg at c:\users\harsh\appdata\local\programs\python\python312\lib\site-packages\dlib-19.24.6-py3.12-win-amd64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\harsh\appdata\local\programs\python\python312\lib\site-packages\face_recognition-1.3.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\harsh\appdata\local\programs\python\python312\lib\site-packages\playsound-1.3.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330


In [5]:
## Q&A,chatbots,text summarization
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Harsh\AppData\Roaming\nltk_data...


In [6]:
lemmatizer=WordNetLemmatizer()

In [7]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
lemmatizer.lemmatize("going",pos='v')

'go'

In [8]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

In [9]:
for word in words:
    print(word+"---->"+lemmatizer.lemmatize(word,pos='v'))

eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalized---->finalize


In [13]:
lemmatizer.lemmatize("goes",pos='v')

'go'

In [11]:
lemmatizer.lemmatize("fairly",pos='v'),lemmatizer.lemmatize("sportingly")

('fairly', 'sportingly')