<a href="https://colab.research.google.com/github/MOHAN-DATTA-24/NLP/blob/main/Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# using ***nltk***

## **Wordnet Lemmatizer**
Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −


In [38]:
import nltk
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [39]:
from nltk.stem import WordNetLemmatizer

In [40]:
word_net_lemma = WordNetLemmatizer()

In [41]:
word_net_lemma.lemmatize("history")

'history'

Lemmatize word using WordNet's built-in morphy function.<br>
**Returns the input word unchanged if it cannot be found in WordNet.**

**POS**: The Part Of Speech tag. Valid options are <br>
## pos = "n" for nouns,
## pos = "v" for verbs,
## pos = "a" for adjectives,
## pos = "r" for adverbs and
## pos = "s" for satellite adjectives.

return: The lemma of word, for the given pos.

**By default pos ="n" so for
<br>going ---> O/p will be going<br> since it is considered as noun.**

In [42]:
word_net_lemma.lemmatize("going")

'going'

In [43]:
word_net_lemma.lemmatize("going",pos='v')

'go'

In [44]:
words = ["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

In [45]:
for word in words:
  print(word+"---->"+word_net_lemma.lemmatize(word,pos="v"))

eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalized---->finalize


In [46]:
word_net_lemma.lemmatize("fairly",pos="r"),word_net_lemma.lemmatize("sportingly",pos="r")


('fairly', 'sportingly')

Wordnet lemmatizer takes more time compared to stemming.

APPlications are:<br>
**Q&A, chatbots, text summarization**

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>

<h2>Disadvantages of Lemmatization</h2>

<ul>
    <li><strong>Computational Overhead:</strong> Lemmatization requires more computational resources compared to stemming due to dictionary lookups and morphological analysis.</li>
    <li><strong>Dependency on Language Resources:</strong> Lemmatization relies heavily on comprehensive and accurate language resources, which may be lacking for less common languages or specialized domains.</li>
    <li><strong>Context Sensitivity:</strong> Lemmatization considers word context, leading to potential ambiguity in determining the base form, especially for words with multiple meanings or forms.</li>
</ul>

</body>
</html>


<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Stemming vs Lemmatization</title>
<style>
    table {
        border-collapse: collapse;
        width: 100%;
    }
    th, td {
        border: 1px solid #dddddd;
        text-align: left;
        padding: 8px;
    }
    th {
        background-color: #f2f2f2;
    }
</style>
</head>
<body>

<h2>Stemming vs Lemmatization</h2>

<table>
    <tr>
        <th>Feature</th>
        <th>Stemming</th>
        <th>Lemmatization</th>
    </tr>
    <tr>
        <td>Process</td>
        <td>Reduces words to their root or base form by removing suffixes.</td>
        <td>Determines the dictionary form or lemma of a word based on its intended meaning.</td>
    </tr>
    <tr>
        <td>Output</td>
        <td>May produce non-existent words or stems that are not lexicographically correct.</td>
        <td>Always produces valid words or lemmas.</td>
    </tr>
    <tr>
        <td>Complexity</td>
        <td>Simple and faster compared to lemmatization.</td>
        <td>More complex and slower due to dictionary lookups and morphological analysis.</td>
    </tr>
    <tr>
        <td>Accuracy</td>
        <td>Less accurate compared to lemmatization as it may result in ambiguity or loss of meaning.</td>
        <td>More accurate as it considers the context and meaning of words.</td>
    </tr>
    <tr>
        <td>Examples</td>
        <td>Running -> Run, Books -> Book</td>
        <td>Running -> Running, Books -> Book</td>
    </tr>
</table>

</body>
</html>


# using ***spaCy***

In [47]:
import spacy

# Load English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")

# Given text
text = "The cats are chasing mice and jumping over fences."

# Process the text with spaCy
doc = nlp(text)

# Lemmatize each token in the text
lemmas = [token.lemma_ for token in doc]

# Print the lemmas
print("Original text:", text)
print("Lemmatized text:", " ".join(lemmas))


Original text: The cats are chasing mice and jumping over fences.
Lemmatized text: the cat be chase mouse and jump over fence .


In [48]:
nlp("fairly")[0].lemma_

'fairly'