# Lemmatization
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So, it links words with similar meanings to one word. 
Text preprocessing includes both Stemming as well as lemmatization. Many times, people find these two terms confusing. Some treat these two as the same. Lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.
Examples of lemmatization:


-> rocks : rock


-> corpora : corpus


-> better : good



# 1. Rule Based Lemmatization
Rule-based lemmatization involves the application of predefined rules to derive the base or root form of a word. Unlike machine learning-based approaches, which learn from data, rule-based lemmatization relies on linguistic rules and patterns.

Here’s a simplified example of rule-based lemmatization for English verbs:

Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.

Example:

Word: “walked”

Rule Application: Remove “-ed”

Result: “walk

# 2. Dictionary-Based Lemmatization
Dictionary-based lemmatization relies on predefined dictionaries or lookup tables to map words to their corresponding base forms or lemmas. Each word is matched against the dictionary entries to find its lemma. This method is effective for languages with well-defined rules.

Suppose we have a dictionary with lemmatized forms for some words:

‘running’ -> ‘run’

‘better’ -> ‘good’

‘went’ -> ‘go’

When we apply dictionary-based lemmatization to a text like “I was running to become a better athlete, and then I went home,” the resulting lemmatized form would be: “I was run to become a good athlete, and then I go home.”


# 3. Machine Learning-Based Lemmatization
Machine learning-based lemmatization leverages computational models to automatically learn the relationships between words and their base forms. Unlike rule-based or dictionary-based approaches, machine learning models, such as neural networks or statistical models, are trained on large text datasets to generalize patterns in language.

Example:

Consider a machine learning-based lemmatizer trained on diverse texts. When encountering the word ‘went,’ the model, having learned patterns, predicts the base form as ‘go.’ Similarly, for ‘happier,’ the model deduces ‘happy’ as the lemma. The advantage lies in the model’s ability to adapt to varied linguistic nuances and handle irregularities, making it robust for lemmatizing diverse vocabularies.


# Implementation of Lemmatization
# 1. NLTK

In [3]:
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lm= WordNetLemmatizer()

print(f"rocks----->{lm.lemmatize('rocks')}")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...


rocks----->rock


In [4]:
print(f"cars----->{lm.lemmatize('cars')}")

cars----->car


In [6]:
# a denotes adjective in "pos"
print("better :", lm.lemmatize("better", pos="a"))

better : good


# 2. Spacy

In [2]:
import spacy

# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')

# Define a sample text
text = "The quick brown foxes are jumping over the lazy dogs."
# Process the text using spaCy
w=nlp(text)


lem_tokens=[t.lemma_ for t in w ]
# Join the lemmatized tokens into a sentence
lemmatized_text = ' '.join(lem_tokens)
 
# Print the original and lemmatized text
print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)

Original Text: The quick brown foxes are jumping over the lazy dogs.
Lemmatized Text: the quick brown fox be jump over the lazy dog .
