# Lemmatization

As we saw in the previous chapter, we can explain to the machine which words are similar but also how different there are.

However some "different" words are only variations of the same word and should not be considered as different entries. 

Let's take an example:

Imagine that you are asked to build a model to classify books in two categories: _cooking_ and _cars_. You will use the most frequent words of the book to build your algorithm.

In that case you don't really want to make a distinction between `apple` and `apples` or between `wheel` and `wheels`. You prefer to consider `apple` and `apples` as being variations of `apple`.

To fix that, we will apply **lemmatization**. This approach aims to reduce each word to its simplest variation (named **lemma**). This lemma corresponds to the heading word in a language dictionary:


**apple** (noun) : `a round fruit (usually with a green or red skin) which can be eaten (plural: apples)`

 


## Still confused?
Let's see how it works in a practical case.

First, read [this article](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/).

Then, try to apply what you have learned by using SpaCy or NLTK.

**Pro tips:** Most lemmatizers only work with a single word and not on sentences. Think about tokenizing your sentence first.

**Pro tips:** If you experience SSL issues during `nltk` import [check this](https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed).

In [None]:
# Can you lemmatize this sentence with Spacy and / or NLTK?

my_sentence = "Those children are playing. this game, those games, I play he plays"


### Normalizing the sentence

In [6]:
import string
import contractions
import re
import spacy
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download NLTK resources
import nltk
nltk.download('punkt')
nltk.download('wordnet')

# Example sentence
my_sentence = "Those children are playing. this game, those games, I play he plays"
print("Original Sentence:")
print(my_sentence)

# Text normalization using lowercasing
normalized_sentence = my_sentence.lower()
print("\nAfter Lowercasing:")
print(normalized_sentence)

# Text normalization: removing punctuation
normalized_sentence = normalized_sentence.translate(str.maketrans("", "", string.punctuation))
print("\nAfter Removing Punctuation:")
print(normalized_sentence)

# Text normalization: handling contractions
normalized_sentence = contractions.fix(normalized_sentence)
print("\nAfter Handling Contractions:")
print(normalized_sentence)

# Tokenization and Lemmatization using NLTK
words = word_tokenize(normalized_sentence)
lemmatizer = WordNetLemmatizer()
normalized_sentence_nltk = ' '.join([lemmatizer.lemmatize(word) for word in words])
print("\nAfter Tokenization and Lemmatization (NLTK):")
print(normalized_sentence_nltk)

# Tokenization and Lemmatization using SpaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp(normalized_sentence)
normalized_sentence_spacy = ' '.join([token.lemma_ for token in doc])
print("\nAfter Tokenization and Lemmatization (SpaCy):")
print(normalized_sentence_spacy)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\becode\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\becode\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original Sentence:
Those children are playing. this game, those games, I play he plays

After Lowercasing:
those children are playing. this game, those games, i play he plays

After Removing Punctuation:
those children are playing this game those games i play he plays

After Handling Contractions:
those children are playing this game those games i play he plays

After Tokenization and Lemmatization (NLTK):
those child are playing this game those game i play he play

After Tokenization and Lemmatization (SpaCy):
those child be play this game those game I play he play


### Using NLK 

In [12]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download NLTK resources
# import nltk
# nltk.download('punkt')
# nltk.download('wordnet')

# Sentence to lemmatize
# my_sentence = "Those children are playing. this game, those games, I play he plays"

# Tokenize the sentence
words = normalized_sentence_spacy

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize each word and join the results
lemmatized_sentence_nltk = ''.join([lemmatizer.lemmatize(word) for word in words])

print (words)
print("Lemmatized Sentence (NLTK):")
print(lemmatized_sentence_nltk)


those child be play this game those game I play he play
Lemmatized Sentence (NLTK):
those child be play this game those game I play he play


### Using Spacy


In [13]:
import spacy

# Load the English NLP model
nlp = spacy.load("en_core_web_sm")

# Sentence to lemmatize
# my_sentence = "Those children are playing. this game, those games, I play he plays"

# Process the sentence with SpaCy
# nlp model tokenizes the input sentence into individual words and assigns various linguistic annotations to each token, 
# such as part-of-speech tags, dependency relationships, and lemmas.
doc = nlp(normalized_sentence_spacy)

# Lemmatize and join the results
lemmatized_sentence_spacy = ' '.join([token.lemma_ for token in doc])

print("Lemmatized Sentence (SpaCy):")
print(lemmatized_sentence_spacy)


Lemmatized Sentence (SpaCy):
those child be play this game those game I play he play


What are the differences between both tools ?

## Conclusion
There are multiple libraries that allow you to do lemmatization. Each of them have their particularities.
There are also other techniques to "simplify" words like [Stemming](https://medium.com/swlh/introduction-to-stemming-vs-lemmatization-nlp-8c69eb43ecfe). Feel free explore those that seems relevant to your use-case.

![stemming vs lemmatization](https://miro.medium.com/max/2050/1*ES5bt7IoInIq2YioQp2zcQ.png)
