#### Day68

# Lemmatizing

Now that you’re up to speed on parts of speech, you can circle back to lemmatizing. Like stemming, **lemmatizing** reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.

**Note:** A **lemma** is a word that represents a whole group of words, and that group of words is called a **lexeme**.

For example, if you were to look up <a href= "https://www.merriam-webster.com/dictionary/blending">the word “blending” in a dictionary</a>, then you’d need to look at the entry for “blend,” but you would find “blending” listed in that entry.

In this example, “blend” is the **lemma**, and “blending” is part of the **lexeme**. So when you lemmatize a word, you are reducing it to its lemma.

Here’s how to import the relevant parts of NLTK in order to start lemmatizing:

In [10]:
from nltk.stem import WordNetLemmatizer

Create a lemmatizer to use:

In [11]:
lemmatizer = WordNetLemmatizer()

Let’s start with lemmatizing a plural noun:

In [12]:
lemmatizer.lemmatize("scarves")

'scarf'

In [13]:
lemmatizer.lemmatize("scarf")

'scarf'

"scarves" gave you 'scarf', so that’s already a bit more sophisticated than what you would have gotten with the Porter stemmer, which is 'scarv'. Next, create a string with more than one word to lemmatize:

In [14]:
string_for_lemmatizing = "The friends of DeSoto love scarves."


Now tokenize that string by word:

In [15]:
from nltk.tokenize import word_tokenize

words = word_tokenize(string_for_lemmatizing)

Here’s your list of words:

In [16]:
words

['The', 'friends', 'of', 'DeSoto', 'love', 'scarves', '.']

Create a list containing all the words in words after they’ve been lemmatized:

In [17]:
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

Here’s the list you got:

In [18]:
lemmatized_words

['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.']

In [27]:
# for comparison with stemming

from nltk.stem import LancasterStemmer,PorterStemmer,SnowballStemmer

lancaster = LancasterStemmer()
porter = PorterStemmer()
snowball = SnowballStemmer(language='english')
print(words)
print([lancaster.stem(word) for word in words])
print([porter.stem(word) for word in words])
print([snowball.stem(word) for word in words])
print([lemmatizer.lemmatize(word) for word in words])
print([lemmatizer.lemmatize(word.casefold()) for word in words]) # using casefold

['The', 'friends', 'of', 'DeSoto', 'love', 'scarves', '.']
['the', 'friend', 'of', 'desoto', 'lov', 'scarv', '.']
['the', 'friend', 'of', 'desoto', 'love', 'scarv', '.']
['the', 'friend', 'of', 'desoto', 'love', 'scarv', '.']
['The', 'friend', 'of', 'DeSoto', 'love', 'scarf', '.']
['the', 'friend', 'of', 'desoto', 'love', 'scarf', '.']


That looks right. The plurals 'friends' and 'scarves' became the singulars 'friend' and 'scarf'.

But what would happen if you lemmatized a word that looked very different from its lemma? Try lemmatizing "worst":

In [29]:
lemmatizer.lemmatize("worst")

'worst'

You got the result 'worst' because lemmatizer.lemmatize() assumed that <a href = "https://www.merriam-webster.com/dictionary/worst">"worst" was a noun</a>. You can make it clear that you want "worst" to be an adjective:

In [30]:
lemmatizer.lemmatize("worst", pos="a")


'bad'

The default parameter for pos is 'n' for noun, but you made sure that "worst" was treated as an adjective by adding the parameter pos="a". As a result, you got 'bad', which looks very different from your original word and is nothing like what you’d get if you were stemming. This is because "worst" is the <a href = "https://www.merriam-bster.com/dictionary/superlative" > superlative</a> form of the adjective 'bad', and lemmatizing reduces superlatives as well as <a href = "https://www.merriam-webster.com/dictionary/comparative" >comparatives</a> to their lemmas.

Now that you know how to use NLTK to tag parts of speech, you can try tagging your words before lemmatizing them to avoid mixing up <a href = "https://en.wikipedia.org/wiki/Homograph" >homographs</a>, or words that are spelled the same but have different meanings and can be different parts of speech.

# Example

In [31]:
lemmatizer.lemmatize("best")

'best'

In [32]:
lemmatizer.lemmatize("best",pos='a')

'best'

In [33]:
lemmatizer.lemmatize("better")

'better'

In [35]:
lemmatizer.lemmatize("better",pos='a')

'good'

In [36]:
lemmatizer.lemmatize("super")

'super'

In [37]:
lemmatizer.lemmatize("super",pos='a')

'super'

In [39]:
lemmatizer.lemmatize("least")

'least'

In [38]:
lemmatizer.lemmatize("least",pos='a')

'least'

In [40]:
lemmatizer.lemmatize("little")

'little'

In [41]:
lemmatizer.lemmatize("little",pos='a')

'little'

In [42]:
lemmatizer.lemmatize("less")

'le'

In [43]:
lemmatizer.lemmatize("less",pos='a')

'less'

In [44]:
lemmatizer.lemmatize("little")

'little'

In [45]:
string = 'the cat is sitting with the bats on the striped mat under many flying geese'

In [46]:
string = word_tokenize(string)

In [48]:
print(string)

['the', 'cat', 'is', 'sitting', 'with', 'the', 'bats', 'on', 'the', 'striped', 'mat', 'under', 'many', 'flying', 'geese']


In [54]:
lem = [lemmatizer.lemmatize(word) for word in string]
print(lem)
lem = ' '.join(lem)
print(lem)

['the', 'cat', 'is', 'sitting', 'with', 'the', 'bat', 'on', 'the', 'striped', 'mat', 'under', 'many', 'flying', 'goose']
the cat is sitting with the bat on the striped mat under many flying goose
