# Lemmatization


* Lemmatization is a pre-processing technique in NLP which breakdown a individual word into it's root meaning. It is used to identifies the similarities between words. The output will be get after lemmatization is called lemma.


* The goal of lemmatization is to group together different forms of a word which has a similar meaning. Eg:- The word 'run', 'running' and 'ran' can be reduced to run because all these 3 words have the similar meaning. Another example is 'information' and 'informative' could not be changed to 'inform', it will remain.


* Lemmatization is more advance and customized technique than stemming because stemming removes affixes from the word with considering the POS(Part of speech) and lemmatization gives option to lemmatize word accord POS(Part of Speech). Eg:- 'noun', 'verb', 'adverb', 'adjective'. 


* Lemmatization is also more useful when dealing with languages that have complex inflectional morphology, where the same word can have multiple forms depending on its tense, gender, number, and other grammatical features.


* Lemmatization uses vocabulary and morphological analysis to remove inflectional endings and return a root meaning of a word. 


* Use of Lemmatization:- In various NLP applications such as search engines, chatbox, text classification, and information retrieval.


* Lemmatization can help computers to understand the proper meaning of a text 

![stemmin_lemm_ex-1.png](attachment:stemmin_lemm_ex-1.png)

In [1]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [3]:
lemmatizer.lemmatize('feet')

'foot'

In [4]:
text = ['unacceptable', 'eating','EATS', 'eaten', 'playing','played','running', 'ran', 'run', 'lovely', 'writing', 
        'writes', 'programming', 'programs', 'feet', 'sportingly', 'feet','information','informative', 'computers',
       'better','corpora', 'played']


In [5]:
Word_Net_Lemmatizer = [lemmatizer.lemmatize(x) for x in text]

print(Word_Net_Lemmatizer)

len(Word_Net_Lemmatizer)

['unacceptable', 'eating', 'EATS', 'eaten', 'playing', 'played', 'running', 'ran', 'run', 'lovely', 'writing', 'writes', 'programming', 'program', 'foot', 'sportingly', 'foot', 'information', 'informative', 'computer', 'better', 'corpus', 'played']


23

In [6]:
for x in text:
    print(x+"----->"+lemmatizer.lemmatize(x.upper()))

unacceptable----->UNACCEPTABLE
eating----->EATING
EATS----->EATS
eaten----->EATEN
playing----->PLAYING
played----->PLAYED
running----->RUNNING
ran----->RAN
run----->RUN
lovely----->LOVELY
writing----->WRITING
writes----->WRITES
programming----->PROGRAMMING
programs----->PROGRAMS
feet----->FEET
sportingly----->SPORTINGLY
feet----->FEET
information----->INFORMATION
informative----->INFORMATIVE
computers----->COMPUTERS
better----->BETTER
corpora----->CORPORA
played----->PLAYED


## 'pos' :- 'pos' or POS tag means (Part of Speech) This parameter is used to specific that what kind of words you want to lemmatize. By default it is set to 'n' means noun. Other options are 'a' for adjective, 'r' for adverb and 'v' for verb. 

In [7]:
# noun

Word_Net_Lemmatizer_noun = [lemmatizer.lemmatize(x, pos='n') for x in text]

print(Word_Net_Lemmatizer)

len(Word_Net_Lemmatizer)

['unacceptable', 'eating', 'EATS', 'eaten', 'playing', 'played', 'running', 'ran', 'run', 'lovely', 'writing', 'writes', 'programming', 'program', 'foot', 'sportingly', 'foot', 'information', 'informative', 'computer', 'better', 'corpus', 'played']


23

In [8]:
# verb

Word_Net_Lemmatizer_verb = [lemmatizer.lemmatize(x, pos='v') for x in text]

print(Word_Net_Lemmatizer)

len(Word_Net_Lemmatizer)

['unacceptable', 'eating', 'EATS', 'eaten', 'playing', 'played', 'running', 'ran', 'run', 'lovely', 'writing', 'writes', 'programming', 'program', 'foot', 'sportingly', 'foot', 'information', 'informative', 'computer', 'better', 'corpus', 'played']


23

In [9]:
# adverb

Word_Net_Lemmatizer_adverb = [lemmatizer.lemmatize(x, pos='r') for x in text]

print(Word_Net_Lemmatizer)

len(Word_Net_Lemmatizer)

['unacceptable', 'eating', 'EATS', 'eaten', 'playing', 'played', 'running', 'ran', 'run', 'lovely', 'writing', 'writes', 'programming', 'program', 'foot', 'sportingly', 'foot', 'information', 'informative', 'computer', 'better', 'corpus', 'played']


23

In [10]:
# adjective

Word_Net_Lemmatizer_adjective = [lemmatizer.lemmatize(x, pos='a') for x in text]

print(Word_Net_Lemmatizer)

len(Word_Net_Lemmatizer)

['unacceptable', 'eating', 'EATS', 'eaten', 'playing', 'played', 'running', 'ran', 'run', 'lovely', 'writing', 'writes', 'programming', 'program', 'foot', 'sportingly', 'foot', 'information', 'informative', 'computer', 'better', 'corpus', 'played']


23

# Compare the result of all Lemmatizer using different POS tags.

In [11]:
import pandas as pd 

df = pd.DataFrame({'Text': text, 'Word_Net_Lemmatizer_noun': Word_Net_Lemmatizer_noun,
                   'Word_Net_Lemmatizer_verb': Word_Net_Lemmatizer_verb, 'Word_Net_Lemmatizer_adverb':Word_Net_Lemmatizer_adverb,
                   'Word_Net_Lemmatizer_adjective':Word_Net_Lemmatizer_adjective})

df

Unnamed: 0,Text,Word_Net_Lemmatizer_noun,Word_Net_Lemmatizer_verb,Word_Net_Lemmatizer_adverb,Word_Net_Lemmatizer_adjective
0,unacceptable,unacceptable,unacceptable,unacceptable,unacceptable
1,eating,eating,eat,eating,eating
2,EATS,EATS,EATS,EATS,EATS
3,eaten,eaten,eat,eaten,eaten
4,playing,playing,play,playing,playing
5,played,played,play,played,played
6,running,running,run,running,running
7,ran,ran,run,ran,ran
8,run,run,run,run,run
9,lovely,lovely,lovely,lovely,lovely


# POS(Part Of Speech)


Verbs, adverbs, nouns, and adjectives are the four main parts of speech in English grammar.


* Verbs: A verb is a word that describes an action, occurrence, or state of being. Examples of verbs include "run", "jump", "eat", "sleep", "think", "feel", etc.


* Adverbs: An adverb is a word that modifies or describes a verb, an adjective, or another adverb. Adverbs usually end in -ly, and examples include "quickly", "slowly", "happily", "sadly", "loudly", etc.


* Nouns: A noun is a word that represents a person, place, thing, or idea. Examples of nouns include "dog", "cat", "car", "house", "book", "love", "happiness", etc.


* Adjectives: An adjective is a word that describes or modifies a noun or pronoun. Adjectives provide more information about the noun or pronoun, such as its size, shape, color, or quality. Examples of adjectives include "big", "small", "red", "blue", "happy", "sad", "smart", etc.

# Part-of-speech (POS) Tagging 

Part-of-speech (POS) tagging is a process of identifying the grammatical parts of speech of words in a sentence, such as noun, verb, adjective, adverb, etc. In NLTK (Natural Language Toolkit), you can perform POS tagging using the pos_tag() function.

In [12]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [13]:
nltk.pos_tag(text)

[('unacceptable', 'JJ'),
 ('eating', 'VBG'),
 ('EATS', 'NNP'),
 ('eaten', 'VBP'),
 ('playing', 'VBG'),
 ('played', 'VBD'),
 ('running', 'VBG'),
 ('ran', 'NN'),
 ('run', 'VB'),
 ('lovely', 'RB'),
 ('writing', 'VBG'),
 ('writes', 'NNS'),
 ('programming', 'VBG'),
 ('programs', 'NNS'),
 ('feet', 'NNS'),
 ('sportingly', 'RB'),
 ('feet', 'NNS'),
 ('information', 'NN'),
 ('informative', 'JJ'),
 ('computers', 'NNS'),
 ('better', 'RBR'),
 ('corpora', 'NNS'),
 ('played', 'VBD')]

# Commonly used POS tags in NLTK along with their corresponding meanings:

* CC: coordinating conjunction


* CD: cardinal digit


* DT: determiner


* EX: existential there


* FW: foreign word


* IN: preposition/subordinating conjunction


* JJ: adjective


* JJR: adjective, comparative


* JJS: adjective, superlative


* LS: list marker


* MD: modal (could, will, would, etc.)


* NN: noun, singular or mass


* NNS: noun, plural


* NNP: proper noun, singular


* NNPS: proper noun, plural


* PDT: predeterminer


* POS: possessive ending


* PRP: personal pronoun


* PRP$: possessive pronoun


* RB: adverb


* RBR: adverb, comparative


* RBS: adverb, superlative


* RP: particle


* SYM: symbol


* TO: to


* UH: interjection


* VB: verb, base form


* VBD: verb, past tense


* VBG: verb, gerund/present participle


* VBN: verb, past participle


* VBP: verb, non-3rd person singular present


* VBZ: verb, 3rd person singular present


* WDT: wh-determiner


* WP: wh-pronoun


* WP$: possessive wh