<a href="https://colab.research.google.com/github/EISHKARAN/TSS-Resources/blob/main/Lemmatization_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### About
Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. <a href="https://en.wikipedia.org/wiki/Lemmatisation"> Link </a>

For eg -
Worked - Work
Working - Work
Works - Work

* Why do we need Lemmatisation ?

- We need lemmatisation to reach the base form of a word in a sentence. Decreasing computational overload can also be regarded as its use case. Suppose a text has work, works, worked, working in it then since NLP converts each text into a vector so, we will have 4 vectors for the text but through lemmatisation, we can end up having just one for the base form.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
text = "The boy was going for a trip where he could say that he hiked, danced, sung, swam, surfed and cooked."

In [None]:
output = nlp(text)
for token in output:
    print("Text - {} and its lemma is {}".format(token.text, token.lemma_))

Text - The and its lemma is the
Text - boy and its lemma is boy
Text - was and its lemma is be
Text - going and its lemma is go
Text - for and its lemma is for
Text - a and its lemma is a
Text - trip and its lemma is trip
Text - where and its lemma is where
Text - he and its lemma is he
Text - could and its lemma is could
Text - say and its lemma is say
Text - that and its lemma is that
Text - he and its lemma is he
Text - hiked and its lemma is hike
Text - , and its lemma is ,
Text - danced and its lemma is danced
Text - , and its lemma is ,
Text - sung and its lemma is sung
Text - , and its lemma is ,
Text - swam and its lemma is swam
Text - , and its lemma is ,
Text - surfed and its lemma is surfed
Text - and and its lemma is and
Text - cooked and its lemma is cook
Text - . and its lemma is .


For much better accuracy, We can use LemmInflect which has outperformed its accuracy with respect to various lemmatisers in NLTK,Spacy,Stanford CoreNLP and CLiPS.

<a href="https://github.com/bjascob/LemmInflect"> Link </a>

In [None]:
!pip install lemminflect

Collecting lemminflect
  Downloading lemminflect-0.2.3-py3-none-any.whl (769 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/769.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.8/769.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m769.7/769.7 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: lemminflect
Successfully installed lemminflect-0.2.3


It's integrated with Spacy, too.
<a href="https://spacy.io/universe/project/lemminflect"> Link </a>

In [None]:
# let's evaluate an example

import lemminflect
doc = nlp('He went to a trip to later brag that he hiked, swam, danced, sang, ran and cooked.')

for token in doc:
    print("Text - {} and its lemma is {}".format(token.text, token._.lemma()))

Text - He and its lemma is He
Text - went and its lemma is go
Text - to and its lemma is to
Text - a and its lemma is a
Text - trip and its lemma is trip
Text - to and its lemma is to
Text - later and its lemma is later
Text - brag and its lemma is brag
Text - that and its lemma is that
Text - he and its lemma is he
Text - hiked and its lemma is hike
Text - , and its lemma is ,
Text - swam and its lemma is swam
Text - , and its lemma is ,
Text - danced and its lemma is danced
Text - , and its lemma is ,
Text - sang and its lemma is sang
Text - , and its lemma is ,
Text - ran and its lemma is run
Text - and and its lemma is and
Text - cooked and its lemma is cook
Text - . and its lemma is .
