___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.

In [1]:
# Perform standard imports:
import spacy
nlp = spacy.load('ko_core_news_lg')

In [2]:
doc1 = nlp(u"나는 오늘 달리기를 하는 주자입니다. 달리기를 좋아하기 때문입니다.")

for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

나는 	 PRON 	 2916077967634857145 	 나+는
오늘 	 NOUN 	 17203112817557272742 	 오늘
달리기를 	 NOUN 	 1386013593075866660 	 달리+기+를
하는 	 VERB 	 7016145617310600694 	 하+는
주자입니다 	 VERB 	 6669919514168566518 	 주자입니다
. 	 PUNCT 	 12646065887601541794 	 .
달리기를 	 NOUN 	 1386013593075866660 	 달리+기+를
좋아하기 	 VERB 	 10208331703961701902 	 좋아하+기
때문입니다 	 VERB 	 14436976343481477712 	 때문+이+ㅂ니다
. 	 PUNCT 	 12646065887601541794 	 .


<font color=green>In the above sentence, `running`, `run` and `ran` all point to the same lemma `run` (...11841) to avoid duplication.</font>

### Function to display lemmas
Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly.

In [3]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

Here we're using an **f-string** to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.

In [4]:
doc2 = nlp(u"나는 오늘 18마리의 생쥐들을 봤어!")

show_lemmas(doc2)

나는           PRON   2916077967634857145    나+는
오늘           NOUN   17203112817557272742   오늘
18마리의        NUM    9593124817331040547    18+마리+의
생쥐들을         NOUN   11049081752070338458   생쥐+들+을
봤어           VERB   16316130839343428056   봤어
!            PUNCT  17494803046312582752   !


<font color=green>Notice that the lemma of `saw` is `see`, `mice` is the plural form of `mouse`, and yet `eighteen` is its own number, *not* an expanded form of `eight`.</font>

In [5]:
doc3 = nlp(u"내일 미팅에서 그를 만날 예정이야!")

show_lemmas(doc3)

내일           NOUN   17486585633085380082   내일
미팅에서         ADV    5343797126385849736    미팅+에서
그를           PRON   16292183577316820414   그+를
만날           VERB   17838407554557962817   만나+ㄹ
예정이야         VERB   3092413078387078107    예정+이+야
!            PUNCT  17494803046312582752   !


<font color=green>Here the lemma of `meeting` is determined by its Part of Speech tag.</font>

In [8]:
doc4 = nlp(u"그것은 엄청난 자동차에요.")

show_lemmas(doc4)

그것은          PRON   2639474868024230844    그것+은
엄청난          ADJ    4138740274229642590    엄청나+ㄴ
자동차에요        VERB   4662446312921723792    자동차에요
.            PUNCT  12646065887601541794   .


<font color=green>Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.</font>

We should point out that although lemmatization looks at surrounding text to determine a given word's part of speech, it does not categorize phrases. In an upcoming lecture we'll investigate *word vectors and similarity*.

## Next up: Stop Words