<a href="https://colab.research.google.com/github/KILjungjoon/nlp_tools/blob/main/Lemmatizer_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.What is a 'Lemmatization'?

* Lemma : a representative form of vocabulary in the dictionary.
* Lemmatization : the process of converting a word to its base form, lemma
> Caring → care,  stroke → strike,   boys → boy .
* Also, sometimes, the same word can have multiple different ‘lemma’s. So, based on the context it’s used, you should identify the ‘part-of-speech’ (POS) tag for the word in that specific context and extract the appropriate lemma.

In [1]:
sentence = """Our model has hitherto been to provide a basic social safety net through subsidised education, healthcare and housing, 
supplemented by social assistance programmes for the needy through a many-helping-hands approach."""

# 2. Wordnet Lemmatizer with NLTK

In [14]:
# 몇몇 문제가 발견, 불완전.
# NLTK
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

word_list = nltk.word_tokenize(sentence)
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])

print(sentence)
print(lemmatized_output)

Our model has hitherto been to provide a basic social safety net through subsidised education, healthcare and housing, supplemented by social assistance programmes for the needy through a many-helping-hands approach.
Our model ha hitherto been to provide a basic social safety net through subsidised education , healthcare and housing , supplemented by social assistance programme for the needy through a many-helping-hands approach .


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# 3.Wordnet Lemmatizer with appropriate POS tag

In [13]:
# 정확도가 많이 높아졌다! 오류 없음.
# Lemmatize with POS Tag 
from nltk.corpus import wordnet
import nltk
nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()

# 2. Lemmatize Single Word with the appropriate POS tag
word = 'feet'
print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))

# 3. Lemmatize a Sentence with the appropriate POS tag
print(sentence)
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])

foot
Our model has hitherto been to provide a basic social safety net through subsidised education, healthcare and housing, supplemented by social assistance programmes for the needy through a many-helping-hands approach.
['Our', 'model', 'have', 'hitherto', 'be', 'to', 'provide', 'a', 'basic', 'social', 'safety', 'net', 'through', 'subsidise', 'education', ',', 'healthcare', 'and', 'housing', ',', 'supplement', 'by', 'social', 'assistance', 'programme', 'for', 'the', 'needy', 'through', 'a', 'many-helping-hands', 'approach', '.']


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# 4.spaCy Lemmatization

In [10]:
# 양호하다. subsidised만 오류
# Spacy
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
doc = nlp(sentence)
print(sentence)
print(" ".join([token.lemma_ for token in doc]))

Our model has hitherto been to provide a basic social safety net through subsidised education, healthcare and housing, supplemented by social assistance programmes for the needy through a many-helping-hands approach.
our model have hitherto be to provide a basic social safety net through subsidised education , healthcare and housing , supplement by social assistance programme for the needy through a many - help - hand approach .


# 5.TextBlob Lemmatizer

In [15]:
!pip install textblob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
# has, been, subsidised, supplemented 오류

from textblob import TextBlob, Word

# Lemmatize a word
word = 'stripes'
w = Word(word)
print(w.lemmatize())

# Lemmatize a sentence
sent = TextBlob(sentence)
print(sentence)
print(" ". join([w.lemmatize() for w in sent.words]))

stripe
Our model has hitherto been to provide a basic social safety net through subsidised education, healthcare and housing, supplemented by social assistance programmes for the needy through a many-helping-hands approach.
Our model ha hitherto been to provide a basic social safety net through subsidised education healthcare and housing supplemented by social assistance programme for the needy through a many-helping-hands approach


# 6.TextBlob Lemmatizer with appropriate POS tag

In [17]:
# GOOD! 오류 없음.
# Define function to lemmatize each word with its POS tag
def lemmatize_with_postag(sentence):
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]    
    lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
    return " ".join(lemmatized_list)

# Lemmatize
print(sentence)
print(lemmatize_with_postag(sentence))

Our model has hitherto been to provide a basic social safety net through subsidised education, healthcare and housing, supplemented by social assistance programmes for the needy through a many-helping-hands approach.
Our model have hitherto be to provide a basic social safety net through subsidise education healthcare and housing supplement by social assistance programme for the needy through a many-helping-hands approach


# 7.Pattern Lemmatizer

In [18]:
!pip install pattern

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pattern
  Downloading Pattern-3.6.0.tar.gz (22.2 MB)
[K     |████████████████████████████████| 22.2 MB 1.5 MB/s 
Collecting backports.csv
  Downloading backports.csv-1.0.7-py2.py3-none-any.whl (12 kB)
Collecting mysqlclient
  Downloading mysqlclient-2.1.1.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 6.7 MB/s 
Collecting feedparser
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 257 kB/s 
[?25hCollecting pdfminer.six
  Downloading pdfminer.six-20221105-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 20.0 MB/s 
Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 44.9 MB/s 
[?25hCollecting cherrypy
  Downloading CherryPy-18.8.0-py2.py3-none-any.whl (348 kB)
[K     |████████████████████████████████| 348 kB

In [27]:
# GOOD! 오류 없음.

from nltk.corpus.reader.semcor import SemcorSentence
import pattern
from pattern.en import lemma, lexeme


print(sentence)
print(" ".join([lemma(wd) for wd in sentence.split()]))

# Lexeme's for each word 
print([lexeme(wd) for wd in sentence.split()])

Our model has hitherto been to provide a basic social safety net through subsidised education, healthcare and housing, supplemented by social assistance programmes for the needy through a many-helping-hands approach.
our model have hitherto be to provide a basic social safety net through subsidise education, healthcare and housing, supplement by social assistance programme for the needy through a many-helping-hand approach.
[['our', 'ours', 'ouring', 'oured'], ['model', 'models', 'modelling', 'modelled'], ['have', 'has', 'having', 'had', "haven't", "hasn't", "hadn't"], ['hitherto', 'hithertos', 'hithertoing', 'hithertoed'], ['be', 'am', 'are', 'is', 'being', 'was', 'were', 'been', 'am not', "aren't", "isn't", "wasn't", "weren't"], ['to', 'tos', 'toing', 'toed'], ['provide', 'provides', 'providing', 'provided'], ['a', 'as', 'aing', 'aed'], ['basic', 'basices', 'basicking', 'basicked'], ['social', 'socials', 'socialing', 'socialed'], ['safety', 'safeties', 'safetying', 'safetied'], ['net

In [29]:
# obtain the lemma by parsing the sentence : sentence를 직접 스트링으로 넣어줘야 한다.
from pattern.en import parse
print(parse("""Our model has hitherto been to provide a basic social safety net through subsidised education, healthcare and housing, 
supplemented by social assistance programmes for the needy through a many-helping-hands approach.""", lemmata=True, tags=False, chunks=False))

Our/PRP$/our model/NN/model has/VBZ/have hitherto/RB/hitherto been/VBN/be to/TO/to provide/VB/provide a/DT/a basic/JJ/basic social/JJ/social safety/NN/safety net/NN/net through/IN/through subsidised/JJ/subsidised education/NN/education ,/,/, healthcare/NN/healthcare and/CC/and housing/NN/housing ,/,/, supplemented/VBN/supplement by/IN/by social/JJ/social assistance/NN/assistance programmes/NNS/programme for/IN/for the/DT/the needy/JJ/needy through/IN/through a/DT/a many-helping-hands/NNS/many-helping-hand approach/NN/approach ././.


# 8.Stanford CoreNLP Lemmatization
* 서버 접속 문제로 pass

In [2]:
!pip install stanfordcorenlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from stanfordcorenlp import StanfordCoreNLP
import json

# Connect to the CoreNLP server we just started
nlp = StanfordCoreNLP('http://localhost', port=9000, timeout=30000)

# Define proporties needed to get lemma
props = {'annotators': 'pos,lemma',
         'pipelineLanguage': 'en',
         'outputFormat': 'json'}

parsed_str = nlp.annotate(sentence, properties=props)
parsed_dict = json.loads(parsed_str)
parsed_dict

# 9.Gensim Lemmatize
* generator raised StopIteration 오류로 pass

In [None]:
from gensim.utils import lemmatize
print(sentence)
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]

# 10.TreeTagger
* 다운로드 등 설정이 번거롭다. pass

In [10]:
!pip install treetaggerwrapper

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting treetaggerwrapper
  Downloading treetaggerwrapper-2.3.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.6 MB/s 
[?25hBuilding wheels for collected packages: treetaggerwrapper
  Building wheel for treetaggerwrapper (setup.py) ... [?25l[?25hdone
  Created wheel for treetaggerwrapper: filename=treetaggerwrapper-2.3-py3-none-any.whl size=40773 sha256=d95281a89e34de115809fb1cd71c6fa5c45229185dee43a2d46342b97869985c
  Stored in directory: /root/.cache/pip/wheels/a0/93/50/47079639c52033b2e2b865a59654eea6832068149414cb78a5
Successfully built treetaggerwrapper
Installing collected packages: treetaggerwrapper
Successfully installed treetaggerwrapper-2.3


In [None]:
import treetaggerwrapper as ttpw
tagger = ttpw.TreeTagger(TAGLANG='en', TAGDIR='/Users/ecom-selva.p/Documents/MLPlus/11_Lemmatization/treetagger')
tags = tagger.tag_text(sentence)
print(sentence)
print(lemmas = [t.split('\t')[-1] for t in tags])

---
---
< 결론 > 오류가 없었던 3가지 방법은 다음과 같다.
* 3.Wordnet Lemmatizer with appropriate POS tag
* 6.TextBlob Lemmatizer with appropriate POS tag
* 7.Pattern Lemmatizer
---
< Reference >
* https://www.machinelearningplus.com/nlp/lemmatization-examples-python/