### 의미 분석 (semantic analysis)
- 형태소 분석: tokenize, ngram, pos tagging, ...
- 구문 분석: syntax => pass

#### 단어 정리
- 어간(stem): 변하지 않는 부분
- 어미(ending): 변하는 부분  
  예: 달리다 달리는 달리고 달렸더니  
  달리 -ㅆ 더 니  


- 어근(root): 핵심 의미가 담긴 부분
- 접사(affix): 어근에 붙어서 어근의 뜻을 재현하는 역할 (prefix, suffix, infix 등)   
  예: 과일/햇과일, 밟다, 짓밟다

#### 어간 추출 stemming
달리  
- 규칙 기반.  
  ex. s -> s삭제, es -> es 삭제 (복수형)
- pos를 고려하지 않는다.  
  ex. running[n, v] -> run [v]  
  ex. running[n] -> run[v]: 품사 정보가 보존되지 않을 가능성이 있다 => 치명적
- 존재하지 않는 단어가 반환될 가능성이 높다.  
  ex. stable -> stability, stab(?), stabl(?)

#### 표제어 추출 lemmatization
달리다  
- 사전 기반: wordnet (NLTK)
- 사전에서 단어를 가져오기에, 정확도가 높다.
- (*중요) pos를 고려한다  
  ex. running[n] -> run[n]: 품사 정보가 보존된다.

In [1]:
import nltk
# Porter, Lancastger, Regexp

In [2]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

In [3]:
word = "cooking cookery cooked cooks cookly".split()
word

['cooking', 'cookery', 'cooked', 'cooks', 'cookly']

In [4]:
[porter.stem(w) for w in word]

['cook', 'cookeri', 'cook', 'cook', 'cookli']

In [6]:
# Lancaster
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()

In [7]:
[lancaster.stem(w) for w in word]

['cook', 'cookery', 'cook', 'cook', 'cook']

In [8]:
porter.stem("stable")

'stabl'

In [9]:
lancaster.stem("stable")

'stabl'

In [27]:
# RegexpStemmer
# 내가 규칙을 커스터마이즈 할 수 있는 Stemmer
from nltk.stem import RegexpStemmer

regexp = RegexpStemmer('ed|ly|s')   # rule customizing

In [12]:
[regexp.stem(w) for w in word]

['cooking', 'cookery', 'cook', 'cook', 'cook']

In [20]:
# Q. word_list를 원래 word, 그리고 porter, lancaster stemmer 이용해서 출력

word_list = "friend friends friendship friendships stable stabilize destabilized understanding misunderstanding football".split()

for word in word_list:
  print("Result: {0:^20}{1:^20}{2:^20}".format(word, porter.stem(word), lancaster.stem(word)))

# 그래서, 보통은 여러개 스테머를 합쳐서 사용

Result:        friend              friend              friend       
Result:       friends              friend              friend       
Result:      friendship          friendship            friend       
Result:     friendships          friendship            friend       
Result:        stable              stabl               stabl        
Result:      stabilize             stabil              stabl        
Result:     destabilized          destabil              dest        
Result:    understanding         understand          understand     
Result:   misunderstanding     misunderstand       misunderstand    
Result:       football            footbal             footbal       


In [32]:
"""
Q. 함수를 만든다.
1. porter -> lancaster -> regexp(원하는대로)
단어가 바뀌지 않았을 때 넘겨줌.

2. 모든 스테머를 사용하도록
"""

def stemming1(word:str):
  stemmed = porter.stem(word)
  if stemmed == word:
    stemmed = lancaster.stem(word)

    if stemmed == words:
      stemmed = regexp.stem(word)

  return stemmed

def stemming2(word:str):
  word = porter.stem(word)
  word = lancaster.stem(word)
  word = regexp.stem(word)
  return word

# 내 답
def stemming3(word:str):
  porter_word, lancaster_word, regexp_word = porter.stem(word), lancaster.stem(word), regexp.stem(word)
  if word != porter_word:
    word = porter_word
  elif porter_word != lancaster_word:
    word = lancaster_word
  elif lancaster_word != regexp_word:
    word = regexp_word
  return word

In [31]:
[stemming1(w) for w in words]

['doe', 'do', 'don', 'doer', 'doer', 'did', 'doinglyli', 'ingdo']

In [35]:
[stemming2(w) for w in words]

['doe', 'do', 'don', 'doer', 'doer', 'did', 'doingl', 'ingdo']

In [34]:
words = "does doing done doer doers did doinglyly ingdo".split()
regexp.stem("lydo")

'do'

In [51]:
"""
Quiz. tokenize + stemming
- input: sentence
- output: sentence
"""
from nltk.tokenize import word_tokenize
# nltk.download('punkt')

def token_stem(sent:str):
  # tokenize
  tokens = word_tokenize(sent)
  # stemming
  stemmed_tokens = [porter.stem(w) for w in tokens]
  
  # join
  sent = " ".join(stemmed_tokens)
  return sent

def token_stem2(sent:str, stemmer=PorterStemmer()):
  sent = " ".join([stemmer.stem(w) for w in word_tokenize(sent)])
  return sent

In [48]:
porter.stem("I am a student")

'i am a stud'

In [50]:
token_stem("I am a student")   # ?????? I should be lowercase 왜지

'I am a student'

In [52]:
token_stem2("I am a studently", LancasterStemmer())

'i am a stud'

### Snowball Stemmer
다국어 지원(13개)

In [53]:
!pip install snowball

Collecting snowball
  Downloading Snowball-0.20.0-py3-none-any.whl (13 kB)
Collecting plotly>=4.7.1
  Downloading plotly-5.3.1-py2.py3-none-any.whl (23.9 MB)
[K     |████████████████████████████████| 23.9 MB 1.5 MB/s 
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly, snowball
  Attempting uninstall: plotly
    Found existing installation: plotly 4.4.1
    Uninstalling plotly-4.4.1:
      Successfully uninstalled plotly-4.4.1
Successfully installed plotly-5.3.1 snowball-0.20.0 tenacity-8.0.1


In [54]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [58]:
from nltk.stem import SnowballStemmer

snowball = SnowballStemmer('spanish')
spanish_words = "corro corres corre corremas correis corren".split()

snow = [snowball.stem(w) for w in spanish_words]
pt = [porter.stem(w) for w in spanish_words]

print(snow)
print(pt)

['corr', 'corr', 'corr', 'correm', 'correis', 'corr']
['corro', 'corr', 'corr', 'correma', 'correi', 'corren']


### 한국어
konlpy의 okt에만 stemming이 있음

In [60]:
!pip install konlpy
from konlpy.tag import Okt, Kkma, Komoran
komoran = Komoran()

Collecting konlpy
  Downloading konlpy-0.5.2-py2.py3-none-any.whl (19.4 MB)
[K     |████████████████████████████████| 19.4 MB 7.2 MB/s 
[?25hCollecting colorama
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting JPype1>=0.7.0
  Downloading JPype1-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 37.8 MB/s 
Collecting beautifulsoup4==4.6.0
  Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.2 MB/s 
Installing collected packages: JPype1, colorama, beautifulsoup4, konlpy
  Attempting uninstall: beautifulsoup4
    Found existing installation: beautifulsoup4 4.6.3
    Uninstalling beautifulsoup4-4.6.3:
      Successfully uninstalled beautifulsoup4-4.6.3
Successfully installed JPype1-1.3.0 beautifulsoup4-4.6.0 colorama-0.4.4 konlpy-0.5.2


In [62]:
okt = Okt()

ko_words = "달리고 달려서 달렸더니 달리니까 달린"
okt.morphs(ko_words, stem=True)   # string형태로 들어가야 됨. stem=True를 해줌으로써 stemming까지

['달리', '고', '달다', '달리다', '달리', '니까', '달리다']

In [63]:
okt.pos(ko_words, stem=True)

[('달리', 'Noun'),
 ('고', 'Josa'),
 ('달다', 'Verb'),
 ('달리다', 'Verb'),
 ('달리', 'Noun'),
 ('니까', 'Josa'),
 ('달리다', 'Verb')]

In [None]:
# komoran.morphs(ko_words, stem=True) -> stemming 기능 없어서 오류 발생