# Simple POS tagging by Using KoNLPy

* Official site : http://konlpy.org/
* How to install(Ubuntu, CentOS, OS X, Windows) : http://konlpy.org/en/v0.4.4/install/

---

## Environment

* Kernel : Python 2.7
* OS : Windows 10
* csv file encoding : utf-8

---

There are 5 pos tagging classes in KonlPy, and comparison is in http://konlpy.org/en/v0.4.4/morph/.
* Kkma
* Komoran
* Hannanum
* Twitter
* Mecab

The example of using Pos tagger

In [3]:
from konlpy.tag import Hannanum
hannanum = Hannanum()

print hannanum.pos(u'안녕하세요, 저는 아루베입니다.')

[(u'\uc548\ub155', u'N'), (u'\ud558', u'X'), (u'\uc138', u'E'), (u'\uc694', u'J'), (u',', u'S'), (u'\uc800', u'N'), (u'\ub294', u'J'), (u'\uc544\ub8e8\ubca0\uc785\ub2c8', u'N'), (u'\uc774', u'J'), (u'\ub2e4', u'E'), (u'.', u'S')]


Because it is unicode, result is not good at looking. We don't know actual Korean from printed result. Just separating them makes it much clearer.

In [13]:
result = hannanum.pos(u'안녕하세요,저는ahroobe입니다.')
for word,pos in result:
    print word,pos

안녕하세요,저는ahroobe입니 N
이 J
다 E
. S


In POS definition, Twitter has simplest POS definition. Result would be changed when we use Twitter POS tagger for same sentence.

pos tags comparision chart link : https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ_-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0

In [7]:
from konlpy.tag import Twitter
twitter = Twitter()

In [10]:
result = twitter.pos(u'안녕하세요,저는ahroobe입니다.')
for word,pos in result:
    print word,pos

안녕하세 Adjective
요 Eomi
, Punctuation
저 Noun
는 Josa
ahroobe Alpha
입니 Adjective
다 Eomi
. Punctuation


## Example of csv file

In [22]:
import csv
# I use csv library to load csv files. (more convenient)

with open('./sample.csv','w') as writefile:
    writer = csv.writer(writefile)
    writer.writerow(['temp','temp','안녕하세요로로로롤~~이런것도가능할까요?저는띄어쓰기같은거몰라요'])

## Load csv file and analyze it.

In [26]:
with open('./sample.csv','r') as readfile:
    reader = csv.reader(readfile)
    # next(reader) # If there is header, skip it.
    for row in reader:
        # Get certian column value. Here, we get second values of row.
        # and decode it by utf-8
        data = row[2].decode('utf-8')
        result =  twitter.pos(data)
        for word,pos in result:
            print word,pos
        

안녕하세 Adjective
요로로 Noun
로 Josa
롤 Noun
~~ Punctuation
이런 Adjective
것 PreEomi
도 Eomi
가능할 Adjective
까 Eomi
요 Noun
? Punctuation
저 Noun
는 Josa
띄어쓰기 Noun
같은 Adjective
거 Eomi
몰라 Verb
요 Eomi
