# NLTK 자연어 처리 패키지 소개

- 샘플 corpus 및 사전
- 토큰 생성(tokenizing)
- 형태소 분석(stemming/lemmatizing)
- 품사 태깅(part-of-speech tagging)
- 구문 분석(syntax parsing)

## 샘플 corpus
샘플 문서 집합

download 명령 <br>
nltk.download('averaged_perceptron_tagger')<br>
nltk.download("gutenberg")<br>
nltk.download('punkt')<br>
nltk.download('reuters')<br>
nltk.download("stopwords")<br>
nltk.download("taggers")<br>
nltk.download("webtext")<br>
nltk.download("wordnet")

In [2]:
import nltk
nltk.download("gutenberg")

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\student\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


True

In [3]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [4]:
emma_raw = nltk.corpus.gutenberg.raw("austen-emma.txt")
print(emma_raw[:1302])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o

## 토큰 생성(tokenizing)

In [9]:
 nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [10]:
from nltk.tokenize import word_tokenize
word_tokenize(emma_raw[50:100])

['Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a']

In [12]:
from nltk.tokenize import RegexpTokenizer
t = RegexpTokenizer("[\w]+")
t.tokenize(emma_raw[50:100])

['Emma', 'Woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a']

## 형태소 분석

 어근, 접두사/접미사, 품사(POS, part-of-speech) 등 언어적 속성의 구조를 파악
 
- stemming (어근 추출)
- lemmatizing (원형 복원)
- POS tagging (품사 태깅)

### Stemming and lemmatizing

In [14]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
st.stem("eating")

'eat'

In [23]:
from nltk.stem import LancasterStemmer
st = LancasterStemmer()
st.stem("designed")

'design'

In [25]:
from nltk.stem import RegexpStemmer
st = RegexpStemmer("ing")
st.stem("cooking")

'cook'

In [28]:
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
print(lm.lemmatize("cooking"))
print(lm.lemmatize("cooking", pos="v"))
print(lm.lemmatize("cookbooks"))

cooking
cook
cookbook


In [29]:
print(WordNetLemmatizer().lemmatize("believes"))
print(LancasterStemmer().stem("believes"))

belief
believ


## POS tagging

POS(part-of-speech)

Part-of-Speech Tagset <br>
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.htm<br>
http://www.ibm.com/support/knowledgecenter/ko/SS5RWK_3.5.0/com.ibm.discovery.es.ta.doc/iiysspostagset.htm

In [31]:
from nltk.tag import pos_tag

tagged_list = pos_tag(word_tokenize(emma_raw[:100]))
tagged_list

[('[', 'NNS'),
 ('Emma', 'NNP'),
 ('by', 'IN'),
 ('Jane', 'NNP'),
 ('Austen', 'NNP'),
 ('1816', 'CD'),
 (']', 'NNP'),
 ('VOLUME', 'NNP'),
 ('I', 'PRP'),
 ('CHAPTER', 'VBP'),
 ('I', 'PRP'),
 ('Emma', 'NNP'),
 ('Woodhouse', 'NNP'),
 (',', ','),
 ('handsome', 'NN'),
 (',', ','),
 ('clever', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('rich', 'JJ'),
 (',', ','),
 ('with', 'IN'),
 ('a', 'DT')]

In [32]:
from nltk.tag import untag
untag(tagged_list)

['[',
 'Emma',
 'by',
 'Jane',
 'Austen',
 '1816',
 ']',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a']

# konlpy 한국어 처리 패키지 소개

http://konlpy.org/ko/latest/ <br>
https://github.com/konlpy/konlpy

* Kkma<br>
    http://kkma.snu.ac.kr/<br>
* Hannanum<br>
    http://semanticweb.kaist.ac.kr/hannanum/<br>
* Twitter<br>
    https://github.com/twitter/twitter-korean-text/<br>
* Komoran<br>
    http://www.shineware.co.kr/?page_id=835<br>
* Mecab<br>
    https://bitbucket.org/eunjeon/mecab-ko-dic

## 한국어 corpus

In [53]:
!pip install JPype1-0.6.2-cp36-cp36m-win_amd64.whl
!pip install konlpy



Requirement 'JPype1-0.6.2-cp36-cp36m-win_amd64.whl' looks like a filename, but the file does not exist




In [44]:
import konlpy
from konlpy.corpus import kolaw
kolaw.fileids()

['constitution.txt']

In [45]:
c = kolaw.open('constitution.txt').read()
print(c[:100])

대한민국헌법

유구한 역사와 전통에 빛나는 우리 대한국민은 3·1운동으로 건립된 대한민국임시정부의 법통과 불의에 항거한 4·19민주이념을 계승하고, 조국의 민주개혁과 평화적 통일의


In [46]:
from konlpy.corpus import kobill
kobill.fileids()

['1809890.txt',
 '1809891.txt',
 '1809892.txt',
 '1809893.txt',
 '1809894.txt',
 '1809895.txt',
 '1809896.txt',
 '1809897.txt',
 '1809898.txt',
 '1809899.txt']

In [47]:
d = kobill.open('1809890.txt').read()
print(d[:100])

지방공무원법 일부개정법률안

(정의화의원 대표발의 )

 의 안
 번 호

9890

발의연월일 : 2010.  11.  12.  

발  의  자 : 정의화․이명수․김을동 

이


## 한국어 처리 유틸리티

pprint 유틸리티 함수로 한국어 프린트 해줌 

In [48]:
x = [u"한글", {u"한글 키": [u"한글 밸류1", u"한글 밸류2"]}]
print(x)

['한글', {'한글 키': ['한글 밸류1', '한글 밸류2']}]


In [49]:
from konlpy.utils import pprint
pprint(x)

['한글', {'한글 키': ['한글 밸류1', '한글 밸류2']}]


## 형태소 분석

- Kkma
- Hannanum
- Twitter
- Komoran
- Mecab
 <br>-------
- morphs : 형태소 추출
- nouns : 명사 추출
- pos : pos 태깅

In [62]:
import numpy as np
from konlpy.tag import *

hannanum = Hannanum()
kkma = Kkma()
twitter = Twitter()

NameError: name 'jpype' is not defined

## 명사 추출
noun 명령

In [57]:
pprint(hannanum.nouns(c[:65]))

NameError: name 'hannanum' is not defined

# 문서 전처리

특징 벡터(feature vector)를 문서로부터 추출하는 과정

## BOW (Bag of Words)

전체 문서  {D1,D2,…,Dn}{D1,D2,…,Dn}  를 구성하는 <br>
고정된 단어장(vocabulary)  {W1,W2,…,Wm}{W1,W2,…,Wm}  를 만들고  <br>
DiDi 라는 개별 문서에 단어장에 해당하는 단어들이 포함되어 있는지를 표시하는 방법

## Scikit-Learn 의 문서 전처리 기능

feature_extraction.text 
- CountVectorizer:<br>
    문서 집합으로부터 단어의 수를 세어 카운트 행렬을 만든다.
- TfidfVectorizer:<br>
    문서 집합으로부터 단어의 수를 세고 TF-IDF 방식으로 단어의 가중치를 조정한 카운트 행렬을 만든다.
- HashingVectorizer: <br>
    hashing trick 을 사용하여 빠르게 카운트 행렬을 만든다.

In [63]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    'The last document?',    
]
vect = CountVectorizer()
vect.fit(corpus)
vect.vocabulary_

{'and': 0,
 'document': 1,
 'first': 2,
 'is': 3,
 'last': 4,
 'one': 5,
 'second': 6,
 'the': 7,
 'third': 8,
 'this': 9}

In [64]:
vect.transform(['This is the second document.']).toarray()

array([[0, 1, 0, 1, 0, 0, 1, 1, 0, 1]], dtype=int64)

In [66]:
vect.transform(['Something completely new.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [67]:
vect.transform(corpus).toarray()

array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 0, 1, 0, 1],
       [0, 1, 0, 0, 1, 0, 0, 1, 0, 0]], dtype=int64)