# NLTK

- corpus
- tokenizing
- morpheme
- POS tagging


## Corpus ; 말뭉치
자연어 분석 작업을 위해 만든 샘플 문서 집합 ( 기초 트레이닝용 )

In [1]:
import nltk
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [2]:
emma_raw = nltk.corpus.gutenberg.raw("austen-emma.txt")
print(emma_raw[:1302])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o

# Tokenizing ; 토큰 생성

긴 문자열을 분석을 위한 작은 단위로 나누는 일 (= tokenizing). 영문의 경우 문장, 단어 등을 토큰으로 사용하거나 정규 표현식을 쓸 수 있다.
- 나뉘어진 문자열 단위 = token
- 문자열을 토큰으로 분리하는 함수 = tokenizer

In [3]:
from nltk.tokenize import sent_tokenize 
# sentence 문장/ 마침표를 이용하여 찾아냄
print(sent_tokenize(emma_raw[:1000])[3])

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.


In [4]:
from nltk.tokenize import word_tokenize # 단어로 자름
word_tokenize(emma_raw[50:100])

['Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a']

In [5]:
from nltk.tokenize import RegexpTokenizer 
t = RegexpTokenizer("[\w]+") # regex, customize
t.tokenize(emma_raw[50:100])

['Emma', 'Woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a']

---

# morpheme ; 형태소
언어학에서 일정한 의미가 있는 가장 작은 말의 단위. 대부분 NLP에서는 토큰으로 형태소를 이용한다. *morphological analysis에서 파생*
- stemming (어간 추출)
- lemmatizing (원형 복원)
- Part-Of-Speech tagging (품사 부착)


In [6]:
words = ['flies', 'dies', 'denied', 'plotted', 'meeting', 
         'itemization', 'sensational', 'traditional']

### stemming 
줄기만 남기고 앞뒤로 자름

In [7]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
[st.stem(w) for w in words]

['fli', 'die', 'deni', 'plot', 'meet', 'item', 'sensat', 'tradit']

In [8]:
from nltk.stem import LancasterStemmer
st = LancasterStemmer()
[st.stem(w) for w in words]

['fli', 'die', 'deny', 'plot', 'meet', 'item', 'sens', 'tradit']

### lemmatizing
같은 의미를 가지는 여러 단어를 가장 근본적인 '원형', 사전형으로 통일

In [9]:
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
[lm.lemmatize(w) for w in words]

['fly',
 'dy',
 'denied',
 'plotted',
 'meeting',
 'itemization',
 'sensational',
 'traditional']

In [10]:
lm.lemmatize("denied", pos="n"), lm.lemmatize("denied", pos="v")

('denied', 'deny')

# POS tagging ; 품사 

낱말을 문법적 기능, 형태, 뜻에 따라 구분한것을 태깅

- NN 명사
- PRP 인칭대명사
- CD 서수
- DT 관형사
- VBP 동사 현재형

In [11]:
from nltk.tag import pos_tag
x = ["volume", "I", "chapter", "1", "I", "am", "a", "boy", "."]
tagged_list = pos_tag(x)
tagged_list

[('volume', 'NN'),
 ('I', 'PRP'),
 ('chapter', 'VBP'),
 ('1', 'CD'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('a', 'DT'),
 ('boy', 'NN'),
 ('.', '.')]

In [12]:
from nltk.tag import untag
untag(tagged_list)

['volume', 'I', 'chapter', '1', 'I', 'am', 'a', 'boy', '.']

#### Scikit-Learn에서 분석할때 tuple은 인식을 못하기 때문에 한개의 문자열로 만들어 준다.

In [13]:
def tokenizer(doc):
    return ["/".join(p) for p in pos_tag(doc)]
tokenizer(x)

['volume/NN',
 'I/PRP',
 'chapter/VBP',
 '1/CD',
 'I/PRP',
 'am/VBP',
 'a/DT',
 'boy/NN',
 './.']

---