# NLTK 자연어 처리

- 엔엘티케이(NLTK)는 자연어 처리를 위한 파이썬 패키지입니다. 아나콘다를 설치하였다면 NLTK는 기본적으로 설치가 되어져 있습니다.
- nltk는 영어 데이터를 전처리하는데 가장 많이 활용되는 툴 중 하나

In [1]:
!pip install nltk



In [2]:
# gutenberg 말뭉치

import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\ilifo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


True

In [3]:
from nltk.corpus import gutenberg

In [16]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [17]:
# 햄릿
hamlet = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')

In [18]:
type(hamlet)

str

In [19]:
print(hamlet[:500])

[The Tragedie of Hamlet by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Barnardo and Francisco two Centinels.

  Barnardo. Who's there?
  Fran. Nay answer me: Stand & vnfold
your selfe

   Bar. Long liue the King

   Fran. Barnardo?
  Bar. He

   Fran. You come most carefully vpon your houre

   Bar. 'Tis now strook twelue, get thee to bed Francisco

   Fran. For this releefe much thankes: 'Tis bitter cold,
And I am sicke at heart

   Barn. Haue you had quiet Guard?
  Fran. Not


In [7]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ilifo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

#### 토큰 생성
문자열을 토큰으로 분리하는 함수를 토큰 생성 함수(tokenizer)라고 한다. 토큰 생성 함수는 문자열을 입력받아 토큰 문자열의 리스트를 출력한다.

#### 문장 토큰 생성

In [8]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(hamlet[:1000]))

['[The Tragedie of Hamlet by William Shakespeare 1599]\n\n\nActus Primus.', 'Scoena Prima.', 'Enter Barnardo and Francisco two Centinels.', 'Barnardo.', "Who's there?", 'Fran.', 'Nay answer me: Stand & vnfold\nyour selfe\n\n   Bar.', 'Long liue the King\n\n   Fran.', 'Barnardo?', 'Bar.', 'He\n\n   Fran.', 'You come most carefully vpon your houre\n\n   Bar.', "'Tis now strook twelue, get thee to bed Francisco\n\n   Fran.", "For this releefe much thankes: 'Tis bitter cold,\nAnd I am sicke at heart\n\n   Barn.", 'Haue you had quiet Guard?', 'Fran.', 'Not a Mouse stirring\n\n   Barn.', 'Well, goodnight.', 'If you do meet Horatio and\nMarcellus, the Riuals of my Watch, bid them make hast.', 'Enter Horatio and Marcellus.', 'Fran.', 'I thinke I heare them.', "Stand: who's there?", 'Hor.', 'Friends to this ground\n\n   Mar.', 'And Leige-men to the Dane\n\n   Fran.', 'Giue you good night\n\n   Mar.', "O farwel honest Soldier, who hath relieu'd you?", 'Fra.', "Barnardo ha's my place: giue you go

#### 단어 토큰 생성

In [9]:
from nltk.tokenize import word_tokenize
print(word_tokenize(hamlet[:100]))

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']', 'Actus', 'Primus', '.', 'Scoena', 'Prima', '.', 'Enter', 'Barnardo', 'a']


In [10]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ilifo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

#### 품사 부착

In [11]:
from nltk.tag import pos_tag
sentence = "You come most carefully vpon your houre"
tagged_list = pos_tag(word_tokenize(sentence))
tagged_list

[('You', 'PRP'),
 ('come', 'VBP'),
 ('most', 'RBS'),
 ('carefully', 'RB'),
 ('vpon', 'VB'),
 ('your', 'PRP$'),
 ('houre', 'NN')]

#### 품사없이 단어 리스트만 추출

In [13]:
from nltk.tag import untag
untag(tagged_list)

['You', 'come', 'most', 'carefully', 'vpon', 'your', 'houre']

#### 품사중에서 'NN'만 제거한 리스트 생성

In [14]:
tagged_list_new = [(a, b) for a, b in tagged_list if b != 'NN'] 

In [15]:
tagged_list_new

[('You', 'PRP'),
 ('come', 'VBP'),
 ('most', 'RBS'),
 ('carefully', 'RB'),
 ('vpon', 'VB'),
 ('your', 'PRP$')]