# 5. Categorizing and Tagging Words

명사, 동사, 형용사 등 이러한 "word classes"들은 많은 language processing tasks에서 유용하다. 이번 챕터의 목적은 아래 질문에 답하는 것이다.

1. What are lexical categories, and how are they used in natural laguage processing?

2. What is a good Python data structure for storing words and their categories?

3. How can we automatically tag each word of a text with its word class?

어떻게 텍스트에 자동으로 태깅할 것인가? (형태소이던 뭐던)

## 5.1 Using a Tagger 

A part-of-speech tagger, of POS tagger, porcesses a sequence of words, and attaches a part of speech tag to each word

In [1]:
import nltk

In [2]:
text= nltk.word_tokenize("And now for something completely different")

In [3]:
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

동음이의어를 고려한 또 다른 예제

In [4]:
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")

In [5]:
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

refuse와 permit은 동사로도 명사로도 나옴. ! 

refUSE is a verb meaning "deny", while REFuse is a noun meaning "trash"

Lexical categories like "noun"은 뭔가 그 용도가 있을듯 하지만 실은 독자로 하여금 읽는 것을 방해한다? 많은 이러한 카테고리들은 텍스트 안의 단어들의 분포에 대한 피상적인 분석으로부터 나온다. 

In [6]:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

text.similar() 메소드는 어떠한 단어와 비슷한 context에서 쓰인 단어들을 찾아 준다.

In [7]:
text.similar('woman')

man time day year car moment world house family boy child country job
state girl way war place word week


In [8]:
text.similar('bought')

made done said put found had seen left given heard brought was been
got set told that in took called


In [9]:
text.similar('over')

in on to of and for with from at by that into as up out down through
all is about


A tagger can also model our knowledge of unknown word; for example, we can guess that <i>scrobbling</i> is probably a verb, with the rood <i>scrobble</i>, and likely to occur in contexts like he was <i>scrobbling</i>

## 5.2 Tagged Corpora 

### Representing Tagged Tokens 

nltk는 tagged token을 token과 tag로 구성된 터플로 나타낸다. 우리는 이러한 special tuples를 <strong>str2tuple()</strong> 함수로 만들 수 있다.

In [10]:
tagged_token = nltk.tag.str2tuple('fly/NN')
tagged_token

('fly', 'NN')

In [11]:
print(tagged_token[0], tagged_token[1])

fly NN


In [14]:
 sent = '''
 The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
 other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
 Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
 said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
 accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
 interest/NN of/IN both/ABX governments/NNS ''/'' ./.  
 ''' 

In [16]:
[nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'),
 ('grand', 'JJ'),
 ('jury', 'NN'),
 ('commented', 'VBD'),
 ('on', 'IN'),
 ('a', 'AT'),
 ('number', 'NN'),
 ('of', 'IN'),
 ('other', 'AP'),
 ('topics', 'NNS'),
 (',', ','),
 ('AMONG', 'IN'),
 ('them', 'PPO'),
 ('the', 'AT'),
 ('Atlanta', 'NP'),
 ('and', 'CC'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('purchasing', 'VBG'),
 ('departments', 'NNS'),
 ('which', 'WDT'),
 ('it', 'PPS'),
 ('said', 'VBD'),
 ('``', '``'),
 ('ARE', 'BER'),
 ('well', 'QL'),
 ('operated', 'VBN'),
 ('and', 'CC'),
 ('follow', 'VB'),
 ('generally', 'RB'),
 ('accepted', 'VBN'),
 ('practices', 'NNS'),
 ('which', 'WDT'),
 ('inure', 'VB'),
 ('to', 'IN'),
 ('the', 'AT'),
 ('best', 'JJT'),
 ('interest', 'NN'),
 ('of', 'IN'),
 ('both', 'ABX'),
 ('governments', 'NNS'),
 ("''", "''"),
 ('.', '.')]

###  Reading Tagged Corpora

nltk의 어떤 corpora들은 part-of-speech로 tagged가 되어있다. 예를 들어 Brown Corpus를 텍스트 에디터로 열어 보면

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.


part-of-speech tag는 대문자로 변환되어야 한다는 것을 명심해라?

In [18]:
nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

corpus가 tagged text를 가지고 있다면 NLTK corpus interface는 항상 tagged_words() 메소드를 가질 것이다.

In [22]:
nltk.corpus.nps_chat.tagged_words()

[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

### Nouns

명사는 한정사(determiners) 형용사(adjectives) 뒤에 나타날 수 있고, 동사의 주어나 목적어가 될 수 있다.

### Exploring Tagged Corpora 

Verb to Verb 찾아내기

In [24]:
from nltk.corpus import brown

def process(sentence):
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
        if (t1.startswith('V') and t2 =='TO' and t3.startswith('V')):
            print (w1, w2, w3)

In [27]:
brown.tagged_sents()[:1]

[[('The', 'AT'),
  ('Fulton', 'NP-TL'),
  ('County', 'NN-TL'),
  ('Grand', 'JJ-TL'),
  ('Jury', 'NN-TL'),
  ('said', 'VBD'),
  ('Friday', 'NR'),
  ('an', 'AT'),
  ('investigation', 'NN'),
  ('of', 'IN'),
  ("Atlanta's", 'NP$'),
  ('recent', 'JJ'),
  ('primary', 'NN'),
  ('election', 'NN'),
  ('produced', 'VBD'),
  ('``', '``'),
  ('no', 'AT'),
  ('evidence', 'NN'),
  ("''", "''"),
  ('that', 'CS'),
  ('any', 'DTI'),
  ('irregularities', 'NNS'),
  ('took', 'VBD'),
  ('place', 'NN'),
  ('.', '.')]]

In [25]:
for tagged_sent in brown.tagged_sents():
    process(tagged_sent)

combined to achieve
continue to place
serve to protect
wanted to wait
allowed to place
expected to become
expected to approve
expected to make
intends to make
seek to set
like to see
designed to provide
get to hear
expects to tell
expected to give
prefer to pay
required to obtain
permitted to teach
designed to reduce
Asked to elaborate
got to go
raised to pay
scheduled to go
cut to meet
needed to meet
hastened to add
found to prevent
continue to insist
compelled to make
made to remove
revamped to give
want to risk
appear to spark
fails to consider
plans to call
going to examine
plans to name
come to pass
voted to accept
happens to hold
authorized to adopt
hesitated to prosecute
try to make
decided to spend
taken to preserve
left to preserve
stand to bring
decided to seek
trying to induce
proposing to make
decided to run
directed to investigate
expected to pass
expected to make
expected to encounter
hopes to pass
came to pay
expected to receive
understood to follow
wanted to vote
decide

이제 part-of-speech tag가 아주 모호한 단어들을 찾아보자. 왜 어떤 단어들이 그들의 맥락에 따른 태그를 가지고 있는지 이해하면 태그 간의 구분이 더 명확해 진다

In [37]:
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')

In [38]:
data = nltk.ConditionalFreqDist((word.lower(), tag) 
                                for (word,tag) in brown_news_tagged)

In [39]:
for word in sorted(data.conditions()):
    if len(data[word]) > 3:
        tags = [tag for (tag, _) in data[word].most_common()]
        print(word, ' '.join(tags))

best ADJ NOUN VERB ADV
close ADV ADJ VERB NOUN
open ADJ VERB NOUN ADV
present ADJ ADV NOUN VERB
that ADP DET PRON ADV


## 5.3 Mapping Words to Properties Using Python Dictionaries 

앞에서 살펴 본 것 처럼, tagged word 의 폼인 (word, tag)는 word와 part-of-speech tag 간의 연관을 나타낸다.

우리는 단어에 tag를 할당하는 프로그램을 만들 수도 있다. 이러한 과정을 <strong>mapping</strong> from words to tags라고 생각할 수 있고 대부분의 경우 <strong>dictionary</strong>을 사용하여 처리한다.

In [40]:
pos = {}

In [41]:
pos['colorless'] = 'ADJ'

In [42]:
pos['ideas'] = 'N'
pos['sleep'] = 'V'
pos['furiously'] = 'ADV'

In [43]:
pos

{'colorless': 'ADJ', 'furiously': 'ADV', 'ideas': 'N', 'sleep': 'V'}

In [44]:
pos['ideas']

'N'

In [45]:
list(pos)

['sleep', 'ideas', 'furiously', 'colorless']

In [46]:
sorted(pos)

['colorless', 'furiously', 'ideas', 'sleep']

In [47]:
[w for w in pos if w.endswith('s')]

['ideas', 'colorless']

In [49]:
for word in sorted(pos):
    print(word + ":", pos[word])

colorless: ADJ
furiously: ADV
ideas: N
sleep: V


In [50]:
for key, val in sorted(pos.items()):
    print(key + ":", val)

colorless: ADJ
furiously: ADV
ideas: N
sleep: V


### Incrementally Updating a Dictionary 

In [57]:
counts = nltk.defaultdict(int)

In [58]:
counts

defaultdict(int, {})

In [59]:
for (word, tag) in brown.tagged_words(categories='news'):
    counts[tag]+=1

In [60]:
counts['CS']

1509

In [56]:
list(counts)

['ABX',
 'VB-HL',
 'WRB',
 ':-HL',
 'MD-HL',
 'IN-TL',
 'CD-HL',
 'AP-HL',
 'TO',
 'BEM',
 'RB$',
 'CC-HL',
 'FW-IN-TL',
 '``',
 'FW-PP$-NC',
 'PPSS+HVD',
 'NNS',
 "''",
 '.',
 'RBR',
 'AT-HL',
 'MD*',
 'NPS',
 'RP-HL',
 'WPS',
 'FW-IN+NN',
 'RP',
 'ABN',
 'NP+BEZ',
 'PN-HL',
 'CD',
 'PN',
 ',-HL',
 'JJR-HL',
 'NN-NC',
 'BEZ*',
 'PPS',
 'JJR',
 'DOD',
 'TO-TL',
 'DO',
 'QL-TL',
 'NP$-TL',
 'FW-CC',
 'NNS$-TL',
 'WPS+BEZ',
 'BEZ',
 'NR-TL',
 'RBT',
 ')-HL',
 'PPS+BEZ',
 "'",
 'VBN-HL',
 'JJS-TL',
 'BE-HL',
 'CS-HL',
 'WPO',
 'FW-VB',
 'QLP',
 'BE',
 'DTS',
 ')',
 'FW-WDT',
 'BEDZ',
 'DT',
 'NNS$-HL',
 'PN$',
 'AT',
 'NNS-HL',
 'RB+BEZ',
 'BED*',
 'HVD-HL',
 'PPO',
 'AP',
 'PP$$',
 'IN-HL',
 'PPSS+HV',
 'NPS-TL',
 'HVD*',
 'FW-IN+NN-TL',
 'MD',
 'VBG-HL',
 'NN$-HL',
 'UH-TL',
 'NP$',
 'CC-TL',
 'EX+BEZ',
 'VBD',
 'BER-TL',
 'HVD',
 'DT+BEZ',
 '*',
 'DT$',
 'FW-JJ',
 'DOZ*',
 'ABN-HL',
 'ABL',
 'NP-TL-HL',
 'JJT-HL',
 'FW-CD',
 'PPS+BEZ-HL',
 'NN$-TL',
 'PPSS+MD',
 'PPSS+BER',
 'AT-TL',
 

In [61]:
from operator import itemgetter

In [62]:
sorted(counts.items(), key=itemgetter(1), reverse=True)

[('NN', 13162),
 ('IN', 10616),
 ('AT', 8893),
 ('NP', 6866),
 (',', 5133),
 ('NNS', 5066),
 ('.', 4452),
 ('JJ', 4392),
 ('CC', 2664),
 ('VBD', 2524),
 ('NN-TL', 2486),
 ('VB', 2440),
 ('VBN', 2269),
 ('RB', 2166),
 ('CD', 2020),
 ('CS', 1509),
 ('VBG', 1398),
 ('TO', 1237),
 ('PPS', 1056),
 ('PP$', 1051),
 ('MD', 1031),
 ('AP', 923),
 ('NP-TL', 741),
 ('``', 732),
 ('BEZ', 730),
 ('BEDZ', 716),
 ("''", 702),
 ('JJ-TL', 689),
 ('PPSS', 602),
 ('DT', 589),
 ('BE', 525),
 ('VBZ', 519),
 ('NR', 495),
 ('RP', 482),
 ('QL', 468),
 ('PPO', 412),
 ('WPS', 395),
 ('NNS-TL', 344),
 ('WDT', 343),
 ('WRB', 328),
 ('BER', 328),
 ('OD', 309),
 ('HVZ', 301),
 ('--', 300),
 ('NP$', 279),
 ('HV', 265),
 ('HVD', 262),
 ('*', 256),
 ('BED', 252),
 ('NPS', 215),
 ('BEN', 212),
 ('NN$', 210),
 ('DTI', 205),
 ('NP-HL', 186),
 ('ABN', 183),
 ('NN-HL', 171),
 ('IN-TL', 164),
 ('EX', 161),
 (')', 151),
 ('(', 148),
 ('JJR', 145),
 (':', 137),
 ('DTS', 136),
 ('JJT', 100),
 ('CD-TL', 96),
 ('NNS-HL', 92),
 ('

In [63]:
[t for t, c in sorted(counts.items(), key=itemgetter(1), reverse=True)]

['NN',
 'IN',
 'AT',
 'NP',
 ',',
 'NNS',
 '.',
 'JJ',
 'CC',
 'VBD',
 'NN-TL',
 'VB',
 'VBN',
 'RB',
 'CD',
 'CS',
 'VBG',
 'TO',
 'PPS',
 'PP$',
 'MD',
 'AP',
 'NP-TL',
 '``',
 'BEZ',
 'BEDZ',
 "''",
 'JJ-TL',
 'PPSS',
 'DT',
 'BE',
 'VBZ',
 'NR',
 'RP',
 'QL',
 'PPO',
 'WPS',
 'NNS-TL',
 'WDT',
 'WRB',
 'BER',
 'OD',
 'HVZ',
 '--',
 'NP$',
 'HV',
 'HVD',
 '*',
 'BED',
 'NPS',
 'BEN',
 'NN$',
 'DTI',
 'NP-HL',
 'ABN',
 'NN-HL',
 'IN-TL',
 'EX',
 ')',
 '(',
 'JJR',
 ':',
 'DTS',
 'JJT',
 'CD-TL',
 'NNS-HL',
 'PN',
 'RBR',
 'VBN-TL',
 'ABX',
 'NN$-TL',
 'IN-HL',
 'DOD',
 'DO',
 'BEG',
 ',-HL',
 'VBN-HL',
 'CD-HL',
 'AT-TL',
 'NNS$',
 'JJS',
 "'",
 'CC-TL',
 'JJ-HL',
 'MD*',
 'VBZ-HL',
 'PPL',
 'PPS+BEZ',
 'PPSS+MD',
 'OD-TL',
 'DOZ',
 'VB-HL',
 'NR$',
 'WP$',
 'FW-NN',
 'ABL',
 'PPLS',
 'NNS$-TL',
 ')-HL',
 'PPSS+BER',
 '.-HL',
 '(-HL',
 'PPSS+HV',
 'HVN',
 'PPSS+BEM',
 'DO*',
 'NPS$',
 'FW-NN-TL',
 'NPS-TL',
 'VBG-TL',
 'DOD*',
 'RB-HL',
 'AT-HL',
 'NR-TL',
 'HVG',
 'FW-IN',
 'BEM',
 

In [64]:
pair = ('NP', 8336)

In [65]:
pair[1]

8336

In [66]:
itemgetter(1)(pair)

8336

## 5.4 Automatic Tagging 

남은 챕터에서는 text에 자동으로 part-of-speech tags를 태깅하는 다양한 방법에 대해 탐구해볼 것이다.

In [67]:
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

### The Default Tagger 

가장 심플한 방법은 모든 토큰에 똑같은 태그를 할당하는 것이다. 이건 안하느니 못하다고 생각할 수도 있지만 tagger performance의 중요한 기반을 설립할 수 있다. 

In [68]:
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]

In [71]:
nltk.FreqDist(tags).max()

'NN'

이제 우리는 모든 것에 NN으로 태깅할 것이다.

In [72]:
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'

In [73]:
tokens = nltk.word_tokenize(raw)

In [76]:
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)

[('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('green', 'NN'),
 ('eggs', 'NN'),
 ('and', 'NN'),
 ('ham', 'NN'),
 (',', 'NN'),
 ('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('them', 'NN'),
 ('Sam', 'NN'),
 ('I', 'NN'),
 ('am', 'NN'),
 ('!', 'NN')]

당연하게도 이것의 성능은 형편없다.

In [75]:
default_tagger.evaluate(brown_tagged_sents)

0.13089484257215028

나중에 살펴볼 것이지만, 이 default tagger는 language processing system의 강건함을 증진시켜준다.

### The Regular Expression Tagger 

정규표현식 태거는 매칭되는 패턴의 토큰에 태깅한다. 예를들어 우리는 <i>ed</i> 로 끝나는 어떠한 단어를 과거 동사로 태깅할 수 있다. 혹은 <i>'s</i>가 붙은 단어를 명사의 소유격으로 볼 수도 있다.

In [77]:
patterns = [
     (r'.*ing$', 'VBG'),               # gerunds
     (r'.*ed$', 'VBD'),                # simple past
     (r'.*es$', 'VBZ'),                # 3rd singular present
     (r'.*ould$', 'MD'),               # modals
     (r'.*\'s$', 'NN$'),               # possessive nouns
     (r'.*s$', 'NNS'),                 # plural nouns
     (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'.*', 'NN')                     # nouns (default)
] 

In [78]:
regexp_tagger = nltk.RegexpTagger(patterns)

In [79]:
regexp_tagger.tag(brown_sents[3])

[('``', 'NN'),
 ('Only', 'NN'),
 ('a', 'NN'),
 ('relative', 'NN'),
 ('handful', 'NN'),
 ('of', 'NN'),
 ('such', 'NN'),
 ('reports', 'NNS'),
 ('was', 'NNS'),
 ('received', 'VBD'),
 ("''", 'NN'),
 (',', 'NN'),
 ('the', 'NN'),
 ('jury', 'NN'),
 ('said', 'NN'),
 (',', 'NN'),
 ('``', 'NN'),
 ('considering', 'VBG'),
 ('the', 'NN'),
 ('widespread', 'NN'),
 ('interest', 'NN'),
 ('in', 'NN'),
 ('the', 'NN'),
 ('election', 'NN'),
 (',', 'NN'),
 ('the', 'NN'),
 ('number', 'NN'),
 ('of', 'NN'),
 ('voters', 'NNS'),
 ('and', 'NN'),
 ('the', 'NN'),
 ('size', 'NN'),
 ('of', 'NN'),
 ('this', 'NNS'),
 ('city', 'NN'),
 ("''", 'NN'),
 ('.', 'NN')]

In [80]:
regexp_tagger.evaluate(brown_tagged_sents)

0.20326391789486245

약 20% 정도는 맞는다

### The Lookup Tagger 

많은 고빈도 단어들은 NN tag를 가지고 있지 않다. 다른 태그들도 찾아보자 

In [81]:
fd = nltk.FreqDist(brown.words(categories='news'))

In [82]:
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

In [87]:
most_freq_words = fd.most_common(100)

In [89]:
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)

In [90]:
baseline_tagger = nltk.UnigramTagger(model=likely_tags)

In [91]:
baseline_tagger.evaluate(brown_tagged_sents)

0.45578495136941344

In [92]:
sent = brown.sents(categories='news')[3]
baseline_tagger.tag(sent)

[('``', '``'),
 ('Only', None),
 ('a', 'AT'),
 ('relative', None),
 ('handful', None),
 ('of', 'IN'),
 ('such', None),
 ('reports', None),
 ('was', 'BEDZ'),
 ('received', None),
 ("''", "''"),
 (',', ','),
 ('the', 'AT'),
 ('jury', None),
 ('said', 'VBD'),
 (',', ','),
 ('``', '``'),
 ('considering', None),
 ('the', 'AT'),
 ('widespread', None),
 ('interest', None),
 ('in', 'IN'),
 ('the', 'AT'),
 ('election', None),
 (',', ','),
 ('the', 'AT'),
 ('number', None),
 ('of', 'IN'),
 ('voters', None),
 ('and', 'CC'),
 ('the', 'AT'),
 ('size', None),
 ('of', 'IN'),
 ('this', 'DT'),
 ('city', None),
 ("''", "''"),
 ('.', '.')]

정확도가 거의 45퍼가 되었지만 실제로 뜯어보니 대부분의 단어들에게 None이 할당됐다. 그 이유는 그 단어가 frequent words 100개 안에 들어가지 않기 때문이다.

In [93]:
baseline_tagger = nltk.UnigramTagger(model=likely_tags,
                                    backoff=nltk.DefaultTagger('NN'))

In [95]:
def performance(cfd, wordlist):    
    lt = dict((word, cfd[word].max()) for word in wordlist)    
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))

def display():    
    import pylab    
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))    
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))    
    sizes = 2 ** pylab.arange(15)    
    perfs = [performance(cfd, words_by_freq[:size]) 
                                              for size in sizes]    
    pylab.plot(sizes, perfs, '-bo')    
    pylab.title('Lookup Tagger Performance with Varying Model Size')    
    pylab.xlabel('Model Size')    
    pylab.ylabel('Performance')    
    pylab.show() 

In [96]:
display()

## 5.5 N-Gram Tagging

### Unigram Tagger

Unigram tagger는 간단한 통계 알고리즘에 기반한다, 각 토큰마다 가장 그 토큰에 맞을 만한 태그를 할당한다. 예를들어 <i>frequent</i>와 함께 쓰인 단어에 JJ 태그를 할당한다. (형용사이니까)

In [97]:
 unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)

In [98]:
 unigram_tagger.tag(brown_sents[2007]) 

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'QL'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

In [99]:
unigram_tagger.evaluate(brown_tagged_sents) 

0.9349006503968017

### Separating the Training and Testing Data 

In [100]:
size = int(len(brown_tagged_sents)*0.9)

In [101]:
size

4160

In [102]:
train_sets = brown_tagged_sents[:size]
test_sets = brown_tagged_sents[size:]

In [104]:
unigram_tagger = nltk.UnigramTagger(train_sets)
unigram_tagger.evaluate(test_sets)

0.8129173726701884

비록 정확도가 좀 떨어지긴 했지만 우리는 unigram tagger가 얼마나 유용한지 알게 되었다.

### General N-Gram Tagging

n-gram tagger는 unigram tagger의 일반화된 버전이다.

- 1-gram tagger = unigram tagger (주위 한 단어만 신경 쓴다?)
- 2-gram tagger = bigram tagger
- 3-gram tagger = trigram tagger

In [106]:
bigram_tagger = nltk.BigramTagger(train_sets)

In [108]:
 bigram_tagger.tag(brown_sents[2007])

[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'CS'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

In [109]:
 unseen_sent = brown_sents[4203] 

In [110]:
bigram_tagger.tag(unseen_sent) 

[('The', 'AT'),
 ('population', 'NN'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('Congo', 'NP'),
 ('is', 'BEZ'),
 ('13.5', None),
 ('million', None),
 (',', None),
 ('divided', None),
 ('into', None),
 ('at', None),
 ('least', None),
 ('seven', None),
 ('major', None),
 ('``', None),
 ('culture', None),
 ('clusters', None),
 ("''", None),
 ('and', None),
 ('innumerable', None),
 ('tribes', None),
 ('speaking', None),
 ('400', None),
 ('separate', None),
 ('dialects', None),
 ('.', None)]

In [112]:
bigram_tagger.evaluate(test_sets) 

0.10206319146815508

bigram tagger는 트레이닝하는 동안 봤던 모든 단어들에 대하여 태깅하지만 한번도 보지 못했던 문장이 등장하면 성능이 팍 나빠진다. 

N-gram taggers should not consider context that crosses a sentence boundary. Accordingly, NLTK taggers are designed to work with lists of sentences, where each sentence is a list of words. At the start of a sentence, tn-1 and preceding tags are set to None.


n이 커질 수록 문맥의 특수성이 올라가지만 트레이닝 데이터 안에 그 모든 문맥을 포함하지 못할 수도 있다? 이는 sparse data problem을 일으키고 이는 NLP에 만연한 문제이다. 결과적으로 정확도와 결과의 커버리지는 트레이드 오프 관계에 놓이게 된다.

## 5.7 How to Determine the Category of a Word 

### Morphological Clues (형태학 상의) 

단어의 내부 구조는 단어의 카테고리를 정하는데 있어서 중요한 실마리이다. 예를 들어 -ness가 붙으면 형용사를 명사로 만들어 낸다. 이와 비슷하게 -ment가 접미사로 붙으면 동사를 명사로 만들어 낸다. 

영어 동사들은 형태학 상으로 아주 복잡할 수가 있다.. 

### Syntactic Clues (구문론적) 

또 다른 관점은 어떤 단어가 등장하냐에 따른 문맥을 통해 정보를 얻는 것이다. 예를 들어, 우리가 이미 명사 카테고리들을 확정지었다고 가정하면 구문 상 명사 앞은 형용사가 나타날 확률이 높다.. 

### Semantic Clues (의미상의)

마지막으로 단어의 의미는 언어학적 카테고리에 중요한 단서이다. 