词性标注是给文本中的每个词按照其含义、上下文内容和语法属性进行标记，是自然语言处理中的基本任务
这部分介绍NLTK词性标注的相关功能

POS(part-of-speech) tagger，用于标记每个词语在文本中的成分，使用NLTK预先训练好的POS Tagging模型

In [None]:
import nltk
from nltk import word_tokenize

text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

In [None]:
nltk.corpus.brown.tagged_words() # NLTK的语料库中部分语料库中的文本已经被词性标注

接下来介绍NLTK中各种Tagger的功能和用法

NLTK中最简单的Tagger是DefaultTagger，给每个词用相同的Tag标注，这种标记方法毫无疑问效果是最差的

In [None]:
tokens = word_tokenize("I do not like green eggs.")
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)

In [None]:
from nltk.corpus import brown
brown_tag_sents = brown.tagged_sents(categories='news')
default_tagger.evaluate(brown_tag_sents) # tag正确率很低

正则表达式标注器Regular Expression Tagger，通过正则表达式进行模式匹配，对匹配特定模式的词标注上特定的Tag

In [None]:
patterns = [
     (r'.*ing$', 'VBG'),                # gerunds
     (r'.*ed$', 'VBD'),                 # simple past
     (r'.*es$', 'VBZ'),                 # 3rd singular present
     (r'.*ould$', 'MD'),                # modals
     (r'.*\'s$', 'NN$'),                # possessive nouns
     (r'.*s$', 'NNS'),                  # plural nouns
     (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'.*', 'NN')                      # nouns (default)
]

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(tokens)
regexp_tagger.evaluate(brown_tag_sents)

Lookup Tagger是用文本中词频最高的若干个词的标记生成查找表，用这些信息去对文本进行标记，如果文本中的词不在查找表中，则使用默认标记。NLTK的实现为UnigramTagger
随着生成查找表的词语的数量增加，Lookup Tagger标注的准确率提高

In [None]:
fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = fd.most_common(100)
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)
lookup_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))
lookup_tagger.evaluate(brown_tag_sents)

UnigramTagger(一元标注器)的实现基于一种简单算法：对于每个token，分配最可能用于该特定token的标记，可以通过给部分词性标注的数据给UnigramTagger进行训练（上面其实就是用词频最高的若干词进行训练）

In [None]:
size = int(len(brown_tag_sents) * 0.9)
train_sents = brown_tag_sents[:size]
test_sents = brown_tag_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
print("train accuracy:",unigram_tagger.evaluate(train_sents))
print("test accuracy:",unigram_tagger.evaluate(test_sents))

N-Gram Tagging是Unigram Tagging的一般情况，通过前n个词的标记作为上下文来判断该词的标记
需要注意N-Gram碰到未出现的词、句的表现很糟糕（因为没有相对应的上下文）
当n变得很大时，每个上下文出现的次数都很少，数据比较稀疏，标记的准确率也会比较高，覆盖率比较低
这里用BigramTagger作为示例进行展示

In [None]:
brown_sents = brown.sents()
unseen_sent = brown_sents[4203]
bigram_tagger = nltk.BigramTagger(train_sents)

print(bigram_tagger.tag(brown_sents[2007]))
print(bigram_tagger.tag(unseen_sent))
print(bigram_tagger.evaluate(test_sents))

NLTK可以将Tagger进行组合，准确性较强的Tagger进行一般的标记，当找不到词的标记时可以使用覆盖率更大的Tagger

In [None]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

NLTK实现了根据条件随机场（CRF）算法进行词性标注的tagger CRFTagger

In [None]:
crf_tagger = nltk.CRFTagger()
crf_tagger.train(train_sents,'model.crf.tagger')
crf_tagger.evaluate(test_sents)

NLTK实现了根据感知器模型进行词性标注的tagger PerceptronTagger

In [None]:
perceptron_tagger = nltk.PerceptronTagger()
perceptron_tagger.train(train_sents,'model.crf.tagger')
perceptron_tagger.evaluate(test_sents)

Brill Tagger基于转换进行词性标注。其总体思路非常简单：猜测每个词的标签，然后再回头修正错误。Brill Tagger基于正确标记的数据进行训练，通过维护一个转换修正规则的列表，错误的标记连续地转化为更正确的标记

In [None]:
from nltk.tbl import demo as brill_demo
brill_demo.demo() # 简单展示Brill Tagger的训练过程