词性标注是给文本中的每个词按照其含义、上下文内容和语法属性进行标记，是自然语言处理中的基本任务
这部分介绍NLTK词性标注的相关功能

POS(part-of-speech) tagger，用于标记每个词语在文本中的成分，使用NLTK预先训练好的POS Tagging模型

In [1]:
import nltk
from nltk import word_tokenize

text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('completely', 'RB'),
 ('different', 'JJ')]

In [2]:
nltk.corpus.brown.tagged_words() # NLTK的语料库中部分语料库中的文本已经被词性标注

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

接下来介绍NLTK中各种Tagger的功能和用法

NLTK中最简单的Tagger是DefaultTagger，给每个词用相同的Tag标注，这种标记方法毫无疑问效果是最差的

In [3]:
tokens = word_tokenize("I do not like green eggs.")
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)

[('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('green', 'NN'),
 ('eggs', 'NN'),
 ('.', 'NN')]

In [4]:
from nltk.corpus import brown
brown_tag_sents = brown.tagged_sents(categories='news')
default_tagger.evaluate(brown_tag_sents) # tag正确率很低

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  default_tagger.evaluate(brown_tag_sents) # tag正确率很低


0.13089484257215028

正则表达式标注器Regular Expression Tagger，通过正则表达式进行模式匹配，对匹配特定模式的词标注上特定的Tag

In [5]:
patterns = [
     (r'.*ing$', 'VBG'),                # gerunds
     (r'.*ed$', 'VBD'),                 # simple past
     (r'.*es$', 'VBZ'),                 # 3rd singular present
     (r'.*ould$', 'MD'),                # modals
     (r'.*\'s$', 'NN$'),                # possessive nouns
     (r'.*s$', 'NNS'),                  # plural nouns
     (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'),  # cardinal numbers
     (r'.*', 'NN')                      # nouns (default)
]

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(tokens)
regexp_tagger.evaluate(brown_tag_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  regexp_tagger.evaluate(brown_tag_sents)


0.20186168625812995

Lookup Tagger是用文本中词频最高的若干个词的标记生成查找表，用这些信息去对文本进行标记，如果文本中的词不在查找表中，则使用默认标记。NLTK的实现为UnigramTagger
随着生成查找表的词语的数量增加，Lookup Tagger标注的准确率提高

In [6]:
fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = fd.most_common(100)
likely_tags = dict((word, cfd[word].max()) for (word, _) in most_freq_words)
lookup_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))
lookup_tagger.evaluate(brown_tag_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  lookup_tagger.evaluate(brown_tag_sents)


0.5817769556656125

UnigramTagger(一元标注器)的实现基于一种简单算法：对于每个token，分配最可能用于该特定token的标记，可以通过给部分词性标注的数据给UnigramTagger进行训练（上面其实就是用词频最高的若干词进行训练）

In [7]:
size = int(len(brown_tag_sents) * 0.9)
train_sents = brown_tag_sents[:size]
test_sents = brown_tag_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
print("train accuracy:",unigram_tagger.evaluate(train_sents))
print("test accuracy:",unigram_tagger.evaluate(test_sents))

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print("train accuracy:",unigram_tagger.evaluate(train_sents))


train accuracy: 0.9353630649241612
test accuracy: 0.8121200039868434


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print("test accuracy:",unigram_tagger.evaluate(test_sents))


N-Gram Tagging是Unigram Tagging的一般情况，通过前n个词的标记作为上下文来判断该词的标记
需要注意N-Gram碰到未出现的词、句的表现很糟糕（因为没有相对应的上下文）
当n变得很大时，每个上下文出现的次数都很少，数据比较稀疏，标记的准确率也会比较高，覆盖率比较低
这里用BigramTagger作为示例进行展示

In [8]:
brown_sents = brown.sents()
unseen_sent = brown_sents[4203]
bigram_tagger = nltk.BigramTagger(train_sents)

print(bigram_tagger.tag(brown_sents[2007]))
print(bigram_tagger.tag(unseen_sent))
print(bigram_tagger.evaluate(test_sents))

[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
[('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'), ('Congo', 'NP'), ('is', 'BEZ'), ('13.5', None), ('million', None), (',', None), ('divided', None), ('into', None), ('at', None), ('least', None), ('seven', None), ('major', None), ('``', None), ('culture', None), ('clusters', None), ("''", None), ('and', None), ('innumerable', None), ('tribes', None), ('speaking', None), ('400', None), ('separate', None), ('dialects', None), ('.', None)]
0.10206319146815508


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print(bigram_tagger.evaluate(test_sents))


NLTK可以将Tagger进行组合，准确性较强的Tagger进行一般的标记，当找不到词的标记时可以使用覆盖率更大的Tagger

In [9]:
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  t2.evaluate(test_sents)


0.8452108043456593

NLTK实现了根据条件随机场（CRF）算法进行词性标注的tagger CRFTagger

In [11]:
crf_tagger = nltk.CRFTagger()
crf_tagger.train(train_sents[:10],'model.crf.tagger') # 因为训练时间过长简单演示一下
crf_tagger.evaluate(test_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  crf_tagger.evaluate(test_sents)


0.5103159573407754

NLTK实现了根据感知器模型进行词性标注的tagger PerceptronTagger

In [13]:
perceptron_tagger = nltk.PerceptronTagger()
perceptron_tagger.train(train_sents[:10],'model.crf.tagger')
perceptron_tagger.evaluate(test_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  perceptron_tagger.evaluate(test_sents)


0.8686335094189176

Brill Tagger基于转换进行词性标注。其总体思路非常简单：猜测每个词的标签，然后再回头修正错误。Brill Tagger基于正确标记的数据进行训练，通过维护一个转换修正规则的列表，错误的标记连续地转化为更正确的标记

In [14]:
from nltk.tbl import demo as brill_demo
brill_demo.demo() # 简单展示Brill Tagger的训练过程

Loading tagged data from treebank... 
Read testing data (200 sents/5251 wds)
Read training data (800 sents/19933 wds)
Read baseline data (800 sents/19933 wds) [reused the training set]
Trained baseline tagger
    Accuracy on test set: 0.8358
Training tbl tagger...
TBL train (fast) (seqs: 800; tokens: 19933; tpls: 24; min score: 3; min acc: None)
Finding initial useful rules...
    Found 12799 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  23  23   0   0  | POS->VBZ if Pos:PRP@[-2,-1]
  18  19   1   0  | NN->VB if Pos:-NONE-@[-2] & Pos:TO@[-1]
  14  14   0   0  | VBP->VB if Pos:MD@[-2,-1]
  12  12   0   0  | VBP->VB if Pos:TO@[-1]
  