这部分介绍nltk的corpus相关的操作
corpus是nltk内置的语料库，可以通过corpus访问不同语言、不同类型的语料库，在这些语料库上进行自然语言处理任务
许多文本语料库包含语言学注释，POS标签、命名实体、句法结构、语义角色等等

In [1]:
import nltk
from nltk.corpus import gutenberg
gutenberg.fileids() # 语料库中的文件标识符

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [2]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid)) # 语料库中某个文本的字节序列
    num_words = len(gutenberg.words(fileid)) # 语料库中某个文本的词序列
    num_sents = len(gutenberg.sents(fileid)) # 语料库中某个文本的句子序列
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt


将语料库作为单词列表或者句子列表来访问，可以按照类别、文本名称等组织方式来读取语料库中的文本
NLTK的文本语料库按照不同的结构方式组织在一起：
+ 独立的文本，文本之间没有组织起来
+ 按照类别（类别之间不重叠）组织，如流派
+ 重叠的类别组织，如主题类别
+ 按照时间组织，如就职演讲语料库

In [3]:
from nltk.corpus import brown
categories = brown.categories()
news_text = brown.words(categories='news') # 通过category检索文本
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print(m + ":", fdist[m], end = ' ')

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389 

NLTK提供了许多不同语言的语料库，需要注意不同的语言的字符编码是不一样的

In [4]:
print(nltk.corpus.indian.words()[:10]) # 印度的语料库
print(nltk.corpus.udhr.fileids()) # udhr，包含了300多种语言的《世界人权宣言》

['মহিষের', 'সন্তান', ':', 'তোড়া', 'উপজাতি', '৷', 'বাসস্থান-ঘরগৃহস্থালি', 'তোড়া', 'ভাষায়', 'গ্রামকেও']
['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1', 'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1', 'Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1', 'Amahuaca', 'Amahuaca-Latin1', 'Amarakaeri-Latin1', 'Amuesha-Yanesha-UTF8', 'Arabela-Latin1', 'Arabic_Alarabia-Arabic', 'Asante-UTF8', 'Ashaninca-Latin1', 'Asheninca-Latin1', 'Asturian_Bable-Latin1', 'Aymara-Latin1', 'Balinese-Latin1', 'Bambara-UTF8', 'Baoule-UTF8', 'Basque_Euskara-Latin1', 'Batonu_Bariba-UTF8', 'Belorus_Belaruski-Cyrillic', 'Belorus_Belaruski-UTF8', 'Bemba-Latin1', 'Bengali-UTF8', 'Beti-UTF8', 'Bichelamar-Latin1', 'Bikol_Bicolano-Latin1', 'Bora-Latin1', 'Bosnian_Bosanski-Cyrillic', 'Bosnian_Bosanski-Latin2', 'Bosnian_Bosanski-UTF8', 'Breton-Latin1', 'Bugisnese-Latin1', 'Bulgarian_Balgarski-Cyrillic', 'Bulgarian_Balgarski-UTF8', 'Cakchiquel-Latin1', 'Campa_Paj

NLTK支持用户导入自己的文本
通过``PlaintextCorpusReader``直接导入文本，访问语料库
如果有``Penn Treebank``对象的本地拷贝，可以通过``BracketParseCorpusReader``访问语料库

In [5]:
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import BracketParseCorpusReader

NLTK提供的语料库的形式有：
+ 单词列表
+ 表格，每一行是单词和单词的属性
+ 工具箱文件，工具箱文件由一组条目组成，其中每个条目由一个或多个字段组成

In [6]:
from nltk.corpus import stopwords # 停用词
stopwords.words('english') # 单词列表的示例

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [7]:
from nltk.corpus import cmudict # cmu发音词典，表格语料库的示例
print(len(cmudict.entries()))
for entry in cmudict.entries()[42371:42379]:
    print(entry)

133737
('fir', ['F', 'ER1'])
('fire', ['F', 'AY1', 'ER0'])
('fire', ['F', 'AY1', 'R'])
('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M'])
('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M'])
('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z'])
('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z'])
('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L'])


In [8]:
from nltk.corpus import toolbox
toolbox.entries('rotokas.dic')[:5]

[('kaa',
  [('ps', 'V'),
   ('pt', 'A'),
   ('ge', 'gag'),
   ('tkp', 'nek i pas'),
   ('dcsv', 'true'),
   ('vx', '1'),
   ('sc', '???'),
   ('dt', '29/Oct/2005'),
   ('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'),
   ('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'),
   ('xe', 'Apoka is gagging from food while talking.')]),
 ('kaa',
  [('ps', 'V'),
   ('pt', 'B'),
   ('ge', 'strangle'),
   ('tkp', 'pasim nek'),
   ('arg', 'O'),
   ('vx', '2'),
   ('dt', '07/Oct/2006'),
   ('ex', 'Rera rauroro rera kaarevoi.'),
   ('xp', 'Em i holim pas em na nekim em.'),
   ('xe', 'He is holding him and strangling him.'),
   ('ex', 'Iroiro-ia oirato okoearo kaaivoi uvare rirovira kaureoparoveira.'),
   ('xp', 'Ol i pasim nek bilong man long rop bikos em i save bikhet tumas.'),
   ('xe',
    "They strangled the man's neck with rope because he was very stubborn and arrogant."),
   ('ex',
    'Oirato okoearo kaaivoi iroiro-ia. Uva viapau uvuiparoi ra vovouparo uva kopiiroi.'),
 

WordNet是面向语义的英语词典，NLTK提供了WordNet的语料库，接下来对WordNet进行探索

In [14]:
# 使用WordNet可以查看单词的词义、同义词和反义词
from nltk.corpus import wordnet
print(wordnet.synsets('motorcar')) # 查看词义
print(wordnet.synset('car.n.01').lemma_names()) # 查看近义词
print(wordnet.synset('car.n.01').definition()) # 查看词的具体定义

[Synset('car.n.01')]
['car', 'auto', 'automobile', 'machine', 'motorcar']
a motor vehicle with four wheels; usually propelled by an internal combustion engine


WordNet同义词对应抽象的概念，它们在英语中并不总是有对应的单词。这些概念以层次结构连接在一起。

In [15]:
motorcar = wordnet.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms() # 查看car相关的概念
sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())

['Model_T',
 'S.U.V.',
 'SUV',
 'Stanley_Steamer',
 'ambulance',
 'beach_waggon',
 'beach_wagon',
 'bus',
 'cab',
 'compact',
 'compact_car',
 'convertible',
 'coupe',
 'cruiser',
 'electric',
 'electric_automobile',
 'electric_car',
 'estate_car',
 'gas_guzzler',
 'hack',
 'hardtop',
 'hatchback',
 'heap',
 'horseless_carriage',
 'hot-rod',
 'hot_rod',
 'jalopy',
 'jeep',
 'landrover',
 'limo',
 'limousine',
 'loaner',
 'minicar',
 'minivan',
 'pace_car',
 'patrol_car',
 'phaeton',
 'police_car',
 'police_cruiser',
 'prowl_car',
 'race_car',
 'racer',
 'racing_car',
 'roadster',
 'runabout',
 'saloon',
 'secondhand_car',
 'sedan',
 'sport_car',
 'sport_utility',
 'sport_utility_vehicle',
 'sports_car',
 'squad_car',
 'station_waggon',
 'station_wagon',
 'stock_car',
 'subcompact',
 'subcompact_car',
 'taxi',
 'taxicab',
 'tourer',
 'touring_car',
 'two-seater',
 'used-car',
 'waggon',
 'wagon']

In [16]:
paths = motorcar.hypernym_paths()
[synset.name() for synset in paths[0]] # 查看car在概念树中的路径

['entity.n.01',
 'physical_entity.n.01',
 'object.n.01',
 'whole.n.02',
 'artifact.n.01',
 'instrumentality.n.03',
 'container.n.01',
 'wheeled_vehicle.n.01',
 'self-propelled_vehicle.n.01',
 'motor_vehicle.n.01',
 'car.n.01']

WordNet中的概念树中词与词之间有比较复杂的关系:
+ part词集包含这个词对应的事物的组成部分
+ substance词集表示这个词对应事物的材质
+ member词集则表示这个词对应事物的整体

In [17]:
print(wordnet.synset('tree.n.01').part_meronyms())
print(wordnet.synset('tree.n.01').substance_meronyms())
print(wordnet.synset('tree.n.01').member_holonyms())

[Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')]
[Synset('heartwood.n.01'), Synset('sapwood.n.01')]
[Synset('forest.n.01')]


WordNet中标明了动词之间的包含关系，同时还有反义词

In [18]:
print(wordnet.synset('walk.v.01').entailments()) # 动词的包含关系
print(wordnet.lemma('horizontal.a.01.horizontal').antonyms()) # 反义词

[Synset('step.v.01')]
[Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')]


WordNet中的同义词网络，使得不同的词可能共享同一个同义词超集，可以找到两个词之间最近的概念超集来估量这两个词词义上的联系和区别

In [20]:
right = wordnet.synset('right_whale.n.01')
minke = wordnet.synset('minke_whale.n.01')
tortoise = wordnet.synset('tortoise.n.01')
print(right.lowest_common_hypernyms(minke))
print(right.lowest_common_hypernyms(tortoise))

[Synset('baleen_whale.n.01')]
[Synset('vertebrate.n.01')]


因为同义词网络是概念上的逐步具体化和细化，可以利用词在层次网络结构中的深度表示这个词的具体性或者说一般性

In [23]:
print(wordnet.synset('right_whale.n.01').min_depth())
print(wordnet.synset('minke_whale.n.01').min_depth())
print(wordnet.synset('tortoise.n.01').min_depth())

15
16
13


衡量两个词之间的相似度同样可以通过层次网络中的路径的相似度来衡量

In [24]:
print(right.path_similarity(minke))
print(right.path_similarity(tortoise))

0.25
0.07692307692307693
