# 第三章：加工原料文本
解决：
1. 编写程序访问本地和网络文件，获得语料
2. 把文档分割成单独的词和标点符号，进行预料分析
3. 格式化输出，将结果保存到文件

## 3.1 从网络和硬盘访问文本

### 3.1.1 电子书

In [2]:
import nltk, re, pprint
from nltk import word_tokenize

In [3]:
#读取网络文件
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"#原地址有变化，需根据网页调整
response = request.urlopen(url)
raw = response.read().decode('utf8')
print(type(raw))
print(len(raw))
raw[1:75]

<class 'str'>
1176965


'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

In [4]:
#使用NTLK分词
tokens = word_tokenize(raw[1:])#切片操作去除第一个字符，分词产生一个词汇和标点符号的链表
print(type(tokens))
print(len(tokens))
tokens[:10]

<class 'list'>
257726


['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

In [5]:
#链表操作
text = nltk.Text(tokens)
print(type(text))
print(text[1021:1059])
print(text.collocations())#经常一起出现的词序列

<class 'nltk.text.Text'>
['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.']
Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Ilya Petrovitch; Project
Gutenberg; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens
None


In [6]:
#正向查找find（）及反向查找rfind（）
print(raw.find("PART I"))
print(raw.rfind("End of Project Gutenberg's Crime"))
raw = raw[5336:1157741]
print(raw.find("PART I"))

5336
-1
0


### 3.1.2 处理 HTML

In [7]:
#访问网址获取HTML的全部内容
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [8]:
#提取原始文本
from bs4 import BeautifulSoup
raw = BeautifulSoup(html,'lxml').get_text()
tokens = word_tokenize(raw)
tokens

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'UK',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 'GMT',
 '12:51'

In [9]:
#搜索单词gene在text 中出现的情况,强调上下文
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


### 3.1.3 处理搜索引擎结果
网络可以被看作未经标注的巨大的语料库。  
优势：  
规模  
限制：  
搜索范围受到限制  
搜索引擎一般只允许搜索单个词或词串及使用通配符。  
搜索引擎给出的结果不一致，在不同的时间或区域给出不同结果  
搜索引擎返回的结果中的标记可能会不可预料的改变  

### 3.1.4 处理 RSS 订阅

In [10]:
#Universal Feed Parser第三方Python库,访问博客的内容
import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
print(llog['feed']['title'])
print(len(llog.entries))

Language Log
13


In [11]:
#访问第2个博客内容
post = llog.entries[1]
post.title

'Year Hare Affair'

In [12]:
#提取博客内容
content = post.content[0].value
content[:70]

"<p>That's the abbreviated title of a popular webcomic by Lin Chao 林超.\xa0"

In [13]:
raw = BeautifulSoup(content,'lxml').get_text()
word_tokenize(raw)

['That',
 "'s",
 'the',
 'abbreviated',
 'title',
 'of',
 'a',
 'popular',
 'webcomic',
 'by',
 'Lin',
 'Chao',
 '林超',
 '.',
 'The',
 'full',
 'title',
 'in',
 'Chinese',
 'is',
 'Nà',
 'nián',
 'nà',
 'tù',
 'nàxiē',
 'shì',
 '那年那兔那些事',
 '(',
 'lit.',
 ',',
 '``',
 'that',
 'year',
 'that',
 'rabbit',
 'those',
 'affairs',
 "''",
 ';',
 'i.e.',
 ',',
 '``',
 'The',
 'story',
 'of',
 'that',
 'rabbit',
 'that',
 'happened',
 'in',
 'that',
 'year',
 "''",
 ')',
 'From',
 'the',
 'beginning',
 'of',
 'the',
 'Wikipedia',
 'article',
 ':',
 'The',
 'comic',
 'uses',
 'animals',
 'as',
 'an',
 'allegory',
 'for',
 'nations',
 'and',
 'sovereign',
 'states',
 'to',
 'represent',
 'political',
 'and',
 'military',
 'events',
 'in',
 'history',
 '.',
 'The',
 'goal',
 'of',
 'this',
 'project',
 'was',
 'to',
 'promote',
 'nationalistic',
 'pride',
 'in',
 'young',
 'people',
 ',',
 'and',
 'focuses',
 'on',
 'appreciation',
 'for',
 'China',
 "'s",
 'various',
 'achievements',
 'since',
 't

### 3.1.5 读取本地文件

In [14]:
import os

In [15]:
#列出当前目录
os.listdir('.')

['.android',
 '.astropy',
 '.bash_history',
 '.conda',
 '.continuum',
 '.defaults-0.1.0.ini',
 '.eclipse',
 '.idlerc',
 '.ipynb_checkpoints',
 '.ipython',
 '.jupyter',
 '.keras',
 '.m2',
 '.matplotlib',
 '.oracle_jre_usage',
 '.p2',
 '.PyCharmCE2017.3',
 '.spyder-py3',
 '.ssh',
 '.theanorc.txt',
 '.VirtualBox',
 '1.baseline+script.ipynb',
 '3D Objects',
 'Anaconda3',
 'AppData',
 'Application Data',
 'chg.txt',
 'Contacts',
 'Cookies',
 'data',
 'Desktop',
 'Documents',
 'Downloads',
 'dump.raw.txt',
 'DUTIRTone',
 'Favorites',
 'gbdt.csv',
 'IntelGraphicsProfiles',
 'Links',
 'Local Settings',
 'logs',
 'logs1',
 'lr.csv',
 'matplotlib.ipynb',
 'MiCloud',
 'MNIST_data',
 'Music',
 'My Documents',
 'NetHood',
 'nlp',
 'nlp.ipynb',
 'NLP2.ipynb',
 'notebooks',
 'NTUSER.DAT',
 'ntuser.dat.LOG1',
 'ntuser.dat.LOG2',
 'NTUSER.DAT{ea49ceec-2e07-11e8-81b7-b8ec32391728}.TxR.0.regtrans-ms',
 'NTUSER.DAT{ea49ceec-2e07-11e8-81b7-b8ec32391728}.TxR.1.regtrans-ms',
 'NTUSER.DAT{ea49ceec-2e07-11e8-8

In [16]:
#打开并读取文件
f = open('chg.txt')
f.read()

'汉皇重色思倾国，御宇多年求不得。\n杨家有女初长成，养在深闺人未识。\n天生丽质难自弃，一朝选在君王侧。\n'

In [17]:
#打开chg文件，按行读取，删除空白符
f = open('chg.txt', 'r')
for line in f:
    print(line.strip())

汉皇重色思倾国，御宇多年求不得。
杨家有女初长成，养在深闺人未识。
天生丽质难自弃，一朝选在君王侧。


### 3.1.6 从PDF、MS Word及其他二进制文件中提取文本
PDF及MSword可以借助第三方函数库如pypdf、pywin32进行访问

### 3.1.7 捕获用户输入

In [18]:
s = input("Enter some text: ")

Enter some text: this is a test


In [19]:
print("You typed", len(word_tokenize(s)), "words.")

You typed 4 words.


## 3.2 字符串：最底层的文本处理

### 3.2.1 字符串的基本操作

In [20]:
monty = 'Monty Python'
circus = "Monty Python's Flying Circus"
print(monty)
print(circus)
circus = 'Monty Python\'s Flying Circus'
print(circus)

Monty Python
Monty Python's Flying Circus
Monty Python's Flying Circus


In [21]:
#反斜杠的作用
circus = 'Monty Python's Flying Circus'
print(circus)

SyntaxError: invalid syntax (<ipython-input-21-b0616ecb83be>, line 2)

In [22]:
#输出显示换行
couplet1 = "Shall I compare thee to a Summer's day?"\
            "Thou are more lovely and more temperate:" 
couplet2 = ("Rough winds do shake the darling buds of May,"
            "And Summer's lease hath all too short a date:")
couplet3 = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:"""#换为单引号也可以
print(couplet1)
print(couplet2)
print(couplet3)

Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:
Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:
Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:


In [23]:
#字符串的加乘操作
very_add = 'very' + 'very' + 'very'
print(very_add)
very_mul = 'very'*3
print(very_mul)
#字符串不能进行减和除的操作

veryveryvery
veryveryvery


### 3.2.2 输出字符串

In [24]:
print(monty)
grail = 'Holy Grail'
print(monty + grail)
print(monty, grail)
print(monty, "and the", grail)

Monty Python
Monty PythonHoly Grail
Monty Python Holy Grail
Monty Python and the Holy Grail


### 3.2.3 访问单个字符串

In [25]:
print(monty[0])
print(monty[-1])
sent = 'colorless green ideas sleep furiously'
#循环遍历字符串
for char in sent:
    print(char, end=' ')

M
n
c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y 

In [26]:
# 规范化为小写，并过滤非字母的字符
from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
#按照字符出现次数进行统计
print(fdist.most_common(5))
[char for (char, count) in fdist.most_common()]

[('e', 117092), ('t', 87996), ('a', 77916), ('o', 69326), ('n', 65617)]


['e',
 't',
 'a',
 'o',
 'n',
 'i',
 's',
 'h',
 'r',
 'l',
 'd',
 'u',
 'm',
 'c',
 'w',
 'f',
 'g',
 'p',
 'b',
 'y',
 'v',
 'k',
 'q',
 'j',
 'x',
 'z']

### 3.2.4 访问子字符串

In [27]:
print(monty[6:10])
print(monty[-12:-7])
print(monty[:5])
print(monty[6:])

Pyth
Monty
Monty
Python


In [28]:
phrase = 'And now for something completely different'
#使用in可查询字符串
if 'thing' in phrase:
    print('found "thing"')
monty.find('Python')

found "thing"


6

### 3.2.5 更多的字符串操作


s.find(t) 字符串s 中包含t 的第一个索引（没找到返回-1）  
s.rfind(t) 字符串s 中包含t 的最后一个索引（没找到返回-1）  
s.index(t) 与s.find(t)功能类似，但没找到时引起ValueError  
s.rindex(t) 与s.rfind(t)功能类似，但没找到时引起ValueError  
s.join(text) 连接字符串s 与text 中的词汇  
s.split(t) 在所有找到t 的位置将s 分割成链表（默认为空白符）  
s.splitlines() 将s 按行分割成字符串链表  
s.lower() 将字符串s 小写  
s.upper() 将字符串s 大写   
s.titlecase() 将字符串s 首字母大写  
s.strip() 返回一个没有首尾空白字符的s 的拷贝  
s.replace(t, u) 用u 替换s 中的t   

### 3.2.6 链表与字符串的差异
字符串不可变  
链表是可变

In [29]:
query = 'Who knows?'
beatles = ['John', 'Paul', 'George', 'Ringo']
print(query[2])
print(beatles[2])
print(query[:2])
print(beatles[:2])
print(query + " I don't")

o
George
Wh
['John', 'Paul']
Who knows? I don't


In [30]:
#字符串和链表间不能连接
beatles + 'Brian'

TypeError: can only concatenate list (not "str") to list

In [31]:
beatles + ['Brian']

['John', 'Paul', 'George', 'Ringo', 'Brian']

## 3.3 使用 Unicode 进行文字处理

### 3.3.1 什么是 Unicode
    Unicode是国际组织制定的可以容纳世界上所有文字和符号的字符编码方案。目前的Unicode字符分为17组编排，0x0000 至 0x10FFFF，每组称为平面（Plane），而每平面拥有65536个码位，共1114112个。  
    Unicode 支持超过一百万种字符。每个字符分配一个编号，称为编码点。在Python 中，编码点写作\uXXXX 的形式，其中XXXX 是四位十六进制形式数。  

### 3.3.2 从文件中提取已编码文本

In [32]:
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
#用编码'latin2'打开波兰编码文件
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


In [34]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))#将所有非ascii字符转换为它们的两位数\xXX和四位数字\uXXXX表示:

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


In [1]:
ord('ń')

324

In [2]:
nacute = '\u0144'
nacute

'ń'

In [3]:
nacute.encode('utf8')

b'\xc5\x84'

### 3.3.3 在 Python中使用本地编码

需要在文件的第一行或第二行中包含字符串：'# -*- coding: <coding>-*-'

## 3.4 使用正则表达式检测词组搭配
正则表达式通常被用来检索、替换那些符合某个模式(规则)的文本   
需要使用import re 导入re 函数库

### 3.4.1使用基本的元字符

In [5]:
import re,nltk
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
# 使用正则表达式«ed$»查找以ed 结尾的词汇
[w for w in wordlist if re.search('ed$', w)]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 'accelerated',
 'accepted',
 'accidented',
 'accoladed',
 'accolated',
 'accomplished',
 'accosted',
 'accredited',
 'accursed',
 'accused',
 'accustomed',
 'acetated',
 'acheweed',
 'aciculated',
 'aciliated',
 'acknowledged',
 'acorned',
 'acquainted',
 'acquired',
 'acquisited',
 'acred',
 'aculeated',
 'addebted',
 'added',
 'addicted',
 'addlebrained',
 'addleheaded',
 'addlepated',
 'addorsed',
 'adempted',
 'adfected',
 'adjoined',
 'admired',
 'admitted',
 'adnexed',
 'adopted',
 'adossed',
 'adreamed',
 'adscripted',
 'aduncated',
 'advanced',
 'advised',
 'aeried',
 'aethered',
 'afeared',
 'affected',
 'affectioned',
 'affined',
 'afflicted',
 'affricated',
 'affrighted',
 'affronted',
 'aforenamed',
 'afterfeed',
 'aftershafted',
 'afterthoughted',
 'afterwitted',
 'agazed',
 'aged',
 'agglomerated',
 'aggri

In [6]:
#通配符“.”匹配任何单个字符
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

### 3.4.2 范围与闭包
re.search匹配整个字符串，直到找到一个匹配  
· 通配符，匹配所有字符  
^abc 匹配以abc 开始的字符串  
abc$ 匹配以abc 结尾的字符串  
[abc] 匹配字符集合中的一个  
[A-Z0-9] 匹配字符一个范围  
ed|ing|s 匹配指定的一个字符串（析取）  
“*” 前面的项目零个或多个，如a*, [a-z]* (也叫Kleene 闭包)  
“+” 前面的项目1 个或多个，如a+, [a-z]+  
“?” 前面的项目零个或1 个（即：可选）如：a?, [a-z]?  
{n} 重复n 次，n 为非负整数  
{n,} 至少重复n 次  
{,n} 重复不多于n 次  
{m,n} 至少重复m 次不多于n 次  
a(b|c)+ 括号表示操作符的范围  

In [7]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

['gold', 'golf', 'hold', 'hole']

In [8]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]
#“+”表示的是“前面字符的一个或多个实例

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [9]:
[w for w in chat_words if re.search('^[ha]+$', w)]

['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

In [10]:
wsj = sorted(set(nltk.corpus.treebank.words()))

In [11]:
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)]

['0.0085',
 '0.05',
 '0.1',
 '0.16',
 '0.2',
 '0.25',
 '0.28',
 '0.3',
 '0.4',
 '0.5',
 '0.50',
 '0.54',
 '0.56',
 '0.60',
 '0.7',
 '0.82',
 '0.84',
 '0.9',
 '0.95',
 '0.99',
 '1.01',
 '1.1',
 '1.125',
 '1.14',
 '1.1650',
 '1.17',
 '1.18',
 '1.19',
 '1.2',
 '1.20',
 '1.24',
 '1.25',
 '1.26',
 '1.28',
 '1.35',
 '1.39',
 '1.4',
 '1.457',
 '1.46',
 '1.49',
 '1.5',
 '1.50',
 '1.55',
 '1.56',
 '1.5755',
 '1.5805',
 '1.6',
 '1.61',
 '1.637',
 '1.64',
 '1.65',
 '1.7',
 '1.75',
 '1.76',
 '1.8',
 '1.82',
 '1.8415',
 '1.85',
 '1.8500',
 '1.9',
 '1.916',
 '1.92',
 '10.19',
 '10.2',
 '10.5',
 '107.03',
 '107.9',
 '109.73',
 '11.10',
 '11.5',
 '11.57',
 '11.6',
 '11.72',
 '11.95',
 '112.9',
 '113.2',
 '116.3',
 '116.4',
 '116.7',
 '116.9',
 '118.6',
 '12.09',
 '12.5',
 '12.52',
 '12.68',
 '12.7',
 '12.82',
 '12.97',
 '120.7',
 '1206.26',
 '121.6',
 '126.1',
 '126.15',
 '127.03',
 '129.91',
 '13.1',
 '13.15',
 '13.5',
 '13.50',
 '13.625',
 '13.65',
 '13.73',
 '13.8',
 '13.90',
 '130.6',
 '130.7',
 '

In [12]:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]

['C$', 'US$']

In [13]:
[w for w in wsj if re.search('^[0-9]{4}$', w)]

['1614',
 '1637',
 '1787',
 '1901',
 '1903',
 '1917',
 '1925',
 '1929',
 '1933',
 '1934',
 '1948',
 '1953',
 '1955',
 '1956',
 '1961',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1971',
 '1972',
 '1973',
 '1975',
 '1976',
 '1977',
 '1979',
 '1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2005',
 '2009',
 '2017',
 '2019',
 '2029',
 '3057',
 '8300']

In [14]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)]

['10-day',
 '10-lap',
 '10-year',
 '100-share',
 '12-point',
 '12-year',
 '14-hour',
 '15-day',
 '150-point',
 '190-point',
 '20-point',
 '20-stock',
 '21-month',
 '237-seat',
 '240-page',
 '27-year',
 '30-day',
 '30-point',
 '30-share',
 '30-year',
 '300-day',
 '36-day',
 '36-store',
 '42-year',
 '50-state',
 '500-stock',
 '52-week',
 '69-point',
 '84-month',
 '87-store',
 '90-day']

In [15]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]

['black-and-white',
 'bread-and-butter',
 'father-in-law',
 'machine-gun-toting',
 'savings-and-loan']

In [16]:
[w for w in wsj if re.search('(ed|ing)$', w)]

['62%-owned',
 'Absorbed',
 'According',
 'Adopting',
 'Advanced',
 'Advancing',
 'Alfred',
 'Allied',
 'Annualized',
 'Anything',
 'Arbitrage-related',
 'Arbitraging',
 'Asked',
 'Assuming',
 'Atlanta-based',
 'Baking',
 'Banking',
 'Beginning',
 'Beijing',
 'Being',
 'Bermuda-based',
 'Betting',
 'Boeing',
 'Broadcasting',
 'Bucking',
 'Buying',
 'Calif.-based',
 'Change-ringing',
 'Citing',
 'Concerned',
 'Confronted',
 'Conn.based',
 'Consolidated',
 'Continued',
 'Continuing',
 'Declining',
 'Defending',
 'Depending',
 'Designated',
 'Determining',
 'Developed',
 'Died',
 'During',
 'Encouraged',
 'Encouraging',
 'English-speaking',
 'Estimated',
 'Everything',
 'Excluding',
 'Exxon-owned',
 'Faulding',
 'Fed',
 'Feeding',
 'Filling',
 'Filmed',
 'Financing',
 'Following',
 'Founded',
 'Fracturing',
 'Francisco-based',
 'Fred',
 'Funded',
 'Funding',
 'Generalized',
 'Germany-based',
 'Getting',
 'Guaranteed',
 'Having',
 'Heating',
 'Heightened',
 'Holding',
 'Housing',
 'Illumin

## 3.5 正则表达式的有益应用

### 3.5.1 提取字符块

In [17]:
word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word)#找出一个词中的元音

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

In [18]:
len(re.findall(r'[aeiou]', word))

16

In [19]:
# 文本中的两个或两个以上的元音序列，并确定它们的相对频率
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
                   for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common(12)

[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106),
 ('ue', 105),
 ('ui', 95)]

### 3.5.2 在字符块上做更多事情

In [20]:
# 英语文本是高度冗余的，当省略了单词内部的元音时，仍然很容易阅读。下例保留了任何初始或最终的元音序列。
def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)
english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


In [21]:
# 正则表达式与条件频率分布相结合，将每对频率制成表格
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


In [23]:
#查找单词列表
cv_word_pairs = [(cv, w) for w in rotokas_words
                 for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
cv_index['su']


['kasuari']

### 3.5.3 查找词干

In [24]:
#去掉字符后缀
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
        return word

In [25]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['ing']

In [26]:
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

['processing']

In [27]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

[('process', 'ing')]

In [28]:
#  .*为贪婪模式
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('processe', 's')]

In [29]:
#  .*?为非贪婪模式
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')

[('process', 'es')]

In [30]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')

[('language', '')]

In [33]:
#在文本中提取词干
import nltk
def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)
[stem(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'sword',
 'i',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

### 3.5.4 搜索已分词文本

In [34]:
from nltk.corpus import gutenberg, nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")#只匹配词,不匹配短语

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave


In [35]:
chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*> <.*> <bro>")#匹配短语

you rule bro; telling you bro; u twizted bro


In [36]:
from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


## 3.6 规范化文本
例如：去掉所有的词缀以及提取词干的任务等

### 3.6.1 词干提取器

In [38]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)

In [39]:
#porter词干提取器
porter = nltk.PorterStemmer()
[porter.stem(t) for t in tokens]

['denni',
 ':',
 'listen',
 ',',
 'strang',
 'women',
 'lie',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandat',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcic',
 'aquat',
 'ceremoni',
 '.']

In [40]:
##lancaster词干提取器
lancaster = nltk.LancasterStemmer()
[lancaster.stem(t) for t in tokens]

['den',
 ':',
 'list',
 ',',
 'strange',
 'wom',
 'lying',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'bas',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'pow',
 'der',
 'from',
 'a',
 'mand',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'som',
 'farc',
 'aqu',
 'ceremony',
 '.']

In [41]:
# 使用词干提取器索引文本
class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

In [42]:
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


### 3.6.2 词形归并器

In [43]:
# WordNetLemmatizer用于提取单词的主干
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'woman',
 'lying',
 'in',
 'pond',
 'distributing',
 'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

## 3.7 用正则表达式为文本分词

### 3.7.1 分词的简单方法
符号 功能  
\b 词边界（零宽度）  
\d 任一十进制数字（相当于[0-9]）  
\D 任何非数字字符（等价于[^ 0-9]）  
\s 任何空白字符（相当于[ \t\n\r\f\v]）  
\S 任何非空白字符（相当于[^ \t\n\r\f\v]）  
\w 任何字母数字字符（相当于[a-zA-Z0-9_]）  
\W 任何非字母数字字符（相当于[^a-zA-Z0-9_]）  
\t 制表符  
\n 换行符  

In [44]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""

In [45]:
re.split(r' ', raw)

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone\nthough),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very\nwell',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

In [46]:
re.split(r'[ \t\n]+', raw)#正则表达式«[ \t\n]+»匹配一个或多个空格、制表符（\t）或换行符（\n）

["'When",
 "I'M",
 'a',
 "Duchess,'",
 'she',
 'said',
 'to',
 'herself,',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though),',
 "'I",
 "won't",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL.',
 'Soup',
 'does',
 'very',
 'well',
 'without--Maybe',
 "it's",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 "hot-tempered,'..."]

In [47]:
re.split(r'\W+', raw)

['',
 'When',
 'I',
 'M',
 'a',
 'Duchess',
 'she',
 'said',
 'to',
 'herself',
 'not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 'I',
 'won',
 't',
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 'Maybe',
 'it',
 's',
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 'tempered',
 '']

In [48]:
re.findall(r'\w+|\S\w*', raw)

["'When",
 'I',
 "'M",
 'a',
 'Duchess',
 ',',
 "'",
 'she',
 'said',
 'to',
 'herself',
 ',',
 '(not',
 'in',
 'a',
 'very',
 'hopeful',
 'tone',
 'though',
 ')',
 ',',
 "'I",
 'won',
 "'t",
 'have',
 'any',
 'pepper',
 'in',
 'my',
 'kitchen',
 'AT',
 'ALL',
 '.',
 'Soup',
 'does',
 'very',
 'well',
 'without',
 '-',
 '-Maybe',
 'it',
 "'s",
 'always',
 'pepper',
 'that',
 'makes',
 'people',
 'hot',
 '-tempered',
 ',',
 "'",
 '.',
 '.',
 '.']

In [49]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))

["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


### 3.7.2 NLTK 的正则表达式分词器
nltk.regexp_tokenize()分词效率更高，避免了括号的特殊处理的需要

In [52]:
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x)    # set flag to allow verbose regexps
 (?:[A-Z]\.)+          
| \d+(?:\.\d+)?%?       
| \w+(?:[-']\w+)*  
| \.\.\.            
| (?:[.,;"'?():-_`])  
'''
nltk.regexp_tokenize(text, pattern)

['That', 'U.S.A.', 'poster-print', 'costs', '12.40', '...']

### 3.7.3 分词的进一步问题

1、没有单一的解决方案能在所有领域都行之有效  
2、缩写，如“didn't”  
3、歧义  

## 3.8 分割
分词是一个更普遍的分割问题的一个实例

### 3.8.1 断句

In [54]:
# Punkt 句子分割器：单词标点分割
import pprint
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
pprint.pprint(sents[79:89])

['"Nonsense!"',
 'said Gregory, who was very rational when anyone else\nattempted paradox.',
 '"Why do all the clerks and navvies in the\n'
 'railway trains look so sad and tired, so very sad and tired?',
 'I will\ntell you.',
 'It is because they know that the train is going right.',
 'It\n'
 'is because they know that whatever place they have taken a ticket\n'
 'for that place they will reach.',
 'It is because after they have\n'
 'passed Sloane Square they know that the next station must be\n'
 'Victoria, and nothing but Victoria.',
 'Oh, their wild rapture!',
 'oh,\n'
 'their eyes like stars and their souls again in Eden, if the next\n'
 'station were unaccountably Baker Street!"',
 '"It is you who are unpoetical," replied the poet Syme.']


### 3.8.2 分词

In [56]:
# 找到将文本字符串正确分割成词汇的字位串，进行分词
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
            words.append(text[last:i+1])
            last = i+1
    words.append(text[last:])
    return words

In [57]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
segment(text, seg1)

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

In [58]:
segment(text, seg2)

['do',
 'you',
 'see',
 'the',
 'kitty',
 'see',
 'the',
 'doggy',
 'do',
 'you',
 'like',
 'the',
 'kitty',
 'like',
 'the',
 'doggy']

In [59]:
#打分函数，基于词典的大小和从词典中重构源文本所需的信息量
def evaluate(text, segs):
    words = segment(text, segs)
    text_size = len(words)
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
seg3 = "0000100100000011001000000110000100010000001100010000001"
segment(text, seg3)

['doyou',
 'see',
 'thekitt',
 'y',
 'see',
 'thedogg',
 'y',
 'doyou',
 'like',
 'thekitt',
 'y',
 'like',
 'thedogg',
 'y']

In [60]:
print(evaluate(text, seg1))
print(evaluate(text, seg2))
print(evaluate(text, seg3))

64
48
47


In [62]:
from random import randint
#使用模拟退火算法的非确定性搜索：一开始仅搜索短语分词；随机扰动0 和1，
# 它们与“温度”成比例；每次迭代温度都会降低，扰动边界会减少
def flip(segs, pos):
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]

def flip_n(segs, n):
    for i in range(n):
        segs = flip(segs, randint(0, len(segs)-1))
    return segs

def anneal(text, segs, iterations, cooling_rate):
    temperature = float(len(segs))
    while temperature > 0.5:
        best_segs, best = segs, evaluate(text, segs)
        for i in range(iterations):
            guess = flip_n(segs, round(temperature))
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        temperature = temperature / cooling_rate
        print(evaluate(text, segs), segment(text, segs))
    print()
    return segs

In [63]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
anneal(text, seg1, 5000, 1.2)

64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
62 ['doyou', 'seethekittyseet', 'hedoggy', 'doyoulikethe', 'kittyl', 'iket', 'hedoggy']
61 ['doyou', 'se', 'etheki', 'ttyseet', 'hedoggy', 'doyou', 'lik', 'etheki', 't', 't', 'yl', 'ik', 'et', 'hedoggy']
59 ['doyou', 'seet', 'heki', 'ttyse', 'et', 'hedoggy', 'doyou', 'lik', 'et', 'heki', 't', 't', 'ylik', 'et', 'hedoggy']
59 ['doyou', 'seet', 'heki', 'ttyse', 'et', 'hedoggy', 'doyou', 'lik

'0000101000001001010000000010000100100000100100100000000'

## 3.9 格式化：从链表到字符串

### 3.9.1 从链表到字符串

In [64]:
# join()方法只适用于一个字符串的链表
silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
print(' '.join(silly))
print(';'.join(silly))
print(''.join(silly))

We called him Tortoise because he taught us .
We;called;him;Tortoise;because;he;taught;us;.
WecalledhimTortoisebecausehetaughtus.


### 3.9.2 字符串与格式

In [65]:
word = 'cat'
sentence = """hello
world"""
print(word)
print(sentence)

cat
hello
world


In [66]:
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in sorted(fdist):
    print(word, '->', fdist[word], end='; ')

cat -> 3; dog -> 4; snake -> 1; 

In [67]:
for word in sorted(fdist):
    print('{}->{};'.format(word, fdist[word]), end=' ')

cat->3; dog->4; snake->1; 

In [68]:
'{} wants a {} {}'.format ('Lee', 'sandwich', 'for lunch')

'Lee wants a sandwich for lunch'

In [69]:
template = 'Lee wants a {} right now'
menu = ['sandwich', 'spam fritter', 'pancake']
for snack in menu:
    print(template.format(snack))

Lee wants a sandwich right now
Lee wants a spam fritter right now
Lee wants a pancake right now


### 3.9.3 排列

In [70]:
 '{:6}'.format('dog')#左对齐

'dog   '

In [71]:
'{:>6}'.format('dog')#右对齐

'   dog'

In [72]:
#布朗语料库的不同部分的频率模型
def tabulate(cfdist, words, categories):
    print('{:16}'.format('Category'), end=' ')                    # column headings
    for word in words:
        print('{:>6}'.format(word), end=' ')
    print()
    for category in categories:
        print('{:16}'.format(category), end=' ')                  # row heading
        for word in words:                                        # for each word
            print('{:6}'.format(cfdist[category][word]), end=' ') # print table cell
        print()                                                   # end the row

from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
tabulate(cfd, modals, genres)

Category            can  could    may  might   must   will 
news                 93     86     66     38     50    389 
religion             82     59     78     12     54     71 
hobbies             268     58    131     22     83    264 
science_fiction      16     49      4     12      8     16 
romance              74    193     11     51     45     43 
humor                16     30      8      8      9     13 


### 3.9.4 将结果写入文件

In [73]:
output_file = open('output.txt', 'w')
words = set(nltk.corpus.genesis.words('english-kjv.txt'))
for word in sorted(words):
    print(word, file=output_file)

In [75]:
print(str(len(words)), file=output_file)

In [76]:
len(words)

2789

### 3.9.5 文本换行

In [77]:
saying = ['After', 'all', 'is', 'said', 'and', 'done', ',','more', 'is', 'said', 'than', 'done', '.']
for word in saying:
    print(word, '(' + str(len(word)) + '),', end=' ')

After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1), 

In [78]:
#textwrap 模块的换行
from textwrap import fill
format = '%s (%d),'
pieces = [format % (word, len(word)) for word in saying]
output = ' '.join(pieces)
wrapped = fill(output)
print(wrapped)

After (5), all (3), is (2), said (4), and (3), done (4), , (1), more
(4), is (2), said (4), than (4), done (4), . (1),
