NLTK基于不同的算法实现tokenizer，包括分句器、多种分词器、还有一些特殊的tokenizer，并在基础分词器、分局器的基础上进行扩展  
本部分主要介绍基础分词器的用法，并比较其效果  
NLTK中的基础分词器包括：
+ RegexpTokenizer 使用正则表达式分词
+ ReppTokenizer Repp分词器 http://anthology.aclweb.org/P/P12/P12-2.pdf#page=406
+ ToktokTokenizer 通用分词器
+ TreebankWordTokenizer 使用正则表达式对文本进行标记之后分词

NLTK使用PunktSentenceTokenizer进行分句     
这是一种使用无监督算法通过缩写词、固定搭配和某些开始句子的固定的词进行训练的分句器

In [1]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)
print(tokenized[:2])

["PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all.", 'Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.']


NLTK支持简单分词（按照空格分词）以及按照行/空格/制表符进行分词的简单分词器

In [2]:
text = """Is 9.5 or 525,600 my favorite number?
May I help you?
"""
print(nltk.tokenize.casual_tokenize(text))
print(nltk.tokenize.LineTokenizer().tokenize(text))

['Is', '9.5', 'or', '525,600', 'my', 'favorite', 'number', '?', 'May', 'I', 'help', 'you', '?']
['Is 9.5 or 525,600 my favorite number?', 'May I help you?']


In [3]:
regexp_tokenizer = nltk.tokenize.RegexpTokenizer('\w+|\$[\d\.]+|\S+')
print(regexp_tokenizer.tokenize(text)) # 简单的根据正则表达式分词

['Is', '9', '.5', 'or', '525', ',600', 'my', 'favorite', 'number', '?', 'May', 'I', 'help', 'you', '?']


In [4]:
toktok = nltk.tokenize.ToktokTokenizer() # TikTok通用分词器
text = u'Is 9.5 or 525,600 my favorite number?'
print(toktok.tokenize(text, return_str=False))

['Is', '9.5', 'or', '525,600', 'my', 'favorite', 'number', '?']


In [5]:
s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
print(nltk.tokenize.TreebankWordTokenizer().tokenize(s))

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']


NLTK中还包括一些功能特殊的分词器 
+ SExprTokenizer 将句子按照括号分割
+ SyllableTokenizer 将词按照音节划分
+ TextTilingTokenizer 将文本中的每一段划分到其对应的子主题

In [6]:
text = '(a (b c)) d e (f)'
print(nltk.tokenize.SExprTokenizer().tokenize(text)) # 按照括号分词

['(a (b c))', 'd', 'e', '(f)']


In [7]:
from nltk.tokenize import SyllableTokenizer
SSP = SyllableTokenizer()
SSP.tokenize("justification")

['jus', 'ti', 'fi', 'ca', 'tion']

In [8]:
from nltk.corpus import brown
import numpy
tt = nltk.tokenize.TextTilingTokenizer(demo_mode=True) # 将文档标记到对应的子主题
text = brown.raw()[:4000]
s, ss, d, b = tt.tokenize(text)
b

[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]