# chapter_01 Tokenizing Text and WordNet Basics

## Tokenizing Text into sentences

In [17]:
para1 = "Hello World. It's good to see you. Thanks for buying this book."
# sent_tokenize用来分句子
from nltk.tokenize import sent_tokenize
sent_tokenize(para1)
# 我们看到这段话被分成了3句
# sent_tokenize是怎么做到的呢？
# 它用了nltk.tokenize.punkt模块，这个模块已经被训练了，会识别句子的开始和结束。

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

In [11]:
para2 = '大家好，我叫李白。这是我的好朋友，杜甫。我们都是唐朝著名的诗人，哈哈哈！' 
sent_tokenize(para2)
# 好坑，发现sent_tokenize对汉字段落的分句能力不高

['大家好，我叫李白。这是我的好朋友，杜甫。我们都是唐朝著名的诗人，哈哈哈！']

In [18]:
# 如果要分句的段落较长，那么直接把punkt模块下的pickle文件load进来，效率会更高
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer.tokenize(para1)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

In [19]:
# 如果要分句的段落是其他语言，那么可以把其他语言的pickle文件load进来
# 西班牙语
para3 = 'hola amigo. estoy bien.'
spanish_tokenizer = nltk.data.load('tokenizers/punkt/spanish.pickle')
spanish_tokenizer.tokenize(para3)

['hola amigo.', 'estoy bien.']

## Tokenizing sentences into words

In [1]:
from nltk.tokenize import word_tokenize
word_tokenize('hello world.')

['hello', 'world', '.']

In [2]:
# word_tokenize调用了TreebankWordTokenizer这个类
# 因此下面的代码也可以实现word tokenize
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize('hello world.')
# 原理：根据空格和标点符号来分词
# 默认的，标点符号也分出来了
# 根源：TokenizerI, 分出三类：PunktWordTokenizer、TreebankWordTokenizer、RegexpTokenizer
# 从RegexpTokenizer又有两类：WordPuncktTokenizer、WhitespaceTokenizer

['hello', 'world', '.']

In [3]:
# separating contractions 对缩略词进行分词
word_tokenize("can't")
# OMG!没有识别出是缩略词，这完全不可接受呀，怎么办？
# 可以用regexp tokenizer

['ca', "n't"]

In [4]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("can't is a contraction")

['can', "'", 't', 'is', 'a', 'contraction']

## Tokenizing sentences using regular expressions

In [5]:
# 用正则表达式会使分词变得复杂，效率降低，因此建议只有当之前的分词结果都不可接受时才使用
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("Can't is a contraction.")

["Can't", 'is', 'a', 'contraction']

In [6]:
# 下面的代码是相同的结果
from nltk.tokenize import regexp_tokenize
regexp_tokenize("Can't is a contraction.", "[\w']+")

["Can't", 'is', 'a', 'contraction']