为了实现数值格式的特征输入,你需要清洗,规范化和预处理初始文本数据.通常,文本语料库和原始文本的数据格式既非准确的,也非规范的.文本处理,涉及使用各种技术将原始文本转换成定义良好的语言成分序列,这些序列具有标准的结构和标记.

以下是将在本章中探讨的一些主流文本预处理技术:
- 切分(tokenization)
- 标注(tagging)
- 分块(chunking)
- 词干提取(stemming)
- 词性还原(lemmatization)

## 文本切分
标识(token)是具有一定的句法语义且独立的最小文本成分.

一段文本或一个文本文件具有几个组成部分,包括可以进一步细分为从句,短语和单词的语句.

最流行的文本切分技术包括句子切分和词语切分,用于将文本语料库分解成句子,并将每个句子分解成单词.

文本切分可以定义为将文本数据分解或拆分为具有更小且有意义的成分的过程.

### 句子切分
句子切分(sentence tokenization)是将文本语料库分解成句子的过程.

执行句子切分有多种技术,基本技术包括在句子之间寻找特定的分隔符.

我们将使用NLTK框架进行切分,该框架提供用于执行句子切分的各种接口.我们将主要关注一下句子切分器:
- sent_tokenize
- PunkSentenceTokenizer
- RegexpTokenizer
- 预先训练的句子切分模型

将文本分割成句子之前,需要一些测试该操作的文本.

In [1]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint

In [2]:
alice = gutenberg.raw(fileids='carroll-alice.txt')
sample_text = '''we will discuss briefly about the basic syntax, structure and design philosophies, 
                There is a defined hierarchical syntax for Python code which you should remember when writing code!
                Python is a really powerful programming language!'''

In [3]:
print len(alice)

144395


In [5]:
print alice[0:100]

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


nltk.sent_tokenize函数是nltk推荐的模型的句子切分函数.它内部使用一个PunktSentenceTokenizer类的实例.

In [6]:
#以下代码段展示了该函数在实例文本中的基本用法:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)

print 'Total sentences in sample_text:', len(sample_sentences)
print 'Sample text sentences :-'
pprint(sample_sentences)
print '\nTotal sentences in alice:', len(alice_sentences)
print 'First 5 sentences in alice:-'
pprint(alice_sentences[0:5])

Total sentences in sample_text: 2
Sample text sentences :-
['we will discuss briefly about the basic syntax, structure and design philosophies, \n                There is a defined hierarchical syntax for Python code which you should remember when writing code!',
 'Python is a really powerful programming language!']

Total sentences in alice: 1625
First 5 sentences in alice:-
[u"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 u"Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'",
 u'So she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up an

## 对德语文本进行句子切分

In [7]:
from nltk.corpus import europarl_raw

In [9]:
german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
#文本长度
print len(german_text)
#文本的前100个字符
print german_text[0:100]

157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


In [11]:
#使用默认的sent_tokenize切分器
german_sentences_def = default_st(text=german_text, language='german')
#从nltk源加载的预训练的德语切分器
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)
#德语切分器也是属于PunktSentenceTokenizer类型
print type(german_tokenizer)

<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>


由上面的语句输出可以看出german_tokenizer是PunktSentenceTokenizer的一个实例,它专门用来处理德语.

接下来,对比使用默认切分器和使用德语切分器的效果有什么不同.

In [12]:
print german_sentences_def == german_sentences
for sent in german_sentences[0:5]:
    print sent

True
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .
Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .
Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .


In [13]:
#使用默认的PunktSentenceTokenizer类也能很方便的实现句子切分.
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
pprint(sample_sentences)

['we will discuss briefly about the basic syntax, structure and design philosophies, \n                There is a defined hierarchical syntax for Python code which you should remember when writing code!',
 'Python is a really powerful programming language!']


### 使用正则表达式的模式来切分句子:RegexpTokenizer类

In [14]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN, gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
pprint(sample_sentences)

['we will discuss briefly about the basic syntax, structure and design philosophies, \n                There is a defined hierarchical syntax for Python code which you should remember when writing code!',
 '                Python is a really powerful programming language!']


### 词语切分
词语切分(word tokenization)是将句子分解或分割成其组成单词的过程.

句子是单词的集合,通过词语切分,在本质上即是将一个句子分割成单词列表,该单词列表又可以重建句子.

词语切分在很多过程中都是非常重要的,特别是在文本清洗和规范化时,诸如词干提取和词形还原这类基于词干,标识信息的操作会子啊每个单词上实施.

nltk的主流词语切分接口:
- word_tokenize
- TreebankWordTokenizer
- RegexpTokenizer
- 从RegexpTokenizer继承的切分器

In [15]:
sentence = "The brown fox wasn't that quick and he couldn't win the race"
default_wt = nltk.word_tokenize
words = default_wt(sentence)
print words

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


nltk.word_tokenize函数是nltk默认并推荐的词语切分器.该切分器实际上是TreebankWordTokenizer类的一个实例或对象,并且是该核心类的一个封装.

TreebankWordTokenizer基于Penn Treebank,并使用各种正则表达式来分割文本.该切分器的一些主要功能包括:

- 分割和分离出现在句子末尾的句点
- 分割和分离空格前的逗号和单引号
- 将大多数标点符号分割成独立标识
- 分割常规的缩写词--例如将'don't'分割成'do'和'n't'

In [16]:
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


**以下代码使用正则表达式执行词语切**

In [17]:
TOKEN_PATTERN = r'\w+' #定义正则表达式
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=False)
words = regex_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [18]:
GAP_PATTERN = r'\s+' #定义正则表达式
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,gaps=True)
words = regex_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


In [19]:
word_indices = list(regex_wt.span_tokenize(sentence))
print word_indices
print [sentence[start:end] for start, end in word_indices]

[(0, 3), (4, 9), (10, 13), (14, 20), (21, 25), (26, 31), (32, 35), (36, 38), (39, 47), (48, 51), (52, 55), (56, 60)]
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


In [20]:
#使用r'\w+|[^\w\s]+'模式将句子切分成独立的字母和非字母标识
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'race']


In [21]:
#whitespacetokenizer基于诸如缩进符,换行符及空格的空白字符将句子分割成单词
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


## 文本规范化
- 文本清洗
- 大小写转换
- 词语校正
- 停用词删除
- 词干提取
- 词形还原

文本规范化通常也称为文本清洗或转换

In [22]:
import re
import string
from pprint import pprint
corpus = ["the brown fox wasn't that quick and he couldn't win the race",
          "Hey that's a great deal! I just bought a phone for $199",
         "@@You'll (learn) a **lot** in the book.Python is  an amazing language!@@"]

### 文本清洗
对于清洗html之类的数据源中提取有意义的文本,可以使用nltk的clean_html()函数,或者BeautifulSoup库来解析HTML数据,或者正则表达式,xpath和lxml库来解析XML数据.

### 文本切分
文本切分和删除多余字符的顺序取决于你要解决的问题和你正在处理的数据.

In [24]:
def tokenize_text(text):
    '''
    接收文本数据,从中提取句子,最后将每个句子划分成标识.
    这些标识可以是单词,特殊字符或标点符号.
    '''
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
    return word_tokens

In [25]:
token_list = [tokenize_text(text) for text in corpus]
pprint(token_list)

[[['the',
   'brown',
   'fox',
   'was',
   "n't",
   'that',
   'quick',
   'and',
   'he',
   'could',
   "n't",
   'win',
   'the',
   'race']],
 [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
  ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']],
 [['@',
   '@',
   'You',
   "'ll",
   '(',
   'learn',
   ')',
   'a',
   '**lot**',
   'in',
   'the',
   'book.Python',
   'is',
   'an',
   'amazing',
   'language',
   '!'],
  ['@', '@']]]


### 删除特殊字符
文本规范化中的一个重要任务是删除多余和特殊的字符,诸如特殊符号或标点符号.

删除特殊字符的原因是分析文本并提取基于NLP和机器学习的特征或信息时,标点符号或特殊字符往往没有多大的意义.

In [26]:
def remove_characters_after_tokenization(tokens):
    '''
    string.punctuation由所有可能的特殊字符/符号组成,并从中创建一个正则表达式模式.
    使用正则表达式sub算法删除特殊字符
    filter函数删除空标识
    '''
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation))) #哇,还可以这么写
    filtered_tokens = filter(None, [pattern.sub('',token) for token in tokens])
    return filtered_tokens

In [28]:
filtered_list_1 = [filter(None, [remove_characters_after_tokenization(tokens) for tokens in sentence_tokens]) 
                   for sentence_tokens in token_list]

In [29]:
print filtered_list_1

[[['the', 'brown', 'fox', 'was', 'nt', 'that', 'quick', 'and', 'he', 'could', 'nt', 'win', 'the', 'race']], [['Hey', 'that', 's', 'a', 'great', 'deal'], ['I', 'just', 'bought', 'a', 'phone', 'for', '199']], [['You', 'll', 'learn', 'a', 'lot', 'in', 'the', 'bookPython', 'is', 'an', 'amazing', 'language']]]


In [32]:
def remove_characters_before_tokenization(sentence, keep_apostrophes=False):
    '''
    在文本切分之前删除特殊字符
    '''
    sentence = sentence.strip()
    if keep_apostrophes:
        #保留撇号和句号
        PATTERN = r'[?|$|&|*|%|@|(|)|~]'
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    else:
        PATTERN =  r'[^a-zA-Z0-9 ]'
        filtered_sentence = re.sub(PATTERN,r'',sentence)
    return filtered_sentence

In [33]:
filtered_list_2 = [remove_characters_before_tokenization(sentence) for sentence in corpus]
print filtered_list_2

['the brown fox wasnt that quick and he couldnt win the race', 'Hey thats a great deal I just bought a phone for 199', 'Youll learn a lot in the bookPython is  an amazing language']


In [36]:
cleaned_corpus = [remove_characters_before_tokenization(sentence, keep_apostrophes=True) for sentence in corpus]
print cleaned_corpus

["the brown fox wasn't that quick and he couldn't win the race", "Hey that's a great deal! I just bought a phone for 199", "You'll learn a lot in the book.Python is  an amazing language!"]


### 扩展缩写词
缩写词是词或音节的缩短形式.