为了实现数值格式的特征输入,你需要清洗,规范化和预处理初始文本数据.通常,文本语料库和原始文本的数据格式既非准确的,也非规范的.文本处理,涉及使用各种技术将原始文本转换成定义良好的语言成分序列,这些序列具有标准的结构和标记.

以下是将在本章中探讨的一些主流文本预处理技术:
- 切分(tokenization)
- 标注(tagging)
- 分块(chunking)
- 词干提取(stemming)
- 词性还原(lemmatization)

## 文本切分
标识(token)是具有一定的句法语义且独立的最小文本成分.

一段文本或一个文本文件具有几个组成部分,包括可以进一步细分为从句,短语和单词的语句.

最流行的文本切分技术包括句子切分和词语切分,用于将文本语料库分解成句子,并将每个句子分解成单词.

文本切分可以定义为将文本数据分解或拆分为具有更小且有意义的成分的过程.

### 句子切分
句子切分(sentence tokenization)是将文本语料库分解成句子的过程.

执行句子切分有多种技术,基本技术包括在句子之间寻找特定的分隔符.

我们将使用NLTK框架进行切分,该框架提供用于执行句子切分的各种接口.我们将主要关注一下句子切分器:
- sent_tokenize
- PunkSentenceTokenizer
- RegexpTokenizer
- 预先训练的句子切分模型

将文本分割成句子之前,需要一些测试该操作的文本.

In [1]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint

In [2]:
alice = gutenberg.raw(fileids='carroll-alice.txt')
sample_text = '''we will discuss briefly about the basic syntax, structure and design philosophies, 
                There is a defined hierarchical syntax for Python code which you should remember when writing code!
                Python is a really powerful programming language!'''

In [3]:
print len(alice)

144395


In [4]:
print alice[0:100]

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


nltk.sent_tokenize函数是nltk推荐的模型的句子切分函数.它内部使用一个PunktSentenceTokenizer类的实例.

In [5]:
#以下代码段展示了该函数在实例文本中的基本用法:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)
sample_sentences = default_st(text=sample_text)

print 'Total sentences in sample_text:', len(sample_sentences)
print 'Sample text sentences :-'
pprint(sample_sentences)
print '\nTotal sentences in alice:', len(alice_sentences)
print 'First 5 sentences in alice:-'
pprint(alice_sentences[0:5])

Total sentences in sample_text: 2
Sample text sentences :-
['we will discuss briefly about the basic syntax, structure and design philosophies, \n                There is a defined hierarchical syntax for Python code which you should remember when writing code!',
 'Python is a really powerful programming language!']

Total sentences in alice: 1625
First 5 sentences in alice:-
[u"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 u"Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'",
 u'So she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up an

## 对德语文本进行句子切分

In [6]:
from nltk.corpus import europarl_raw

In [7]:
german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')
#文本长度
print len(german_text)
#文本的前100个字符
print german_text[0:100]

157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


In [8]:
#使用默认的sent_tokenize切分器
german_sentences_def = default_st(text=german_text, language='german')
#从nltk源加载的预训练的德语切分器
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)
#德语切分器也是属于PunktSentenceTokenizer类型
print type(german_tokenizer)

<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>


由上面的语句输出可以看出german_tokenizer是PunktSentenceTokenizer的一个实例,它专门用来处理德语.

接下来,对比使用默认切分器和使用德语切分器的效果有什么不同.

In [9]:
print german_sentences_def == german_sentences
for sent in german_sentences[0:5]:
    print sent

True
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .
Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .
Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .


In [10]:
#使用默认的PunktSentenceTokenizer类也能很方便的实现句子切分.
punkt_st = nltk.tokenize.PunktSentenceTokenizer()
sample_sentences = punkt_st.tokenize(sample_text)
pprint(sample_sentences)

['we will discuss briefly about the basic syntax, structure and design philosophies, \n                There is a defined hierarchical syntax for Python code which you should remember when writing code!',
 'Python is a really powerful programming language!']


### 使用正则表达式的模式来切分句子:RegexpTokenizer类

In [11]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'
regex_st = nltk.tokenize.RegexpTokenizer(pattern=SENTENCE_TOKENS_PATTERN, gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
pprint(sample_sentences)

['we will discuss briefly about the basic syntax, structure and design philosophies, \n                There is a defined hierarchical syntax for Python code which you should remember when writing code!',
 '                Python is a really powerful programming language!']


### 词语切分
词语切分(word tokenization)是将句子分解或分割成其组成单词的过程.

句子是单词的集合,通过词语切分,在本质上即是将一个句子分割成单词列表,该单词列表又可以重建句子.

词语切分在很多过程中都是非常重要的,特别是在文本清洗和规范化时,诸如词干提取和词形还原这类基于词干,标识信息的操作会子啊每个单词上实施.

nltk的主流词语切分接口:
- word_tokenize
- TreebankWordTokenizer
- RegexpTokenizer
- 从RegexpTokenizer继承的切分器

In [12]:
sentence = "The brown fox wasn't that quick and he couldn't win the race"
default_wt = nltk.word_tokenize
words = default_wt(sentence)
print words

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


nltk.word_tokenize函数是nltk默认并推荐的词语切分器.该切分器实际上是TreebankWordTokenizer类的一个实例或对象,并且是该核心类的一个封装.

TreebankWordTokenizer基于Penn Treebank,并使用各种正则表达式来分割文本.该切分器的一些主要功能包括:

- 分割和分离出现在句子末尾的句点
- 分割和分离空格前的逗号和单引号
- 将大多数标点符号分割成独立标识
- 分割常规的缩写词--例如将'don't'分割成'do'和'n't'

In [13]:
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


**以下代码使用正则表达式执行词语切**

In [14]:
TOKEN_PATTERN = r'\w+' #定义正则表达式
regex_wt = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=False)
words = regex_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [15]:
GAP_PATTERN = r'\s+' #定义正则表达式
regex_wt = nltk.RegexpTokenizer(pattern=GAP_PATTERN,gaps=True)
words = regex_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


In [16]:
word_indices = list(regex_wt.span_tokenize(sentence))
print word_indices
print [sentence[start:end] for start, end in word_indices]

[(0, 3), (4, 9), (10, 13), (14, 20), (21, 25), (26, 31), (32, 35), (36, 38), (39, 47), (48, 51), (52, 55), (56, 60)]
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


In [17]:
#使用r'\w+|[^\w\s]+'模式将句子切分成独立的字母和非字母标识
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'race']


In [18]:
#whitespacetokenizer基于诸如缩进符,换行符及空格的空白字符将句子分割成单词
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print words

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


## 文本规范化
- 文本清洗
- 大小写转换
- 词语校正
- 停用词删除
- 词干提取
- 词形还原

文本规范化通常也称为文本清洗或转换

In [19]:
import re
import string
from pprint import pprint
corpus = ["the brown fox wasn't that quick and he couldn't win the race",
          "Hey that's a great deal! I just bought a phone for $199",
         "@@You'll (learn) a **lot** in the book.Python is  an amazing language!@@"]

### 文本清洗
对于清洗html之类的数据源中提取有意义的文本,可以使用nltk的clean_html()函数,或者BeautifulSoup库来解析HTML数据,或者正则表达式,xpath和lxml库来解析XML数据.

### 文本切分
文本切分和删除多余字符的顺序取决于你要解决的问题和你正在处理的数据.

In [20]:
def tokenize_text(text):
    '''
    接收文本数据,从中提取句子,最后将每个句子划分成标识.
    这些标识可以是单词,特殊字符或标点符号.
    '''
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]
    return word_tokens

In [21]:
token_list = [tokenize_text(text) for text in corpus]
pprint(token_list)

[[['the',
   'brown',
   'fox',
   'was',
   "n't",
   'that',
   'quick',
   'and',
   'he',
   'could',
   "n't",
   'win',
   'the',
   'race']],
 [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
  ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']],
 [['@',
   '@',
   'You',
   "'ll",
   '(',
   'learn',
   ')',
   'a',
   '**lot**',
   'in',
   'the',
   'book.Python',
   'is',
   'an',
   'amazing',
   'language',
   '!'],
  ['@', '@']]]


### 删除特殊字符
文本规范化中的一个重要任务是删除多余和特殊的字符,诸如特殊符号或标点符号.

删除特殊字符的原因是分析文本并提取基于NLP和机器学习的特征或信息时,标点符号或特殊字符往往没有多大的意义.

In [22]:
def remove_characters_after_tokenization(tokens):
    '''
    string.punctuation由所有可能的特殊字符/符号组成,并从中创建一个正则表达式模式.
    使用正则表达式sub算法删除特殊字符
    filter函数删除空标识
    '''
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation))) #哇,还可以这么写
    filtered_tokens = filter(None, [pattern.sub('',token) for token in tokens])
    return filtered_tokens

In [23]:
filtered_list_1 = [filter(None, [remove_characters_after_tokenization(tokens) for tokens in sentence_tokens]) 
                   for sentence_tokens in token_list]

In [24]:
print filtered_list_1

[[['the', 'brown', 'fox', 'was', 'nt', 'that', 'quick', 'and', 'he', 'could', 'nt', 'win', 'the', 'race']], [['Hey', 'that', 's', 'a', 'great', 'deal'], ['I', 'just', 'bought', 'a', 'phone', 'for', '199']], [['You', 'll', 'learn', 'a', 'lot', 'in', 'the', 'bookPython', 'is', 'an', 'amazing', 'language']]]


In [25]:
def remove_characters_before_tokenization(sentence, keep_apostrophes=False):
    '''
    在文本切分之前删除特殊字符
    '''
    sentence = sentence.strip()
    if keep_apostrophes:
        #保留撇号和句号
        PATTERN = r'[?|$|&|*|%|@|(|)|~]'
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    else:
        PATTERN =  r'[^a-zA-Z0-9 ]'
        filtered_sentence = re.sub(PATTERN,r'',sentence)
    return filtered_sentence

In [26]:
filtered_list_2 = [remove_characters_before_tokenization(sentence) for sentence in corpus]
print filtered_list_2

['the brown fox wasnt that quick and he couldnt win the race', 'Hey thats a great deal I just bought a phone for 199', 'Youll learn a lot in the bookPython is  an amazing language']


In [27]:
cleaned_corpus = [remove_characters_before_tokenization(sentence, keep_apostrophes=True) for sentence in corpus]
print cleaned_corpus

["the brown fox wasn't that quick and he couldn't win the race", "Hey that's a great deal! I just bought a phone for 199", "You'll learn a lot in the book.Python is  an amazing language!"]


### 扩展缩写词
缩写词是词或音节的缩短形式.举例说明:"is not"缩写成"isn't"

In [28]:
#导入缩写词与原始形式的对应关系
from contractions import CONTRACTION_MAP

def expand_contractions(sentence, contraction_mapping):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL)
    
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) \
                                if contraction_mapping.get(match) \
                                else contraction_mapping.get(match.lower())
        expaned_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
    
    expanded_sentence = contractions_pattern.sub(expand_match, sentence)
    return expanded_sentence

In [29]:
expanded_corpus = [expand_contractions(sentence, CONTRACTION_MAP) for sentence in cleaned_corpus]
print expanded_corpus

['the brown fox was not that quick and he could not win the race', 'Hey that is a great deal! I just bought a phone for 199', 'you will learn a lot in the book.Python is  an amazing language!']


### 大小写转换

In [30]:
print corpus[0].lower()

the brown fox wasn't that quick and he couldn't win the race


In [31]:
print corpus[0].upper()

THE BROWN FOX WASN'T THAT QUICK AND HE COULDN'T WIN THE RACE


### 删除停用词
停用词是指没有或只有极小意义的词语.

通常在处理过程中将它们从文本中删除,以保留具有最大意义及语境的词语.

In [32]:
def remove_stopwords(tokens):
    stopword_list = nltk.corpus.stopwords.words('english')
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    return filtered_tokens

In [33]:
expanded_corpus_tokens = [tokenize_text(text) for text in expanded_corpus]
filtered_list_3 = [[remove_stopwords(tokens) for tokens in sentence_tokens] for sentence_tokens in expanded_corpus_tokens]
print filtered_list_3

[[['brown', 'fox', 'quick', 'could', 'win', 'race']], [['Hey', 'great', 'deal', '!'], ['I', 'bought', 'phone', '199']], [['learn', 'lot', 'book.Python', 'amazing', 'language', '!']]]


**可以看到第一句的not,no这样的否定词也被删除掉,通常这些词语应当被保留,以便于在诸如情绪分析等应用中句子语意不会失真**

### 词语校正
文本规范化面临的主要挑战之一是文本中存在不正确的单词.
- 校正重复字符,比如'finalllllyyyyyy'校正成'finally'
- 校正拼写错误

In [34]:
old_word = 'finalllyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step = 1

while True:
    #remove one repeated character
    new_word = repeat_pattern.sub(match_substitution, old_word)
    if new_word != old_word:
        print 'Step: {} Word: {}'.format(step, new_word)
        step += 1 #update step
        #update old word to last substituted state
        old_word = new_word
        continue
    else:
        print 'Final word:', new_word
        break

Step: 1 Word: finalllyy
Step: 2 Word: finallly
Step: 3 Word: finally
Step: 4 Word: finaly
Final word: finaly


**上面的词语校正最终结果不是正确的,接下来我们引入Wordnet对每次校正的结果进行语义校正,如果当前词语是有效词语就结束词语校正过程**

In [35]:
from nltk.corpus import wordnet
old_word = 'finalllyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step = 1
while True:
    #语义检查
    if wordnet.synsets(old_word):
        print 'Final correct word:', old_word
        break
    #remove one repeated character
    new_word = repeat_pattern.sub(match_substitution, old_word)
    if new_word != old_word:
        print 'Step: {} Word: {}'.format(step, new_word)
        step += 1
        old_word = new_word
        continue
    else:
        print 'Final word:', new_word
        break

Step: 1 Word: finalllyy
Step: 2 Word: finallly
Step: 3 Word: finally
Final correct word: finally


In [36]:
#将上面的处理逻辑组织成函数的形式
from nltk.corpus import wordnet

def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word!=old_word else new_word
    
    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens

In [37]:
sample_sentence = 'My schooool is realllllyyy amaaaazingggg'
sample_sentence_tokens = tokenize_text(sample_sentence)[0]
print sample_sentence_tokens

print remove_repeated_characters(sample_sentence_tokens)

['My', 'schooool', 'is', 'realllllyyy', 'amaaaazingggg']
['My', 'school', 'is', 'really', 'amazing']


#### 校正拼写错误
校正算法更多请参考:http://norving.com/spell-correct.html

我们的主要目标是,给定一个单词,找出这个单词最有可能的正确形式.

我们遵循的方法是:生成一系列类似输入词的候选词,并从该集合中选择最有可能的单词作为正确的单词.

根据编辑距离从候选词中选出一个最终的结果

最常用单词列表下载地址:http://norving.com/big.txt

In [38]:
import re, collections
def tokens(text):
    '''
    Get all words from the corpus
    '''
    return re.findall('[a-z]+', text.lower())

WORDS = tokens(file('big.txt').read())
WORDS_COUNTS = collections.Counter(WORDS)

#top 10 words in the corpus
print WORDS_COUNTS.most_common(10)

[('the', 80030), ('of', 40025), ('and', 38313), ('to', 28766), ('in', 22050), ('a', 21155), ('that', 12512), ('he', 12401), ('was', 11410), ('it', 10681)]


In [39]:
#定义三个函数分别计算出与输入单词的编辑距离为0,1,2的单词组.

def edits0(word):
    '''
    编辑距离为0
    '''
    return {word}

def edits1(word):
    '''
    编辑距离为1
    '''
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    def splits(word):
        return [(word[:i], word[i:]) for i in range(len(word)+1)]
    
    pairs = splits(word)
    deletes = [a+b[1:] for (a,b) in pairs if b] #删除,结果字符串长度减少1
    transposes = [a+b[1]+b[0]+b[2:] for (a,b) in pairs if len(b)>1] #相邻位置调换
    replaces = [a+c+b[1:] for (a,b) in pairs for c in alphabet if b] #分别替换各个位置
    inserts = [a+c+b for (a,b) in pairs for c in alphabet]  #插入一个字符,结果字符串长度增加1
    return set(deletes+transposes+replaces+inserts)

def edits2(word):
    '''
    编辑距离为2
    '''
    return {e2 for e1 in edits1(word) for e2 in edits1(e1)}
    

In [40]:
#根据单词是否存在与词汇词典WORD_COUNTS中,从edit函数得出的候选词组中返回一个单词子集.
def known(words):
    '''
    从候选词组中获得一个有效单词列表
    '''
    return {w for w in words if w in WORDS_COUNTS}

In [41]:
#基于与输入单词之间的编辑距离,给出最有可能的候选词
word = 'fianlly'
edits0(word)

{'fianlly'}

In [42]:
known(edits0(word))

set()

In [43]:
edits1(word)

{'afianlly',
 'aianlly',
 'bfianlly',
 'bianlly',
 'cfianlly',
 'cianlly',
 'dfianlly',
 'dianlly',
 'efianlly',
 'eianlly',
 'faanlly',
 'faianlly',
 'fainlly',
 'fanlly',
 'fbanlly',
 'fbianlly',
 'fcanlly',
 'fcianlly',
 'fdanlly',
 'fdianlly',
 'feanlly',
 'feianlly',
 'ffanlly',
 'ffianlly',
 'fganlly',
 'fgianlly',
 'fhanlly',
 'fhianlly',
 'fiaally',
 'fiaanlly',
 'fiablly',
 'fiabnlly',
 'fiaclly',
 'fiacnlly',
 'fiadlly',
 'fiadnlly',
 'fiaelly',
 'fiaenlly',
 'fiaflly',
 'fiafnlly',
 'fiaglly',
 'fiagnlly',
 'fiahlly',
 'fiahnlly',
 'fiailly',
 'fiainlly',
 'fiajlly',
 'fiajnlly',
 'fiaklly',
 'fiaknlly',
 'fiallly',
 'fially',
 'fialnlly',
 'fialnly',
 'fiamlly',
 'fiamnlly',
 'fianally',
 'fianaly',
 'fianblly',
 'fianbly',
 'fianclly',
 'fiancly',
 'fiandlly',
 'fiandly',
 'fianelly',
 'fianely',
 'fianflly',
 'fianfly',
 'fianglly',
 'fiangly',
 'fianhlly',
 'fianhly',
 'fianilly',
 'fianily',
 'fianjlly',
 'fianjly',
 'fianklly',
 'fiankly',
 'fianlaly',
 'fianlay',
 'fi

In [44]:
known(edits1(word))

{'finally'}

In [45]:
known(edits2(word))

{'faintly', 'finally', 'finely', 'frankly'}

In [46]:
#对编辑距离更小的单词赋予更高的优先级
candidates = (known(edits0(word)) or known(edits1(word)) or known(edits2(word)) or [word])

In [47]:
candidates

{'finally'}

如果候选词中存在两个单词的编辑距离相同,则可以通过使用max(candidates,key=WORD_COUNTS.get)函数从词汇字典WORDS_COUNTS中选取出现频率最高的词来作为有效词.

In [48]:
def correct(word):
    '''
    Get the best correct spelling for the input word
    '''
    candidates = (known(edits0(word)) or known(edits1(word)) or known(edits2(word)) or [word])
    return max(candidates, key=WORDS_COUNTS.get)

In [49]:
correct('fianlly')

'finally'

In [50]:
correct('FIANLLY') #函数对大小写敏感,无法校正非小写的单词

'FIANLLY'

In [51]:
#对上面的函数进行改进,使其能够同时校正大写和小写的单词.
def correct_match(match):
    '''
    该函数的逻辑是存储单词的原始大小写格式,
    然后将所有字母转换为大小写字母,更正拼写错误,
    最后使用case_of函数将其重新转换回初始的大小写格式.
    '''
    word = match.group(0)
    def case_of(text):
        return (str.upper if text.isupper() else
               str.lower if text.islower() else
               str.title if text.istitle() else
               str)
    
    return case_of(word)(correct(word.lower()))

def correct_text_generic(text):
    return re.sub('[a-zA-Z]+', correct_match, text)

In [52]:
print correct_text_generic('fianlly')

finally


In [53]:
print correct_text_generic('FIANLLY')

FINALLY


以上采用的方法存在不足之处,如果单词没有出现在词汇字典中,就有可能无法被校正.

In [54]:
from pattern.en import suggest
print suggest('fianlly')

print suggest('flaot')

[('finally', 1.0)]
[('flat', 0.85), ('float', 0.15)]


还有PyEnchant库,它基于enchant库,以及aspell-python库,它是目前很流行的GNU Aspell的一个python 封装.

### 词干提取
**词素**是任何自然语言中最小的独立单元.由词干和词缀(affixe)组成.

**词缀**是指前缀,后缀等词语单元,它们附加到词干熵以改变其含义或创建一个新单词.

**词干**也经常称为单词的基本形式,可以通过在词干上添加词缀来创建新词,这个过程称为"词形变化".

相反的过程是从单词的变形形式中获得单词的基本形式,这称为**词干提取**.

对于词干提取器,nltk包有几种实现算法.这些词干提取器包含在stem模块中,该模块继承了nltk.stem.api模块中的StemmerI接口.

最受欢迎的词干提取器之一是波特词干提取器.

In [55]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

print ps.stem('jumping'), ps.stem('jumps'), ps.stem('jumped')

jump jump jump


In [56]:
print ps.stem('lying')

lie


In [57]:
print ps.stem('strange')

strang


**兰卡斯特词干提取器Lancaster stemmer基于兰卡斯特词干算法,通常也称为佩斯/哈斯科词干提取器(Paice/Husk stemmer).**

**该词干提取器是一个迭代提取器,具有超过120条规则来具体说明如何删减或替换词缀以获得词干.**

In [58]:
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()

print ls.stem('jumping'), ls.stem('jumps'), ls.stem('jumped')

jump jump jump


In [59]:
print ls.stem('lying')
print ls.stem('strange')

lying
strange


**RegexpStemmer词干提取器,RegexStemmer它可以根据用户定义的规则建立自己的词干提取器**

**它使用正则表达式来识别词语中的形态学词缀,并且删除与之匹配的任何部分**

In [60]:
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$', min=4)

print rs.stem('jumping'),rs.stem('jumps'), rs.stem('jumped')

print rs.stem('lying')

print rs.stem('strange')

jump jump jump
ly
strange


**使用SnowballStemmer来对其他语言的单词进行词干提取.**

In [61]:
from nltk.stem import SnowballStemmer
#德语
ss = SnowballStemmer('german')
print 'Supported Languages:', SnowballStemmer.languages
print ss.stem('autobahnen')
print ss.stem('springen')

Supported Languages: (u'danish', u'dutch', u'english', u'finnish', u'french', u'german', u'hungarian', u'italian', u'norwegian', u'porter', u'portuguese', u'romanian', u'russian', u'spanish', u'swedish')
autobahn
spring


波特词干提取器是目前最常用的词干提取器,但是在实际执行词干提取时,你还是应该根据具体问题来选择词干提取器,并经过反复试验以验证提取器效果.

### 词形还原
词形还原的过程与词干提取非常相似,去除词缀以获得单词的基本形式.这种基本形式称为**根词**,而不是**词干**.

根词也称为**词元**,始终存在于词典中.

词形还原的过程比词干提取慢得多,因为它涉及一个附加步骤,**当且仅当该词元存在于词典中**时,才通过去除词缀形成根形式或词元.

nltk包有一个强大的词形还原模块,它使用WordNet,单词的句法和语义来获得根词或词元.

词性主要包含三个实体--名词,动词和形容词----最常见于自然语言.

In [62]:
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

print wnl.lemmatize('cars','n')
print wnl.lemmatize('men','n')

car
men


In [63]:
print wnl.lemmatize('running','v')

run


In [64]:
print wnl.lemmatize('ate','v')
print wnl.lemmatize('saddest','a')
print wnl.lemmatize('fancier','a')

eat
sad
fancy


WordNetLemmatizer使用单词及其词性,通过比对WordNet语料库,并采用递归技术删除词缀直到在词汇网络中找到匹配项,

最终获得输入词的基本形式或词元.如果没有找到匹配项,则将返回输入词(输入词不做任何变化).

In [65]:
#如果词性错误,那么词性还原就会失效
print wnl.lemmatize('ate','n')
print wnl.lemmatize('fancier','v')

ate
fancier


## 理解文本句法和结构
本节将关注以下技术:
- 词性(POS)标签
- 浅层分析
- 基于依存关系的解析
- 基于成分结构的解析

 1. 下载spacy的英文语言模型
 2. 斯坦福分析器,有斯坦福大学开发的基于java的语言分析器,它能够帮助我们解析句子以了解其底层结果.我们将使用斯坦福分析器和nltk来执行基于依存关系的解析以及基于成分结构的解析.
 
### 词性标注
词性是基于语法语境和词语作用的具体词汇分类.对单词进行分类并标记POS标签称为词性标注或POS标注.

更多信息请参阅www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Pann-Treebank-Tagset.pdf 中找到有关各种pos标签机器标注的更多信息.

#### POS标签器推荐


In [66]:
#第一种方法是使用nltk推荐pos_tag()函数,基于Penn Treebank
sentence = 'The brown fox is quick and he is jumping over the lazy dog'

tokens = nltk.word_tokenize(sentence)

tagged_sent = nltk.pos_tag(tokens, tagset='universal')
print tagged_sent

[('The', u'DET'), ('brown', u'ADJ'), ('fox', u'NOUN'), ('is', u'VERB'), ('quick', u'ADJ'), ('and', u'CONJ'), ('he', u'PRON'), ('is', u'VERB'), ('jumping', u'VERB'), ('over', u'ADP'), ('the', u'DET'), ('lazy', u'ADJ'), ('dog', u'NOUN')]


In [67]:
#第二种方法是使用pattern模块通过以下代码段获取句子的POS标签
from pattern.en import tag
tagged_sent = tag(sentence)
print tagged_sent

[(u'The', u'DT'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'is', u'VBZ'), (u'quick', u'JJ'), (u'and', u'CC'), (u'he', u'PRP'), (u'is', u'VBZ'), (u'jumping', u'VBG'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]


In [68]:
#第三种方法是构建自己的POS标签器
from nltk.corpus import treebank
data = treebank.tagged_sents()
train_data = data[:3500]
test_data = data[3500:]

print train_data[0]

[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'), (u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'), (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')]


**评估标签器的性能**

In [69]:
from nltk.tag import DefaultTagger
dt = DefaultTagger('NN')

print dt.evaluate(test_data)
print dt.tag(tokens)

0.145415819537
[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NN'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NN'), ('jumping', 'NN'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]


In [70]:
#使用正则表达式和RegexpTagger来尝试构建一个性能更好的标签器
from nltk.tag import RegexpTagger

patterns = [
    (r'.*ing$','VBG'),  #gerunds
    (r'.*ed$','VBD'),   #simple past
    (r'.*es$', 'VBZ'),  #3rd singular present
    (r'.*ould$','MD'),  #modals
    (r'.*\'s$','NN$'),  #possessive nouns
    (r'.*s$','NNS'),    #plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$','CD'), #cardinal numbers
    (r'.*', 'NN')       # nouns (default) ...
    
]

In [71]:
rt = RegexpTagger(patterns)
print rt.evaluate(test_data)

0.240391131765


In [72]:
print rt.tag(tokens)

[('The', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('is', 'NNS'), ('quick', 'NN'), ('and', 'NN'), ('he', 'NN'), ('is', 'NNS'), ('jumping', 'VBG'), ('over', 'NN'), ('the', 'NN'), ('lazy', 'NN'), ('dog', 'NN')]


n元分词是来自文本序列或语音序列的n个连续项.这些项可以由单词,音素,字母,字符或音节组成.

In [74]:
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

ut = UnigramTagger(train_data)
bt = BigramTagger(train_data)
tt = TrigramTagger(train_data)

#testing performance of unigram tagger
print ut.evaluate(test_data)
print ut.tag(tokens)

#testing performance of bigram tagger
print bt.evaluate(test_data)
print bt.tag(tokens)

#testing performance of trigram tagger
print tt.evaluate(test_data)
print tt.tag(tokens)

0.861361215994
[('The', u'DT'), ('brown', None), ('fox', None), ('is', u'VBZ'), ('quick', u'JJ'), ('and', u'CC'), ('he', u'PRP'), ('is', u'VBZ'), ('jumping', u'VBG'), ('over', u'IN'), ('the', u'DT'), ('lazy', None), ('dog', None)]
0.134669377481
[('The', u'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]
0.0806467228192
[('The', u'DT'), ('brown', None), ('fox', None), ('is', None), ('quick', None), ('and', None), ('he', None), ('is', None), ('jumping', None), ('over', None), ('the', None), ('lazy', None), ('dog', None)]


In [76]:
#通过创建一个包含标签列表的组合标签器以及使用backoff标签器,我们将尝试组合运用所有的标签器(串联所有的标签器)
def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff

ct = combined_tagger(train_data=train_data, taggers=[UnigramTagger, BigramTagger, TrigramTagger], backoff=rt)
print ct.evaluate(test_data)
print ct.tag(tokens)

0.910155871817
[('The', u'DT'), ('brown', 'NN'), ('fox', 'NN'), ('is', u'VBZ'), ('quick', u'JJ'), ('and', u'CC'), ('he', u'PRP'), ('is', u'VBZ'), ('jumping', 'VBG'), ('over', u'IN'), ('the', u'DT'), ('lazy', 'NN'), ('dog', 'NN')]


In [78]:
'''
对于最终的标签器,我们将使用有监督的分类算法来训练它.
ClassifierBasedPOSTagger类使我们能够使用Classifier_builder参数中的有监督机器学习算法来训练标签器.该函数用于从训练数据中生成各种特征.
在这里,我们使用的分类器是NaiveBayesClassifier,它使用贝叶斯定理构建概率分类器,假设特征之间是独立的.
'''
#如何基于分类方法构建POS标签器并对其进行评估
from nltk.classify import NaiveBayesClassifier
from nltk.tag.sequential import ClassifierBasedPOSTagger

nbt = ClassifierBasedPOSTagger(train=train_data, classifier_builder=NaiveBayesClassifier.train)
print nbt.evaluate(test_data)
print nbt.tag(tokens)

0.930680607997
[('The', u'DT'), ('brown', u'JJ'), ('fox', u'NN'), ('is', u'VBZ'), ('quick', u'JJ'), ('and', u'CC'), ('he', u'PRP'), ('is', u'VBZ'), ('jumping', u'VBG'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'VBG')]


## 浅层分析
浅层分析(shallow parsing)也称为浅分析(light parsing)或组块分析(chunking).是分析句子结构的一种技术,它将句子分解成最小的组成部分,然后将它们组合成更高级的短语.

在浅层分析中,主要的关注焦点是识别这些短语或语块.它的主要目的是获得语义上有意义的短语,并观察它们之间的关系.

### 浅层分析器推荐
在这里,我们将使用pattern包创建一个浅层分析器,用以从句子中提取有意义的语块.

In [79]:
sentence = 'The brown fox is quick and he is jumping over the lazy dog'

from pattern.en import parsetree
tree = parsetree(sentence)

print tree

[Sentence('The/DT/B-NP/O brown/JJ/I-NP/O fox/NN/I-NP/O is/VBZ/B-VP/O quick/JJ/B-ADJP/O and/CC/O/O he/PRP/B-NP/O is/VBZ/B-VP/O jumping/VBG/I-VP/O over/IN/B-PP/B-PNP the/DT/B-NP/I-PNP lazy/JJ/I-NP/I-PNP dog/NN/I-NP/I-PNP')]


In [80]:
for sentence_tree in tree:
    print sentence_tree.chunks

[Chunk('The brown fox/NP'), Chunk('is/VP'), Chunk('quick/ADJP'), Chunk('he/NP'), Chunk('is jumping/VP'), Chunk('over/PP'), Chunk('the lazy dog/NP')]


In [81]:
for sentence_tree in tree:
    for chunk in sentence_tree.chunks:
        print chunk.type, '->', [(word.string, word.type) for word in chunk.words]

NP -> [(u'The', u'DT'), (u'brown', u'JJ'), (u'fox', u'NN')]
VP -> [(u'is', u'VBZ')]
ADJP -> [(u'quick', u'JJ')]
NP -> [(u'he', u'PRP')]
VP -> [(u'is', u'VBZ'), (u'jumping', u'VBG')]
PP -> [(u'over', u'IN')]
NP -> [(u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]


前缀I,O和B,即分块技术领域里十分流行的IOB标注,I,O和B分别表示内部,外部和开头.

标签前面的B-前缀表示它是块的开始,而I-前缀则表示它在块内.O标签表示标识不属于任何块.

当后续标签跟当前语块的标签类型相同,并且它们之间不存在O标签时,则对当前块使用B-标签.

In [82]:
for sentence_tree in tree:
    print sentence_tree.chunks

[Chunk('The brown fox/NP'), Chunk('is/VP'), Chunk('quick/ADJP'), Chunk('he/NP'), Chunk('is jumping/VP'), Chunk('over/PP'), Chunk('the lazy dog/NP')]


In [83]:
for sentence_tree in tree:
    for chunk in sentence_tree.chunks:
        print chunk.type, '->', [(word.string, word.type) for word in chunk.words]

NP -> [(u'The', u'DT'), (u'brown', u'JJ'), (u'fox', u'NN')]
VP -> [(u'is', u'VBZ')]
ADJP -> [(u'quick', u'JJ')]
NP -> [(u'he', u'PRP')]
VP -> [(u'is', u'VBZ'), (u'jumping', u'VBG')]
PP -> [(u'over', u'IN')]
NP -> [(u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]


接下来,构建一些通用函数,更好的解析和可视化浅层分析的语句树,还可以在分析常见句子时重复使用它们.

In [84]:
from pattern.en import parsetree,Chunk
from nltk.tree import Tree

def create_sentence_tree(sentence, lemmatize=False):
    sentence_tree = parsetree(sentence,
                             relations=True,
                             lemmata=lemmatize) # if you want to lemmatize the tokens
    return sentence_tree[0]

In [85]:
#get various constituents of the parse tree
def get_sentence_tree_constituents(sentence_tree):
    return sentence_tree.constituents()

#process the shallow parsed tree into an easy to understand format
def process_sentence_tree(sentence_tree):
    tree_constituents = get_sentence_tree_constituents(sentence_tree)
    processed_tree = [
        (item.type,
        [
            (w.string,w.type)
            for w in item.words
        ]
        )
        if type(item) == Chunk
        else ('-',
             [
                 (item.string, item.type)
             ]
             )
             for item in tree_constituents
    ]
    return processed_tree

In [86]:
#print the sentence tree using nltk's Tree syntax
def print_sentence_tree(sentence_tree):
    processed_tree = process_sentence_tree(sentence_tree)
    processed_tree = [
        Tree(item[0],
            [
                Tree(x[1],[x[0]])
                for x in item[1]
            ]
            )
        for item in processed_tree
    ]
    
    tree = Tree('5', processed_tree)
    print tree
    
def visualize_sentence_tree(sentence_tree):
    processed_tree = process_sentence_tree(sentence_tree)
    processed_tree = [
        Tree(item[0],
            [
                Tree(x[1],[x[0]])
                for x in item[1]
            ]
            )
        for item in processed_tree
    ]
    tree = Tree('S', processed_tree)
    tree.draw()

In [89]:
t = create_sentence_tree(sentence)
print t
print 
pt = process_sentence_tree(t)
print pt

Sentence('The/DT/B-NP/O/NP-SBJ-1 brown/JJ/I-NP/O/NP-SBJ-1 fox/NN/I-NP/O/NP-SBJ-1 is/VBZ/B-VP/O/VP-1 quick/JJ/B-ADJP/O/O and/CC/O/O/O he/PRP/B-NP/O/NP-SBJ-2 is/VBZ/B-VP/O/VP-2 jumping/VBG/I-VP/O/VP-2 over/IN/B-PP/B-PNP/O the/DT/B-NP/I-PNP/O lazy/JJ/I-NP/I-PNP/O dog/NN/I-NP/I-PNP/O')

[(u'NP', [(u'The', u'DT'), (u'brown', u'JJ'), (u'fox', u'NN')]), (u'VP', [(u'is', u'VBZ')]), (u'ADJP', [(u'quick', u'JJ')]), ('-', [(u'and', u'CC')]), (u'NP', [(u'he', u'PRP')]), (u'VP', [(u'is', u'VBZ'), (u'jumping', u'VBG')]), (u'PP', [(u'over', u'IN')]), (u'NP', [(u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')])]


In [90]:
print_sentence_tree(t)

(5
  (NP (DT The) (JJ brown) (NN fox))
  (VP (VBZ is))
  (ADJP (JJ quick))
  (- (CC and))
  (NP (PRP he))
  (VP (VBZ is) (VBG jumping))
  (PP (IN over))
  (NP (DT the) (JJ lazy) (NN dog)))


In [94]:
visualize_sentence_tree(t)

### 构建自己的浅层分析器
我们将使用正则表达式,基于标签的学习器等技术构建自己的浅层分析器.

在nltk中,可以使用**treebank语料库**,它带有**语块标注**.

In [93]:
from nltk.corpus import treebank_chunk
data = treebank_chunk.chunked_sents()
train_data = data[:4000]
test_data = data[4000:]

print train_data[7]

(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.)


In [98]:
'''
我们的数据点是使用短语和POS标签完成标注的句子,这将有助于训练浅层分析器.
使用正则表达式进行浅层分析,还会使用分块和加缝隙的概念.
通过分块,我们可以完成并指定特定的模式来识别想要在句子中分块或分段的内容.
加缝隙则与分块过程相反,在该过程中,我们指定一些特定的标识使其不属于任何语块,然后形成除这些标识以外的必要语块.
'''

sample_sentence = 'the quick fox jumped over the lazy dog'

from nltk.chunk import RegexpParser
from pattern.en import tag

tagged_simple_sent = tag(sample_sentence)
print tagged_simple_sent

chunk_grammar = '''NP:{<DT>?<JJ>*<NN.*>}'''
rc = RegexpParser(chunk_grammar)
c = rc.parse(tagged_simple_sent)
print c

[(u'the', u'DT'), (u'quick', u'JJ'), (u'fox', u'NN'), (u'jumped', u'VBD'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]
(S
  (NP the/DT quick/JJ fox/NN)
  jumped/VBD
  over/IN
  (NP the/DT lazy/JJ dog/NN))


In [100]:
#illustrate NP chunking based on explicit chink patterns
chink_grammar = '''NP: {<.*>+} # chunk everything as NP }<VBD|IN>+{'''
rc = RegexpParser(chink_grammar)
c = rc.parse(tagged_simple_sent)
print c

(S
  (NP
    the/DT
    quick/JJ
    fox/NN
    jumped/VBD
    over/IN
    the/DT
    lazy/JJ
    dog/NN))


请记住,语块是包含在组块(语块)集合中的标识序列,缝隙则是被排除在语块之外的标识或标识序列

我们要训练一个更为通用的基于正则表达式的浅层分析器,并在我们的测试treebank数据上检测其性能.

训练一个更为通用的基于正则表达式的浅层分析器,并在我们的测试treebank数据上检测其性能.在程序内部,需要执行几个步骤来完成此分析器.

- 首选,需要将nltk中用于表示被解析语句的Tree结构转换为ChunkString对象
- 然后,使用定义好的分块和加缝隙规则创建一个RegexpParser对象.
- 最后,使用ChunkRule和ChinkRule类及其对象创建完整的,带有必要语块的浅层分析树.

In [101]:
tagged_sentence = tag(sentence)
print tagged_sentence

grammar = '''
NP: {<DT>?<JJ>?<NN.*>}
ADJP: {<JJ>}
ADVP: {<RB.*>}
PP: {<IN>}
VP: {<MD>?<VB.*>+}
'''

rc = RegexpParser(grammar)
c = rc.parse(tagged_sentence)
print c
print rc.evaluate(test_data)

[(u'The', u'DT'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'is', u'VBZ'), (u'quick', u'JJ'), (u'and', u'CC'), (u'he', u'PRP'), (u'is', u'VBZ'), (u'jumping', u'VBG'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]
(S
  (NP The/DT brown/JJ fox/NN)
  (VP is/VBZ)
  (ADJP quick/JJ)
  and/CC
  he/PRP
  (VP is/VBZ jumping/VBG)
  (PP over/IN)
  (NP the/DT lazy/JJ dog/NN))
ChunkParse score:
    IOB Accuracy:  54.5%%
    Precision:     25.0%%
    Recall:        52.5%%
    F-Measure:     33.9%%


接下来,我们将使用分好块并标记好的treebank训练数据,构建一个浅层分析器.

将用到两个分块函数:
- 一个是tree2conlltags函数,它可以为每个词元获取三组数据--单词,标签和块标签;
- 另一个是colltags2tree函数,它可以从上述三元组数据中生成分析树.

请记住,块标签使用前面提到的IOB格式:

In [102]:
from nltk.chunk.util import tree2conlltags,conlltags2tree

train_sent = train_data[7]
print train_sent

(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.)


In [103]:
wtc = tree2conlltags(train_sent)
wtc

[(u'A', u'DT', u'B-NP'),
 (u'Lorillard', u'NNP', u'I-NP'),
 (u'spokewoman', u'NN', u'I-NP'),
 (u'said', u'VBD', u'O'),
 (u',', u',', u'O'),
 (u'``', u'``', u'O'),
 (u'This', u'DT', u'B-NP'),
 (u'is', u'VBZ', u'O'),
 (u'an', u'DT', u'B-NP'),
 (u'old', u'JJ', u'I-NP'),
 (u'story', u'NN', u'I-NP'),
 (u'.', u'.', u'O')]

In [104]:
tree = conlltags2tree(wtc)
print tree

(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.)


接下来,我们定义了一个函数conll_tag_chunks()从分块标注好的句子中提取POS和块标签. 

In [105]:
def conll_tag_chunks(chunk_sents):
    tagged_sents = [tree2conlltags(tree) for tree in chunk_sents]
    return [[(t,c) for (w,t,c) in sent] for sent in tagged_sents]

def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff

定义一个NGramTagChunker类,将标记好的句子作为训练输入,获取它们的WTC三元组,即单词(word),POS标签(POS tag)和块标签(Chunk tag)三元组

并使用UnigramTagger作为backoff标签器训练一个BigramTagger.我们还将定义一个parse()函数来对新的句子执行浅层分析.

In [107]:
from nltk.tag import UnigramTagger, BigramTagger
from nltk.chunk import ChunkParserI

class NGramTagChunker(ChunkParserI):
    
    def __init__(self, train_sentences, tagger_classes=[UnigramTagger, BigramTagger]):
        '''
        基于语句WTC三元组的n元分词标签训练浅层分析器.
        将一列训练语句作为输入,并使用分好块的分析树元数据做标注.
        使用conll_tag_chunks()函数来获取所有分块分析树的WTC三元组数据列表
        然后,使用这些三元组数据列表训练出一个bigram标签器,它使用Unigram标签器作为backoff标签器,
        并且将训练模型存储在self.chunk_tagger中.
        
        这里使用tagger_classes参数来分析其他n元分词标签.
        '''
        train_sent_tags = conll_tag_chunks(train_sentences)
        self.chunk_tagger = combined_tagger(train_sent_tags, tagger_classes)
        
    def parse(self, tagged_sentence):
        '''
        评估测试数据上的标签并对新的句子进行浅层分析.
        该函数使用经POS标注的句子作为输入,从句子中分离出POS标签,
        并使用我们训练完的self.chunk_tagger获取句子的IOB块标签.
        然后,将其与原始句子标识相结合,并使用conlltags2tree()函数获取最终的浅层分析树
        '''
        if not tagged_sentence:
            return None
        pos_tags = [tag for word, tag in tagged_sentence]
        chunk_pos_tags = self.chunk_tagger.tag(pos_tags)
        chunk_tags = [chunk_tag for (pos_tag, chunk_tag) in chunk_pos_tags]
        wpc_tags = [(word, pos_tag,chunk_tag) for ((word, pos_tag), chunk_tag) in zip(tagged_sentence, chunk_tags)]
        
        return conlltags2tree(wpc_tags)

In [108]:
#train the shallow parser
ntc = NGramTagChunker(train_data)

#test parser performance on test data
print ntc.evaluate(test_data)

#parse our sample sentence
tree = ntc.parse(tagged_sentence)
print tree

ChunkParse score:
    IOB Accuracy:  99.6%%
    Precision:     98.4%%
    Recall:       100.0%%
    F-Measure:     99.2%%
(S
  (NP The/DT brown/JJ fox/NN)
  is/VBZ
  (NP quick/JJ)
  and/CC
  (NP he/PRP)
  is/VBZ
  jumping/VBG
  over/IN
  (NP the/DT lazy/JJ dog/NN))


现在在conll2000语料库熵对我们的分析器进行训练和评估.

conll2000语料库是一个更大的语料库,它包含了'华尔街日报'摘录.

将前7500个句子上训练我们的分析器,并在其余3448个句子上进行性能测试.

In [109]:
from nltk.corpus import conll2000
wsj_data = conll2000.chunked_sents()
train_wsj_data = wsj_data[:7500]
test_wsj_data = wsj_data[7500:]
print train_wsj_data[10]

(S
  (NP He/PRP)
  (VP reckons/VBZ)
  (NP the/DT current/JJ account/NN deficit/NN)
  (VP will/MD narrow/VB)
  (PP to/TO)
  (NP only/RB #/# 1.8/CD billion/CD)
  (PP in/IN)
  (NP September/NNP)
  ./.)


In [111]:
#train the shallow parser
tc = NGramTagChunker(train_wsj_data)

#test performance on the test data
print tc.evaluate(test_wsj_data)

ChunkParse score:
    IOB Accuracy:  89.4%%
    Precision:     80.8%%
    Recall:        85.9%%
    F-Measure:     83.3%%
