标记分析是将文本分割成一组有意义的片段的过程。这些片段被称作标记，例如将一段文字分割成单词或者句子。根据手头的任务需要，
可以自定义将输入的文本分割成有意义的标记。

In [5]:
text = "Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action."

# 句子解析
from nltk.tokenize import sent_tokenize

sent_tokenize_list = sent_tokenize(text)
print("\nSentence tokenizer:")
print(sent_tokenize_list)

# 基本单词解析
from nltk.tokenize import word_tokenize

print("\nWord tokenizer:")
print(word_tokenize(text))

# 如果需要将标点符号保留到不同的句子标记中，可以用WordPunct标记解析器
from nltk.tokenize import WordPunctTokenizer

word_punct_tokenizer = WordPunctTokenizer()
print("\nWord punct tokenizer:")
print(word_punct_tokenizer.tokenize(text))




Sentence tokenizer:
['Are you curious about tokenization?', "Let's see how it works!", 'We need to analyze a couple of sentences with punctuations to see it in action.']

Word tokenizer:
['Are', 'you', 'curious', 'about', 'tokenization', '?', 'Let', "'s", 'see', 'how', 'it', 'works', '!', 'We', 'need', 'to', 'analyze', 'a', 'couple', 'of', 'sentences', 'with', 'punctuations', 'to', 'see', 'it', 'in', 'action', '.']

Word punct tokenizer:
['Are', 'you', 'curious', 'about', 'tokenization', '?', 'Let', "'", 's', 'see', 'how', 'it', 'works', '!', 'We', 'need', 'to', 'analyze', 'a', 'couple', 'of', 'sentences', 'with', 'punctuations', 'to', 'see', 'it', 'in', 'action', '.']


**提取文本数据的词干**   
处理文本文档时，可能会碰到单词的不同形式。    
例如play，有play,plays,playing,player,playing等等，这些事具有同样含义的单词家族。    
在文本分析中，提取这些单词的原型非常有用。它有助于我么提取一些统计信息来分析整个文本。    
词干提取的目标是将不同词形的单词都变成其原形。词干提取使用启发式处理方法截取单词的尾部，以提取单词的原形。     

In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

# 定义一些单词进行词干提取
words = ['table', 'probably', 'wolves', 'playing', 'is', 
        'dog', 'the', 'beaches', 'grounded', 'dreamt', 'envision']

# 定义一个稍后会用到的词干提取器列表
# 对比不同的词干提取器对象
stemmers = ['PORTER', 'LANCASTER', 'SNOWBALL']
# 初始化3个词干提取器对象
stemmer_porter = PorterStemmer()
stemmer_lancaster = LancasterStemmer()
stemmer_snowball = SnowballStemmer('english')
# 为了以整齐的表格形式将输出数据打印出来设定的正确格式
formatted_row = '{:>16}' * (len(stemmers) + 1) # >向右对齐，<向左对齐
print('\n', formatted_row.format('WORD', *stemmers), '\n')
for word in words:
    stemmed_words = [stemmer_porter.stem(word), 
            stemmer_lancaster.stem(word), stemmer_snowball.stem(word)]
    print(formatted_row.format(word, *stemmed_words))

严格程度 Lancaster > Snowball > Porter,一般采用Snowball词干提取器

在上一节中，可以看到用词根还原得到的单词原型并不是有意义的。    
词形还原通过对单词进行词汇和语法分析来实现，可以解决这个问题。     
下面介绍用词形还原的方法还原文本的基本形式

In [10]:
from nltk.stem import WordNetLemmatizer

words = ['table', 'probably', 'wolves', 'playing', 'is', 
        'dog', 'the', 'beaches', 'grounded', 'dreamt', 'envision']

# 比较两个不同的词形还原器
lemmatizers = ['NOUN LEMMATIZER', 'VERB LEMMATIZER']
lemmatizer_wordnet = WordNetLemmatizer()

formatted_row = '{:>24}' * (len(lemmatizers) + 1)
print('\n', formatted_row.format('WORD', *lemmatizers), '\n')
for word in words:
    lemmatized_words = [lemmatizer_wordnet.lemmatize(word, pos='n'),
           lemmatizer_wordnet.lemmatize(word, pos='v')]
    print(formatted_row.format(word, *lemmatized_words))


                     WORD         NOUN LEMMATIZER         VERB LEMMATIZER 

                   table                   table                   table
                probably                probably                probably
                  wolves                    wolf                  wolves
                 playing                 playing                    play
                      is                      is                      be
                     dog                     dog                     dog
                     the                     the                     the
                 beaches                   beach                   beach
                grounded                grounded                  ground
                  dreamt                  dreamt                   dream
                envision                envision                envision


**用分块的方法划分文本**     
分块是指基于任意随机条件将输入文本分割成块。     
与标记解析不同的是，分块没有条件约束，分块的结果不需要有实际意义    
当处理非常大的文本文档时，就需要将文本进行分块，以便于下一步分析

In [11]:
import numpy as np
from nltk.corpus import brown

# 将文本分割成块
def splitter(data, num_words):
    words = data.split(' ')
    output = []

    cur_count = 0
    cur_words = []
    for word in words:
        cur_words.append(word)
        cur_count += 1
        if cur_count == num_words:
            output.append(' '.join(cur_words))
            cur_words = []
            cur_count = 0

    output.append(' '.join(cur_words)) # 剩余部分

    return output 


# 从布朗语料库加载数据(Brown corpus)加载数据，用到前10000个单词
data = ' '.join(brown.words()[:10000])

# 定义每块包含的单词数目 
num_words = 1700

text_chunks = splitter(data, num_words)

print("Number of text chunks =", len(text_chunks))

Number of text chunks = 6


**词袋模型**     
如果需要处理包含数百万单词的文本文档，需要将其转化为某种数值表示的形式，以便让机器用这些数据来学习算法。
这些算法需要数值数据，以便可以对这些数据进行分析，并输出有用信息。     
这里需要用到词袋。词袋是从所有文档的所有单词中学习词汇的模型。词袋通过构建文档中所有单词的直方图来对每篇文档进行建模。

In [15]:
import numpy as np
from nltk.corpus import brown

data = ' '.join(brown.words()[:10000])
num_words = 2000

chunks = []
counter = 0

text_chunks = splitter(data, num_words)

# 创建一个基于这些文本块的词典
for text in text_chunks:
    chunk = {'index': counter, 'text': text}
    chunks.append(chunk)
    counter += 1

# 提取文档-词矩阵，文档-词矩阵激励了文档中每个单词出现的频次
from sklearn.feature_extraction.text import CountVectorizer
# 定义对象
vectorizer = CountVectorizer(min_df=5, max_df=.95)
# 提取文档-词矩阵
doc_term_matrix = vectorizer.fit_transform([chunk['text'] for chunk in chunks])
# 从vectorizer对象中提取词汇
vocab = np.array(vectorizer.get_feature_names())
print("\nVocabulary:")
print(vocab)

# 打印文档-词矩阵
print("\nDocument term matrix:")
chunk_names = ['Chunk-0', 'Chunk-1', 'Chunk-2', 'Chunk-3', 'Chunk-4']
formatted_row = '{:>12}' * (len(chunk_names) + 1)
print('\n', formatted_row.format('Word', *chunk_names), '\n')
for word, item in zip(vocab, doc_term_matrix.T):
    # 'item' 是压缩的系数矩阵结构
    output = [str(x) for x in item.data]
    print(formatted_row.format(word, *output))



Vocabulary:
['about' 'after' 'against' 'aid' 'all' 'also' 'an' 'and' 'are' 'as' 'at'
 'be' 'been' 'before' 'but' 'by' 'committee' 'congress' 'did' 'each'
 'education' 'first' 'for' 'from' 'general' 'had' 'has' 'have' 'he'
 'health' 'his' 'house' 'in' 'increase' 'is' 'it' 'last' 'made' 'make'
 'may' 'more' 'no' 'not' 'of' 'on' 'one' 'only' 'or' 'other' 'out' 'over'
 'pay' 'program' 'proposed' 'said' 'similar' 'state' 'such' 'take' 'than'
 'that' 'the' 'them' 'there' 'they' 'this' 'time' 'to' 'two' 'under' 'up'
 'was' 'were' 'what' 'which' 'who' 'will' 'with' 'would' 'year' 'years']

Document term matrix:

         Word     Chunk-0     Chunk-1     Chunk-2     Chunk-3     Chunk-4 

       about           1           1           1           1           3
       after           2           3           2           1           3
     against           1           2           2           1           1
         aid           1           1           1           3           5
         all       