在第三章通过建立我们自己的分析器和标签器,你可以体会到机器学习模式是数据和算法的组合.

机器学习的优势是一旦我们训练好了模型,就可以直接在新的以前未见到的数据上应用这个模型.

NLP的一个最有挑战的问题是文本分类,它基于每个文件的内在属性或特性,尽量将文本文件划分到不同的类别.

该技术可用在不同的领域,包括垃圾邮件识别和新闻分类.

文本分类具有很多划分方法,本章主要介绍两种基于文档内容类型的分类:

- 基于内容的分类
    根据文本内容主题或题目的属性或权重来进行文档分类的.
- 基于请求的分类
    受到用户需求的影响,其目标是特定的用户群和读者.这类分类受到特殊策略和思想的控制
    
## 文本规范化处理

定义一个规范化模块以处理文本文档规范化,并在后面建立分类器时使用这个处理模块.

我们将在模块中实现和使用下面的规范化技术:

- 扩展缩写词
- 通过词形还原实现文本处理规范化
- 去除特殊字符与符号
- 去除停用词

In [1]:
from contractions import CONTRACTION_MAP #缩写映射
import re
import nltk
import string
from nltk.stem import WordNetLemmatizer

stopword_list = nltk.corpus.stopwords.words('english') #停用词
wnl = WordNetLemmatizer() #词形还原

def tokenize_text(text):
    '''
    实现词语切分,并去除分割后符号中的多余空格
    '''
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    return tokens

In [2]:
def expand_contractions(text, contraction_mapping):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE|re.DOTALL)
    
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) \
                                if contraction_mapping.get(match) \
                                else contraction_mapping.get(match.lower())
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
    
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'","",expanded_text)
    return expanded_text

In [3]:
from pattern.en import tag
from nltk.corpus import wordnet as wn

#Annotate text tokens with POS tags
def pos_tag_text(text):
    '''
    词形还原函数
    '''
    #convert Penn treebank tag to wordnet tag
    def penn_to_wn_tags(pos_tag):
        if pos_tag.startwith('J'):
            return wn.ADJ
        elif pos_tag.startwith('V'):
            return wn.VERB
        elif pos_tag.startwith('N'):
            return wn.NOUN
        elif pos_tag.startwith('R'):
            return wn.ADV
        else:
            return None
        
    tagged_text = tag(text)
    tagged_lower_text = [(word.lower(), penn_to_wn_tags(pos_tag))
                         for word, pos_tag in tagged_text]
    return tagged_lower_text

#lemmatize text based on POS tags
def lemmatize_text(text):
    pos_tagged_text = pos_tag_text(text)
    lemmatized_tokens = [wnl.lemmatize(word, pos_tag) if pos_tag else word for word,pos_tag in pos_tagged_text]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text

In [4]:
def remove_special_characters(text):
    '''
    实现特殊符号和字符的去除
    '''
    tokens = tokenize_text(text)
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

In [5]:
#文本处理流水线
def normalize_corpus(corpus, tokenize=False):
    normalized_corpus = []
    for text in corpus:
        text = expand_contractions(text, CONTRACTION_MAP)
        text = lemmatize_text(text)
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        normalized_corpus.append(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
            
    return normalized_corpus

## 特征提取
下面将介绍和实现如下特征提取技术:

- 词袋模型
- TF-IDF模型
- 高级词向量模型

特征提取的实现可以分为两个模块.

### 词袋模型
词袋模型是从文本文档中提取特征最简单但又最有效的技术.

这个模型的本质是将文本文档转化成向量,这个向量表示在文档空间中全部不同的单词在该文档中出现的频率.

该模型可以是n元分词词袋模型,它计算不同的n元分词在每个文档中的出现频率.

In [6]:
CORPUS = [
    'the sky is blue',
    'sky is blue and sky is beautiful',
    'the beautiful sky is so blue',
    'i love blue cheese'
]

new_doc = ['loving this blue sky today']

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

def bow_extractor(corpus, ngram_range=(1,1)):
    '''
    实现基于词袋的特征提取模块
    '''
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [8]:
#build bow vectorizer and get features
bow_vectorizer, bow_features = bow_extractor(CORPUS)

features = bow_features.todense()
print features

[[0 0 1 0 1 0 1 0 1]
 [1 1 1 0 2 0 2 0 0]
 [0 1 1 0 1 0 1 1 1]
 [0 0 1 1 0 1 0 0 0]]


In [9]:
#extract features from new document using built vectorizer
new_doc_features = bow_vectorizer.transform(new_doc)
new_doc_features = new_doc_features.todense()

In [10]:
print new_doc_features

[[0 0 1 0 0 0 1 0 0]]


In [11]:
#print the features name
feature_names = bow_vectorizer.get_feature_names()
print feature_names

[u'and', u'beautiful', u'blue', u'cheese', u'is', u'love', u'sky', u'so', u'the']


In [12]:
import pandas as pd
def display_features(features, feature_names):
    df = pd.DataFrame(data=features, columns=feature_names)
    print df
    
display_features(features, feature_names)

   and  beautiful  blue  cheese  is  love  sky  so  the
0    0          0     1       0   1     0    1   0    1
1    1          1     1       0   2     0    2   0    0
2    0          1     1       0   1     0    1   1    1
3    0          0     1       1   0     1    0   0    0


In [13]:
#build bow vectorizer and get features
bow_vectorizer, bow_features = bow_extractor(CORPUS,ngram_range=(1,2))

features = bow_features.todense()
feature_names = bow_vectorizer.get_feature_names()

print features
print feature_names

[[0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 1 0 1]
 [1 1 1 0 1 1 0 0 2 1 1 0 0 0 2 2 0 0 0 0 0]
 [0 0 1 1 1 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0]
 [0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0]]
[u'and', u'and sky', u'beautiful', u'beautiful sky', u'blue', u'blue and', u'blue cheese', u'cheese', u'is', u'is beautiful', u'is blue', u'is so', u'love', u'love blue', u'sky', u'sky is', u'so', u'so blue', u'the', u'the beautiful', u'the sky']


### TF-IDF模型
词袋模型存在一些问题,语料库全部文档中出现次数较多的单词将会拥有较高的频率,而这些并不是有意义的单词,

而那些低频率的单词恰恰是能携带更多文档信息的单词.而TF-IDF可以解决这个问题.TF-IDF代表的是词频-逆文档频率,是两个度量的组合;

**该技术最初是作为显示搜索引擎用户查询结果排序函数的一个度量,现在称为信息检索和文本特征提取的一部分**

**词频**由tf表示,由词袋模型计算得出.任何文档的词频是该词在特定文档出现的原始频率值.

**逆文档频率**由idf表示,是每个单词的文档频率的逆.该值由语料库中全部文档数量除以每个单词的文档频率,然后对结果应用对数运算变换其比例.(在这里需要对每个词频加1,意味着词汇表中每个单词至少在训练语料库中出现过一次,从而避免了除0的错误,称作**平滑逆文档频率**).

$idf(t) = 1+\log \frac{C}{1+df(t)}$

最终使用的tfidf是归一化后的值.

In [13]:
from sklearn.feature_extraction.text import TfidfTransformer

def tfidf_transformer(bow_matrix):
    transformer = TfidfTransformer(norm='l2', smooth_idf=True, use_idf=True)
    tfidf_matrix = transformer.fit_transform(bow_matrix)
    return transformer, tfidf_matrix

In [14]:
import numpy as np

feature_names = bow_vectorizer.get_feature_names()

#build tfidf transformer and show train corpus tfidf features
tfidf_trans, tfidf_features = tfidf_transformer(bow_features)
features = np.round(tfidf_features.todense(),2)
display_features(features, feature_names)

    and  beautiful  blue  cheese    is  love   sky    so   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.00  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00  0.00
2  0.00       0.43  0.29    0.00  0.35  0.00  0.35  0.55  0.43
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00  0.00


In [15]:
#show tfidf features for new_doc using built tfidf transformer

nd_tfidf = tfidf_trans.transform(new_doc_features)
nd_features = np.round(nd_tfidf.todense(),2)
display_features(nd_features, feature_names)

   and  beautiful  blue  cheese   is  love   sky   so  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0  0.0


In [16]:
import scipy.sparse as sp
from numpy.linalg import norm
feature_names = bow_vectorizer.get_feature_names()

#compute term frequency
tf = bow_features.todense()
tf = np.array(tf, dtype='float64')

#show term frequency
display_features(tf, feature_names)

   and  beautiful  blue  cheese   is  love  sky   so  the
0  0.0        0.0   1.0     0.0  1.0   0.0  1.0  0.0  1.0
1  1.0        1.0   1.0     0.0  2.0   0.0  2.0  0.0  0.0
2  0.0        1.0   1.0     0.0  1.0   0.0  1.0  1.0  1.0
3  0.0        0.0   1.0     1.0  0.0   1.0  0.0  0.0  0.0


In [17]:
#从词袋模型特征矩阵中获取DF(每个单词的文档频率)

#build the document frequency matrix
df = np.diff(sp.csc_matrix(bow_features, copy=True).indptr)
df = 1 + df #to smoothen idf later

#show document frequencies
display_features([df], feature_names)

   and  beautiful  blue  cheese  is  love  sky  so  the
0    2          3     5       2   4     2    4   2    3


In [18]:
#compute inverse document frequencies
total_docs = 1+len(CORPUS)
idf = 1.0+np.log(float(total_docs)/df)

#show inverse_document frequencies
display_features([np.round(idf,2)], feature_names)

#compute idf diagonal matrix
total_features = bow_features.shape[1]

idf_diag = sp.spdiags(idf, diags=0, m=total_features, n=total_features) #变成对角矩阵
idf = idf_diag.todense()

#print the idf diagonal matrix
print np.round(idf, 2)

    and  beautiful  blue  cheese    is  love   sky    so   the
0  1.92       1.51   1.0    1.92  1.22  1.92  1.22  1.92  1.51
[[ 1.92  0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    1.51  0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    1.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    1.92  0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    1.22  0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    1.92  0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    1.22  0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    1.92  0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    1.51]]


In [19]:
#compute tfidf feature matrix
tfidf = tf*idf

#show tfidf feature matrix
display_features(np.round(tfidf,2),feature_names)

    and  beautiful  blue  cheese    is  love   sky    so   the
0  0.00       0.00   1.0    0.00  1.22  0.00  1.22  0.00  1.51
1  1.92       1.51   1.0    0.00  2.45  0.00  2.45  0.00  0.00
2  0.00       1.51   1.0    0.00  1.22  0.00  1.22  1.92  1.51
3  0.00       0.00   1.0    1.92  0.00  1.92  0.00  0.00  0.00


In [20]:
#tfidf归一化
#compute L2 norm
norms = norm(tfidf, axis=1)

#print norms for each document
print np.round(norms,2)

#compute normalized tfidf
norm_tfidf = tfidf/norms[:, None]

#show final tfidf feature matrix
display_features(np.round(norm_tfidf, 2), feature_names)

[ 2.5   4.35  3.5   2.89]
    and  beautiful  blue  cheese    is  love   sky    so   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.00  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00  0.00
2  0.00       0.43  0.29    0.00  0.35  0.00  0.35  0.55  0.43
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00  0.00


In [21]:
#compute new doc term freqs from bow freqs
nd_tf = new_doc_features
nd_tf = np.array(nd_tf, dtype='float64')

#compute tfidf using idf matrix from train corpus
nd_tfidf = nd_tf*idf
nd_norms = norm(nd_tfidf, axis=1)
norm_nd_tfidf = nd_tfidf / nd_norms[:, None]

#show new_doc tfidf feature vector
display_features(np.round(norm_nd_tfidf, 2), feature_names)

   and  beautiful  blue  cheese   is  love   sky   so  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0  0.0


In [22]:
#直接从原始文档中计算文档基于tfidf的特征向量

from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_extractor(corpus, ngram_range=(1,1)):
    vectorizer = TfidfVectorizer(min_df=1, norm='l2', smooth_idf=True, use_idf=True, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

In [23]:
#build tfidf vectorizer and get training corpus feature vectors
tfidf_vectorizer, tdidf_features = tfidf_extractor(CORPUS)
display_features(np.round(tdidf_features.todense(),2), feature_names)

    and  beautiful  blue  cheese    is  love   sky    so   the
0  0.00       0.00  0.40    0.00  0.49  0.00  0.49  0.00  0.60
1  0.44       0.35  0.23    0.00  0.56  0.00  0.56  0.00  0.00
2  0.00       0.43  0.29    0.00  0.35  0.00  0.35  0.55  0.43
3  0.00       0.00  0.35    0.66  0.00  0.66  0.00  0.00  0.00


In [24]:
#get tfidf feature vector for the new document
nd_tfidf = tfidf_vectorizer.transform(new_doc)
display_features(np.round(nd_tfidf.todense(),2), feature_names)

   and  beautiful  blue  cheese   is  love   sky   so  the
0  0.0        0.0  0.63     0.0  0.0   0.0  0.77  0.0  0.0


### 高级词向量模型
word2vec模型由谷歌公司2013年发布,它是一个基于神经网络的实现,使用CBOW和skip-gram两种结构学习单词的分布式向量表示.

gensim库里的word2vec的python实现基本思想是:提供一些文档语料作为输入,会得到词向量特征所谓输出.

在模型内部,建立基于输入文档的**词汇表**,通过前面提到的各种技术学习**单词的向量表示**,一旦学习完成,

就建立一个可用于从文档中提取单词向量的模型.使用如平均值和tfidf加权等方法,**使用词向量计算文档的平均向量**.

在训练语料建立模型时,我们将主要关注下面的参数:
- size:该参数用于设定词向量的维度,可以是几十到几千.可以尝试不同的维度,以获得最好的效果.
- window: 该参数用于设定语境或窗口尺寸,指定了训练时对算法来说可算做上下文的单词窗口长度.
- min_count: 该参数指定单词表中单词在语料中出现的最小次数.这个参数有助于去除一些文档中出现次数最少的不重要的单词
- sample: 该参数用于对单词出现的频率进行下采样,其理想值在0.01到0.0001之间

建立模型之后,将**基于一些加权策略来定义和实现两种词向量与文档结合的技术**.
- 平均词向量
- TF-IDF加权词向量

In [25]:
import gensim
import nltk

#tokenize corpora
TOKENIZED_CORPUS = [nltk.word_tokenize(sentence) for sentence in CORPUS]
tokenized_new_doc = [nltk.word_tokenize(sentence) for sentence in new_doc]

#build the word2vec model on our training corpus
model = gensim.models.Word2Vec(TOKENIZED_CORPUS, size=10, window=10, min_count=2, sample=1e-3)

Using TensorFlow backend.


**开始实现特征提取技**

In [26]:
#平均词向量,上面的模型为单词表中每个单词创建一个单词表示,我们可以输入下面的代码来查看它们.
print model['sky']

print model['blue']

[ 0.01608407 -0.04819566  0.04227461 -0.03011346  0.0254148   0.01728328
  0.0155535   0.00774884 -0.02752112  0.01646519]
[-0.0472235   0.01662185 -0.01221706 -0.04724348 -0.04384995  0.00193343
 -0.03163504 -0.03423524  0.02661656  0.03033725]


In [27]:
#使用平均词向量的方法计算文档向量
import numpy as np
#define function to average word vectors for a text document 

def average_word_vectors(words, model, vocabulary, num_features):
    feature_vector = np.zeros((num_features,), dtype='float64')
    nwords = 0
    
    for word in words:
        if word in vocabulary:
            nwords = nwords+1.
            feature_vector = np.add(feature_vector, model[word])
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector

#generalize above function for a corpus of documents
def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features) for tokenized_sentence in corpus]
    return np.array(features)

In [28]:
#get averaged word vectors for our training CORPUS
avg_word_vec_features = averaged_word_vectorizer(corpus=TOKENIZED_CORPUS, model=model, num_features=10)
print np.round(avg_word_vec_features, 3)

[[ 0.006 -0.01   0.015 -0.014  0.004 -0.006 -0.024 -0.007 -0.001 -0.   ]
 [-0.008 -0.01   0.021 -0.019 -0.002 -0.002 -0.011  0.002  0.003 -0.001]
 [-0.003 -0.007  0.008 -0.02  -0.001 -0.004 -0.014 -0.015  0.002 -0.01 ]
 [-0.047  0.017 -0.012 -0.047 -0.044  0.002 -0.032 -0.034  0.027  0.03 ]]


In [29]:
#get averaged word vectors for our test new_doc
nd_avg_word_vec_features = averaged_word_vectorizer(corpus=tokenized_new_doc, model=model, num_features=10)
print np.round(nd_avg_word_vec_features,3)

[[-0.016 -0.016  0.015 -0.039 -0.009  0.01  -0.008 -0.013 -0.     0.023]]


**TF-IDF加权平均词向量**

In [30]:
#define function to compute tfidf weighted averaged word vector for a document
def tfidf_wtd_avg_word_vectors(words, tfidf_vector, tfidf_vocabulary, model, num_features):
    word_tfidfs = [tfidf_vector[0, tfidf_vocabulary.get(word)] if tfidf_vocabulary.get(word) else 0 for word in words]
    word_tfidf_map = {word:tfidf_val for word, tfidf_val in zip(words, word_tfidfs)}
    
    feature_vector = np.zeros((num_features,), dtype='float64')
    vocabulary = set(model.wv.index2word)
    wts = 0.
    for word in words:
        if word in vocabulary:
            word_vector = model[word]
            weighted_word_vector = word_tfidf_map[word]*word_vector
            wts = wts + word_tfidf_map[word]
            feature_vector = np.add(feature_vector, weighted_word_vector)
    
    if wts:
        feature_vector = np.divide(feature_vector, wts)
        
    return feature_vector

#generalize above function for a corpus of documents
def tfidf_weighted_averaged_word_vectorizer(corpus, tfidf_vectors, tfidf_vocabulary, model, num_features):
    docs_tfidfs = [(doc, doc_tfidf) for doc, doc_tfidf in zip(corpus, tfidf_vectors)]
    features = [tfidf_wtd_avg_word_vectors(tokenized_sentence, tfidf, tfidf_vocabulary, model, num_features) 
               for tokenized_sentence, tfidf in docs_tfidfs]
    return np.array(features)

In [31]:
#get tfidf weights and vocabulary from earlier results and compute result
corpus_tfidf = tdidf_features
vocab = tfidf_vectorizer.vocabulary_
wt_tfidf_word_vec_features = tfidf_weighted_averaged_word_vectorizer(corpus=TOKENIZED_CORPUS, tfidf_vectors=corpus_tfidf,
                                                                    tfidf_vocabulary=vocab, model=model,num_features=10)
print np.round(wt_tfidf_word_vec_features, 3)

[[ 0.011 -0.011  0.014 -0.011  0.007 -0.007 -0.024 -0.008 -0.004 -0.004]
 [-0.    -0.014  0.028 -0.014  0.004 -0.003 -0.012  0.011 -0.001 -0.002]
 [-0.001 -0.008  0.007 -0.019  0.001 -0.004 -0.012 -0.018  0.001 -0.014]
 [-0.047  0.017 -0.012 -0.047 -0.044  0.002 -0.032 -0.034  0.027  0.03 ]]


In [32]:
#compute avgd word vector for test new_doc
nd_wt_tfidf_word_vec_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_new_doc, tfidf_vectors=nd_tfidf,
                                                                        tfidf_vocabulary=vocab,model=model, num_features=10)
print np.round(nd_wt_tfidf_word_vec_features, 3)

[[-0.012 -0.019  0.018 -0.038 -0.006  0.01  -0.006 -0.011 -0.003  0.023]]


### 分类算法
分类算法是有监督的机器学习算法.分类算法整体有三个过程:
- 训练是有监督学习算法分析和尝试推理训练数据的模式.
- 评估包括测试模型的预测性能,检验它在训练数据集上训练和学习效果如何
- 调优也称为超参数调优或优化.

### 多项式朴素贝叶斯
该算法是主流的朴素贝叶斯算法的一个特例,用于超过两类的预测和分类任务.朴素贝叶斯算法是一个将贝叶斯定义应用于实践的有监督学习算法.

**它有一个"朴素"的假设条件:每个特征与其他特征之间相互独立.(特征之间的相互独立)**

在给定特征X的条件下,y发生的概率,基于朴素假设条件,我们可以把这个概率公式表示成后验概率公式$后验概率=\frac{步验X似然值}{证据}$,这里步验表示为P(y),似然值为$\prod_{i=1}^{n}P(x_i|y)$,证据表示为P(X).

从这个方程, 我们通过结合MAP(最大后验概率)决策规则就可以建立贝叶斯分类器,MAP代表最大后验概率.

该方法在训练数据不充足时也能很好的工作,但是当出现多特征时,会导致**维度灾难**.

朴素贝叶斯通过解耦类变量--相关条件的特征分布的方法解决这个问题,从而**使各分布作为单一维度的分布独立估计**.

sklearn中的MultinomialNB类中提供了一个多项式朴素贝叶斯的优秀实现.

### 评估分类模型
分类模型的性能一般基于模型对新数据的预测输出结果.

有多个指标可以判定模型的预测性能,但我们主要关注以下几个指标:
- 准确率(Accuracy)
- 精确率(precision)
- 召回率(recall)
- F1 score

In [33]:
from sklearn import metrics
import numpy as np
import pandas as pd
from collections import Counter

actual_labels = ['spam','ham','spam','spam','spam',
                'ham','ham','spam','ham','spam',
                'spam','ham','ham','ham','spam',
                'ham','ham','spam','spam','ham']
predicted_labels = ['spam','spam','spam','ham','spam',
                   'spam','ham','ham','spam','spam',
                   'ham','ham','spam','ham','ham',
                   'ham','spam','ham','spam','spam']
ac = Counter(actual_labels)
pc = Counter(predicted_labels)

print 'Actual counts:', ac.most_common()
print 'Predicted counts:', pc.most_common()

Actual counts: [('ham', 10), ('spam', 10)]
Predicted counts: [('spam', 11), ('ham', 9)]


In [34]:
#建立一个混淆矩阵
cm = metrics.confusion_matrix(y_true=actual_labels, y_pred=predicted_labels,labels=['spam','ham'])
print pd.DataFrame(data=cm, columns=pd.MultiIndex(levels=[['Predicted:'],['spam','ham']], labels=[[0,0],[0,1]]),
                  index=pd.MultiIndex(levels=[['Actual:'],['spam','ham']], labels=[[0,0],[0,1]]))

             Predicted:    
                   spam ham
Actual: spam          5   5
        ham           6   4


In [35]:
positive_class = 'spam'
true_positive = 5.
false_positive=6.
false_negative=5.
true_negative=4.

In [36]:
#准确率计算
accuracy = np.round(metrics.accuracy_score(y_true=actual_labels,y_pred=predicted_labels),2)
accuracy_manual = np.round((true_positive+true_negative)/(true_positive+true_negative+false_negative+false_positive),2)
print 'Accuracy:', accuracy
print 'Manually computed accuracy:', accuracy_manual

Accuracy: 0.45
Manually computed accuracy: 0.45


In [37]:
#精确率计算
precision = np.round(metrics.precision_score(y_true=actual_labels,y_pred=predicted_labels,pos_label=positive_class),2)
precision_manual = np.round((true_positive)/(true_positive+false_positive),2)
print 'precision:', precision
print 'Manually computed precision:', precision_manual

precision: 0.45
Manually computed precision: 0.45


In [38]:
#召回率计算,召回率的定义是正类中被正确预测的实例数量.也被称为命中率,覆盖率或灵敏度
recall = np.round(metrics.recall_score(y_true=actual_labels,y_pred=predicted_labels,pos_label=positive_class),2)
recall_manual = np.round((true_positive)/(true_positive+false_negative),2)
print 'recall:', recall
print 'Manually computed recall:', recall_manual

recall: 0.5
Manually computed recall: 0.5


In [39]:
#F1 score是另一个准确性指标,通过计算准确率和召回率的调和平均值得到.
f1_score = np.round(metrics.f1_score(y_true=actual_labels,y_pred=predicted_labels, pos_label=positive_class),2)
f1_score_manual = np.round((2*precision*recall)/(precision+recall), 2)
print 'F1 score:', f1_score
print 'Manually computed f1_score:', f1_score_manual

F1 score: 0.48
Manually computed f1_score: 0.47


## 建立一个多分类系统
- 数据集:使用sklearn下载的20个新闻组数据集.
- 数据集组成:包括分散在20个不同类别或主题的18000个新闻组帖子,这就构建了20类分类问题.
- 数据预处理:从文档中去除文件头,文件尾和引用,同时剔除空文档

In [40]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

def get_data():
    data = fetch_20newsgroups(subset='all', 
                             shuffle=True,
                             remove=('headers','footers','quotes')) #获取新闻数据,掐头去尾去引用
    return data

def prepare_datasets(corpus, labels, test_data_proportion=0.3): #划分训练集和测试集
    train_X, test_X, train_Y, test_Y = train_test_split(corpus, labels, test_size=0.33, random_state=42)
    return train_X, test_X, train_Y, test_Y

def remove_empty_docs(corpus, labels): #剔除空文档
    filtered_corpus = []
    filtered_labels = []
    for doc, label in zip(corpus, labels):
        if doc.strip(): #判断是否是空文档,即只包含空格的文档
            filtered_corpus.append(doc)
            filtered_labels.append(label)
            
    return filtered_corpus, filtered_labels

In [41]:
#get the data
dataset = get_data()
print dataset.target_names

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [42]:
#get corpus of documents and their corresponding labels
corpus,labels = dataset.data, dataset.target
corpus,labels = remove_empty_docs(corpus, labels)

#see sample document and its label index, name
print 'Sample document:', corpus[10]
print 'Class label:', labels[10]
print 'Actual class label:', dataset.target_names[labels[10]]

Sample document: the blood of the lamb.

This will be a hard task, because most cultures used most animals
for blood sacrifices. It has to be something related to our current
post-modernism state. Hmm, what about used computers?

Cheers,
Kent
Class label: 19
Actual class label: talk.religion.misc


In [43]:
#prepare train and test datasets
train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus, labels, test_data_proportion=0.3)

In [44]:
#对数据集进行规范化处理,就是词性还原,删除无效字符,删除停用词
from normalization import normalize_corpus

norm_train_corpus = normalize_corpus(train_corpus)
norm_test_corpus = normalize_corpus(test_corpus)

In [45]:
#使用前面建立好的特征提取模块从文档中提取特征.分别建立词袋模型,TF-IDF模型,平均词向量模型和TF-IDF加权平均词向量模型,并比较它们的性能

#词袋模型
bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus)
bow_test_features = bow_vectorizer.transform(norm_test_corpus)

#TF-IDF模型
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)
tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)

#文档分词
tokenized_train = [nltk.word_tokenize(text) for text in norm_train_corpus]
tokenized_test = [nltk.word_tokenize(text) for text in norm_test_corpus]

#建立word2vec模型
model = gensim.models.Word2Vec(tokenized_train, size=500, window=100, min_count=30, sample=1e-3)

#averaged word vector features
avg_wv_train_features = averaged_word_vectorizer(corpus=tokenized_train, model=model, num_features=500)
avg_wv_test_features = averaged_word_vectorizer(corpus=tokenized_test, model=model, num_features=500)

#tfidf weighted averaged word vector features
vocab = tfidf_vectorizer.vocabulary_
tfidf_wv_train_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_train,
                                                                 tfidf_vectors=tfidf_train_features,
                                                                 tfidf_vocabulary=vocab, model=model,
                                                                 num_features=500)
tfidf_wv_test_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_test,
                                                                tfidf_vectors=tfidf_test_features,
                                                                tfidf_vocabulary=vocab, model=model,
                                                                num_features=500)

In [47]:
#前面代码实现了从文本文档中提取了全部必要的特征之后,基于前面讨论的四个指标,我们定义一个函数用来评估分类模型
from sklearn import metrics
import numpy as np

def get_metrics(true_labels, predicted_labels):
    print 'Accuracy:', np.round(metrics.accuracy_score(true_labels, predicted_labels),2)
    print 'Precision:', np.round(metrics.precision_score(true_labels, predicted_labels,average='weighted'),2)
    print 'Recall:', np.round(metrics.recall_score(true_labels, predicted_labels, average='weighted'), 2)
    print 'F1 Score:', np.round(metrics.f1_score(true_labels,predicted_labels, average='weighted'), 2)

In [48]:
#开始训练分类模型
def train_predict_evaluate_model(classifier, train_features, train_labels, test_features, test_labels):
    #build model
    classifier.fit(train_features, train_labels)
    #predict using model
    predictions = classifier.predict(test_features)
    #evaluate model prediciton performance
    get_metrics(true_labels=test_labels, predicted_labels=predictions)
    return predictions

In [49]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

mnb = MultinomialNB()
svm = SGDClassifier(loss='hinge', n_iter=100)

In [51]:
#Multinomial Naive Bayes with bag of words features
mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb,
                                                  train_features=bow_train_features,
                                                  train_labels=train_labels,
                                                  test_features=bow_test_features,
                                                  test_labels=test_labels)
print '-'*50
#support Vector Machine with bag of words features
svm_bow_predictions = train_predict_evaluate_model(classifier=svm,
                                                  train_features=bow_train_features,
                                                  train_labels=train_labels,
                                                  test_features=bow_test_features,
                                                  test_labels=test_labels)
print '-'*50
#Multinomial Naive Bayes with tfidf features
mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb,
                                                    train_features=tfidf_train_features,
                                                    train_labels=train_labels,
                                                    test_features=tfidf_test_features,
                                                    test_labels=test_labels)
print '-'*50
#support Vector Machine with tfidf features
svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
                                                    train_features=tfidf_train_features,
                                                    train_labels=train_labels,
                                                    test_features=tfidf_test_features,
                                                    test_labels=test_labels)
print '-'*50
#support vector machine with averaged word vector features
svm_svgwv_predictions = train_predict_evaluate_model(classifier=svm,
                                                    train_features=avg_wv_train_features,
                                                    train_labels=train_labels,
                                                    test_features=avg_wv_test_features,
                                                    test_labels=test_labels)
print '-'*50
#support vector machine with tfidf weighted averaged word vector features
svm_tfidfwv_predictions = train_predict_evaluate_model(classifier=svm,
                                                      train_features=tfidf_wv_train_features,
                                                      train_labels=train_labels,
                                                      test_features=tfidf_wv_test_features,
                                                      test_labels=test_labels)

Accuracy: 0.67
Precision: 0.73
Recall: 0.67
F1 Score: 0.65
--------------------------------------------------




Accuracy: 0.62
Precision: 0.66
Recall: 0.62
F1 Score: 0.63
--------------------------------------------------
Accuracy: 0.72
Precision: 0.78
Recall: 0.72
F1 Score: 0.71
--------------------------------------------------




Accuracy: 0.77
Precision: 0.77
Recall: 0.77
F1 Score: 0.77
--------------------------------------------------




Accuracy: 0.56
Precision: 0.56
Recall: 0.56
F1 Score: 0.54
--------------------------------------------------




Accuracy: 0.54
Precision: 0.55
Recall: 0.54
F1 Score: 0.52


从上面的输出结果来看,使用TF-IDF特征的SVM模型获得了最好的结果.

我们可以建立svm tf-idf模型的混淆矩阵,以便了解模型性能不好的具体分类的情况

In [52]:
cm = metrics.confusion_matrix(test_labels, svm_tfidf_predictions)
pd.DataFrame(cm, index=range(0,20), columns=range(0,20))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,155,2,0,1,1,1,2,2,4,0,7,4,3,3,5,33,4,9,8,19
1,1,225,7,7,7,15,7,0,2,1,0,2,5,5,4,2,4,0,3,0
2,1,17,225,18,7,19,9,1,1,0,0,3,5,2,1,2,1,1,1,0
3,1,8,23,224,11,3,11,1,1,1,1,3,6,3,2,0,1,0,0,0
4,0,5,7,16,226,5,6,2,3,1,0,4,8,2,4,1,1,0,1,0
5,0,20,19,2,1,273,0,0,1,0,0,0,4,3,1,0,0,1,0,0
6,0,2,5,14,11,1,269,12,4,2,1,1,9,1,3,1,1,1,1,0
7,3,5,2,0,1,4,4,248,19,1,3,2,9,2,3,0,4,4,2,0
8,2,1,0,3,3,2,5,27,251,4,3,3,1,3,2,3,1,3,5,0
9,2,1,1,0,2,2,4,3,6,277,13,2,2,1,2,3,2,1,1,0


从混淆矩阵上可以看到,很多标签为0的文档被错误地分类到类标签15里面.

同样对于类标签18的很多文档被错误地分类到类标签16里面.

很多类标签标识19的文档被错误分类到类型标识为15里面.

In [53]:
class_names = dataset.target_names
print class_names[0],'->',class_names[15]
print class_names[18],'->', class_names[16]
print class_names[19],'->',class_names[15]

alt.atheism -> soc.religion.christian
talk.politics.misc -> talk.politics.guns
talk.religion.misc -> soc.religion.christian


从前面输出的可以发现,错误分类与实际分类存在相关性,导致它们的特征也存在相关性,所以会出现较高的误分类率.

下面我们对误分类的情况进行分析

In [54]:
import re
num = 0
for document, label, predicted_label in zip(test_corpus, test_labels, svm_tfidf_predictions):
    if label==0 and predicted_label==15:
        print 'Actual Label:', class_names[label]
        print 'Predicted Label:', class_names[predicted_label]
        print 'Document:-'
        print re.sub('\n',' ',document)
        print
        num += 1
        if num == 4:
            break

Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:-
I would like a list of Bible contadictions from those of you who dispite being free from Christianity are well versed in the Bible. 

Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:-
  They spent quite a bit of time on the wording of the Constitution.  They picked words whose meanings implied the intent.  We have already looked in the dictionary to define the word.  Isn't this sufficient?   But we were discussing it in relation to the death penalty.  And, the Constitution need not define each of the words within.  Anyone who doesn't know what cruel is can look in the dictionary (and we did).

Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:-
 That's very interesting.    I wonder, are women's reactions recorded after a frustrating night with a man?   Is that considered to be important?

Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Docum

In [55]:
num=0
for document ,label, predicted_label in zip(test_corpus, test_labels, svm_tfidf_predictions):
    if label == 18 and predicted_label==16:
        print 'Actual Label:', class_names[label]
        print 'Predicted Label:', class_names[predicted_label]
        print 'Document:-'
        print re.sub('\n',' ',document)
        print
        num += 1
        if num == 4:
            break

Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:-
After the initial gun battle was over, they had 50 days to come out peacefully. They had their high priced lawyer, and judging by the posts here they had some public support. Can anyone come up with a rational explanation why the didn't come out (even after they negotiated coming out after the radio sermon) that doesn't include the Davidians wanting to commit suicide/murder/general mayhem?

Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:-
Yesterday, the FBI was saying that at least three of the bodies had gunshot wounds, indicating that they were shot trying to escape the fire.  Today's paper quotes the medical examiner as saying that there is no evidence of gunshot wounds in any of the recovered bodies.  At the beginning of this siege, it was reported that while Koresh had a class III (machine gun) license, today's paper quotes the government as saying, no, they didn't have a