# 朴素贝叶斯
计算和比较$p(c_1|x,y)$和$p(c_2|x,y)$，其代表含义为
    给定某个由x、y表示的数据点，该数据来自$c_i$的概率为多少
    
其计算公式为：

$p(c_i|x, y) = \frac{p(x, y|c_i)p(c_i)}{p(x, y)}$

朴素指的是我们对样本做两个假设：
* 样本特征之间相互独立
* 每个特征同等重要

朴素贝叶斯通常两种实现方式：
* 基于贝努利模型实现
* 基于多项式模型实现

本次采用贝努利模型实现

## 使用py进行文本分类

### Step 1:从文本中构建词向量

词集模型（set-of-words model）：将每个词的出现与否作为一个特征，每个词只出现一次

In [1]:
def load_dataset():
        posting_list = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
        class_vec = [0, 1, 0, 1, 0, 1]    #1 is abusive, 0 not
        return posting_list, class_vec

In [2]:
def create_vocablist(dataset):
    vocabset = set([])
    for doc in dataset:
        vocabset = vocabset | set(doc)
    return list(vocabset)

In [3]:
def setOfWords2Vec(vocablist, inputset):
    return_vec = [0] * len(vocablist)
    for word in inputset:
        if word in vocablist:
            return_vec[vocablist.index(word)] = 1
        else:
            print('the word : %s is not in my vocablist' %word)
    return return_vec

In [4]:
dataset, label = load_dataset()

In [5]:
vocablist = create_vocablist(dataset)
print(vocablist)

In [6]:
train_data = []
for i in range(len(dataset)):
    train_data.append(setOfWords2Vec(vocablist, dataset[i]))
print(train_data)

## Step 2: 训练算法：从词向量计算概率
计算$p(c_i|w) = \frac{p(w|c_i)p(c_i)}{p(w)}$ 

w为一个向量

$p(c_i)$：类别i中 文档数/总文档数

$p(w|c_i)$：用朴素贝叶斯假设，将w视为独立特征，

就有：
$p(w_0,w_1,w_2...w_N|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i)...p(w_N|c_i)$

因为$p(w_0|c_i)p(w_1|c_i)p(w_2|c_i)...p(w_N|c_i)$因子很小，会出现程序下溢，所以我们采用取对数的方式进行计算

同时计算乘积时如果其中一个因子为零，其乘积结果也为零，所以取分子为1，分母为2


In [7]:
import numpy as np
def train_bayes(train_data, train_label):
    num_data = len(train_data)
    num_feat = len(train_data[0])
    p_absive = sum(train_label) / float(num_data)
    p0_num = np.ones(num_feat)
    p0_sumword = 2.0
    p1_num = np.ones(num_feat)
    p1_sumword = 2.0
    for i in range(num_data):
        if train_label[i] == 1:
            p1_num += train_data[i]
            p1_sumword += sum(train_data[i])
        else:
            p0_num += train_data[i]
            p0_sumword += sum(train_data[i])
    p1_vec = np.log(p1_num / p1_sumword)
    p0_vec = np.log(p0_num / p0_sumword)
    return p0_vec, p1_vec,p_absive

In [8]:
train_bayes(train_data, label)

(array([-2.56494936, -2.56494936, -3.25809654, -3.25809654, -3.25809654,
        -3.25809654, -2.56494936, -2.56494936, -2.56494936, -3.25809654,
        -3.25809654, -2.56494936, -2.56494936, -2.56494936, -2.56494936,
        -2.56494936, -3.25809654, -1.87180218, -2.56494936, -2.56494936,
        -2.56494936, -2.56494936, -2.56494936, -3.25809654, -2.15948425,
        -3.25809654, -2.56494936, -3.25809654, -2.56494936, -2.56494936,
        -2.56494936, -3.25809654]),
 array([-3.04452244, -3.04452244, -2.35137526, -1.94591015, -2.35137526,
        -2.35137526, -3.04452244, -2.35137526, -3.04452244, -2.35137526,
        -2.35137526, -3.04452244, -3.04452244, -1.94591015, -3.04452244,
        -3.04452244, -2.35137526, -3.04452244, -3.04452244, -3.04452244,
        -3.04452244, -2.35137526, -3.04452244, -2.35137526, -2.35137526,
        -2.35137526, -3.04452244, -1.65822808, -3.04452244, -3.04452244,
        -3.04452244, -2.35137526]),
 0.5)

In [9]:
def classify_bayes(test_vec, p0_vec, p1_vec, p_class1):
    p1 = sum(test_vec * p1_vec) + np.log(p_class1)
    p0 = sum(test_vec * p0_vec) + np.log(1.0 - p_class1)
    if p1 > p0:
        return 1
    else:
        return 0

## Step 3： 测试算法

In [10]:
def test():
    # 准备数据
    dataset,label = load_dataset()
    vocab_list = create_vocablist(dataset)
    # 将单词转化为向量
    data_vec = []
    for doc in dataset:
        data_vec.append(setOfWords2Vec(vocab_list, doc))
    # 训练算法 
    p0_vec,p1_vec,p_absive = train_bayes(data_vec, label)
    # 测试算法
    test1 = ['love', 'my', 'dalmation']
    test1_vec = setOfWords2Vec(vocab_list, test1)
    test2 = ['stupid', 'garbage']
    test2_vec = setOfWords2Vec(vocab_list, test2)
    print('test1 classify as %d' %classify_bayes(test1_vec, p0_vec, p1_vec, p_absive))
    print('test2 classify as %d' %classify_bayes(test2_vec, p0_vec, p1_vec, p_absive))

In [11]:
test()

test1 classify as 0
test2 classify as 1


## 示例：过滤垃圾邮件

### Step 1：准备数据

词袋模型（bags-of-words model）：指一个出现多次的单词可能不会表达某种信息，所以在词袋中每个词可以出现多次

In [12]:
def bagOfWords2Vec(vocablist, inputset):
    return_vec = [0] * len(vocablist)
    for word in inputset:
        if word in vocablist:
            return_vec[vocablist.index(word)] += 1
    return return_vec

In [13]:
def text_parse(text):
    import re
    list = re.split(r'\W+',text)
    return [word .lower() for word in list if len(word) > 2]

In [14]:
def load_data():
    doc_list = []
    label = []
    full_text = []
    for i in range(1, 26):
        word_list = text_parse(open('email/spam/%d.txt' %i, 'r', encoding="ISO-8859-1").read())
        doc_list.append(word_list)
        full_text.extend(word_list)
        label.append(1)
        word_list = text_parse(open('email/ham/%d.txt' %i, 'r', encoding="ISO-8859-1").read())
        doc_list.append(word_list)
        full_text.extend(word_list)
        label.append(0)
    return doc_list, full_text, label

In [15]:
dataset, full_set, label = load_data()

In [34]:
# 随机构建训练集
train_set = range(50)
testset = []
for i in range(10):
    rand_index = int(np.random.uniform(0, len(train_set)))
    testset.append(train_set[rand_index])
    del(list(train_set)[rand_index])
vocab_list = create_vocablist(dataset)
train_vec = []
train_label = []
for index in train_set:
    train_vec.append(bagOfWords2Vec(vocab_list, dataset[index]))
    train_label.append(label[index])

### Step 2: 训练算法

In [35]:
p0_v,p1_v,p_spam = train_bayes(np.array(train_vec), np.array(label))

### Step 3:测试算法

In [39]:
def test_email():
    error = 0
    for index in testset:
        word_vec = setOfWords2Vec(vocab_list,dataset[index])
        if classify_bayes(np.array(word_vec), p0_v, p1_v, p_spam) != label[index]:
            error += 1
            print("classification error", dataset[index])
    print('error rate is %f' %(float(error) / len(testset)))
    return float(error) / len(testset)

In [40]:
for i in range(10):
    error += test_email()
print(error/10.0)

error rate is 0.000000
error rate is 0.000000
error rate is 0.000000
error rate is 0.000000
error rate is 0.000000
error rate is 0.000000
error rate is 0.000000
error rate is 0.000000
error rate is 0.000000
error rate is 0.000000
0.0
