# 第三题：使用sklearn的MultinomialNB完成spambase垃圾邮件分类任务

实验内容：
1. 使用MultinomialNB完成spambase邮件分类
2. 计算十折交叉验证的精度、查准率、查全率、F1值

# 1. 读取数据集

In [1]:
import numpy as np
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
spamx = spambase[:, :57]
spamy = spambase[:, 57]

# 2. 导入模型

这里我们使用多项式分布的朴素贝叶斯

In [2]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_predict

# 3. 数据预处理

多项式分布的朴素贝叶斯是用于处理离散特征的朴素贝叶斯分类器，这里我们将所有的特征转换为二值型，二值型也属于离散型。

In [3]:
spamx_binary = (spamx != 0).astype('float64')

In [4]:
spamx_binary

array([[ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  0.,  1., ...,  1.,  1.,  1.],
       ..., 
       [ 1.,  0.,  1., ...,  1.,  1.,  1.],
       [ 1.,  0.,  0., ...,  1.,  1.,  1.],
       [ 0.,  0.,  1., ...,  1.,  1.,  1.]])

这样我们就得到了二值型的数据集

# 4. 计算十折交叉验证下，MultinomialNB的精度、查准率、查全率、F1值

###### 双击此处填写

精度|查准率|查全率|F1值
-|-|-|-
0.906976744186 | 0.865435356201 | 0.904578047435 | 0.884573894283

In [5]:
# YOUR CODE HERE
model = MultinomialNB()

from sklearn.model_selection import cross_val_predict
prediction = cross_val_predict(model, spamx_binary, spamy, cv=10)

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

a = accuracy_score(spamy, prediction)
p = precision_score(spamy, prediction)
r = recall_score(spamy, prediction)
f = f1_score(spamy, prediction)

print(a, p, r, f)



0.906976744186 0.865435356201 0.904578047435 0.884573894283
