# 第二题：使用sklearn的BernoulliNB完成spambase垃圾邮件分类任务

实验内容：
1. 使用BernoulliNB完成spambase邮件分类
2. 计算十折交叉验证的精度、查准率、查全率、F1值

# 1. 读取数据集

In [1]:
import numpy as np
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
spamx = spambase[:, :57]
spamy = spambase[:, 57]

# 2. 导入模型

这里我们使用伯努利分布的朴素贝叶斯

In [2]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_predict

# 3. 数据预处理

伯努利分布的朴素贝叶斯要求特征都服从伯努利分布，这里我们将所有的特征转换为二值型，如果不为0，则为1。

In [3]:
spamx_binary = (spamx != 0).astype('float64')

In [4]:
spamx_binary

array([[0., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 0., 1., ..., 1., 1., 1.],
       ...,
       [1., 0., 1., ..., 1., 1., 1.],
       [1., 0., 0., ..., 1., 1., 1.],
       [0., 0., 1., ..., 1., 1., 1.]])

这样我们就得到了二值型的数据集

# 4. 计算十折交叉验证下，BernoulliNB的精度、查准率、查全率、F1值

###### 双击此处填写

精度|查准率|查全率|F1值
-|-|-|-
0.8845903064551185|0.8824582338902148|0.8157749586321015|0.8478073946689596

In [13]:
model = BernoulliNB()
prediction = cross_val_predict(model, X = spamx_binary, y = spamy, cv = 10)
print(prediction)

[1. 1. 1. ... 0. 0. 0.]


In [14]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [15]:
print(accuracy_score(spamy,prediction))
print(precision_score(spamy,prediction))
print(recall_score(spamy,prediction))
print(f1_score(spamy,prediction))

0.8845903064551185
0.8824582338902148
0.8157749586321015
0.8478073946689596
