# 第一题：使用sklearn的GaussianNB、BernoulliNB、MultinomialNB完成spambase垃圾邮件分类任务

实验内容：
1. 使用GaussianNB、BernoulliNB、MultinomialNB完成spambase邮件分类
2. 计算各自十折交叉验证的精度、查准率、查全率、F1值
3. 根据精度、查准率、查全率、F1值的实际意义以及四个值的对比阐述三个算法在spambase中的表现对比

# 1. 读取数据集

In [1]:
import numpy as np
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
spamx = spambase[:, :57]
spamy = spambase[:, 57]

# 2. 导入模型

In [2]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_predict

# 3. 计算十折交叉验证下，GaussianNB、BernoulliNB、MultinomialNB的精度、查准率、查全率、F1值

In [3]:
# YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
gaussian_nb = GaussianNB()
multinomial_nb = MultinomialNB()
bernoulli_nb = BernoulliNB()

# 进行十折交叉验证
gaussian_pred = cross_val_predict(gaussian_nb, spamx, spamy, cv=10)
multinomial_pred = cross_val_predict(multinomial_nb, spamx, spamy, cv=10)
bernoulli_pred = cross_val_predict(bernoulli_nb, spamx, spamy, cv=10)

# 计算GaussianNB的指标
acc = accuracy_score(spamy, gaussian_pred)
pre = precision_score(spamy, gaussian_pred)
rec = recall_score(spamy, gaussian_pred)
f1 = f1_score(spamy, gaussian_pred)
print("GaussianNB:")
print("Accuracy:", format(acc,'.6f'))
print("Precision:", format(pre,'.6f'))
print("Recall:", format(rec,'.6f'))
print("F1 Score:", format(f1,'.6f'))

# 计算MultinomialNB的指标
acc = accuracy_score(spamy, multinomial_pred)
pre = precision_score(spamy, multinomial_pred)
rec = recall_score(spamy, multinomial_pred)
f1 = f1_score(spamy, multinomial_pred)
print("MultinomialNB:")
print("Accuracy:", format(acc,'.6f'))
print("Precision:", format(pre,'.6f'))
print("Recall:", format(rec,'.6f'))
print("F1 Score:", format(f1,'.6f'))

# 计算BernoulliNB的指标
acc = accuracy_score(spamy, bernoulli_pred)
pre = precision_score(spamy, bernoulli_pred)
rec = recall_score(spamy, bernoulli_pred)
f1 = f1_score(spamy, bernoulli_pred)
print("BernoulliNB:")
print("Accuracy:", format(acc,'.6f'))
print("Precision:", format(pre,'.6f'))
print("Recall:", format(rec,'.6f'))
print("F1 Score:", format(f1,'.6f'))

GaussianNB:
Accuracy: 0.821778
Precision: 0.700444
Recall: 0.956977
F1 Score: 0.808858
MultinomialNB:
Accuracy: 0.786351
Precision: 0.732363
Recall: 0.721456
F1 Score: 0.726869
BernoulliNB:
Accuracy: 0.883938
Precision: 0.881336
Recall: 0.815223
F1 Score: 0.846991


算法|精度|查准率|查全率|F1值
-|-|-|-|-
GaussianNB|0.821778|0.700444|0.956977|0.808858
MultinomialNB|0.786351|0.732363|0.721456|0.726869
BernoulliNB|0.883938|0.881336|0.815223|0.846991  

分析如下：BernoulliNB的精度、查准率和F1都是最高的，这可能是因为BernoulliNB假设特征是二值的，这对于处理垃圾邮件分类这样的文本数据可能更为合适。GaussianNB的查全率最高且远高于另外两个算法，但在精度和查准率上表现较差，这可能是因为GaussianNB假设特征遵循高斯分布，这在处理文本数据时可能不太准确。MultinomialNB在所有指标上都表现一般，这可能是因为MultinomialNB假设特征是多项分布的，这在处理文本数据时可能不太准确。总的来说，对于这个垃圾邮件分类任务，BernoulliNB可能更合适