# 

# 0.1 朴素贝叶斯与KNN简单分类算法
## 0.1.1 朴素贝叶斯
**贝叶斯定理**：
\begin{equation}
P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)}
\end{equation}
**“朴素”假设**：假设所有特征在给定类别下**条件独立**，即：
\begin{equation}
P(X|y) = P(x_1|y) \cdot P(x_2|y) \cdots P(x_n|y)
\end{equation}

## 0.1.2 KNN-近邻算法
**距离度量**\
欧氏距离
\begin{equation}
d(p, q) = \sqrt{\sum_{i=1}^{n} (q_i - p_i)^2}
\quad \text{或} \quad
d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
\end{equation}
汉明距离
\begin{equation}
d_H(a, b) = \sum_{i=1}^{n} \delta(a_i, b_i)
\quad \text{或} \quad
d_H(a, b) = \sum_{i=1}^{n}[a_i \ne b_i]
\end{equation}
**K值选择** 过小，易受噪声影响；过大，无法凸显距离度量\
**决策规则** 基本KNN；距离加权

## 0.1.3 鸢尾花数据集实例

In [2]:
# 数据集导入
import pandas as pd
import sklearn
from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
print(iris.data.shape)
df[:5]

(150, 4)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [3]:
# 数据集划分
from sklearn.model_selection import train_test_split

iris_x = iris.data
iris_y = iris.target
print("数据集大小：", iris_x.shape)
print("标签：",iris.target_names)
print("数据属性：",iris.feature_names)

x_train,x_test,y_train,y_test = train_test_split(iris_x,iris_y,test_size=0.3)
print("训练集大小：",len(x_train))
print("测试集大小：",len(x_test))

数据集大小： (150, 4)
标签： ['setosa' 'versicolor' 'virginica']
数据属性： ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
训练集大小： 105
测试集大小： 45


In [12]:
# GaussianNB

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(x_train,y_train)
GNB_perd = clf.predict(x_test)

print("真实值：",y_test)
print("预测值：",GNB_perd)
print("训练集score:{:.4f}".format(clf.score(x_train,y_train)))
print("测试集score:{:.4f}".format(clf.score(x_test,y_test)))

真实值： [2 2 0 1 1 0 2 0 0 1 1 2 1 2 0 0 0 2 1 0 1 2 0 1 2 1 0 0 1 2 2 1 2 2 0 1 1
 0 2 2 0 2 1 2 1]
预测值： [2 2 0 1 1 0 2 0 0 1 1 2 1 2 0 0 0 2 1 0 1 2 0 1 2 1 0 0 1 2 2 1 2 2 0 1 1
 0 2 2 0 2 1 2 1]
训练集score:0.9429
测试集score:1.0000


In [13]:
# MultinomialNB

from sklearn.naive_bayes import MultinomialNB
clf2 = MultinomialNB()
clf2.fit(x_train,y_train)
MNB_perd = clf2.predict(x_test)

print("真实值：",y_test)
print("预测值：",MNB_perd)
print("训练集score:{:.4f}".format(clf2.score(x_train,y_train)))
print("测试集score:{:.4f}".format(clf2.score(x_test,y_test)))

真实值： [2 2 0 1 1 0 2 0 0 1 1 2 1 2 0 0 0 2 1 0 1 2 0 1 2 1 0 0 1 2 2 1 2 2 0 1 1
 0 2 2 0 2 1 2 1]
预测值： [2 2 0 2 1 0 2 0 0 1 1 2 1 2 0 0 0 2 1 0 1 1 0 1 2 1 0 0 1 2 2 1 2 2 0 1 1
 0 2 2 0 2 1 2 1]
训练集score:0.9714
测试集score:0.9556


In [14]:
# BernoulliNB

from sklearn.naive_bayes import BernoulliNB
clf3 = BernoulliNB()
clf3.fit(x_train,y_train)
BNB_perd = clf3.predict(x_test)

print("真实值：",y_test)
print("预测值：",BNB_perd)
print("训练集score:{:.4f}".format(clf3.score(x_train,y_train)))
print("测试集score:{:.4f}".format(clf3.score(x_test,y_test)))

真实值： [2 2 0 1 1 0 2 0 0 1 1 2 1 2 0 0 0 2 1 0 1 2 0 1 2 1 0 0 1 2 2 1 2 2 0 1 1
 0 2 2 0 2 1 2 1]
预测值： [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0]
训练集score:0.3429
测试集score:0.3111


In [17]:
# KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier
clf4 = KNeighborsClassifier()
clf4.fit(x_train,y_train)
KNN_perd = clf4.predict(x_test)

print("真实值：",y_test)
print("预测值：",KNN_perd)
print("训练集score:{:.4f}".format(clf4.score(x_train,y_train)))
print("测试集score:{:.4f}".format(clf4.score(x_test,y_test)))

真实值： [2 2 0 1 1 0 2 0 0 1 1 2 1 2 0 0 0 2 1 0 1 2 0 1 2 1 0 0 1 2 2 1 2 2 0 1 1
 0 2 2 0 2 1 2 1]
预测值： [2 2 0 2 1 0 2 0 0 1 1 2 1 2 0 0 0 2 1 0 1 2 0 1 2 1 0 0 1 2 2 1 2 2 0 1 1
 0 1 2 0 2 1 2 1]
训练集score:0.9810
测试集score:0.9556
