# 朴素贝叶斯


## 基本

设输入空间$X\in R^n$为n维向量的集合，输出空间为类标记集合$Y=\{c_1,c_2,\cdots,c_K\}$。 \
输入为特征向量$x\in X$,输出为类标记(class label) $y\in Y$。

首先看看先验概率和后验概率：\
* 什么是先验概率？

* 什么是后验概率？


朴素贝叶斯模型描述中的条件独立性假设：$$p(x|y)=\prod\limits_{i=1}^np(x_i|y)$$

即：$$x_i\perp x_j|y,\forall\  i\ne j$$

于是利用贝叶斯定理，对于单次观测：$$p(y|x)=\frac{p(x|y)p(y)}{p(x)}=\frac{\prod\limits_{i=1}^pp(x_i|y)p(y)}{p(x)}$$
 
将输入$x$分到后验概率最大的类$y$。

$$y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X_{j}=x^{(j)} | Y=c_{k}\right)$$

后验概率最大等价于0-1损失函数时的期望风险最小化。


模型：

- 高斯模型
- 多项式模型
- 伯努利模型

## 数据集

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from collections import Counter
import math

In [2]:
# data
def create_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    df.columns = [
        'sepal length', 'sepal width', 'petal length', 'petal width', 'label'
    ]
    data = np.array(df.iloc[:100, :])
    # print(data)
    return data[:, :-1], data[:, -1]

In [3]:
X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [4]:
X_test[0], y_test[0]

(array([6.4, 2.9, 4.3, 1.3]), 1.0)

## 手动实现

In [5]:
class NaiveBayes:
    
    def __init__(self,):
        self.category = None
    
    def mu(self,X):
        """
        数学期望
        """
        return np.mean(X)
    
    def std(self,X):
        """
        方差
        """
        mu = self.mu(X)
        return np.sqrt(np.sum(np.power(X-mu,2))/len(X)*1.0 )
    
    def gaussian_probobility(self,x,mu,sigma):
        """
        高斯概率密度
        """
        exponent = np.exp(-1.0*(np.power(x-mu,2)/(2*np.power(sigma,2))))
        return 1.0 / (np.sqrt(2*np.pi) * sigma) * exponent
    
    

    def fit(self,X,y):
        """
        训练数据
        """
        labels = np.unique(y)                  # 类别
        data = {label:[] for label in labels} # 属于该类别的数据
        for x,label in zip(X,y):
            data[label].append(x)
        
        # 类别
        self.category = {
            label:self.summarize(value)
            for label,value in data.items()
        }
        
#         print(self.category)
        return "train done!"
            
    def summarize(self,train_data):
        """
        处理训练数据,将x数据的均值以及标准差进行包装
        """
        summaries = [(self.mu(x),self.std(x)) for x in zip(*train_data)]
        return summaries
            
    def calculate_probabilities(self,data):
        """
        计算概率
        """
        probabilities = {}
        for label,value in self.category.items():
            probabilities[label] = 1
            for i in range(len(value)):
                mu,std = value[i]
                # 计算概率
                probabilities[label] *= self.gaussian_probobility(data[i],mu,std)
        return probabilities

    def predict(self,X_test):
        """
        预测数据类别
        """
        label = sorted(
            self.calculate_probabilities(X_test).items(),
            key=lambda x: x[-1]
        )[-1][0] # 进行排序，选择最大的概率
        
        return label
        
    def score(self,X_test,y_test):
        """
        准确率
        """
        count = 0
        for X,y in zip(X_test,y_test):
            label = self.predict(X)
            if label == y:
                count = count + 1
        return count / len(X_test) * 1.0
    
nb = NaiveBayes()
nb.fit(X_train,y_train)

'train done!'

In [6]:
nb.predict([4.4,  3.2,  1.3,  0.2])

0.0

In [7]:
nb.score(X_test, y_test)

1.0

## sklearn实现

In [8]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train,y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [9]:
gnb.score(X_test,y_test)

1.0

In [10]:
gnb.predict([[4.4,  3.2,  1.3,  0.2]])

array([0.])

In [11]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB # 伯努利模型和多项式模型

In [12]:
bnb = BernoulliNB()
bnb.fit(X_train,y_train)
bnb.score(X_test,y_test)

0.4666666666666667

In [13]:
mnb = MultinomialNB()
mnb.fit(X_train,y_train)
mnb.score(X_test,y_test)

1.0