# 第 4 章 朴素贝叶斯 Naive Bayes

朴素贝叶斯法是训练数据学习联合概率分布 $P(X,Y)$，然后求得后验概率分布 $P(Y|X)$

利用训练数据学习 $P(X|Y)$ 和 $P(Y)$ 的估计，根据数学上的三者之间的关系得到数据分布的联合概率分布
$$P(X,Y)＝P(Y)P(X|Y)$$

概率估计方法可以是**极大似然估计(Maximum likelihood estimation)** 或 **贝叶斯估计(Bayesian estimation)**

朴素贝叶斯法的基本假设是条件独立性，
$$\begin{aligned} P(X&=x | Y=c_{k} )=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right) \\ &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) \end{aligned}$$


强假设，模型包含的条件概率的数量大为减少，朴素贝叶斯法的学习与预测大为简化。因而朴素贝叶斯法高效，且易于实现。其缺点是分类的性能不一定很高。

朴素贝叶斯法利用贝叶斯定理与学到的联合概率模型进行分类预测
$$P(Y | X)=\frac{P(X, Y)}{P(X)}=\frac{P(Y) P(X | Y)}{\sum_{Y} P(Y) P(X | Y)}$$
 
将输入 $x$ 分到后验概率最大的类 $y$
$$y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X_{j}=x^{(j)} | Y=c_{k}\right)$$

后验概率最大等价于0-1损失函数时的期望风险最小化。

模型：
- **Gaussian** distribution
- Polynomial distribution
- Bernoulli distribution

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
import math

# loading dataset iris
def create_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['label'] = iris.target
    df.columns = [
        'sepal length', 'sepal width', 'petal length', 'petal width', 'label'
    ]
    data = np.array(df.iloc[:100, :])
    # print(data)
    return data[:, :-1], data[:, -1]

# processing data for model
X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print("The training data shape: X-{}, y-{}".format(np.shape(X_train), np.shape(y_train)))
print("The test data shape: X-{}, y-{}".format(np.shape(X_test), np.shape(y_test)))

The training data shape: X-(70, 4), y-(70,)
The test data shape: X-(30, 4), y-(30,)


## reference：https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

In [10]:
# the class of Naive Bayes with Gaussian Distribution for assuming data distribution.
class NaiveBayes:
  def __init__(self):
    self.model = None
    
  # compute Mathematical Expectation
  @staticmethod
  def mean(X):
    return sum(X) / float(len(X))
  
  # compute Standard deviation or variance
  def standard_deviation(self, X):
    average = self.mean(X)
    return math.sqrt(sum([pow(x - average, 2) for x in X]) / float(len(X)))
  
  # compute Probability density function
  def gaussian_probability(self, x, mean, standard_deviation):
    exponent = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(standard_deviation, 2))))
    return (1 / (math.sqrt(2 * math.pi) * standard_deviation)) * exponent
  
  # processing training data
  def summarize(self, train_data):
    summaries = [(self.mean(i), self.standard_deviation(i)) for i in zip(*train_data)]
    return summaries
  
  # 按照 label 类别 计算 数学期望和标准差
  def fit(self, X, y):
    labels = list(set(y))
    data = {label: [] for label in labels}
    for f, label in zip(X, y):
      data[label].append(f)
    self.model = {label: self.summarize(value) for label, value in data.items()}
    return "Gaussian Naive Bayes done."
  
  # compute Probability
  def calculate_probability(self, input_data):
    probabilities = {}
    for label, value in self.model.items():
      probabilities[label] = 1
      for i in range(len(value)):
        mean, standard_deviation = value[i]
        probabilities[label] *= self.gaussian_probability(input_data[i], mean, standard_deviation)
    return probabilities
  
  # classifcation
  def predict(self, X_test):
    label = sorted(self.calculate_probability(X_test).items(), key=lambda x : x[-1])[-1][0]
    return label
  
  # compute accuracy
  def  accuracy(self, X_test, y_test):
    num_correct = 0
    for X, y in zip(X_test, y_test):
      label = self.predict(X)
      if label == y:
        num_correct += 1
    return num_correct / float(len(X_test))
  
# ---------------------------------------
# TEST
model = NaiveBayes()
model.fit(X_train, y_train)

print("The classification of a sample: {}".format(model.predict([4.4,  3.2,  1.3,  0.2])))
print("The accuracy of classifier for a sample: {}".format(model.accuracy(X_test, y_test)))

The classification of a sample: 0.0
The accuracy of classifier for a sample: 1.0


In [14]:
# sklearn instances
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB # 伯努利模型和多项式模型
from sklearn.naive_bayes import BernoulliNB

clf_GaussianNB = GaussianNB()
clf_GaussianNB.fit(X_train, y_train)
print("The Gaussian Naive Bayes classification of a sample: {}".format(clf_GaussianNB.predict([[4.4,  3.2,  1.3,  0.2]])))
print("The accuracy of Gaussian Naive Bayes classifier for a sample: {}".format(clf_GaussianNB.score(X_test, y_test)))
print("--------------------------------------------------------------------------")
clf_MultinomialNB = MultinomialNB()
clf_MultinomialNB.fit(X_train, y_train)
print("The Multinomial Naive Bayes classification of a sample: {}".format(clf_MultinomialNB.predict([[4.4,  3.2,  1.3,  0.2]])))
print("The accuracy of Multinomial Naive Bayes classifier for a sample: {}".format(clf_MultinomialNB.score(X_test, y_test)))
print("--------------------------------------------------------------------------")
clf_BernoulliNB = BernoulliNB()
clf_BernoulliNB.fit(X_train, y_train)
print("The Bernoulli Naive Bayes classification of a sample: {}".format(clf_BernoulliNB.predict([[4.4,  3.2,  1.3,  0.2]])))
print("The accuracy of Bernoulli Naive Bayes classifier for a sample: {}".format(clf_BernoulliNB.score(X_test, y_test)))

The Gaussian Naive Bayes classification of a sample: [0.]
The accuracy of Gaussian Naive Bayes classifier for a sample: 1.0
--------------------------------------------------------------------------
The Multinomial Naive Bayes classification of a sample: [0.]
The accuracy of Multinomial Naive Bayes classifier for a sample: 1.0
--------------------------------------------------------------------------
The Bernoulli Naive Bayes classification of a sample: [1.]
The accuracy of Bernoulli Naive Bayes classifier for a sample: 0.4
