**<font color = black size=6>实验九:贝叶斯分类</font>**

In [1]:
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

**<font color = blue size=4>第一部分:实验任务</font>**

1.朴素贝叶斯

<img src='./Naive Bayes Classifier Pseudocode.jpg'>

<span style="color:purple">该数据集(train_mushroom.csv)为分类数据集，为蘑菇的特征信息以及是否有毒。包括了13个特征以及一个标签(即为label类型,代表是否有毒)。label='p'代表有毒，label='e'代表无毒。</span>

<span style="color:purple">1) 将训练数据集'train_mushroom.csv'和'test_mushroom.csv'载入并转换为你需要的格式</span>

In [2]:
train_df = pd.read_csv('./train_mushroom.csv')
test_df = pd.read_csv('./test_mushroom.csv')

print(train_df.shape[0])
print(test_df.shape[0])
print(train_df[:3])
print(test_df[:3])

500
100
  cap-shape cap-surface cap-color bruises odor gill-spacing gill-size  \
0         s           n         t       p    f            n         k   
1         s           y         t       a    f            b         k   
2         s           w         t       l    f            b         n   

  gill-color ring-number ring-type spore-print-color population habitat label  
0          e           p         k                 s          u       x     p  
1          e           p         n                 n          g       x     e  
2          e           p         n                 n          m       b     e  
  cap-shape cap-surface cap-color bruises odor gill-spacing gill-size  \
0         k           y         n       f    s            c         n   
1         b           f         g       f    n            w         b   
2         b           s         g       f    n            w         b   

  gill-color ring-number ring-type spore-print-color population habitat label  
0     

2) 计算每个标签值y对应的先验概率P(y)
$$P(y)=\frac{|D_y|}{|D|}$$
其中$D_y$为标签值为y的样本集合，$|D_y|$为这个集合的样本个数；D为所有样本集合，|D|为所有样本个数



In [3]:
# 计算每个标签值y对应的先验概率P(y)
train_len = len(train_df)
probability_y = {}
couter_y = Counter(train_df['label'])
print(couter_y)

for label, count in couter_y.items():
    probability_y[label] = count / train_len

print(probability_y)

Counter({'e': 295, 'p': 205})
{'p': 0.41, 'e': 0.59}


<span style="color:purple">3) 
对于数据集中的每个特征的非重复特征值$x_i$，计算给定标签值y时特征值$x_i$的条件概率$P(x_i│y)$,
$$P(x_i│y)=\frac{|D_{x_i,y}|}{|D_y|}$$
$D_{x_i,y}$为标签值为y，特征值为$x_i$的样本集合；$|D_{x_i,y}|$为该集合的样本个数
</span>

In [4]:

from collections import defaultdict

# 遍历数据集D中的每个特征，将每个特征的非重复值取出
features = train_df.columns[:-1]
unique_values = defaultdict(set)#如果访问一个不存在的键，那么就会返回空的set
print(unique_values)

for feature in features:
    unique_values[feature].update(train_df[feature].unique())
    
print(unique_values)

# 根据标签值将数据集D分为两个子数据集，分别包括所有标签值为p的样本和所有标签值为e的样本
Dp = train_df[train_df['label'] == 'p']
De = train_df[train_df['label'] == 'e']

print(f'len of Dp:{len(Dp)} len of De is:{len(De)}')

# 定义条件概率的字典
conditional_probabilities_p = {}
conditional_probabilities_e = {}

#以特征cap-shape为例。Dp中cap-shape的非重复值集合为['b' 'c' 'f' 'k' 's' 'x' 'y']，计算条件概率P(cap-shape='b'|label='p'),P(cap-shape='c'|label='p'),...,P(cap-shape='y'|label='p')，
#上述对cap-shape特征操作完成后，按照同样的步骤对Dp中的剩余12个特征进行同样的操作

# 遍历Dp的每个特征，分别求出该特征每个特征值的条件概率
for feature in features:
    probabilities = {}

    # get unique key of label
    for value in unique_values[feature]:
        count_p = Dp[Dp[feature] == value].shape[0]
        probabilities[value] = count_p / Dp.shape[0]


    conditional_probabilities_p[feature] = probabilities

# 同样的操作对De进行计算
for feature in features:
    probabilities = {}

    for value in unique_values[feature]:
        count_e = De[De[feature] == value].shape[0]
        probabilities[value] = count_e / De.shape[0]

    conditional_probabilities_e[feature] = probabilities

print(conditional_probabilities_p)
print(conditional_probabilities_e)



defaultdict(<class 'set'>, {})
defaultdict(<class 'set'>, {'cap-shape': {'s', 'f', 'y', 'k', 'b', 'x', 'c'}, 'cap-surface': {'w', 'n', 's', 'f', 'y', 'g'}, 'cap-color': {'w', 'n', 'u', 'p', 'r', 'f', 'y', 'b', 'g', 't', 'e', 'c'}, 'bruises': {'n', 'l', 'p', 'f', 'a', 't'}, 'odor': {'n', 'f', 'y', 's'}, 'gill-spacing': {'w', 'n', 'c', 'b'}, 'gill-size': {'w', 'n', 'h', 'p', 'k', 'b', 'g'}, 'gill-color': {'w', 'h', 'p', 'r', 'b', 'g', 't', 'e'}, 'ring-number': {'p', 't', 'o', 'e'}, 'ring-type': {'n', 'u', 'l', 'p', 'f', 'k', 'e'}, 'spore-print-color': {'w', 'n', 's', 'h', 'r', 'y', 'a', 'v'}, 'population': {'u', 's', 'p', 'y', 'd', 'm', 'g', 'v', 'c'}, 'habitat': {'w', 'u', 'l', 's', 'p', 'f', 'b', 'd', 'm', 'g', 'x'}})
len of Dp:205 len of De is:295
{'cap-shape': {'s': 0.05853658536585366, 'f': 0.36097560975609755, 'y': 0.11219512195121951, 'k': 0.0, 'b': 0.03902439024390244, 'x': 0.424390243902439, 'c': 0.004878048780487805}, 'cap-surface': {'w': 0.1073170731707317, 'n': 0.063414634146

<span style="color:purple">4) 
编写函数，给定样本$x=(x_1,...,x_i,...,x_d)$以及标签y, 计算其后验概率    
输入：样本x，标签y  
输出：样本x对应标签y的后验概率  
计算后验概率公式:
$P(y)\prod_{i=1}^{d}P(x_i|y)$   
    
<span style="color:purple">例:  
特征和标签：(cap-shape, cap-surface,..., habitat), label  
输入: [k, y, n, f, s, c, n, b, o, e, w, v, d], p  
输出: P(label='p') $\times$ P(cap-shape='k'|label='p') $\times$ ... $\times$ P(habitat='d'|label='p')</span>

In [5]:
def pro(a:pd.Series,label):
    # init
    posterior_probability = 0.
    
    # calculate P(y)
    P_y = probability_y[label]
    posterior_probability = P_y
    # print(P_y)
    
    # calculate |-|P(xi|y)
    conditional_probabilities = {}
    if(label=='p'):
        conditional_probabilities = conditional_probabilities_p
    elif(label=='e'):
        conditional_probabilities = conditional_probabilities_e
    else:
        raise Exception("not found label of a!")
    
    assert isinstance(conditional_probabilities,dict)
    
    for feature in features:
        xi = a[feature]
        # print('xi',xi)
        # print(feature)
        # print(conditional_probabilities[feature])
        if xi not in conditional_probabilities[feature]:
            _p = 0
        else: 
            _p = conditional_probabilities[feature][xi]
        posterior_probability *= _p
        # print(f'condition of {feature} and {xi} and {label} is {_p}')
    
    # print('posterior_probability',posterior_probability)
    return posterior_probability

<span style="color:purple">5) 对测试集中的每个样本a，利用上个步骤所编写的函数，分别求所有可能的标签对应的后验概率，后验概率更大的对应标签即为预测标签。最后与测试集本身标签进行比较计算出准确率</span>

In [6]:
acc_num = 0
# total_num = 0
labels_unique = train_df['label'].unique()
print(labels_unique)
for index,a in test_df.iterrows():
    
    # print(a)
    posterior_probability = [pro(a,label) for label in labels_unique] 
    # print(posterior_probability)
    # except the [0,0]
    if posterior_probability == [0.,0.]:
        continue
    #
    max_p = max(posterior_probability)
    max_index = posterior_probability.index(max_p)
    max_label = 'p' if max_p == 0 else 'e'
    if max_label == a['label']:
        acc_num += 1
    
    # total_num += 1

# print(total_num)
# acc_rate = acc_num/total_num
acc_rate = acc_num/test_df.size
print(f'acc_rate is : {acc_rate}') 
    
    
    

    
    

['p' 'e']
acc_rate is : 0.010714285714285714


2.引入拉普拉斯平滑

<span style="color:purple">1) 首先，请判断是否有某个特征值和某个类没有在训练集中同时出现而使得条件概率为零。如果无，则无需进行下列实验；如果有，请在上个实验的基础上引入拉普拉斯平滑</span>

In [7]:
''' 
有某个特征值和某个类没有在训练集中同时出现而使得条件概率为零
因为在之前的查找条件概率的时候就发现有不存在的情况,
所以加上了当发现条件概率不存在是直接认为是0

if xi not in conditional_probabilities[feature]:
    _p = 0
else: 
    _p = conditional_probabilities[feature][xi]
    posterior_probability *= _p

所以需要引入拉普拉斯平滑
'''



' \n有某个特征值和某个类没有在训练集中同时出现而使得条件概率为零\n因为在之前的查找条件概率的时候就发现有不存在的情况,\n所以加上了当发现条件概率不存在是直接认为是0\n\nif xi not in conditional_probabilities[feature]:\n    _p = 0\nelse: \n    _p = conditional_probabilities[feature][xi]\n    posterior_probability *= _p\n\n所以需要引入拉普拉斯平滑\n'

<span style="color:purple">2) 引入拉普拉斯平滑后计算每个标签y对应的先验概率P(y)
$$P(y)=\frac{|D_y|+1}{|D|+N}$$
其中$D_y$为标签为y的样本集合；$|D_y|$为标签为y的集合的样本个数；D为所有样本集合；|D|为所有样本个数;N为标签取值的个数

</span>

In [8]:
'''
计算每个标签值y对应的先验概率P(y)
'''
train_len = len(train_df)
probability_y = {}
couter_y = Counter(train_df['label'])
print(couter_y)
N = len(couter_y.keys())
print(N)

for label, count in couter_y.items():
    probability_y[label] = (count+1) / (train_len+N)

print(probability_y)



Counter({'e': 295, 'p': 205})
2
{'p': 0.4103585657370518, 'e': 0.5896414342629482}


<span style="color:purple">3) 计算数据集中的每个特征的非重复特征值$x_i$对应标签y的条件概率$P(x_i│y)$,
    $$P(x_i│y)=\frac{|D_{x_i,y}|+1}{|D_y|+N_i}$$
$D_{x_i,y}$为标签为$y$，特征为$x_i$的样本集合；$|D_{x_i,y}|$为该样本个数;$N_i$为第$i$个特征取值的个数
</span>

In [9]:

'''
计算条件概率

--计算条件概率时需要注意引入拉普拉斯平滑--
'''
from collections import defaultdict

# 遍历数据集D中的每个特征，将每个特征的非重复值取出
features = train_df.columns[:-1]
unique_values = defaultdict(set)#如果访问一个不存在的键，那么就会返回空的set
print(unique_values)

for feature in features:
    unique_values[feature].update(train_df[feature].unique())
    
print(unique_values)

# 根据标签值将数据集D分为两个子数据集，分别包括所有标签值为p的样本和所有标签值为e的样本
Dp = train_df[train_df['label'] == 'p']
De = train_df[train_df['label'] == 'e']

print(f'len of Dp:{len(Dp)} len of De is:{len(De)}')

# 定义条件概率的字典
conditional_probabilities_p = {}
conditional_probabilities_e = {}

#以特征cap-shape为例。Dp中cap-shape的非重复值集合为['b' 'c' 'f' 'k' 's' 'x' 'y']，计算条件概率P(cap-shape='b'|label='p'),P(cap-shape='c'|label='p'),...,P(cap-shape='y'|label='p')，
#上述对cap-shape特征操作完成后，按照同样的步骤对Dp中的剩余12个特征进行同样的操作

# 遍历Dp的每个特征，分别求出该特征每个特征值的条件概率
for feature in features:
    probabilities = {}

    # get unique key of label
    for value in unique_values[feature]:
        Ni = train_df[feature].unique().shape[0]
        # print('Ni',Ni)
        count_p = Dp[Dp[feature] == value].shape[0]
        probabilities[value] = (count_p+1) / (Dp.shape[0]+Ni)


    conditional_probabilities_p[feature] = probabilities

# 同样的操作对De进行计算
for feature in features:
    probabilities = {}

    for value in unique_values[feature]:
        Ni = train_df[feature].unique().shape[0]
        count_e = De[De[feature] == value].shape[0]
        probabilities[value] = (count_e+1) / (De.shape[0]+Ni)

    conditional_probabilities_e[feature] = probabilities

print(conditional_probabilities_p)
print(conditional_probabilities_e)


defaultdict(<class 'set'>, {})
defaultdict(<class 'set'>, {'cap-shape': {'s', 'f', 'y', 'k', 'b', 'x', 'c'}, 'cap-surface': {'w', 'n', 's', 'f', 'y', 'g'}, 'cap-color': {'w', 'n', 'u', 'p', 'r', 'f', 'y', 'b', 'g', 't', 'e', 'c'}, 'bruises': {'n', 'l', 'p', 'f', 'a', 't'}, 'odor': {'n', 'f', 'y', 's'}, 'gill-spacing': {'w', 'n', 'c', 'b'}, 'gill-size': {'w', 'n', 'h', 'p', 'k', 'b', 'g'}, 'gill-color': {'w', 'h', 'p', 'r', 'b', 'g', 't', 'e'}, 'ring-number': {'p', 't', 'o', 'e'}, 'ring-type': {'n', 'u', 'l', 'p', 'f', 'k', 'e'}, 'spore-print-color': {'w', 'n', 's', 'h', 'r', 'y', 'a', 'v'}, 'population': {'u', 's', 'p', 'y', 'd', 'm', 'g', 'v', 'c'}, 'habitat': {'w', 'u', 'l', 's', 'p', 'f', 'b', 'd', 'm', 'g', 'x'}})
len of Dp:205 len of De is:295
{'cap-shape': {'s': 0.06132075471698113, 'f': 0.35377358490566035, 'y': 0.11320754716981132, 'k': 0.0047169811320754715, 'b': 0.04245283018867924, 'x': 0.41509433962264153, 'c': 0.009433962264150943}, 'cap-surface': {'w': 0.10900473933649289

<span style="color:purple">4) 对测试集中的每个样本a，利用上个步骤所编写的函数，分别求所有可能的标签对应的后验概率，后验概率更大的对应标签即为预测标签，最后与测试集本身标签进行比较计算出准确率</span>

In [11]:
'''
引入拉普拉斯平滑后
'''

acc_num = 0
total_num = 0
labels_unique = train_df['label'].unique()
print(labels_unique)
for index,a in test_df.iterrows():
    
    # print(a)
    posterior_probability = [pro(a,label) for label in labels_unique] 
    print(posterior_probability)
    # except the [0,0]
    if posterior_probability == [0.,0.]:
        continue
    #
    max_p = max(posterior_probability)
    max_index = posterior_probability.index(max_p)
    max_label = 'p' if max_p == 0 else 'e'
    if max_label == a['label']:
        acc_num += 1
    
    total_num += 1

print(total_num)
acc_rate = acc_num/total_num
print(f'acc_rate is : {acc_rate}') 
    

['p' 'e']
[7.665999837752774e-10, 2.8787165683937546e-16]
[6.287548144266044e-13, 4.6079146900350027e-23]
[0.0, 0.0]
[3.6325045385043913e-08, 4.3543611958897136e-17]
[0.0, 0.0]
[0.0, 0.0]
[2.602043675746354e-09, 7.132104026295093e-17]
[3.009424752811098e-13, 3.4835835056664624e-21]
[0.0, 0.0]
[0.0, 0.0]
[0.0, 0.0]
[9.038701972402352e-11, 3.2004554789789377e-17]
[8.057357186827239e-11, 2.4181219174507533e-16]
[0.0, 0.0]
[7.762405116377833e-14, 2.902986254722052e-21]
[8.982211634665776e-13, 1.382374407010501e-22]
[0.0, 0.0]
[9.914338216093691e-09, 6.592260941621698e-15]
[1.5970832995318278e-08, 5.839682181598759e-17]
[4.3385769467531276e-11, 3.2004554789789377e-17]
[0.0, 0.0]
[1.2533864292914792e-09, 9.49461348500534e-16]
[9.914338216093691e-09, 6.592260941621698e-15]
[3.009424752811099e-12, 4.976547865237804e-22]
[2.3887998069920504e-10, 2.2860396278420984e-18]
[4.776864687001741e-13, 3.135225155099815e-20]
[2.9812221591260784e-10, 4.0302031957512556e-16]
[1.6052734702986573e-10, 5.3340

**<font color = blue size=4>第二部分:作业提交</font>**

一、实验课下课前提交完成代码，如果下课前未完成，请将已经完成的部分进行提交，未完成的部分于之后的实验报告中进行补充  
要求:  
1)文件格式为：学号-姓名.ipynb  
2)【不要】提交文件夹、压缩包、数据集等无关文件，只需提交单个ipynb文件即可，如果交错请到讲台前联系助教，删掉之前的错误版本后再进行提交

二、实验报告截止日期： 【11月17日 14:20】
要求：  
1)文件格式为：学号-姓名.pdf  
2)【不要】提交文件夹、压缩包、代码文件、数据集等任何与实验报告无关的文件，只需要提交单个pdf文件即可  
3)文件命名时不需要额外添加“实验几”等额外信息，按照格式提交  
4)每周的实验报告提交地址会变化，且有时间限制，提交时间为下周的实验课开始时，请注意及时提交。

实验九(贝叶斯分类)的实验报告上交地址:https://send2me.cn/ufVNphux/T9yuatQDc00TVw  

三、课堂课件获取地址:https://www.jianguoyun.com/p/DRLiP2oQp5WhChjB86YFIAA  
实验内容获取地址:https://www.jianguoyun.com/p/DbLessAQp5WhChjD86YFIAA