原数据

In [12]:
import pandas as pd
data = pd.DataFrame(data={'fruit': ["banana", "apple", "banana", "apple", "banana","apple", "banana",
                                    "apple", "apple", "apple", "banana", "banana", "apple", "banana",],
                          'tasty': ["yes", "no", "yes", "yes", "yes", "yes", "yes",
                                    "yes", "yes", "yes", "yes", "no", "no", "no"],
                          'size': ["large", "large", "large", "small", "large", "large", "large",
                                    "small", "large", "large", "large", "large", "small", "small"]})
print(data)

     fruit tasty   size
0   banana   yes  large
1    apple    no  large
2   banana   yes  large
3    apple   yes  small
4   banana   yes  large
5    apple   yes  large
6   banana   yes  large
7    apple   yes  small
8    apple   yes  large
9    apple   yes  large
10  banana   yes  large
11  banana    no  large
12   apple    no  small
13  banana    no  small


构建贝叶斯网络

In [13]:
from pgmpy.models import BayesianModel

model = BayesianModel([('fruit', 'tasty'), ('size', 'tasty')])  # fruit -> tasty <- size



分别查看每种类型的数据量

比如下面就要用``ParameterEstimator``来数原数据中有哪些状态

结果显示，直接数水果种类的话（``pe.state_counts('fruit')``），有7个苹果7个香蕉

直接数水果味道（``pe.state_counts('tasty')``）的话，苹果中大苹果好吃的3个，不好吃的1个......

In [14]:
from pgmpy.estimators import ParameterEstimator
pe = ParameterEstimator(model, data)
print("\n", pe.state_counts('fruit'))  # unconditional
print("\n", pe.state_counts('tasty'))  # conditional on fruit and size


         fruit
apple       7
banana      7

 fruit apple       banana      
size  large small  large small
tasty                         
no      1.0   1.0    1.0   1.0
yes     3.0   2.0    5.0   0.0


## 概率分布估计（贝叶斯参数估计）·最大似然估计MLE

接下来进行最大似然估计（Maximum Likelihood Estimation）

基于数据，我们可以估计出各个节点的概率分布

In [15]:
from pgmpy.estimators import MaximumLikelihoodEstimator
mle = MaximumLikelihoodEstimator(model, data)
print(mle.estimate_cpd('fruit'))  # 计算fruit的概率分布，由于fruit没有父节点，因此是无约束的
print(mle.estimate_cpd('tasty'))  # 计算tasty的概率分布，由于tasty有fruit和size两个父节点，因此需要考虑这两个节点的概率

+---------------+-----+
| fruit(apple)  | 0.5 |
+---------------+-----+
| fruit(banana) | 0.5 |
+---------------+-----+
+------------+--------------+-----+---------------+
| fruit      | fruit(apple) | ... | fruit(banana) |
+------------+--------------+-----+---------------+
| size       | size(large)  | ... | size(small)   |
+------------+--------------+-----+---------------+
| tasty(no)  | 0.25         | ... | 1.0           |
+------------+--------------+-----+---------------+
| tasty(yes) | 0.75         | ... | 0.0           |
+------------+--------------+-----+---------------+


In [16]:
para = mle.get_parameters() # 可以用get_parameters获取所有参数
print(para[1])

+------------+--------------+-----+---------------+
| fruit      | fruit(apple) | ... | fruit(banana) |
+------------+--------------+-----+---------------+
| size       | size(large)  | ... | size(small)   |
+------------+--------------+-----+---------------+
| tasty(no)  | 0.25         | ... | 1.0           |
+------------+--------------+-----+---------------+
| tasty(yes) | 0.75         | ... | 0.0           |
+------------+--------------+-----+---------------+


## 概率分布估计（贝叶斯参数估计）·贝叶斯估计BPE

贝叶斯估计是使用已有的CPD和样本数据结合进行估计的方法，首先基于我们的先验概率表CPD给出一个概率估计，然后再使用数据进行修正。

贝叶斯估计中的先验设定包含两种常见的先验概率：K2 和 BDeu。
K2先验只是将每个状态的计数加1。而一个稍微更合理的选择是BDeu（贝叶斯Dirichlet等价均匀）先验。对于BDeu，我们需要指定一个等价样本大小N，然后伪计数就相当于观察到N个均匀样本（以及每个父配置的状态）。

In [17]:
from pgmpy.estimators import BayesianEstimator
est = BayesianEstimator(model, data)

print(est.estimate_cpd('tasty', prior_type='BDeu', equivalent_sample_size=10))# 使用BDeu作为先验概率，其中10是等价样本的大小，对于tasty状态而言，equivalent_sample_size为10意味着对于每个父配置，我们添加了相当于10个均匀样本的伪计数（在这里：+5个小香蕉是美味的，+5个不是）。
print(est.estimate_cpd('tasty', prior_type='K2'))# 使用K2作为先验概率


+------------+---------------------+-----+---------------------+
| fruit      | fruit(apple)        | ... | fruit(banana)       |
+------------+---------------------+-----+---------------------+
| size       | size(large)         | ... | size(small)         |
+------------+---------------------+-----+---------------------+
| tasty(no)  | 0.34615384615384615 | ... | 0.6428571428571429  |
+------------+---------------------+-----+---------------------+
| tasty(yes) | 0.6538461538461539  | ... | 0.35714285714285715 |
+------------+---------------------+-----+---------------------+
+------------+--------------------+-----+--------------------+
| fruit      | fruit(apple)       | ... | fruit(banana)      |
+------------+--------------------+-----+--------------------+
| size       | size(large)        | ... | size(small)        |
+------------+--------------------+-----+--------------------+
| tasty(no)  | 0.3333333333333333 | ... | 0.6666666666666666 |
+------------+--------------------+--