# 章節 15：維度處理

## 15.3 階層式的版本

我們解決一個簡單版本的問題：如果我們知道總共有多少的物種，我們可以估計每個種類出現的機率。

現在回到原始的問題，估計有多少的物種。為了解決此問題這邊定義一個 meta-Suite，一個 Suite 包含一組 Suite 的假設。上層的 Suite 為多少物種分佈；下層的 Suite 為物種出現機率的分佈。

這邊是類別的定義：

<pre>
class Species(thinkbayes.Suite):

    def __init__(self, ns):
        # ns: 一序列可能的物種數量
        
        # 產生一序列的 Dirichlet 物件
        hypos = [thinkbayes.Dirichlet(n) for n in ns]
        
        thinkbayes.Suite.__init__(self, hypos)
</pre>

產生上層 Suite 的程式碼：

<pre>
ns = range(3, 30)
suite = Species(ns)
</pre>

ns 一序列是可能的物種數量 n。目前有看過 3 個物種，所以至少從 3 種開始。這邊選擇了一個似乎合理的上限，但我們稍後會檢查超出此界限的概率是否很低。並且一開始假設每個種類數量是均勻分佈。為了更新階層式模型，我們需要更新所有的階層。通常先更新最下層在往上層更新，但這邊的例子我們可以先更新最上層：

<pre>
#class Species

def Update(self, data):
    thinkbayes.Suite.Update(self, data)
    for hypo in self.Values():
        
hypo.Update(data)

</pre>
Species.Update invokes Update in the parent class, then loops through the
sub-hypotheses and updates them.
Now all we need is a likelihood function:

# class Species
    def Likelihood(self, data, hypo):
        dirichlet = hypo
        like = 0
        for i in range(1000):
            like += dirichlet.Likelihood(data)
return like

data is a sequence of observed counts; hypo is a Dirichlet object. Species.Likelihood calls Dirichlet.Likelihood 1000 times and returns the total.
Why call it 1000 times? Because Dirichlet.Likelihood doesn’t actually compute the likelihood of the data under the whole Dirichlet distribution. Instead, it draws one sample from the hypothetical distribution and com- putes the likelihood of the data under the sampled set of prevalences.
Here’s what it looks like:
# class Dirichlet
    def Likelihood(self, data):
        m = len(data)
        if self.n < m:
            return 0
        x = data
        p = self.Random()
        q = p[:m]**x
        return q.prod()
The length of data is the number of species observed. If we see more species than we thought existed, the likelihood is 0.
Otherwise we select a random set of prevalences, p, and compute the multi-
nomial PMF, which is
c px1 ···pxn x1n
pi is the prevalence of the ith species, and xi is the observed number. The first term, cx, is the multinomial coefficient; I leave it out of the computa- tion because it is a multiplicative factor that depends only on the data, not the hypothesis, so it gets normalized away (see http://en.wikipedia.org/ wiki/Multinomial_distribution).
m is the number of observed species. We only need the first m elements of p; for the others, xi is 0, so pxi is 1, and we can leave them out of the product.