# 章節 15：維度處理

## 15.1 獅子、老虎和熊

先從簡單的問題開始，假設只有三種物種。分別是獅子、老虎和熊。假設我們參訪一個野生動物保護區看到 3 隻獅子、2 隻老虎和 1 隻熊。如果我們在此保護區看到任何動物的機率都一樣，我們看到的每個物種的數量由多項式分佈（Multinomal distribution）決定。如果獅子、老虎和熊的流行程度是 p_lion、p_tiger 和 p_bear，則看到 3 隻獅子、2 隻老虎和 1 隻熊的似然性（likelihood）與以下成正比

$$ p\_lion^3 * p\_tiger^2 * p\_bear^1$$

有一種接近但是不正確的方式是用 Beta 分佈，如 4.5 小節用在硬幣問題，去分別描述每個物種的流行程度。例如，我們看到 3 隻獅子和 3 隻非獅子；也就是想成 3 次正面硬幣跟 3 次反面硬幣，而 p_lion 的後驗分佈是：

In [16]:
import thinkbayes as tb

def beta_mle(positive, negative, title):
    beta = tb.Beta()
    beta.Update((positive, negative))
    pmf = beta.MakePmf()
    print(title, "MLE:", pmf.MaximumLikelihood()*100, "%")
    
beta_mle(3, 3, "獅子")
beta_mle(2, 4, "老虎")
beta_mle(1, 5, "熊")

獅子 MLE: 50.0 %
老虎 MLE: 33.0 %
熊 MLE: 17.0 %


p_lion, p_tiger 和 p_bear 的最大似然估計分別是 50%、33% 跟 17%。

The maximum likelihood estimate for p_lion is the observed rate, 50%. Similarly the MLEs for p_tiger and p_bear are 33% and 17%.
But there are two problems:
1. We have implicitly used a prior for each species that is uniform from 0 to 1, but since we know that there are three species, that prior is not correct. The right prior should have a mean of 1/3, and there should be zero likelihood that any species has a prevalence of 100%.
2. The distributions for each species are not independent, because the prevalences have to add up to 1. To capture this dependence, we need a joint distribution for the three prevalences.
We can use a Dirichlet distribution to solve both of these problems (see http://en.wikipedia.org/wiki/Dirichlet_distribution). In the same way we used the beta distribution to describe the distribution of bias for a coin, we can use a Dirichlet distribution to describe the joint distribution of p_lion, p_tiger and p_bear.
The Dirichlet distribution is the multi-dimensional generalization of the beta distribution. Instead of two possible outcomes, like heads and tails,

the Dirichlet distribution handles any number of outcomes: in this exam- ple, three species.
If there are n outcomes, the Dirichlet distribution is described by n parame- ters, written α1 through αn.


Here’s the definition, from thinkbayes.py, of a class that represents a Dirichlet distribution:
class Dirichlet(object):
    def __init__(self, n):
        self.n = n
        self.params = numpy.ones(n, dtype=numpy.int)
n is the number of dimensions; initially the parameters are all 1. I use a numpy array to store the parameters so I can take advantage of array opera- tions.
Given a Dirichlet distribution, the marginal distribution for each prevalence is a beta distribution, which we can compute like this:
    def MarginalBeta(self, i):
        alpha0 = self.params.sum()
        alpha = self.params[i]
        return Beta(alpha, alpha0-alpha)
i is the index of the marginal distribution we want. alpha0 is the sum of the parameters; alpha is the parameter for the given species.
In the example, the prior marginal distribution for each species is Beta(1, 2). We can compute the prior means like this:
    dirichlet = thinkbayes.Dirichlet(3)
    for i in range(3):
        beta = dirichlet.MarginalBeta(i)
        print beta.Mean()
As expected, the prior mean prevalence for each species is 1/3.
To update the Dirichlet distribution, we add the observations to the param- eters like this:
    def Update(self, data):
        m = len(data)
        self.params[:m] += data
Here data is a sequence of counts in the same order as params, so in this example, it should be the number of lions, tigers and bears.
170 Chapter15. DealingwithDimensions
  0.035 0.030 0.025 0.020 0.015 0.010 0.005
0.000
0.0 0.2
0.4 0.6 0.8 1.0 Prevalence
    
lions tigers bears
                
    
Figure 15.1: Distribution of prevalences for three species.
data can be shorter than params; in that case there are some species that have not been observed.
Here’s code that updates dirichlet with the observed data and computes the posterior marginal distributions.
    data = [3, 2, 1]
    dirichlet.Update(data)
    for i in range(3):
        beta = dirichlet.MarginalBeta(i)
        pmf = beta.MakePmf()
        print i, pmf.Mean()
Figure 15.1 shows the results. The posterior mean prevalences are 44%, 33%, and 22%.
