
# Model evidence
Suppose we wish to compare a set of $L$ models $\{\mathcal{M}_i\}$ where $i = 1,\cdots, L$. Here a model refers to a probability distribution over the observed data $\mathcal{D}= \{\mathbf{X},\mathbb{t}\}$.

$p(\mathcal{M}_i|\mathcal{D})$ denotes the probability that the set $\mathcal{D}$ is generated by the distribution (model) $\mathcal{M}_i$ within these $L$ distributions (models). Given a training set $\mathcal{D}$, we then wish to evaluate the posterior distribution
$$p(\mathcal{M}_i|\mathcal{D})\propto p(\mathcal{M}_i)p(\mathcal{D}|\mathcal{M}_i) \tag{3.66}$$
where
- $p(\mathcal{M}_i)$ denotes the prior, which allows us to express a preference for different models. Simply, we assume all models are given equal prior probability.
- $p(\mathcal{D}|\mathcal{M}_i)$, which is called *model evidence*, expresses the preference shown by the data for different models. The model evidence is sometimes also called the *marginal likelihood* because it can be viewed as a likelihood function over the space of models, in which the parameters have been marginalized out.

$$p(\mathcal{D}|\mathcal{M}_i) = \int p(\mathcal{D}|\mathbf{w},\mathcal{M}_i)p(\mathbf{w}|\mathcal{M}_i)d\mathbf{w} \tag{3.68}$$
where $\mathbf{w}$ is a set of parameters that governs the model $\mathcal{M}_i$.

Now, for convinient, we takes the procedures below
- Consider the model is governed by only one parameter $w$.
- Omit the common dependence on the model $\mathcal{M}_i$ to keep the notation uncluttered.
- <font color='Red'>Assume the prior disritution $p(w)$ is flat with width $\Delta w_{prior}=\frac{1}{p(w)}$.</font>
- <font color='Red'>Assume the posterior distribution $p(w|\mathcal{D})$ is sharply peaked around the most probable value $w_{MAP}$, with width $\Delta w_{posteroir}=\frac{1}{p(w_{MAP})}=\frac{1}{p(w|\mathcal{D})}$.</font>

*<font color='Red'>The discussion of this article is based on this two assumptions.*</font>

Then (3.68) will simplify to the form
$$\begin{align*}
p(\mathcal{D}) &= \int p(\mathcal{D}|w)p(w)dw \\
&\simeq \frac{1}{\Delta w_{prior}}\int p(\mathcal{D}|w)dw\qquad p(w)=\frac{1}{\Delta w_{prior}}\ is\ constant\\
&\simeq \frac{1}{\Delta w_{prior}}\int \frac{p(\mathcal{D}|w)p(w|\mathcal{D})}{p(w|\mathcal{D})}dw\\
&\simeq \frac{\Delta w_{posterior}}{\Delta w_{prior}} \int p(\mathcal{D}|w)p(w|\mathcal{D})dw\qquad p(w|\mathcal{D})=\frac{1}{\Delta w_{posteior}}\ is\ constant\\
&\simeq \frac{\Delta w_{posterior}}{\Delta w_{prior}} \int p(\mathcal{D}|w)p(w=w_{MAP})dw\\
&=\frac{\Delta w_{posterior}}{\Delta w_{prior}} \int p(\mathcal{D}|w_{MAP})dw\\
&= p(\mathcal{D}|w_{MAP})\frac{\Delta w_{posterior}}{\Delta w_{prior}}\tag{3.70}
\end{align*}$$

and so taking logs we obtain
<font color='Red'>$$\ln p(\mathcal{D}) \simeq \ln p(\mathcal{D}|w_{MAP})+\ln\left(\frac{\Delta w_{posterior}}{\Delta w_{prior}}\right) \tag{3.71}$$
where
- The first term represent the fit to the data given by the most probable parameter values. This can be obtained by considering a form of the model evidence $p$ (for example Gaussian) and evaluate the likelihood over the data set $\mathcal{D}$. 
- The second term penalizes the model according to its complexity. Because $\Delta w_{posterior}<\Delta w_{prior}$ this term is negative, and it increases in magnitude as the ratio $\Delta w_{posterior}/\Delta w_{prior}$ gets smaller. Thus, if parameters are finely tuned to the data in the posterior distribution, then the penalty term is large.</font>

For a model having a set of $M$ parameters, we can make a similar approximation for each parameter in turn. Assuming that all parameters have the same ration of $\Delta w_{posterior}/\Delta w_{prior}$, we obtain
$$\ln p(\mathcal{D})\simeq p(\mathcal{D}|\mathbf{w}_{MAP})+M\ln\left(\frac{\Delta w_{posterior}}{\Delta w_{prior}}\right) \tag{3.72}$$
As we increase the complexity of the model, 
- The first term will typically decrese, because a more complex model is better able to fit the data.
- The second term will increase due to the dependence on $M$.

------------

# Insight of model selection
From a sampling perspective, the marginal likelihood can be view as the probability of generating the data set $\mathcal{D}$ from a model whose parameters are sampled at random from the prior. To generate a particular data set from a specific model, we first choose the values of the parameters from the prior distribution $p(\mathbf{w})$, and then for these parameter values we sample the data from $p(\mathcal{D}|\mathbf{w})$. A simple model has little variability and so will generate data sets that are fairly similar to each other. Its distribution $p(\mathcal{D})$ is therefore confined to relartively small region. By contrast, a complex model can generate a great variety of different data sets, and so its distribution $p(\mathcal{D})$ is spread over a large region of the space of data set. <font color='Red'>Because the distribution $p(\mathcal{D}|\mathcal{M}_i)$  are normalized, the particular data set $\mathcal{D}_0$ can have the highest value of the evidence for the model that is the less complex and satisfies the condition of being able to generate this data set.</font>

The results hold only if we make proper assumption about the form of the model as well as proper prior.

--------
# Conclusion
- Bayesian framework avoids the problem of over-fitting because the model evidence takes the fitness to the data and the penalty of the model complexity into account.
- Bayesian allows models to be compared on the basis of the training data alone.
- However, A Bayesian approach, like any approach to parttern recognition, needs to make assumptions about the form of the model, and if these are invalid then the results can be misleading.
  - If we consider a infinite flat prior (for example, Gaussian prior with infinite variance), then $\Delta w_{prior} = \infty$, thus the evidence will go to zero, as can be seen from (3.70). In this case, it is better to evaluate the Bayesian factor between two models, where the definition of Bayesian factor is denoted by $p(\mathcal{D}|\mathcal{M}_i)/p(\mathcal{D}|\mathcal{M}_j)$.
- In a practical application, it will be wise to keep aside an independent test set of data on which to evaluate the overall performance of the final system.