# The Laplace Approximation
As we shall see in Section 4.5, the Bayesian treatment of linear regression is so complex that we cannot integrate exactly over the parameter vector $\mathbf{w}$ <font color='red'>since the posterior distribution is no longer Gaussian. It is therefore necessary to introduce some form of approximations.</font>

Laplace approximation is a simple <font color='red'>Gaussian approximation</font> of the probability density.

## 1-Dimensional distribution

Consider first the case of a single continuous variable $z$, and suppose the distribution $p(z)$ is defined by

$$p(z) = \frac{1}{Z}f(z) \tag{4.125}$$

where 
- $p(z)$ is the unknown distribution that we want to estimate.
- $Z=\int f(z)dz$ is the normalization coefficient, and $Z$ is unknown
- $f(z)$ is a known function that used to denote unnormalized posterior distribution.

Then we want to find the Gaussian approximation of $p(z)$, which is denoted by

<font color='red'>$$q(z)=\mathcal{N}(z|z_0, A^{-1})\approx p(z)$$</font>

where
- $z_0$ is the local maximum of the distribution, it therefore should satisfy that $p'(z_0) = 0$, or equivalently
<font color='red'>$$\left. \frac{df(z)}{dz}\right|_{z=z_0} = 0 \tag{4.126}$$</font>
- A Gaussian distribution has the property that its logarithm is a quadratic function of the variables. We therefore consider a **Taylor expansion** of $\ln f(z)$ centerd on the mode $z_0$ so that
$$\ln f(z)\simeq \ln f(z_0)-\frac{1}{2}A(z-z_0)^2 \tag{4.127}$$
where the first-order term $\frac{d\ln f(z)}{dz} =\frac{f'(z)}{f(z)} = \frac{0}{f(z)} = 0$, and $A$ is the second derivative of $f(z)$
<font color='red'>$$A = \left. -\frac{d^2}{dz^2}\ln f(z)\right|_{z=z_0} \tag{4.128}$$</font>

Taking the exponential of (4.127), we obtain
$$f(z) \simeq f(z_0)exp\left\{-\frac{A}{2}(z-z_0)^2\right\} \tag{4.129}$$
From the quadratic from, we obtain the Gaussian distribution
$$q(z) = \left(\frac{A}{2\pi}\right)^{1/2} exp\left\{-\frac{A}{2}(z-z_0)^2\right\} \tag{4.130}$$

Which is the Laplace approximation of the distribution $p(z)$.

*Note that the Gaussian approximation will only be well defined if its precision $A>0$, in other words the stationary point $z_0$ must be a local maximum, so that the second derivative of $f(z)$ at the point $z_0$ is negative.*

## M-Dimensional distribution

In the same manner, the $M$-dimensional distribution has the multivariate Gaussian approximation that take the form

<font color='red'>$$q(\mathbf{z}) = \frac{|\mathbf{A}|^{1/2}}{(2\pi)^{M/2}}exp\left\{-\frac{1}{2}(\mathbf{z}-\mathbf{z}_0)^T\mathbf{A}(\mathbf{z}-\mathbf{z}_0)\right\} = \mathcal{N}(\mathbf{z}|\mathbf{z}_0, \mathbf{A}^{-1}) \tag{4.134}$$</font>

where
- $\mathbf{z}_0$ satisfies
$$\left. \nabla  f(\mathbf{z})\right|_{\mathbf{z}=\mathbf{z}_0} = 0\Rightarrow \color{red}{\left. \nabla \ln f(\mathbf{z})\right|_{\mathbf{z}=\mathbf{z}_0}= \left.\frac{\nabla f(\mathbf{z})}{f(\mathbf{z})}\right|_{\mathbf{z}=\mathbf{z}_0} = 0}$$
- $\mathbf{A}$ is the $M\times M$ Hessian matrix defined by
<font color='red'>$$\mathbf{A} = -\nabla\nabla\ln f(\mathbf{z})|_{\mathbf{z}=\mathbf{z}_0} \tag{4.132}$$</font>

--------------

# Model comparison and BIC

Like what we have done at (4.127,4.128,4.129), we can obtain the approximation in $M$-dimension as follows

$$f(\mathbf{z})\simeq f(\mathbf{z}_0)exp\left\{-\frac{1}{2}(\mathbf{z}-\mathbf{z}_0)^T\mathbf{A}(\mathbf{z}-\mathbf{z}_0)\right\} \tag{4.133}$$

Integrating both sides, we have the normalization constant $Z$.

$$\begin{align*}
Z &= \int f(\mathbf{z})d\mathbf{z}\\
&= f(\mathbf{z}_0)\int exp\left\{-\frac{1}{2}(\mathbf{z}-\mathbf{z}_0)^T\mathbf{A}(\mathbf{z}-\mathbf{z}_0)\right\}d\mathbf{z}\\
&=f(\mathbf{z}_0)\frac{(2\pi)^{M/2}}{|\mathbf{A}|^{1/2}} \tag{4.135}
\end{align*}$$

where we have use Gaussian to determine the result of the integration.

The comparison procedure is totally the same as we did for linear regression in Section 3.4. <font color='red'>The point is to find out the form of *model evidence* for which the value can be used to judge the goodness of a model.</font>

Now consider a dataset $\mathcal{D}$ and a set of models $\{\mathcal{M}_i\}$ having parameters $\{\theta_i\}$. For each model we define a likelihood function $p(\mathcal{D}|\theta_i,\mathcal{M}_i)$. If we introduce a prior $p(\theta_i|\mathcal{M}_i)$ over the parameters, then we are interested in computing the model evidence $p(\mathcal{D}|\mathcal{M}_i)$ for various models. From now on we omit the conditioning on $\mathcal{M}_i$ to keep the notation uncluttered. From Bayes' theorem the model evidence is given by

$$p(\mathcal{D}) = \int p(\mathcal{D}|\theta)p(\theta)d\theta \tag{4.136}$$

The discussion above have mentioned that $f(\theta)$ denotes the unnormalized posterior distribution, thus

$$f(\theta) = p(\mathcal{D}|\theta)p(\theta),\qquad Z = \int f(\theta)d\theta = p(\mathcal{D})$$

Then the log model evidence takes the form

$$\begin{align*}
\ln p(\mathcal{D}) =\ln Z
& \simeq\ln f(\theta_{MAP})+\frac{M}{2}\ln(2\pi)-\frac{1}{2}\ln|\mathbf{A}|\\
& = \ln p(\mathcal{D}|\theta_{MAP})+\underbrace{\ln p(\theta_{MAP}) +\frac{M}{2}\ln(2\pi)-\frac{1}{2}\ln|\mathbf{A}|}_{Occam factor} \tag{4.137}
\end{align*}$$

where 
- The first term represents the log likelihood evaluated using the optimized parameters.
- The remaining terms comprise the 'Occam factor' which penalizes model complexity.