# Approximating the Expectation of the log of the joint probabilitiy

In the previous chapter, we have figured out an approximation of the expectation of the log of the approximate posterior, so the free energy functional is simplified to this:

$$F[Q(\Theta)] = \mathbb{E}[ln P(y, \Theta)] - [-\frac{1}{2}[ln(|\Sigma|) + n ln 2\pi] - \frac{n}{2}]$$

So now, we will basically follow the exact same logic, but to approximate $\mathbb{E}[ln P(y, \Theta)]$. With everything we have seen up until now, this one should be easy!

## Quadratic approximation of the log of the joint probability

In order to approximate the log of the joint, we will use the same technic as we used for the log of the approximate posterior: the quadratic approximation at the mode. So same as before:

$$ln P(y, \Theta) \approx ln P(y, \mu) -\frac{1}{2}(\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)$$

$$\Sigma^{-1} = \delta_{\mu, x_j} ln P(y, \mu)$$

One thing you might wonder about is why the $y$ remains, why do we have $ln P(y, \Theta_0)$ and not ln P(y_0, \Theta_0)? That's because y are the observed data, which is fixed. We can't try to approximate the joint probability by adjusting the observed data. That's in contrast with the $\Theta$ that we are trying to estimate. What we are doing with the quadratic approximation is approximate the function around a certain value of $\Theta$. 

## What of the Laplace approximation?

In the function above, we have used the quadratic method to approximate the log of the joint posterior, doesn't that imply that we are approximating the joint probability as a Gaussian? Indeed, if you were to take the exponential of the approximation above, we would be doing just that. But in fact, in the case of the joint probability, we don't need to approximate it, we know what it is: it's just the numerator. And we have seen before that it's the multiplication of the prior and the likelihood:

$$P(y, \Theta) = P(y|\Theta)P(\Theta)$$

We do know how to calculate that. So in the approximation above, we don't need to use the approximation formula to figure out what $ln P(y, \mu)$ is equal to, we can simply calculate it. But so then, what's the point of using the quadratic approximation in the first place? Remember, we are not using the quadratic approximation to find the distribution, but rather the Expectation of the distribution. In the particular case of the joint probability, we are only using the quadratic approximation for that, it's only for the posterior that it was handy to use the quadratic approximation of the log at the mode to realize that the posterior itself is a Gaussian. 


## Calculating the expectation of the quadratic approximation of the log joint probability.

So now, we can use the exact same trick as before, trace trick and everything:

$$
\begin{align}
\mathbb{E}_{Q(\Theta)}[ln P(y, \Theta)] 
&\approx ln P(y, \mu)  -\frac{1}{2}\mathbb{E}_{Q(\Theta)}
[(-\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)] \\

&= ln P(y, \mu)  -\frac{1}{2}\mathbb{E}_{Q(\Theta)}
[tr\big((-\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)]\big)\\

&= ln P(y, \mu)  -\frac{1}{2}tr\bigg(\Sigma^{-1}\mathbb{E}_{Q(\Theta)}
[(-\Theta-\mu)^T (\Theta-\mu)]\bigg)\\

&= ln P(y, \mu)  -\frac{1}{2}tr(\Sigma^{-1}\Sigma)\\

&= ln P(y, \mu)  -\frac{n}{2}
\end{align}
$$


## Quadratic approximation of the log joint probability $P(y, \Theta)$

We can try to approximate the log of the joint probability using the quadratic approximation. Why would we want to do that? For the joint probability as we have it now, we can't really calculate the Expectation because again, we have to deal with the integral. By maybe we can calculate the Expectation of the quadratic approximation of the log joint probability. 

So to approximate the log of the joint probability with the quadratic approximation, we have this:

$$ln P(y, \Theta) \approx ln P(y, \Theta_0) + \nabla ln P(y, \Theta_0)^T(\Theta-\Theta_0) + \frac{1}{2}(\Theta-\Theta_0)^T H_{ln P}(y, \Theta_0)(\Theta-\Theta_0)$$

Or something like that. I have just replaced the f(x) by the actual function, and the $x_0$ by $\Theta_0$. One thing you might wonder about is why the $y$ remains, why do we have $ln P(y, \Theta_0)$ and not ln P(y_0, \Theta_0)? That's because y are the observed data, which is fixed. We can't try to approximate the joint probability by adjusting the observed data. That's in contrast with the $\Theta$ that we are trying to estimate. What we are doing with the quadratic approximation is approximate the function around a certain value of $\Theta$. 

There is actually a little notation issue with the formulae above. I have written the hessian as $H_{ln P}$. That's a little bit unconventional, because it would typically be written as H_f, and then you would define the function in a separate formulae ("where f=..."). There is however a different notatation that you can adopt which is the following:

$$[H_f(x_0)] = \delta_{x_i, x_j}f(x_0)$$

These two formulae mean the same thing (hence the equal term), but the latter displays the function we are computing the Hessian from more clearly. So we can replace the H in the formulae above by that notation. Again, it doesn't change anything, it just makes it clearer and easier to read:

$$ln P(y, \Theta) \approx ln P(y, \Theta_0) + \nabla ln P(y, \Theta_0)^T(\Theta-\Theta_0) + \frac{1}{2}(\Theta-\Theta_0)^T [\delta_{\Theta_0, \Theta_0} ln P(y, \Theta_0)](\Theta-\Theta_0)$$ 

One other thing we need to do is decide what $\Theta_0$ should be. This is basically a set of values for each of the parameters from which to start approximating from. If you remember the 3D plots of multivariate distribution we saw before, the $\Theta_0$ is just a point on that distribution, which corresponds to a given value for each of the parameter. But which point should we chose? One thing to note about quadratic approximation (and any kind of finite Taylor series expansion) is that the approximate is most accurate around the point we start from (i.e. around $\Theta_0$, no matter which $\Theta_0$ we chose). And remember that the reason we are approximating the log of the joint probability using the quadratic approximation is because perhaps we can solve the expectation (i.e. integral) the quadratic approximation of the log of the joint probability. So in other words, our final goal is to approximate the Expectation of the log of the joint probability by calculating the Expectation of the quadratic approximation of the log joint probability (I know, it's a mouthful).

Accordingly, we should chose our $\Theta_0$ to be the point of the distribution which surroundings has the strongest impact on the Expectation. So what is the point in a probability distribution that has the strongest impact on the expectation? The mode! The mode is the point in your distribution with the highest probabilty. And because the Expectation is a weighted averagethe mode of the distribution (the point with the highest probability) has the strongest influence. This is why we will take $\Theta_0$ to be $\mu$, where $\mu$ stands for the mode. So we can rewrite the function above like this to make it more explicit:

$$ln P(y, \Theta) \approx ln P(y, \mu) + \nabla ln P(y, \mu)^T(\Theta-\mu) + \frac{1}{2}(\Theta-\mu)^T [\delta_{\mu, x_j} ln P(y, \mu)](\Theta-\mu)$$ 

In simple words, we have: "The log of the joint probability is approximately equal to the log of the joint probability at the mode, plus the first order derviative (i.e. the gradient) of the log of the joint probability at the mode (with some scaling) plus the second order derivative (i.e. the Hessian) of the log of the joint probability at the mode (with some scaling)". But using the mode of the distribution as our point to approximate from has another important advantage. If you think of a simple normal distribution, what is the gradient at the mode? Remember, the gradient is the slope of the tangential line at a particular point, just in multidimensional space. 

Since the mode of a probability distribution is a peak, the gradient is equal to.... 0. So by choosing the mode, we can actually get rid of the gradient term, because it is always zero! So we now have:

$$ln P(y, \Theta) \approx ln P(y, \mu) + \frac{1}{2}(\Theta-\mu)^T [\delta_{\mu, x_j} ln P(y, \mu)](\Theta-\mu)$$ 

That's a little bit easier to deal with. And remember that the Hessian ($\delta_{\mu, x_j} ln P(y, \mu)$) is a matrix. We will accordingly rewrite the function like this (again, don't worry too much about the why now, it will become obvious down the line):

$$ln P(y, \Theta) \approx ln P(y, \mu) + \frac{1}{2}(\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)$$

$$\Sigma^{-1} = \delta_{\mu, x_j} ln P(y, \mu)$$

## Expectation of the quadratic approximation of the log joint probability $ln P(y, \Theta)$

Okay, so we now have written a quadratic approximation of the log of the joint probability. And the reason we did that is because maybe we can compute the Expectation of the approximation easily, which we couldn't do for the actual function. So now let's try to compute the expectation of the approximation:

$$\mathbb{E}_{Q(\Theta)}[ln P(y, \Theta)] \approx \mathbb{E}_{Q(\Theta)}\bigg[ln P(y, \mu) + \frac{1}{2}(\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)\bigg]$$

Again, the oldest trick in the book, we can take out of the Expectation all bits that don't depend on $\Theta$:

$$\mathbb{E}_{Q(\Theta)}[ln P(y, \Theta)] \approx ln P(y, \mu) -\frac{1}{2}\mathbb{E}_{Q(\Theta)}\bigg[(\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)\bigg]$$

Okay, that looks a bit better, but we still have the Expectation in here. Remember how I said in the beginning that we will be using some more advanced math in that chapter? Well that's what we will do now. You don't actually need to understand that part, but we might write a chapter to explain how that works (drop an issue on github if you'd like us to). We will simply repraise the derivation from ["A primer on variational Laplace"](https://doi.org/10.1016/j.neuroimage.2023.120310):

"To simplify these expressions, note that each quadratic term inside the square brackets is a scalar. This means we can use the ‘trace trick’, $tr(ABC) = tr(CAB)$. Applying this gives the simpler expressions:"
$$
\begin{align}
\mathbb{E}_{Q(\Theta)}[ln P(y, \Theta)] \approx ln P(y, \mu) -\frac{1}{2}\mathbb{E}_{Q(\Theta)}\bigg[tr\big((\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)\big)\bigg] \\
= ln P(y, \mu) -\frac{1}{2}tr\bigg(\mathbb{E}_{Q(\Theta)}\big[(\Theta-\mu)^T(\Theta-\mu) \Sigma^{-1}\big]\bigg)
\end{align}
$$

With the above, we simply reordered what's inside the bracket. We also took the trace operator (tr) out of the expectation, because the expectation of a trace are equal to the trace of an expectation. One more thing we can now do is also take out the $\Sigma^{-1}$ of the expectation, because it also doesn't contain $\Theta$:

$$\mathbb{E}_{Q(\Theta)}[ln P(y, \Theta)] \approx ln P(y, \mu) -\frac{1}{2}tr\bigg(\Sigma^{-1}\mathbb{E}_{Q(\Theta)}\big[(\Theta-\mu)^T(\Theta-\mu)\big]\bigg)$$


Okay, we are making progress. The term in between the expectation bracket probably looks familar from previous chapters. Indeed, the expectation of the $(\Theta-\mu)^T(\Theta-\mu)$ is the formulae of the covariance matrix we saw before. And one thing we haven't explained (but we will below) is that in fact, the $\Sigma$ is also the covariance matrix. So in fact, 

$$\mathbb{E}_{Q(\Theta)}\big[(\Theta-\mu)^T(\Theta-\mu)\big] = \Sigma$$

So here we are, we have gotten rid of the Expectation:

$$\mathbb{E}_{Q(\Theta)}[ln P(y, \Theta)] \approx ln P(y, \mu) -\frac{1}{2}tr\big(\Sigma\Sigma^{-1}\big)$$

But we aren't done with tricks just yet. We still have this weird trace operator thingy. And we have the covariance matrix multiplied by its inverse. It turns out that in matrix algebra, a matrix multiplied by its inverse is equal to the identity matrix, which is a matrix with 1 along the diagonal and 0 elsewhere. So in order to solve the operation above, we need to compute the trace of the identity matrix. It turns out that the trace of the identity matrix is the number of rows (or columns) of the matrix. In other words, $tr\big(\Sigma\Sigma^{-1}\big)$ simplifies to $n$, which is the number of parameters of our model. So we can simplify the fomulae even further:

$$\mathbb{E}_{Q(\Theta)}[ln P(y, \Theta)] \approx ln P(y, \mu) -\frac{n}{2}$$

Again, to put that in simple words, the approximation of the expectation of the log of the joint probability is equal to the log of the joint probability at the mode, minus half the number of parameters. So complicated math along the day, but now we have a very clear and simple formulae for the first part of the free energy formulae!


## Concretly, what are we supposed to calculate?

Okay, so the formulae above is the approximation of the expectation of the log of the joint probability. So back to our simple penguins example. Our model is the linear model, which has three parameters: $\Theta = [\beta_0, \beta_1, \sigma^2]$. So the n is 3, as simple as that! Now, we need to find the mode of the distribution, so that we can calculate the log of the joint probability at that point and get the one half of the free energy functional. If we know what the mode of the joint probability, we can compute $lnP(y, \mu)$, because there is nothing problematic about the function of the joint probability. If you remember from the previous chapters, the joint probability is this:

$$P(y, \Theta) = P(y|\Theta) P(\Theta)$$

So it's the likelihood, times the prior. And we have defined in the previous chapters the following:

$$P(y| \Theta) = (\frac{1}{\sqrt{2\pi\sigma^2}})^n\prod_{i=1}^{n}exp^{-\frac{[y_i-X\boldsymbol{\beta}]^2}{2\sigma^2}}$$

And

$$P(\Theta) = P(\boldsymbol{\beta}) \times P(\sigma^2)$$

Where:
- $P(\boldsymbol{\beta})$ is a mutlivariate normal distribution: $\frac{1}{(2*\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}}exp\big(-\frac{1}{2}(\mathcal{\beta} - \mathcal{\mu})^T\Sigma^{-1}(\mathcal{\beta}-\mathcal{\mu})\big)$
- $ P(\sigma^2)$ is an inverse gamma distribution: $\frac{\beta^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1}exp(-\frac{\beta}{\sigma^2})$

So if we want to calculate the log of the joint probability, we can simply compute the product of the prior and the likelihood at that point and take the log of that. In fact, since it is the log we are interested in, we can simplify the function a bit, simply by making use of the fact that with the log, multiplications become additions and things like that. By now, you are probably a bit tired about reading equations, so I will simply provide the simplified formulae. I do provide it in the last section of this chapter, and I strongly encourage you to give it a quick look, just to see that it is nothing crazy, just cleaning up and simplifying things:

$$
\begin{align}
ln P(y, \Theta) &= ln\Bigg(
    (\frac{1}{\sqrt{2\pi\sigma^2}})^n\prod_{i=1}^{n}exp^{-\frac{[y_i-X\boldsymbol{\beta}]^2}{2\sigma^2}} \times
    \frac{1}{(2*\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}}exp\big(-\frac{1}{2}(\mathcal{\beta} - \mathcal{\mu})^T\Sigma^{-1}(\mathcal{\beta}-\mathcal{\mu})\big) \times
    \frac{\beta^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1}exp(-\frac{\beta}{\sigma^2}) 
    \Bigg)\\
&=
-\frac{n}{2}ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y-\boldsymbol{X\beta})^T(y-\boldsymbol{X\beta}) \\
&-\frac{p}{2}ln(2\pi) - \frac{1}{2}ln|\Sigma| - \frac{1}{2}(\boldsymbol{\beta} -\mu)^T\Sigma^{-1}(\boldsymbol{\beta} - \mu) \\
&+ \alpha ln\lambda - ln\Gamma(\alpha) - (\alpha + 1)ln\sigma^2-\frac{\lambda}{\sigma^2}
\end{align}
$$

So still a beefy formulae, but nothing unsolvable. You know what your data $y$ are, as well as your design matrix of your model $\boldsymbol{X}$. You also know your priors $\alpha$, $\beta$. So if you know what the values for $\mu$ are, i.e. the values of $\Beta$ and $\sigma^2$ that yield the max result, you can just calculate what that may result is. But how do we know what these values should be? How do we know which values of $\Beta$ and $\sigma^2$ to choose, so that we can solve the term of $\mathbb{E}_{Q(\Theta)}[ln P(y, \Theta)]$ (or the approximation thereof) of the free energy functional?

Before answering this question, we have to explain what the mode of the joint probability is. You may have noticed that the symbol we used is the same as the symbol we've used in the previous chapter for the mode of the Approximate posterior $Q(\Theta)$. You may have thought that it is just because $\mu$ is the generic symbol used for the mode of a distribution, so you may have thought that despite the symbol being the same, the mode of the approximate posterior and the mode of the joint probability might not be the same. Well, as it turns out, the mode of the joint probability 




. I guess by now you are probably a bit sick of seeing mathematical formulae and derivations. So for this one, I'll skip straight to the simplified formulae of the log of the joint likelihood. It is simply us


Now we need to find the mode of the product between the two to calculate the log of the probability at the mode. In this case, we can in fact use the log to our advantage. We can try the values of $\Beta$ and $\sigma^2$ that maximize the log of the joint distribution, because that would also be the values that maximize the joint probability (the logarthmic function is monotonic). And with the log, we can convert multiplications to addition, which makes life a bit easier.

$$
\begin{align}
ln P(y, \Theta) &= ln(P(y| \Theta) \times P(\Theta)) \\
&= ln(P(y| \Theta) \times P(\boldsymbol{\beta}) \times P(\sigma^2))\\
&= ln(P(y| \Theta)) + ln(P(\boldsymbol{\beta})) + ln(P(\sigma^2)) \\
\end{align}
$$

When taking the log of each of the probability distributions, we can also simplify each of them by replacing the multiplications by additions.

$$
\begin{align}
ln P(y| \Theta) &= ln\bigg((\frac{1}{\sqrt{2\pi\sigma^2}})^n\prod_{i=1}^{n}exp^{-\frac{[y_i-X\boldsymbol{\beta}]^2}{2\sigma^2}}\bigg) \\
&= ln\bigg((\frac{1}{\sqrt{2\pi\sigma^2}})^n\bigg) + ln\bigg(\prod_{i=1}^{n}exp^{-\frac{[y_i-X\boldsymbol{\beta}]^2}{2\sigma^2}}\bigg) \\
&=-\frac{n}{2}ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y-\boldsymbol{X\beta})^T(y-\boldsymbol{X\beta})
\end{align}
$$

And I've just used a little trick to get rid of the product when taking the log of it.


$$ln P(y, \Theta) \approx ln P(y, \mu) + \frac{1}{2}(\Theta-\mu)^T H_f(y, \mu)(\Theta-\mu)$$

You may have a couple of questions. First, where does the $\mu$ comes from? The $\mu$ here stands for the mode of the joint probability, which are the most likely values of the distribution. With the quadratic approximation, we approximate the function from a specific point, and we are chosing this point to be the mode of the distribution. Why? That's because the approximation is most accurate around the point we are approximating from. And again, we are trying to approximate the joint probability to eventually approximate the Expectation. The mode of a distribution and the points around it are the one that bear the strongest weight on the Expectation. Indeed, these are the most probable values and therefore the most represented in the distribution. So if we have to pick a point to approximate from, we want that point to be the best suited for our ultimate goal of computing the expectation, which is why we take the mode of the joint probability.

Now you surely have noticed as well that there is a term that's missing in the above: the first order derivative, what we had represented as $\nabla$ above. The reason it is missing is also one of the reason we are chosing to approximate our function around $\mu$. Remember that the $\nabla$ term stands for the derivative, and the derivative is the slope of the tangent at a given point. With a probability distribution, at the mode (i.e. the peak), the derivative is 0.

One of the thing you might wonder about is why do we still have the term y everywhere. The $\Theta$ term was replaced to $\mu$, i.e. the mode of the distribution in all the various places where we had $x_0$ in the Quadratic approximation formulae, but we didn't do that for the y term. That's because y is not a variable, it is fixed. So it is not something we can play around with, and therefore it stays right where it is. 

In the paper  ["A primer on variational Laplace"](doi:10.1016/j.neuroimage.2023.120310), you'll see that instead of $H_f(y, \mu)$, they are using a different notation. The formula looks like this:

$$ln P(y, \Theta) \approx ln P(y, \mu) + \frac{1}{2}(\Theta-\mu)^T [\delta_{\mu, \mu} ln P(y, \mu)](\Theta-\mu)$$

These two notations mean exactly the same thing. The $\delta$ symbol corresponds the H symbol of the Hessian matrix in the following way:

$$[H_f(x_0)] = \delta_{x_i, x_j}f(x_0)$$

The latter notation is a bit nicer, because then we can clearly see the formulae $ln P(y, \mu)$, instead of having it hidden as $f$ under the Hessian H symbol. We will therefore adopt that notation instead. But again, remember that it changes nothing, it just makes easier to understand what each bit of the function i.

So our approximation for the log of the joint posterior is the formulae above, which we can simplify a little bit:

$$ln P(y, \Theta) \approx ln P(y, \mu) + \frac{1}{2}(\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)$$

Where:

$$\Sigma^{-1} = [\delta_{\mu, \mu} ln P(y, \mu)]$$

If we were to take the exponential of that approximation, we would get the approximate of the joint probability ($P(y, \Theta)$), which would be a multivariate normal distribution as per the Laplace approximation we saw before. It's really easy to see in the formulae, and that's why we have written the Hessian as $\Sigma^{-1}$. But in fact, we will not do that. 


## Laplace approximation
So in our particular case, the function we are trying to approximate is this:

$$f(x) = ln P(x)$$

Where $P(x)$ is either $P(y, \Theta)$ or $Q(\Theta)$

So if we want to take the Quadratic approximation, we have something like this:

$$f(x) \approx ln P(x) + \nabla f(x_0)^T(x-x_0) + \frac{1}{2}(x-x_0)^T H_f(x_0)(x-x_0)$$

One important thing to know about Taylor series expansion is that the approximation is most accurate around the point we have chosen to approximate from. So if we should choose the point $x_0$ wisely. And remember that in our case, we are trying to approximate the function $P(x)$ so that we can calcuate the Expectation of that approximated function, as an approximation of the expectation of the true function. So which point $x_0$ should we chose? We want to approximate the function around the point that will make the largest difference on the expectation. If we have inaccurate values around points that don't matter so much for the expectation, that's not too big a deal. But if our approximation is inaccurate around points that have a strong impact on the expectation, that's a bit of a problem. So what point should we chose?

The answer is the **mode**, because the mode is the point that has the largest impact on the expectation. Indeed, the mode is basically the value with the highest probability, so it has the largest impact on your expectation. So we set $x_0$ to be the mode of the distribution. 

This has a nice consequence: at the mode of the distribution, the gradient (i.e. derivative) is 0. The gradient is the slope of the tangential line to our distribution at the particular point (or vector for vector input functions). And if you think of a normal distribution, at the peak, the slope at the peak of the distribution is 0, it's just a flat line. So that means that near the mode, the quadratic approximation of a distribution simplifies to this:

$$f(x) \approx ln P(x_0) + \frac{1}{2}(x-x_0)^T H_f(x_0)(x-x_0)$$

Because:

$$\nabla f(x_0)^T(x-x_0) = 0$$

That's a bit simpler to deal with. But something even more interesting happens when we take the exponential of it all. We have defined our function as this:

$$f(x) = ln P(x)$$

Because that's what we are trying to find the expectation for. But if we take the exponential of $f(x)$, like so:

$$G(x) = exp(f(x)) = P(x)$$

Then we have the following approximation for $G(x)$:

$$G(x) \approx exp[f(x_0) + \frac{1}{2}(x-x_0)^T H_f(x_0)(x-x_0)]$$

With exponential, a sum is equal to a multiplication. So we have:

$$G(x) \approx exp[f(x_0)] \cdot exp[\frac{1}{2}(x-x_0)^T H_f(x_0)(x-x_0)]$$

And because $exp[f(x)] = G(x)$, we can rewrite the above as:

$$G(x) \approx G(x_0) \cdot exp[\frac{1}{2}(x-x_0)^T H_f(x_0)(x-x_0)]$$

Another important thing that happens when we approximate our probability distribution close to its mode is that the term $H_f(x_0)$ is negative definite. Remember it is a matrix, not a single number. As I mentioned above, this term captures the curvature of the probability distribution at that point. Since we are at the peak of the distribution (i.e. the max), the curvature is negative. If you think of a mountain, when you are at the top, if you look around, you don't see anywhere that goes up. 

Now we will again do something that seems useless. Since $H_f(x_0)$ is negative definite, we can write:

$$\Lambda = -H_f(x_0)$$

And accordingly, we can rewrite the equation like so:

$$G(x) \approx G(x_0) \cdot exp[-\frac{1}{2}(x-x_0)^T \Lambda(x-x_0)]$$

What has changed? Now we have the exponentiated expression that is negative, because we defined $\Lambda$ to be the negative of the Hessian matrix at $x_0$. 

## ... The quadratic approximation of a probability distribution near its mode is a multivariate normal distribution

Why did we do all of that? Well, the formulae above might remind you of some other function we have seen before... The multivariate normal distribution:

$$P(\mathcal{\beta}) = \frac{1}{(2*\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}}exp(-\frac{1}{2}(\mathcal{\beta} - \mathcal{\mu})^T\Sigma^{-1}(\mathcal{\beta}-\mathcal{\mu}))$$

In the multivariate normal distribution, we have an exponentiated term that gets divided by something ($(2*\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}$). That something is just a normalization constant to ensure that the whole sums to 1. In the quadratic approximation of our probability distribution, we have some exponentiated term multiplied by some number $G(x_0)$ (which is a single number remember). 

If we look at the exponentiated term, they are almost exactly teh same across both formulae, to the difference that in the multivariate we have the term $\Sigma$, while in the quadratic approximation, we have $\Lambda$. But if you remember, both these terms are matrices. So in fact, we have:

$$\Lambda = \Sigma^{-1}$$

The exponentiated term is the same between the two formulae, so the only difference is the one number by which we multiply it. In the multiavariate normal distribution, this is the normalization constant to ensure that it all sums to 1. In the case of the exponential of the quadratic approximation, we don't quite know what it is. But since we multiply the exponentiated term by that instead of the normalization constant, we say that the exponential quadratic approximation of a probability distribution is a **scaled multivariate distribution**. 

This is what the Laplace approximation is. It is bascially the observation that the exponential quadratic approximation of a probability distribution near its mode is a scaled multivariate normal distribution (i.e. a Gaussian). That's just the way it is. 

What you might wonder is what should we do about the $G(x_0)$ term? Should we calculate it? Should we somehow figure out if its equal to the normalization constant? The truth is that we don't quite care about it. Since it is a constant, it won't impact the Expectation, no matter what we set it as. However, since $G(x)$ is a probability distribution, this value has to be a constant that ensures that the whole integrates to 1. So the only valid option is the following:

$$G(x_0) = \frac{1}{(2*\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}}$$


## Assumption of Gaussianity under the Laplace approximation

There is an important implication to all of this. If we approximate the approximate posterior $Q(\Theta)$ using the quadratic approximation, it implies that we approximate the posterior as a Gaussian. In other words, **we will assume that the posterior is a normal distribution**. This is why you will often read about how the Free energy method for Bayesian statistics has an "assumption of Gaussianity" about the posterior, or that the method assumes that the posterior is a normal distribution. This is all true, but it is important to understand where this "assumption" comes from. It comes from the methodological choice to approximate $Q(\Theta)$ using Laplace approximation, i.e. **quadratic approximation of the log approximate posterior around its mode**. It doesn't come from a general assumption that any posteriors will look like a normal distribution, but from the fact that around its mode, the posterior will look like one.


# Approximating the log of the approximate posterior $ln Q(x)$

We can do the same thing for the log of the posterior:

$$ln Q(\Theta) \approx ln Q(\mu) + \frac{1}{2}(\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)$$