# Bayesian inference for the Gaussian

While doing the Bayesian Inference, 
1. Assume a prior distribution. 
2. Multiply the likelihood function by the prior.
3. Normalize to reach our solution of the posterior distribution.

*Generally, the step 2 and step 3 is combined.*  
<font color='red'>The procedure cost lots of calculations. Our calculations will be greatly simplified if we choose a **conjugate form** for the prior distribution. (Conjugate form means that the prior should have the same form as the likelihood function, such that the posterior will also has the same form)</font>

## Unknown Mean, Known Variance
Consider a single Gaussian random variable $x$. We shall suppose that the variance $\sigma^2$ is known, and the task is to inferring the mean $\mu$ given a set of $N$ observations $\mathbf{X}=\big\{ x_1,\cdots,x_n \big\}$。 The likelihood function, that is the probability of the observed data given $\mu$, viewed as a function of $\mu$, is given by
$$p(\mathbf{X}|\mu)=\prod_{n=1}^Np(x_n|\mu)=\frac{1}{(2\pi\sigma^2)^{N/2}}exp\left\{-\frac{1}{2\sigma^2}\sum_{n=1}^{N}(x_n-\mu)^2\right\}$$
Again we emphasize that the likelihood function $p(\mathbf{X}|\mu)$ is not a probability distribution over $\mu$ and is not nomalized. 

### Prior
Then we introduce a prior distribution $p(\mu)$ given by Gaussian. The reason is that the form of Gaussian is the same as the likelihood function which is a exponential of quadratic function and the product of the likelihood function and prior will also be Gaussian. The prior distribution is then called a conjugate distribution.  
The prior distribuion is given by
<font color='red'>$$p(\mu)=\mathcal{N}(\mu|\mu_0,\sigma_0^2)$$</font>
### Posterior
The posterior distribution is given by
$$\begin{align*}
p(\mu|\mathbf{X})&\propto p(\mathbf{X}|\mu)p(\mu)\\
p(\mu|\mathbf{X})&=\mathcal{N}(\mu|\mu_N,\sigma_N^2)
\end{align*}$$
We need to solve the mean and variance of the posterior distribution. 

$$\begin{align*}
p(\mu|\mathbf{X})\propto p(\mathbf{X}|\mu)p(\mu)&=
\frac{1}{(2\pi\sigma^2)^{N/2}}exp\left\{ -\frac{1}{2\sigma^2}\sum_{n=1}^N(x_n-\mu)^2\right \}
\cdot\frac{1}{(2\pi \sigma_0^2)^{1/2}}exp\left\{-\frac{1}{2\sigma_0^2}(\mu-\mu_0)^2\right\}\\
&=\frac{1}{(2\pi\sigma^2)^{N/2}}\frac{1}{(2\pi \sigma_0^2)^{1/2}}exp\left\{-\frac{1}{2\sigma^2}\sum_{n=1}^N(x_n-\mu)^2-\frac{1}{2\sigma_0^2}(\mu-\mu_0)^2\right\}\\
&=\frac{1}{(2\pi\sigma^2)^{N/2}}\frac{1}{(2\pi \sigma_0^2)^{1/2}}exp\left\{-\frac{1}{2\sigma^2}\sum_{n=1}^N(x_n^2-2x_n\mu+\mu^2)-\frac{1}{2\sigma_0^2}(\mu^2-2\mu\mu_0+\mu_0^2)\right\}\\
&=\frac{1}{(2\pi\sigma^2)^{N/2}}\frac{1}{(2\pi \sigma_0^2)^{1/2}}exp
\left\{-\frac{1}{2}\left[\left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2}\right)\mu^2
-\left(\frac{2x_1+\cdots+2x_n}{\sigma^2}+\frac{2\mu_0}{\sigma_0^2}\right)\mu
+\left(\frac{x_1^2+\cdots+x_n^2}{\sigma^2}+\frac{\mu_0^2}{\sigma_0^2}\right)\right]
\right\}\\
&=\frac{1}{(2\pi\sigma^2)^{N/2}}\frac{1}{(2\pi \sigma_0^2)^{1/2}}exp
\left\{-\frac{1}{2}\left[\left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2}\right)\mu^2
-2\left (\frac{\sum_{n=1}^Nx_n}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right)\mu
+const\right]
\right\}\\
&=\frac{1}{(2\pi\sigma^2)^{N/2}}\frac{1}{(2\pi \sigma_0^2)^{1/2}}exp
\left\{-\frac{1}{2}\left[\left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2}\right)\mu^2
-2\left (\frac{\mu_{ML}}{N\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right)\mu
+const\right]
\right\}\\
&=\frac{1}{(2\pi\sigma^2)^{N/2}}\frac{1}{(2\pi \sigma_0^2)^{1/2}}exp
\left\{-\frac{1}{2}\left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2}\right)\left[\mu^2
-2\left (\frac{\mu_{ML}}{N\sigma^2}+\frac{\mu_0}{\sigma_0^2}\right)\left/\left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2}\right)\right.\mu
+const\right]
\right\}\\
p(\mu|\mathbf{X})&=\frac{1}{(2\pi\sigma^2)^{N/2}}\frac{1}{(2\pi \sigma_0^2)^{1/2}}exp
\left\{-\frac{1}{2}\left(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2}\right)\left(\mu-\mu_N \right )^2
\right\}\cdot exp(const)\qquad \mu_N=\frac{\sigma^2}{N\sigma_0^2+\sigma^2}\mu_0+\frac{N\sigma_0^2}{N\sigma_0^2+\sigma^2}\mu_{ML}\\
&=\frac{1}{(2\pi \sigma_N^2)^{1/2}}exp\left\{-\frac{1}{2\sigma_N^2}(\mu-\mu_N)^2\right\}\qquad \frac{1}{\sigma_N^2}=\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2}
\end{align*}$$

Thus, the posterior distribution is given by
$$p(\mu|\mathbf{X})=\mathcal{N}(\mu|\mu_N,\sigma_N^2)$$
where 
<font color='red'>$$\begin{align*}
\mu_N&=\frac{\sigma^2}{N\sigma_0^2+\sigma^2}\mu_0+\frac{N\sigma_0^2}{N\sigma_0^2+\sigma^2}\mu_{ML}\\
\frac{1}{\sigma_N^2}&=\frac{1}{\sigma_0^2}+\frac{N}{\sigma^2}
\end{align*}$$</font>
Conclusion from the form of the posterior mean and variance.
- The mean of the posterior distribution $\mu_N$ compromise between the prior mean $\mu_0$ and the maximun likelihood solution $\mu_{ML}$.
  - If $N=0$, which means the observation haven't started, the prior mean is the posterior mean.
  - If $N\to\infty$, the posterior is given by the maximun likelihood solution.
- The variance of the posterior distribution is denoted by $\sigma_N^2$, while $\frac{1}{\sigma_N^2}$ denotes the data precision.
  - If $N=0$, which means the observation haven't started, the posterior precision equal to the prior precision.
  - If $N\to\infty$, the posterior precision increace to infinity, and the posterior variance decreace to zero.
  


## Known Mean, Unknown Variance
For Gaussian distribution with unknown variance, it turns out to be most convinient to work with the precision $\lambda\equiv \frac{1}{\sigma^2}$. The likelihood function for $\lambda$ takes the form
$$p(\mathbf{X}|\lambda)=\prod_{n=1}^N\mathcal{N}(x_n|\mu,\lambda^{-1})\propto \lambda^{N/2}exp\left\{-\frac{\lambda}{2}\sum_{n=1}^N(x_n-\mu)^2\right\}$$

### Prior
With respect to the unknown precision $\lambda$, there is a power form and a exponential form. The corresponding conjugate prior should therefore be proportional to the product of a power of $\lambda$ and the exponential of a linear function of $\lambda$. This corresponds to the gamma distribution which is defined by 
<font color='red'>$$Gam(\lambda|a, b)=\frac{1}{\Gamma(a)}b^a\lambda^{a-1}exp(-b\lambda)$$</font>
Here $\Gamma(a)$ is the gamma function that is defined by $\Gamma(x)=\int_0^{\infty}u^{x-1}e^{-u}du$ to ensure the prior distribution is correctly normalized.  
The mean and variance of the gamma distribution are given by
$$\begin{align*}
\mathbb{E}[\lambda] &= \frac{a}{b}\\
var[\lambda] &= \frac{a}{b^2}
\end{align*}$$

### Posterior
Consider a prior distribution $Gam(\lambda|a_0, b_0)$. If we multiply by the likelihood function, then we obtain a posterior distribution
$$p(\lambda|\mathbf{X})\propto \lambda^{a_0-1}\lambda^{N/2}exp\left\{-b_0\lambda-\frac{\lambda}{2}\sum_{n=1}^N(x_n-\mu)^2\right\}$$
which we recognize as a gamma distribution of the form $Gam(\lambda|a_N, b_N)$ where
$$\begin{align*}
a_N&=a_0+\frac{N}{2}\\
b_N&=b_0+\frac{1}{2}\sum_{n=1}^N(x_n-\mu)^2=b_0+\frac{N}{2}\sigma_{ML}^2
\end{align*}$$
where 
- $\sigma_{ML}^2$ is the maximum likelihood extimator of the variance.
- At the $N$ data point, the observation contribute $\frac{N}{2}$ to the parameter $a_N$, thus we can interpret the parameter $a_0$ on the prior distribution as a contribution of $2a_0$ on *prior observations*.
- Similarly, the term $b_0$ contribute $2b_0$ on prior observations.


## Unknown Mean, Unknown Variance

### Prior
To find a conjugate prior, we consider the dependence of the likelihood function on $\mu$ and $\lambda$
$$\begin{align*}
p(\mathbf{X}|\mu,\lambda)&=\prod_{n=1}^N\left(\frac{\lambda}{2\pi}\right)^{1/2}exp\left\{-\frac{\lambda}{2}(x_n-\mu)^2\right\}\\
&\propto \left[\lambda^{1/2}exp\left(-\frac{\lambda\mu^2}{2}\right)\right]^Nexp\left\{\lambda\mu\sum_{n=1}^Nx_n-\frac{\lambda}{2}\sum_{n=1}^Nx_n^2\right\}
\end{align*}$$
where $\mu$ and $\lambda$ are unknown. The prior distribution takes the form
$$\begin{align*}
p(\mu,\lambda) &\propto \left[\lambda^{1/2}exp\left(-\frac{\lambda\mu^2}{2}\right)\right]^{\beta}exp\{c\lambda\mu-d\lambda\}\\
&=exp\left\{-\frac{\beta\lambda}{2}(\mu-c/\beta)^2\right\}\lambda^{\beta/2}exp\left\{-\left(d-\frac{c^2}{2\beta}\right)\right\}
\end{align*}$$
After normalizing, the prior distribution takes the form
<font color='red'>$$\begin{align*}p(\mu,\lambda) &= \mathcal{N}(\mu|c/\beta,(\beta\lambda)^{-1})\cdot Gam(\lambda|1+\beta/2,d-c^2/2\beta)\\
&= p(\mu|\lambda)p(\lambda)\end{align*} $$</font>
The distribution is called *Gaussian-gamma* distribution.

---------------
---------------
# Multivariate Gaussian distribution prior
### Gaussian distribution
For the distribution that mean is unknown, and precision is known, the conjugate prior is again Gaussian.
$$p(\mu)=\mathcal{N}(\mathbf{\mu}|\mathbf{\mu}_0,\Sigma_0)$$

### Wishart distribution
For the distribution that mean is known, and precision is unknown, the conjugate prior is Wishart.
$$\mathcal{W}(\Lambda|W,v)=B|\Lambda|^{(v-D-1)/2}exp\left(\frac{1}{2}Tr(W^{-1}\Lambda)\right)$$
where $v$ is called the number of degrees of freedom of the distribution, $W$ is a $D\times D$ scale matrix, and $Tr(\cdot)$ denotes the trace of a matrix. The normalization constant B is given by
$$B(W,v)=|W|^{-v/2}\left(2^{vD/2}\pi^{D(D-1)/4}\prod_{i=1}^D\Gamma\left(\frac{v+1-i}{2}\right)\right)^{-1}$$

### Gaussian-Wishart distribution
For the distribution that mean is unknown, and precision is unknown, the conjugate prior is Gaussian-Wishart.
$$p(\mathbf{\mu},\Lambda|\mathbf{\mu}_0,\beta,W,v)=\mathcal{N}(\mathbf{\mu}|\mathbf{\mu}_0, (\beta\Lambda)^{-1})\mathcal{W}(\Lambda|W,v)$$
