## Priors

A prior is an unconditional probability of a parameter, the assumed probability distribution of the parameter before evidence is considered. In the Bayesian model, the posterior probability distribution is an update to the prior, given the information from the data.

How to choose a prior? If we have some past information on the prior, we have an _informative_ prior. When no information is available, we have an _uninformative prior_.

---
### Jeffreys prior

Like the uniform prior, the Jeffreys prior is also an uninformative prior. However, if we were to use a uniform prior for a parameter, we can create a valid new parameterization and the distribution is no long equiprobable. This shows that the uniform prior is uninformative in one paramterization, but becomes informative through change of variables. Unlike the uniform prior, we would use a Jeffreys prior as it is invariant (unchanged) to the parameterization (defining parameters) used to define the prior. That is, regardless of how we scale the parameter (change of coordinates), the relative probability distribution doesn't change.
<br><br>
If $p_{\phi}(\phi) = p_{\theta}(\theta)\lvert \frac{d\theta}{d\phi}\rvert$, and if we can perform a change of variables, $p_\theta(\theta)$ is invariant under parameterization.
<br><br>
The Fisher information transforms under parameterization:
$$
I_\phi(\phi) = I_\theta(\theta)\left(\frac{d\theta}{d\phi} \right)^2
$$
but not if we take the square root:
$$
\sqrt{I_\phi(\phi)} = \sqrt{I_\theta(\theta)}\lvert\frac{d\theta}{d\phi}\rvert
$$
So, we define $p_\theta(\theta)\propto \sqrt{I_\theta(\theta)}$
<br><br>
The Fisher information measures the amount of information about an unkown parameter $\theta$ from a random variable $X$ (expectation here is the integral over the values of $X$ while keeping $\theta$ fixed):
$$
I(\theta) = -E\left[\frac{d^2}{d\theta^2}\log f(X\vert \theta) \vert \theta\right]
$$
<br>
this tells us that where $f(X\vert \theta)$ is low, this is informative. In terms of the MLE, we solve for $\frac{d}{d\theta}\log(L(\theta\vert X)) = 0$ such that $\frac{d^2}{d\theta^2}\log(L(\theta\vert X))<0$. A "peaked" MLE corresponds to high fisher information.

<br><br><br>
Sources
* https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture6.pdf

---
### Inverse-gamma prior

**Lemma:** If $x_i\vert \mu,\sigma^2 \sim N(\mu,\sigma^2)$ and $\sigma^2\sim IG(\alpha,\beta)$, then
$$
\sigma^2\vert x_1,x_2,\cdots,x_n \sim IG\left(\alpha+n/2,\beta+\sum(x_i-\mu)/2\right)
$$
where the inverse gamma distribution is
$$
f(x\vert \alpha,\beta) = \frac{\beta^\alpha}{\Gamma(\alpha)}(1/x)^{\alpha+1}\exp(-\beta/x)
$$
<br><br>
Why does this matter?
<br>
A popular kenerl in GPR is:
$$
\kappa(\mathbf{x}_i,\mathbf{x}_j) = \sigma_f^2 \exp\left(-\frac{1}{2l^2}
  (\mathbf{x}_i - \mathbf{x}_j)^T
  (\mathbf{x}_i - \mathbf{x}_j)\right)
$$

The (marginal) likelihood, or evidence of the joint normal distribution defining a GPR with $0$ as the mean function is the following:
$$
p(y\vert X,\theta) = (2\pi)^{\frac{-n}{2}}\lvert K_y\rvert^{\frac{-n}{2}}\exp\left[-\frac{1}{2}y^TK_y^{-1}y\right]
$$
Since we must specify the priors for the parameters of the covariance function, which typically inclues a variance parameter (like $1/l^2 > 0$), the inverse gamma prior is a convient choice because it ensures that the variance hyperparameter stays positive $>0$.

<br><br><br>
Sources:
* https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf