# Bayesian Monte Carlo Parameter Inference

In Bayesian inference statistic we want to find the distribution of $\theta$ given data $z_{i=1}^n$. This is, we want to find 

$$ p(\theta | z_{i=1}^n) $$ 

When using Monte Carlo Bayesian parameter inference, we sample from $p(\theta | z_{i=1}^n)$. The random samples we draw from $p(\theta | z_{i=1}^n)$ allow us to calculate estimates of expected value, mode, variance, quantiles, marginal distributions, and other summary statistics of $\theta$. 

This sampling approach is very different from the approach where we try to determine those quantities analytically from $p(\theta | z_{i=1}^n)$. Analytical derivation of those quantities is generally only possible if $p(\theta | z_{i=1}^n)$ is of a relatively simple form. In the case of Bayesian parameter inference in more complicated statistical and machine learning models $p(\theta | z_{i=1}^n)$ is generally no of a simple form and we need to use sampling methods to get summary statistics of $p(\theta | z_{i=1}^n)$. 

To sample from $p(\theta | z_{i = 1}^n) $ we need an expression for it. Since we are in the Bayesian inference setting we have 

$$ p(\theta | z_{i = 1}^n) = \frac{p(z_{i = 1}^n | \theta)p(\theta)}{p(z_{i = 1}^n)} $$

And since we are going to use the Metropolis-Hastings sampling algorithm it's enough to have

$$ p(\theta | z_{i = 1}^n) \propto p(z_{i = 1}^n | \theta)p(\theta) $$


Let's first look at the marginal distribution for $\theta$ 

$$ p(\theta) $$

We have to specify this for our parameter $\theta = (a, b, \sigma^2)$ in order to sample from $p(\theta | x_{i=1}^n)$. This is the prior knowledge about our data generating process, which is a component that is mandatory to specify in the Bayesian inference setting. 




## Unsupervised Learning

## Supervised Learning

Let's first look at 

$$ p(z_{i = 1}^n | \theta) = p((x, y)_{i = 1}^n | \theta) $$ 

first. The parameter $\theta = (\theta_1, \theta_2)$ describes the conditional distribution of $Y$ given $X = x$. This means, that we partly describe the joint distribution $Z = (X, Y)$ through the conditional distribution $Y | X =x $. The parameter $\theta$, however, doesn't say anything about the distribution of $X$. To fully describe $p(z_{i = 1}^n | \theta) = p((x, y)_{i = 1}^n | \theta)$ we need the marginal distribution of $X$ under $\theta$. This is, $p(x_{i=1}^n | \theta$). Then we can describe $p(z_{i = 1}^n | \theta)$ using the chain rule of probability like

$$ p(z_{i = 1}^n | \theta) = p((x,y)_{i = 1}^n | \theta) = p(y_{i = 1}^n | x_{i = 1}^n, \theta) p(x_{i=1}^n | \theta) $$

However, often in the supervised learning setting we assume that $X$, which describes the independent variables, is actually not random. This is, we set

$$ p(z_{i = 1}^n | \theta) = p(y_{i = 1}^n | \theta; x_{i=1}^n) $$

And consequently we sample from

$$ p(\theta | y_{i = 1}^n; x_{i=1}^n) \propto p(y_{i = 1}^n | \theta; x_{i=1}^n)p(\theta) $$

From a technical point of view, treating $X$ as deterministic makes our life easier because we have one level of randomness less, which we need to take into consideration. Treating $X$ as random might make sense if we have some justified assumption about the distribution of $X$. For the examples here, and probably for most applications of Bayesian parameter inference in supervised learning settings, $X$ is treated as deterministic. I could imagine that this approach leads to some some kind of underestimation of uncertainty in the Bayesian inference setting. How severe that is is difficult for me to say at this point. 

Note that treating $X$ as uniformly distributed over some interval $[c, d] \in ℝ$ and calculating the marginal distribution $p(y_{i=1}^n | \theta)$ wouldn't work because we  would have 

$$ p(y_{i=1}^n | \theta) = \int_c^d{p(y_{i=1}^n | x_{i=1}^n, \theta) p(x_{i=1}^n | \theta)}dx = \int_c^d{p(y_{i=1}^n | x_{i=1}^n, \theta)}dx = const + n_{\theta_2}(\omega) $$

This is, integrating out $x_{i=1}^n$ would leave us with the marginal distribution for $y_{i=1}^n$ in which the functional relationship between $X$ and $Y$ would have gotten lost and we would only be left with the random component $n_{\theta_2}(\omega)$.


## Statistical Inference Alternatives

Note that we could also use a different statistical inference approach to estimate $\theta$, like e.g. maximum likelihood estimation. Remember the difference between $\hat{\theta}_{MAP}$ and $\hat{\theta}_{MLE}$.

$$ \hat{\theta}_{MAP} = \underset{\theta\ \in \Theta}{argmax}{p(\theta | x_{i=1}^n)} $$

$$ \hat{\theta}_{MLE} = \underset{\theta\ \in \Theta}{argmax}{L(x_{i=1}^n; \theta)} $$

$L$ is the likelihood function. In the case of maximum likelihood inference, $L$ is not a probability distribution and thus we can't use random sampling to infer parameters in the Bayesian inference setting.