# Integrating Over Parameters, Tractability, MLE, MAP, Bayesian, and Probability Theory

## Probability Theory Basics

**Rule I**: $\displaystyle p(a)=\int p(a,b)\,db$

**Rule II**: $\displaystyle p(a\mid b)=\frac{p(a,b)}{p(b)}$

**Bayes Rule**: $\displaystyle p(a\mid b)=\frac{p(b\mid a)\cdot p(a)}{p(b)}$

## Maximum Likelihood Estimation (MLE)

Method for estimating the parameter(s) of a distribution (?).

Given is a set of points $\mathcal{D}=\{y_n\}_{n=1}^N$.
We _choose to model_ the points with a normal distribution $\mathcal{N}(\mu,\sigma)$ so we can assess the probability of a new point $y$ as $p(y;\mu,\sigma)\sim\mathcal{N}(\mu,\sigma)$.
The goal is to be able to compute $p(y;\mu,\sigma)$ for any $y$.
To be able to do that we need to estimate $\mu$ and $\sigma$ and that is what we use MLE for.

The data is all information we have, so we want $p(\mathcal{D};\mu,\sigma)$ to be high.
We assume the data is i.i.d. given the parameters $\mu$ and $\sigma$.
We can therefore write

$$p(\mathcal{D};\mu,\sigma)=\prod_{n=1}^Np(y_n;\mu,\sigma)$$

What we actually look for are the parameters.
We are interested in the particular set of parameters for which $p(\mathcal{D};\mu,\sigma)$ is maximized.
Let $\mu^\star$ and $\sigma^\star$ denote this maximizer.
We search for
\begin{align}
\mu^\star=&\arg\max_\mu p(\mathcal{D};\mu,\sigma)\\
=&\arg\max_\mu\prod_{n=1}^Np(y_n;\mu,\sigma)\\
=&\arg\max_\mu\log\prod_{n=1}^Np(y_n;\mu,\sigma)\\
=&\arg\max_\mu\sum_{n=1}^N\log p(y_n;\mu,\sigma)
\end{align}

We can apply a $\log$ to the product because it will not change the maximizer.

The same formulation can be written down for $\sigma^\star$ to find the optimal parameter here.

In order to _actually_ get a value for $\mu^\star$ we replace $p(y_n;\mu,\sigma)$ with the definition of our model.
Above it is a normal distribution, so it would be TODO.

Depending on the model there may be a closed-form solution for computing maximizer (i.e., the parameterization for which the data is the most likely) or one can restort to gradient descent to find it.

## Maximum Aposteriori (MAP)

Method for estimating the parameter(s) of a distribution (?).
Just like MLE, except the probabilistic approach is a bit different.

Previously, with MLE, we modeled $p(\mathcal{D};\mu,\sigma)$, that is, the probability of the data given the parameters.
We then looked for the parameters for which this probability is the highest.
Now we model the probability $p(\mu\mid\mathcal{D})$, that is, the probability of the parameters given the data.
Note that $\sigma$ is also a parameter, it's left out for readability.

Again, analogous to MLE, we seek for the argument that maximizes the probability.
Mathematically speaking, that is

$$\mu^\star=\arg\max_\mu p(\mu\mid\mathcal{D})\,.$$

Following Bayes rule, 

$$p(\mu\mid\mathcal{D})=\frac{p(\mathcal{D}\mid\mu)\cdot p(\mu)}{p(\mathcal{D})}\,.$$

We can plug the rewritten version of $p(\mu\mid\mathcal{D})$ into the optimization objective from above:

\begin{align}
\mu^\star=&\arg\max_\mu p(\mu\mid\mathcal{D})\\
=&\arg\max_\mu\frac{p(\mathcal{D}\mid\mu)\cdot p(\mu)}{p(\mathcal{D})}\\
=&\arg\max_\mu p(\mathcal{D}\mid\mu)\cdot p(\mu)
\end{align}

Note that the denominator $p(\mathcal{D})$ can be dropped because it does not depend on $\mu$.

Just like we did with MLE, we first expand $p(\mathcal{D}\mid\mu)$ into a product over all data samples available to us, and second, apply a $\log$ to ease the optimization, without affecting the position of the maximum (the best parameter configuration).

\begin{align}
\mu^\star=&\arg\max_\mu p(\mathcal{D}\mid\mu)\cdot p(\mu)\\
=&\arg\max_\mu\prod_{n=1}^Np(y_n\mid\mu)\cdot p(\mu)\\
=&\arg\max_\mu\log\left(\prod_{n=1}^N p(y_n\mid\mu)\cdot p(\mu)\right)\\
=&\arg\max_\mu\sum_{n=1}^N\log p(y_n\mid\mu)+\log p(\mu)\\
\end{align}

This is very similar to MLE, with the only difference that there is a new component in the optimization problem that we need to consider.
It is the probability of the parameters, $p(\mu)$, also known as the **prior**.

This is where we have to explicitly make an assumption, for example, by saying our data is normally distributed. TODO

## Bayesian Modeling (?)

Instead of just estimating a single value per parameter (like above just a scalar for the mean of our normal distribution), we now model the parameters as distributions themselves.

What was previously $\mu$ is now $\mu$ but stands for a random variable.

What we are interested in is modeling the probability of a new (test) data point $y$ given all the information we have, namely the data $\mathcal{D}$. It is called the **posterior predictive distribution**.

$$p(y\mid\mathcal{D})=\int p(y\mid\mu,\mathcal{D})\cdot p(\mu\mid\mathcal{D})\,d\mu$$

We constructed the integral using Rule I.
The choice of $\mu$ is an arbitrary one, one could have integrated over anything, but it is a _helpful_ one.

There are not two components in the integral.
The first is $p(y\mid\mu,\mathcal{D})$, which is the same (?) as $p(y\mid\mu)$ and something we can compute.
How? Using Bayesian inference (?) TODO

As for the second part, the so called **posterior distribution of the parameters** $p(\mu\mid\mathcal{D})$, we can again apply Bayes rule as done in MAP above:

$$p(\mu\mid\mathcal{D})=\frac{p(\mathcal{D}\mid\mu)\cdot p(\mu)}{p(\mathcal{D})}\,.$$

This time around we cannot just discard $p(\mathcal{D})$, because we are not looking for the $\arg\max_\mu$ but the actual probability $p(\mu\mid\mathcal{D})$.
We either need to find a way of computing $p(\mathcal{D})$ or approximating it, or _know_ of some ticket (?) conjugate, ...?