**Expectation Maximization** <br>

We want to maximize the parameters $\theta$ that maximize the log likelihood given some observed data $x$.

\begin{align}
\argmax_{\theta} \mathrm{l}(\theta)=\argmax_{\theta} \log p(x\mid\theta)
\end{align}

if the full data however is given by $(z,x)$ and $z$ is a hidden probabilistic variable that follows a distribution $p(z)$ which we seek to find we can write:

\begin{align}
\log p(x\mid\theta)=\log \sum_{z\in Z} p(x,z\mid\theta)
\end{align}

Optimizing this expression is diffficuls due to the order of the summation and the logarithm. For such a scenario an analytic solution is not feasible and we have to resort to a technique called expectation maximization. We can introduce an arbitrary probability density $q(z)$ over the hidden variable and use Jensen's inqeuality to get:

\begin{align}
\log p(x\mid\theta) &= \log \sum_{z\in Z} p(x,z\mid\theta) \\
&= \log \sum_{z\in Z} q(z) \frac{p(x,z\mid\theta)}{q(z)} \\
&= \log \mathrm{E}_{z \sim q(z)}  \left[ \frac{p(x,z\mid\theta)}{q(z)} \right] \\
&\underset{\mathrm{Jensen's}}{\geq} \mathrm{E}_{z \sim q(z)}  \left[ \log \frac{p(x,z\mid\theta)}{q(z)} \right] \\
&=\underbrace{\mathrm{E}_{z \sim q(z)}  \log p(x,z\mid\theta)}_{\equiv Q(\theta)} -  \mathrm{E}_{z \sim q(z)} \log q(z) \\
&\equiv \mathrm{ELBO}
\end{align}

Since $\log q(z)$ is independent of $\theta$ we can optimize the the log-likelihood by optimizing the lower bound $Q(\theta)$. In order to optimize the lower bound we need to compute the expectation using the distribution $q(z)$ which is unknwon. However by rearraing the expression for the $\mathrm{ELBO}$ we find:

\begin{align}
\mathrm{ELBO} &=\mathrm{E}_{z \sim q(z)}  \left[ \log \frac{p(x,z\mid\theta)}{q(z)} \right] \\
&=\mathrm{E}_{z \sim q(z)}  \left[ \log \frac{p(x,z\mid\theta)p(z\mid x,\theta)}{q(z)p(z\mid x,\theta)} \right] \\
&=\mathrm{E}_{z \sim q(z)}  \left[ \log \frac{p(x\mid\theta)p(z\mid x,\theta) p(z\mid x,\theta)}{q(z)p(z\mid x,\theta)} \right] \\
&=\mathrm{E}_{z \sim q(z)}  \left[ \log \frac{p(x\mid\theta) p(z\mid x,\theta)}{q(z)} \right] \\
&= \log p(x\mid\theta) - \mathrm{KL}(q(z) \mid\mid p(z\mid x,\theta))
\end{align}
In the last step we used that $\log p(x\mid\theta)$ is independent of $z$. We can now see that the optimal $q(z)$ is given by $p(z\mid x,\theta)$ since in this case the Kullback-Leibler Divergence is minimized and therefore the $\mathrm{ELBO}$ maximized. <br>
We now have two objectives: (1) optimizing the lower bound $Q(\theta)$. (2) choosing $q(z)=p(z\mid x,\theta)$. The expectation maximization algorithm works by repeating those two steps until convergence:
* E-Step: compute $p(z\mid x,\theta)$ in order to obtain $q(z)$.
* M-Step: update $\theta$ <- $\argmax_{\theta} Q(\theta)$. <br>

The initial $\theta$ can be picked at random.

***Example: Gaussian Mixture Model (GMM)*** <br>

In the GMM the probabilty of observing event $X_i=x$ is given by a mixture of $k=1,...,n$ gaussian distributions $P(X_i=x\mid Z_i=k)$ that are weighted by a factor $\pi_i$ describing the probability that $x_i$ was drawn from i'th mixture component. 

\begin{align}
P(X_i=x)=\sum_{k=1}^{n} \pi_{k}P(X_i=x\mid Z_i=k) \quad \mathrm{with} \quad \sum_i \pi_i =1
\end{align}

Since we are dealing with Gaussian Distributions, $P(X_i=x\mid Z_i=k)=N(\sigma^{2}_{k}, \mu_k)$. The log-Likelihood is given by:

\begin{align}
l(\theta)&=\log \prod_{i=1}^n\sum_{k=1}^{n} \pi_{k}N(\sigma^{2}_{k}, \mu_k) \\
&= \sum_{i=1}^n \log \sum_{k=1}^{n} \pi_{k}N(\sigma^{2}_{k}, \mu_k)
\end{align}

If we knew from which of the Gaussian a given datapoint was sampled we could go ahead a find a analytic solution to this problem since it would simplify to a MLE of a single Gaussian Distribution with $\pi_i=1/N_i$ with $N_i$ being the number of samples from the i'th Gaussian. In order to find the solution to the Problem we can apply EM:

* E-step: Compute the posterior probability of the 