# Generalized EM algorithm

For the setup, assume we have the parameterized distributions $p(z), p_{\theta}(x|z),$ where $z$ denote some latent variable. Our goal is to make inference on the unobserved latent variable given some observed data $\{x_i\}_{i\in [n]}$. Formally, we want to estimate

$$p(z=j|x) = \frac{p_{\theta}(x|z=j)p(z=j)}{\sum_j p_{\theta}(x|z=j)p(z=j)}$$

Here we are assuming that $z$ takes on finitely many values, the case for infinitely many values is discussed in the next section. Therefore, if we know the distribution $p(z), p_{\theta}(x|z),$ then computing the posterior is trivial. The real problem, is, therefore, how to estimate the parameters $\theta$, giving only partially observed data. <br>

To do so, a natural starting point is considering the log-likelihood

$$\log p(x) = \log \sum_j p_\theta(x, z=j) = \log \sum_j p_\theta (x|z=j)p(z=j)$$

One way of optimizing over $\theta$ is use a combination of maximum likelihood estimation and gradient descent. We discuss the EM approach like in the previous section. The difficulty of optimizing the expression is that the sum is within the logarithm, and therefore is hard to pull out. We can migitate this by consider likelihood weighing. Given any distribution $q(z=j)$, we have

$$\begin{align*}
   \log p(x) &= \log \sum_j p_\theta(x, z=j)\\
   &=  \log \sum_j p_\theta(x, z=j) \frac{q(z=j)}{q(z=j)}\\
   &\geq \sum_j \log\bigg(\frac{p_\theta(x, z=j)}{q(z=j)}\bigg)q(z=j)
\end{align*}$$

Where the third line we used Jensen's inequality. The lower bound holds for all $q$. We can choose $q$ to make the lower bound tight. Recall that Jensen's inequality equality holds when $X$ constant almost everywhere. Therefore, we choose $q(z=j)$ so that

$$q(z=j)\propto p_\theta(x, z=j)\propto p(z=j|x)$$

Choosing $q(z=j)$ to be the posterior $p(z=j|x)$ therefore gives us a tight bound

$$\begin{align*}
   \log p(x)
   &= \sum_j \log\bigg(\frac{p_\theta(x, z=j)}{p(z=j|x)}\bigg)p(z=j|x)\\
   &= \sum_j \log p_\theta(x, z=j)p(z=j|x) - \sum_j p(z=j|x)p(z=j|x)
\end{align*}$$

We can now optimize the log-likelihood iteratively. At the beginning, we randomly initialize the parameters. In the E step, using the old parmaters $\theta_{old}$

$$p_{\text{old}}(z=j|x) = \frac{p_{\theta_{old}}(x|z=j)p(z=j)}{\sum_j p_{\theta_{old}}(x|z=j)p(z=j)}$$

In the M step, we determine new parameters $\theta_{new}$ by maximizing the log-likelihood

$$\theta_{\text{new}} = \underset{\theta}{\arg\max}\; \sum_j \log p_{\theta_{old}}(x, z=j)p_{\text{old}}(z=j|x)$$