# Latent variable models

In latent variable models, we assume that the data distribution $\mathbf{x}$ is dependent on some unobserved variables called latent variables $\mathbf{z}$. The Bayes net structure is shown below. Assume that we have the prior distribution $p(\mathbf{z})$ and the parameterized conditional distribution $p_{\theta}(\mathbf{x}|\mathbf{z})$, the loglikelihood objective now becomes

$$\log p(x) = \log \int p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z} = \mathbb{E}_{\mathbf{z}}[p_{\theta}(\mathbf{x}|\mathbf{z})]$$

How can we train this objective? One way is by sampling the prior $\{\mathbf{z}_i\}\sim p_{\mathbf{z}}(z)$ and approximate the expectation by sample averages

$$\log p(x) \approx \frac{1}{K}\sum_{i=1}^K p_{\theta}(\mathbf{x}|\mathbf{z}_i)$$

However, this method becomes inefficient as the dimensionality of the latent space becomes large. A more efficient way is to consider importance sampling. Note that the integral can be rewritten as

$$
\begin{align*}
    \log p(\mathbf{x}) &= \log \int p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z}\\
    &=\log \int \frac{p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q(\mathbf{z})}q(\mathbf{z})d\mathbf{z}\\
    &= \log \mathbb{E}_{\mathbf{z}\sim q(\mathbf{z})} \bigg[\frac{p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q(\mathbf{z})}\bigg]
\end{align*}
$$

Where $q(\mathbf{z})$ can be any distribution! By Jensen's inequality, we can move the logarithm inside expectation, this gives us a lower bound on the objective. 

$$
\begin{align*}
    \log p(\mathbf{x})
    &= \log \mathbb{E}_{\mathbf{z}\sim q(\mathbf{z})} \bigg[\frac{p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q(\mathbf{z})}\bigg]\\
    &\geq \mathbb{E}_{\mathbf{z}\sim q(\mathbf{z})} \bigg[\log \frac{p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q(\mathbf{z})}\bigg]\\
\end{align*}
$$

Ideally, we want the lower bound to be as tight as possible. We claim that the lower bound is equality when $q(\mathbf{z})$ is chosen to be the posterior distribution $p(\mathbf{z}|\mathbf{x})$. 

````{prf:theorem} ELBO
:label: my-theorem 

The lower bound of 

$$\log p(\mathbf{x})\geq \mathbb{E}_{\mathbf{z}\sim q(\mathbf{z})} \bigg[\log \frac{p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q(\mathbf{z})}\bigg]$$

Is attained when $q(\mathbf{z}) = p(\mathbf{z}|\mathbf{x})$

````

````{prf:proof}
Recall that Jensen's inquality holds when the random variable is constant almost everywhere. This means that when 

$$q(\mathbf{z}) \propto p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z}) \propto p(\mathbf{z}|\mathbf{x})$$

Equality holds. Below we provide another proof using KL-divergence. 
````

````{prf:proof}
Another way of proving the theorem is to rewrite the lower bound as

$$
\begin{align*}
    \log p(\mathbf{x})
    &= \log \mathbb{E}_{\mathbf{z}\sim q(\mathbf{z})} \bigg[\frac{p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{q(\mathbf{z})}\bigg]\\
    &\geq \log \mathbb{E}_{\mathbf{z}\sim q(\mathbf{z})} \bigg[\log \frac{p_{\theta}(\mathbf{z}|\mathbf{x})p(\mathbf{x})}{q(\mathbf{z})}\bigg]\\
    &= -\text{KL}(q(\mathbf{z}||p(\mathbf{z}|\mathbf{x})) + \log p(\mathbf{x})
\end{align*}
$$

Which becomes tight when $-\text{KL}(q(\mathbf{z}||p(\mathbf{z}|\mathbf{x}))=0$ or $q(\mathbf{z}) = p(\mathbf{z}|\mathbf{x})$.

````

Theorem 1 suggests that if we choose $q(\mathbf{z}) = p(\mathbf{z}|\mathbf{x})$, then we have equality. However, the posterior in practice is hard to compute because it involves estimating an integral. 

$$p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{\int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})dz}$$

Therefore, we instead approximate $p(\mathbf{z}|\mathbf{x})$ from a family of parameterized distribution $q_{\phi}(\mathbf{z}|\mathbf{x}) \in \mathcal{Q}$. We want to choose $q$ that maximizes the evidence lower bound (i.e. closest to the posterior). In other words, we want $\text{KL}(q(\mathbf{z}||p(\mathbf{z}|\mathbf{x}))$. To compute this KL, we rewrite it as 

$$
\begin{align*}
    \text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x})||p_{\theta}(\mathbf{z}|\mathbf{x})) &= \int q_{\phi}(\mathbf{z}|\mathbf{x}) \log\frac{q_{\phi}(\mathbf{z}|\mathbf{x})}{p_{\theta}(\mathbf{z}|\mathbf{x})} d\mathbf{z}\\
    &= \int q_{\phi}(\mathbf{z}|\mathbf{x}) \log\frac{q_{\phi}(\mathbf{z}|\mathbf{x})p(\mathbf{x})}{p_{\theta}(\mathbf{x}, \mathbf{z})} d\mathbf{z}\\
    &= \int q_{\phi}(\mathbf{z}|\mathbf{x}) (\log p(\mathbf{x})+\log\frac{q_{\phi}(\mathbf{z}|\mathbf{x})}{p_{\theta}(\mathbf{x}, \mathbf{z})}) d\mathbf{z}\\
    &= \log p(\mathbf{x}) + \int q_{\phi}(\mathbf{z}|\mathbf{x}) \log\frac{q_{\phi}(\mathbf{z}|\mathbf{x})}{p_{\theta}(\mathbf{x}|\mathbf{z})p(\mathbf{z})} d\mathbf{z}\\
    &= \log p(\mathbf{x}) + \text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x})||p(\mathbf{z})) - \mathbf{E}_{z\sim q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]
\end{align*}
$$

Therefore, the lower bound can be written as

$$
\begin{align*}
\log p(\mathbf{x}) &\geq -\text{KL}(q(\mathbf{z}||p(\mathbf{z}|\mathbf{x})) + \log p(\mathbf{x})\\
&= -\text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x})||p(\mathbf{z})) + \mathbf{E}_{z\sim q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]
\end{align*}
$$

Which we now all can compute, therefore, our optimization problem is 

$$\max_{\theta, \phi} -\text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x})||p(\mathbf{z})) + \mathbf{E}_{z\sim q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]$$