# Introduction to Latent Variable Models and Variational Methods

## CSCI E-82A
### Stephen Elston

## Latent Variable Models

We refer to a probabilistic model with **hidden variables** a **latent variable model**. A latent variable model has three components:  

- **Visible or observed variables, $\nu$:** Data can be acquired for these variables from **emissions** of the values.
- **Hidden, unobserved, or latent variables, $h$:** The actual value of these variables is not observable and can only be estimated. 
- **Model parameters, $\theta$:** Are a vector of parameters which must be estimated for the model. 

In general, our goal is to find the joint distribution:   

$$p(\nu, h; \Theta)$$

## Mixture Models

A mixture model allows us to represent complex probability distributions. There are many real-world cases where a single distribution would not be an accurate representation. For example:

- Missing value problems may require a mixture of distributions. 
- An unscrupulous casino may alternate between using fair and 'loaded' dice. The distribution of numbers shown by these two types of dice are quite different. An observer trying to model the full distribution will need to use a mixture of the two.  
- Returns of many financial assets are dependent on overall market conditions. These returns might represent a specific log-normal distribution for a period of time, and then once investor sentiment changes, a different distribution.   
- Response rates to a promotional email offer might represent distributions for several populations. The offer might be for men's running shoes. One population of respondent is expected to be male athletes. However, the are other potential buyers who might be purchasing the shoes on behalf of a male athlete. There response rates of these populations could be quite different, and from just an email address there is no way to know which population each response comes from. 

Let, $\nu_i$ be a real-valued vector of observed values in $\mathbb{R}^d$, and $h_i \in \{1, 2, 3, \ldots, K \}$ be a discrete-valued hidden variable. We can represent a factorized DAG of the mixture model as: 

$$p(\nu, h) = p(\nu\ |\ h) p(h)$$

Where, $p(h = k) = \pi_k$, for some probabilities of each of the $K$ components of the mixture. 

### Gaussian Mixture Model

One of the most widely used mixture models is the **mixture of Gaussian distributions**. GMMs are used in many applications from engineering, medicine, and robot navigation. 

As the name implies the GMM is a mixture of *K* individual Gaussian distributions where the probability of the kth distribution is $\pi_k \in \{\pi_1, \pi_2, \ldots, \pi_K \}$. Each of the *K* distributions has a location parameter, $\mu_k \in \{\mu_1, \mu_2, \ldots, \mu_K\}$ and a covariance parameter, $\Sigma_k \in \{\Sigma_1, \Sigma_2, \ldots, \Sigma_K\}$. The parameter vector for one component of the latent variable model is then:  

$$\theta_k = (\mu_k, \mu_k, \Sigma_k)$$

The conditional probability distribution for a single component of the GMM can then be written:   

$$p(\nu\ |\ h = k) = \mathcal{N}(\nu; \mu_k, \Sigma_k)$$

What we actually observe is the marginal distribution of the visible variables. For the GMM we can find this marginal distribution as follows:  

$$p(x) = \sum_{k=1}^{K} p(\nu\ |\ h = k) p(h = k) =  \sum_{k=1}^{K} \pi_k \mathcal{N}(\nu; \mu_k, \Sigma_k)$$

Here, the hidden variable is marginalized out. The right hand term is the expectation of $p(x)$ for the mixture of Gaussians.

## Variational Bayes 

We need a way to perform inference to find the parameter vector, $\theta$ of the latent variable model, $p(\nu, h; \Theta)$. As has already been stated, there are no exact inference methods for latent variable problems. However, there are practical approximate methods, which often work well.     

The Monte Carlo method has been applied to latent variable problems for decades. More recently **variational methods** have been gaining popularity. There are several key differences between Monte Carlo methods and variational methods:    

- Variational methods are **computationally more efficient** than Monte Carlo methods. This fact, has lead to the growth in the use of variational methods.
- We can always know when a variational approximation method has converged. This is not the case with Monte Carlo methods. 
- Variational methods use **local optimization** and there is no guarantee the **global optimum** can ever be found. Whereas, Monte Carlo methods will generally find the globally optimal solution, eventually. This local convergence property is the price we pay for the efficiency of variational methods.  

### Review of Kullback-Leibler Divergence

Variational methods are based on the Kullback-Leibler divergence. Let's review some of the properties of the KL divergence.   

The KL divergence between two distributions, $p(x)$ and $q(x)$ is written:

$$\mathbb{D}_{KL}(P \parallel Q) = - \sum_{x} p(x)\ ln_b \frac{p(x)}{q(x)}$$   

Some key properties of the KL divergence include:   

- $\mathbb{D}_{KL}(P \parallel Q) \ge 0$ for all $p(x)$ and $q(x)$.
- $\mathbb{D}_{KL}(P \parallel Q) = 0$ if and only if $p(x)= q(x)$.
- KL divergence is not symmetric so, $\mathbb{D}_{KL}(P \parallel Q) \ne \mathbb{D}_{KL}(Q \parallel P)$. This is why the term **divergence** is applied and this quantity cannot be considered a distance metric. 

### The Variational Lower Bound

Our problem is the find the full vector of parameters, $\theta$, using just the data from the visible variables, $\nu$. The problem of finding the posterior distribution of $\theta$ given $\nu$ can be formulated as:   

$$p(\theta\ |\ \nu) \propto p(\nu\ |\ \theta) p(\theta) \propto \sum_h p(\nu, h\ |\ \theta) p(\theta)$$  

Where, $p(\theta)$ is the prior distribution of $\theta$. Since our goal is to find the value of $\theta$ that maximizes the likelihood, $p(\theta\ |\ \nu)$, we can work with proportional relationships and therefore not have to deal with the troublesome normalization $Z(\theta)$.

The variational approximation assumes that the joint conditional distribution can be factorized as follows:  

$$p(\nu, h\ |\ \theta) \approx q(h) q(\theta)$$

The variational approximation is achieved by finding a value of $\theta$ that minimizes the KL divergence between $p(h \theta\ |\ \nu)$ and $q(h) q(\theta)$. Using these terms, the expanded KL divergence, and the properties stated in the previous section we find:  

$$\mathbb{D}_{KL}(q(h) q(\theta) \parallel p(h \theta\ |\ \nu)) = 
\mathbb{E}_{q(h)} \big[ log(q(h)) \big] + 
\mathbb{E}_{q(\theta)} \big[ log(q(\theta)) \big] -
\mathbb{E}_{q(h) q(\theta)} \big[ log(p(h, \nu, \theta)) \big]
\ge 0$$

Rearranging these terms we find the bound on $log(p(\nu))$:

$$log(p(\nu)) \ge 
-\mathbb{E}_{q(h)} \big[ log(q(h)) \big] -
\mathbb{E}_{q(\theta)} \big[ log(q(\theta)) \big] +
\mathbb{E}_{q(h) q(\theta)} \big[ log(p(h, \nu, \theta)) \big]$$  

From the above, you can see that we can maximize the likelihood of the joint distribution by minimizing the KL divergence with respect to $q(h)$ and $q(\theta)$. Given the aforementioned factorization, this minimization can be achieved coordinate-wise, making the problem tractable. 

### Steps in the Variational Algorithm

There are two alternating steps in the variational algorithm.  

- We would like to maximize the likelihood of $q(h)$. But, by definition, we cannot know the actual values of the hidden variables, $h$. However, we can compute an updated estimate, $q^{new}(h)$, using the values of observed data, $\nu$ and current estimates of $q(\theta)$ and $q^{old}(h)$. This process is often referred to as **hallucinating data**, since data for hidden variables is manufactured. This process is also known as the **E-step**, since it maximizes the expected value of $q(h)$.  
  
- Likewise, using the values of observed data, $\nu$ and current estimates of $q^{old}(\theta)$ and $q(h)$ we can compute an updated estimate of $q^{new}(\theta)$. This process is also known as the **M-Step**, since it maximizes the likelihood of $q(\theta)$.  

**The E-step:**   



**The M-step:**   

