(Q3) Variational Inference

$p(x_n|\tau, \lambda_1, \lambda_2) = \tau \mathcal{N}(x_n|0, \lambda_1^{-1}) + (1 - \tau)\mathcal{N}(0, \lambda_2^{-1})$

We have the following priors on the parameters

$\tau \sim \text{Beta}(\alpha_0, \alpha_0)$

$\lambda_1 \sim \text{Gamma}(a_0, b_0)$

$\lambda_2 \sim \text{Gamma}(c_0, d_0)$

(A) Define the model using latent variables $\mathbf{z} = \{z_i\}^N_{i=1}$ 

We formulate the model using latent variables $z_n = (z_{1},...,z_{N})$ which explicitly specify the component
responsible for generating observation $x_n$. In detail,

$z_n = (z_{n1}, z_{n2})^T = 
\begin{cases} 
(1,0)^T, & (x_n \text{ is from } \mathcal{N}(x_n|0,\lambda_1^{-1})) \\
(0,1)^T, & (x_n \text{ is from } \mathcal{N}(x_n|0,\lambda_2^{-1}))
\end{cases}
$

and place a prior on the latent variables

$p(\mathbf{z}|\tau) = \prod^N_{n=1} \tau^{z_{n1}} (1- \tau)^{z_{n_2}}$

The likelihood in the latent variable model is given by

$p(\mathbf{x}|\mathbf{z},\lambda_1,\lambda_2) = \prod^N_{n=1} \mathcal{N}(x_n|0,\lambda_1^{-1})^{z_{n1}} \mathcal{N}(x_n|0,\lambda_2^{-1})^{z_{n2}}$

The joint distribution of all observed (${\mathbf{x}}$) and unobserved variables ($\mathbf{z}, \tau, \lambda_1,\lambda_2$) factorizes as follows

$p(\mathbf{x},\mathbf{z},\tau,\lambda_1,\lambda_2) = p(\tau) p(\lambda_1) p(\lambda_2) p(\mathbf{z}|\tau) p(\mathbf{x}|\mathbf{z},\lambda_1,\lambda_2)$

and the log of the joint distribution can correspondingly be written as

$\log p(\mathbf{x},\mathbf{z},\tau,\lambda_1,\lambda_2) = \log p(\tau) + \log p(\lambda_1) + \log p(\lambda_2) + \log p(\mathbf{z}|\tau) + \log p(\mathbf{x}|\mathbf{z},\lambda_1,\lambda_2)$

We approximate the posterior distribution $p(\mathbf{z},\tau,\lambda_1,\lambda_2|\mathbf{x}) $ using the factorized variational distribution $q(\mathbf{z}) q(\tau) q(\lambda_1) q(\lambda_2)$

################################################################################

(B) Derive the variational update for $\lambda_2$. You can assume mean-field approximation

$q(\mathbf{z}, \tau, \lambda_1, \lambda_2) = q(\lambda_1)q(\lambda_2)q(\tau)\prod_n q(z_n)$

and assume the other factors are given by

$q(\tau) = Beta(\tau|\alpha_n,\beta_n)$

$q(z_{n1}) = Bernoulli(z_{n1}| r_{n1})$ where $r_{n1}$ is the updated responsibility

$q(\lambda_1) = Gamma(\lambda_1|a_n, b_n)$

Update of factor $q(\lambda_2)$

$\log q^*(\lambda_2) = E_{q(\mathbf{z})q(\tau)q(\lambda_1)} [\log p(\mathbf{x}, \mathbf{z}, \tau, \lambda_1, \lambda_2)] $

=> $\log q^*(\lambda_2) = E_{q(\mathbf{z})q(\tau)q(\lambda_1)} [\log p(\tau) + \log p(\lambda_1) + \log p(\lambda_2) + \log p(\mathbf{z}|\tau) + \log p(\mathbf{x}|\mathbf{z},\lambda_1,\lambda_2)] $

To derive the variational update for $\lambda_2$​, we need to find the optimal $q(\lambda_2​)$ that minimizes the KL divergence between the true posterior $p(z,\tau,\lambda_1​,\lambda_2​∣x)$ and the approximating distribution $q(z)q(\tau)q(\lambda_1​)q(\lambda_2​)$. This can be done by applying the coordinate ascent variational inference (CAVI) algorithm.

The CAVI update for $q(\lambda_2​)$ is given by taking the expectation of the log joint distribution with respect to all other factors and then exponentiating the result. We need to keep only the terms dependent on $\lambda_2$ (having $\lambda_2$ in the term). The rest terms are constant with respect to this factor can be added to the constant "C"

=> $\log q^*(\lambda_2) = E_{q(\mathbf{z})q(\tau)q(\lambda_1)} [\log p(\lambda_2)] + E_{q(\mathbf{z})q(\tau)q(\lambda_1)} [\log p(\mathbf{x}|\mathbf{z},\lambda_1,\lambda_2)] + C$

=> $\log q^*(\lambda_2) = \log p(\lambda_2) + E_{q(\mathbf{z})q(\tau)q(\lambda_1)} [\log p(\mathbf{x}|\mathbf{z},\lambda_1,\lambda_2)] + C$ (E.q 1)



According to the exercise, we have $p(\lambda_2) = \text{Gamma}(c_0, d_0)$ as the prior. Expand the Gamma distribution for $\lambda_2$, we have:

$\log p(\lambda_2) = \log \text{Gamma}(\lambda_2|c_0, d_0) = \log \left[ \dfrac{{d_0}^{c_0}}{\Gamma(c_0)} \lambda_2^{c_0-1} \exp(-d_0 \lambda_2) \right] $

=> $\log p(\lambda_2) = c_0 \log d_0 - \log \Gamma(c_0) + (c_0 - 1) \log \lambda_2 - d_0 \lambda_2 $ 


We can drop the term that is independent of $\lambda_2$ 

=> $\log p(\lambda_2) = (c_0 - 1) \log \lambda_2 - d_0 \lambda_2 + C$ (E.q 2)


Additionally, we have
 
$E_{q(\mathbf{z})q(\tau)q(\lambda_1)} [\log p(\mathbf{x}|\mathbf{z},\lambda_1,\lambda_2)] = \prod^N_{n=1} \mathcal{N}(x_n|0, \lambda_1^{-1})^{z_{n1}} \mathcal{N}(x_n| 0, \lambda_2^{-1})^{z_{n2}}$
    
=> $E_{q(\mathbf{z})q(\tau)q(\lambda_1)} [\log p(\mathbf{x}|\mathbf{z},\lambda_1, \lambda_2)] = E_{q(\mathbf{z})q(\tau)q(\lambda_1)}
[\sum^N_{n=1} z_{n1}\log\mathcal{N}(x_n|0, \lambda_1^{-1}) + z_{n2}\log\mathcal{N}(x_n| 0, \lambda_2^{-1})]$. We can drop the term that is independent of $\lambda_2$, which can be treated as a constant
    
=> $E_{q(\mathbf{z})q(\tau)q(\lambda_1)} [\log p(\mathbf{x}|\mathbf{z},\lambda_1, \lambda_2)]  = E_{q(\mathbf{z})q(\tau)q(\lambda_1)}
[\sum^N_{n=1} z_{n2}\log\mathcal{N}(x_n| 0, \lambda_2^{-1})] + C$

By definition, $E_{q(z_n)} [\sum^N_{n=1}z_{nk}] = \sum^N_{n=1}r_{nk}$ is the expected responsibility of component $k$ for observation $x_n$ according to Bernoulli distribution​

=> $E_{q(\mathbf{z})q(\tau)q(\lambda_1)} [\log p(\mathbf{x}|\mathbf{z},\lambda_1, \lambda_2)]  = 
\sum^N_{n=1} r_{n2}\log\mathcal{N}(x_n| 0, \lambda_2^{-1}) + C$ (E.q 3)

Plugging (2)(3) into equation (1), we have:
    
$\log q^*(\lambda_2) = (c_0 - 1) \log \lambda_2 - d_0 \lambda_2 + \sum^N_{n=1} r_{n2}\log\mathcal{N}(x_n| 0, \lambda_2^{-1}) + C$

=> $\log q^*(\lambda_2) = (c_0 - 1) \log \lambda_2 - d_0 \lambda_2 + \sum^N_{n=1} r_{n2}\log[(2 \pi \lambda_2^{-1})^{-1/2} \exp(\dfrac{1}{2} (x_n - 0)^2 (\lambda_2^{-1})^{-1})] + C$

=> $\log q^*(\lambda_2) = (c_0 - 1) \log \lambda_2 - d_0 \lambda_2 + \sum^N_{n=1} r_{n2}\log[(2 \pi)^{-1/2} \lambda_2^{1/2} \exp(\dfrac{1}{2} x_n^2 \lambda_2)] + C$

=> $\log q^*(\lambda_2) = (c_0 - 1) \log \lambda_2 - d_0 \lambda_2 + \sum^N_{n=1} r_{n2} \left[ -\dfrac{1}{2} \log (2 \pi) + \dfrac{1}{2} \log \lambda_2 + \dfrac{1}{2} x_n^2 \lambda_2 \right] + C$

=> $\log q^*(\lambda_2) = (c_0 - 1) \log \lambda_2 - d_0 \lambda_2 - \dfrac{N}{2} r_{n2} \log (2 \pi) + \dfrac{N}{2} r_{n2} \log \lambda_2 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} x_n^2 \lambda_2 + C$

=> $\log q^*(\lambda_2) = (c_0 - 1) \log \lambda_2 - d_0 \lambda_2 - \dfrac{1}{2} \sum^N_{n=1} r_{n2} \log (2 \pi) + \dfrac{1}{2} \sum^N_{n=1} r_{n2} \log \lambda_2 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} x_n^2 \lambda_2 + C$

Removing constant term $\sum^N_{n=1} r_{n2} \log (2 \pi)$:

=> $\log q^*(\lambda_2) = (c_0 - 1) \log \lambda_2 - d_0 \lambda_2 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} \log \lambda_2 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} x_n^2 \lambda_2 + C$

=> $\log q^*(\lambda_2) = (c_0 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} - 1) \log \lambda_2 - (d_0 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} x_n^2 ) \lambda_2 + C$

=> $q^*(\lambda_2) \propto \exp(c_0 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} - 1) \log \lambda_2 - (d_0 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} x_n^2 ) \lambda_2$, which resembles the Gamma distribution, as the prior of $\lambda_2$ is Gamma distribution, which is conjugate to the posterior:

$q^∗(\lambda_2​)=\text{Gamma}(c_N,d_N​)$

where

$c_N​ = c_0 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} $​

$d_N​ = d_0 + \dfrac{1}{2} \sum^N_{n=1} r_{n2} x_n^2 $​
