(4) Variational Bayes

Suppose you are given data $(y_n, x_n)$, where $y_n \in R$ and $x_n \in R^2$ for all $n=1,...,N$. We model this using a linear regression model

$$y_n = ax_{n1} + bx_{n2} + \epsilon_n, n = 1,...N$$

$$ \epsilon_n \overset {\text{i.i.d}} \sim \mathcal{N}(0,1) $$

Prior distributions for the parameters are

$$ a \sim \mathcal{N}(0,1), \text{ and} $$
$$ b \sim \mathcal{N}(0,1) $$

Assume a variational mean-field distribution $q(a,b)=q(a)q(b)$ for the parameters of the model, where the factors are assumed to be of the form

$$ q(a) = \mathcal{N}(a| \mu_a, \sigma_a^2) $$
$$ q(b) = \mathcal{N}(b| \mu_b, \sigma_b^2) $$

Derive the variational update for the factor $q(a)$

Update of factor $q(a)$. Note that this exercise doesnt need the latent variable representation of the model. We can directly use the joint distribution of the model.

Step 1: Write down the log joint distribution of the model based on the priors and likelihoods

$$ p(\mathbf{x}, \mathbf{y}, a, b) = p(\mathbf{y} | \mathbf{x}, a, b) \log p(a) \log p(b) $$

$$ => \log p(\mathbf{x}, \mathbf{y}, a, b) = \sum_{n=1}^N [\log p(y_n | x_n, a, b)] + \log p(a) + \log p(b) $$

The variance of the error is also the variance of the likelihood for $\mathbf{y}$

Substituting in the expressions for the likelihood and prior distributions, we get:


$$ \log p(\mathbf{x}, \mathbf{y}, a, b) = \sum_{n=1}^N [\log \mathcal{N}(y_n|ax_{n1} + bx_{n2}, 1)] + \log \mathcal{N}(a|0,1) + \log \mathcal{N}(b|0,1) $$


Step 2: To derive the variational update for the factor $q(a​)$ q(a), we can use the coordinate ascent variational inference (CAVI) algorithm. This involves optimizing the ELBO/minimizes the KL divergence with respect to one factor at a time while holding the others fixed. This is done by calculating the expectation of the logarithm with respect to the variational distributions excluding the current one in consideration, which is $q(a)$

$\log q^*(a) = E_{q(b)} [p(\mathbf{x}, \mathbf{y}, a, b)] $

=> $\log q^*(a) = E_{q(b)} [\sum_{n=1}^N [\log p(y_n | x_n, a, b)] + \log p(a) + \log p(b)] $

We need to keep only the terms dependent on $a$. The rest terms are constant with respect to this factor can be added to the constant "C"

=> $\log q^*(a) = E_{q(b)} [\log p(a)] +  E_{q(b)} [\log p(\mathbf{y}|\mathbf{x},a,b)] + C$

=> $\log q^*(a) = \log p(a) + E_{q(b)} [\log p(\mathbf{y}|\mathbf{x},a,b)] + C$ (E.q 1)

According to the exercise, we have $p(a) = \mathcal{N}(0,1)$ as the prior. 

=> $\log p(a) = \log \mathcal{N}(a|0,1) = -\dfrac{1}{2} \log 2 \pi - \dfrac{1}{2} \dfrac{(a-0)^2}{1} = -\dfrac{1}{2} \log 2 \pi - \dfrac{1}{2} a^2 $

We can drop the term that is independent of $a$ 

=> $\log p(a) = - \dfrac{1}{2} a^2 + C $ (E.q 2)


Additionally, we have
 
$E_{q(b)} [\log p(\mathbf{y}|\mathbf{x},a,b)] = E_{q(b)} \left[ \sum^N_{n=1} \left( -\dfrac{1}{2} \log 2 \pi - \dfrac{1}{2} (y_n - ax_{n1} - bx_{n2})^2 \right) \right]$

Dropping all terms not depending on $a$, we have:

$E_{q(b)} [\log p(\mathbf{y}|\mathbf{x},a,b)] = E_{q(b)} \left[ \sum^N_{n=1} \left(- \dfrac{1}{2} (-2ax_{n1}y_{n} + a^2 x_{n1}^2 + 2ax_{n1}bx_{n2}) \right) \right]$

$E_{q(b)} [\log p(\mathbf{y}|\mathbf{x},a,b)] = E_{q(b)} \left[ \sum^N_{n=1} \left(ax_{n1}y_{n} - \dfrac{1}{2} a^2 x_{n1}^2 - ax_{n1}bx_{n2} \right) \right]$

$E_{q(b)} [\log p(\mathbf{y}|\mathbf{x},a,b)] = \sum^N_{n=1} \left(ax_{n1}y_{n} - \dfrac{1}{2} a^2 x_{n1}^2 - ax_{n1} E_{q(b)}\left[b\right]x_{n2} \right) $, where $ E_{q(b)}\left[b\right] = 0$ in the prior

=> $E_{q(b)} [\log p(\mathbf{y}|\mathbf{x},a,b)] = \sum^N_{n=1} \left(ax_{n1}y_{n} - \dfrac{1}{2} a^2 x_{n1}^2 \right) $ (E.q 3)

Step 3: Plugging (2)(3) into equation (1), we have:
    
$\log q^*(a) = - \dfrac{1}{2} a^2 + \sum^N_{n=1} \left(ax_{n1}y_{n} - \dfrac{1}{2} a^2 x_{n1}^2 \right) + C $

=> $\log q^*(a) = - \dfrac{1}{2} a^2 + a \sum^N_{n=1}  x_{n1}y_{n} - \dfrac{1}{2} a^2 \sum^N_{n=1} x_{n1}^2  + C $

=> $\log q^*(a) = - \dfrac{1}{2} a^2 ( \sum^N_{n=1} x_{n1}^2 + 1) + a \sum^N_{n=1}  x_{n1}y_{n} + C $

=> $\log q^*(a) = - \dfrac{1}{2} a^2 ( \sum^N_{n=1} x_{n1}^2 + 1) + a \sum^N_{n=1}  x_{n1}y_{n} + C $

Step 4: Figuring out the closed form solution for the variational update for $q(a)$, if it happens that the prior and the likelihood are conjugate. In this case, both prior and likelihood are Gaussian, so the posterior is also Gaussian. 

Completing the square form $-\dfrac{1}{2}x^TAx + b^Tx $

If $\log q^*(a) \propto -\dfrac{1}{2}x^TAx + b^Tx$ => $q(a) = \mathcal{N}(a|m, S)$

where $\textbf{S} = A^{-1}$ and $\textbf{m} = A^{-1}b$


=> $\log q^*(a) \propto - \dfrac{1}{2} a^2 ( \sum^N_{n=1} x_{n1}^2 + 1) + a \sum^N_{n=1}  x_{n1}y_{n}  $

Thus we have the final update for the factor q(a) as:

$q(a) = \mathcal{N}(a|m_a, s_a^2)$

where

$m_a = s_a^2 (\sum^N_{n=1} x_{n1} y_n)$

and

$s_a^2 = (\sum^N_{n=1} x_{n1}^2 + 1)^{-1}$

suppose the data x = (x_N)^N_{n=1} are distributed i.i.d as x_n \sim Exp(\lambda) and assume prior \lambda \sim Gamma(\alpha, \beta). Derive marginal likelihood p(x|alpha, beta). Hint: Gamma prior is conjugate to exponential likelihood.

$p(x|\alpha,\beta) = \frac{\Gamma(N+\alpha)}{(\beta+\sum_{n=1}^N x_n)^{N+\alpha}} \frac{\beta^\alpha}{\Gamma(\alpha)}$

$p(x|\lambda) = \prod_{n=1}^N \lambda e^{-\lambda x_n} = \lambda^N e^{-\lambda \sum_{n=1}^N x_n}$

The prior distribution for \lambda is given by:

$p(\lambda|\alpha,\beta) = \frac{\beta\alpha}{\Gamma(\alpha)}\lambda{\alpha-1}e^{-\beta\lambda}$

Using Bayes’ theorem, the posterior distribution for \lambda is given by:

$p(\lambda|x,\alpha,\beta) \propto p(x|\lambda)p(\lambda|\alpha,\beta) = \frac{\beta\alpha}{\Gamma(\alpha)}\lambda{N+\alpha-1}e{-(\beta+\sum_{n=1}N x_n)\lambda}$

This is the kernel of a Gamma distribution with parameters N+\alpha and \beta+\sum_{n=1}^N x_n.

The marginal likelihood is given by:

$p(x|\alpha,\beta) = \int p(x|\lambda)p(\lambda|\alpha,\beta)d\lambda$

Substituting the expressions for the likelihood and prior, we get:

$p(x|\alpha,\beta) = \int \frac{\beta\alpha}{\Gamma(\alpha)}\lambda{N+\alpha-1}e{-(\beta+\sum_{n=1}N x_n)\lambda} d\lambda$

This integral evaluates to:

p(x|\alpha,\beta) = \frac{\Gamma(N+\alpha)}{(\beta+\sum_{n=1}^N x_n)^{N+\alpha}} \frac{\beta^\alpha}{\Gamma(\alpha)}
