## VFE

We seek a distribution over the noise free latent variables $f$ that approximates the true posterior: $q(f)\approx p(f \mid y)$. To do this we use the standard bound used in variational inference:

$$
\begin{aligned}
p(y ) &= \overbrace{\mathbb{E}_{q(f)}\left[\log \frac{p(y,f)}{q(f)} \right]}^{\mathcal{L}(q)} +\overbrace{\mathbb{E}_{q(f)}\left[\log \frac{p(f \mid y)}{q(f)}\right]}^{KL\left[q \mid\mid p\right]}\\
&\geq \mathcal{L}(q)
\end{aligned}
$$

Thus, maximizing $\mathcal{L}(q)$ is equivalent to minimizing the Kullback-Leiveck divergence between $q$ and the true posterior $p(f\mid y)$. 


Let $u$ be a subset of $f$: $u \subset f$: $f = \{f_{\ne u}, u\}$. We can rewrite $\mathcal{L}$ as:

$$
\begin{aligned}
\mathcal{L}(q) &= \mathbb{E}_{q(f)}\left[\log \frac{p(y\mid f)p(f)}{q(f)} \right] \\
&= \mathbb{E}_{q(f)}\left[\log \frac{p(y\mid f)p(f_{\ne u} \mid u) p(u)}{q(f)} \right] \\
\end{aligned}
$$

We can also rewrite the true posterior $p(f\mid y)$ using the subset $u$ of $f$:
$$
p(f\mid y) = p(f_{\ne u} \mid u,y)p(u \mid y)
$$

The Titsias approximation to the posterior chooses $q$ as:
$$
q(f) = \overbrace{p(f_{\ne u} \mid u)}^{\mathcal{N}(f_{\ne u} \mid K_{f_{\ne u}u}K_{uu}^{-1}u,\: K_{f_{\ne u}f_{\ne u}} - Q_{f_{\ne u}f_{\ne u}} )}q(u)
$$
We see that **this approximation removes the dependency on the data ($y$) in the first term**. The second term $q(u)$ is let free. 

Substituting this $q$ in the bound $\mathcal{L}$ leads to:

$$
\begin{aligned}
\require{cancel}
\mathcal{L}(q) &= \mathbb{E}_{q(f)}\left[\log \frac{p(y\mid f)\cancel{p(f_{\ne u} \mid u)} p(u)}{\cancel{p(f_{\ne u} \mid u)}q(u)}\right] \\
&= \mathbb{E}_{q(f)}\left[\log p(y\mid f)\right] + \mathbb{E}_{p(f_{\ne u} \mid u)q(u)}\left[\log \frac{p(u)}{q(u)}\right] \\
&= \mathbb{E}_{q(f)}\left[\log p(y\mid f)\right] + \int \int \left[\log \frac{p(u)}{q(u)}\right] p(f_{\ne u} \mid u)q(u) df_{\ne u} du \\
&= \mathbb{E}_{q(f)}\left[\log p(y\mid f)\right] + \int \left[\log \frac{p(u)}{q(u)}\right] q(u) \overbrace{\left(\int p(f_{\ne u} \mid u) df_{\ne u} \right)}^{=1}  du \quad \text{Trick (1)}\\
&= \mathbb{E}_{q(f)}\left[\log p(y\mid f)\right] + KL\left[q(u)\mid\mid p(u)\right] \\
&= \mathbb{E}_{q(f)}\left[\log \prod_{i=1}^N p(y_i\mid f_i)\right] + KL\left[q(u)\mid\mid p(u)\right] \quad \text{(step 6)} \\
&= \mathbb{E}_{q(f_1,..,f_N)}\left[\sum_{i=1}^N \log  p(y_i\mid f_i)\right] + KL\left[q(u)\mid\mid p(u)\right]\\
&= \sum_{i=1}^N \mathbb{E}_{q(f_1,..,f_N)}\left[\log  p(y_i\mid f_i)\right] + KL\left[q(u)\mid\mid p(u)\right] \\
&= \sum_{i=1}^N \mathbb{E}_{q(f_1,..,f_N)}\left[\frac{-1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y_i- f_i)^2\right] + KL\left[q(u)\mid\mid p(u)\right] \\
&= \sum_{i=1}^N \int \left[\frac{-1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y_i- f_i)^2\right]q(f_i)\overbrace{\int q(f_1,..,f_{i-1},f_{i+1},..,f_N \mid f_i) d(f_1,..,f_{i-1},f_{i+1},..,f_N)}^{=1}df_i + KL\left[q(u)\mid\mid p(u)\right] \\
&= \sum_{i=1}^N \mathbb{E}_{q(f_i)}\left[\frac{-1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(y_i- f_i)^2\right]  + KL\left[q(u)\mid\mid p(u)\right]\\
&= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(f_i)}\left[ - \frac{1}{2\sigma^2}(y_i- f_i)^2\right]  + KL\left[q(u)\mid\mid p(u)\right] \\
&= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[\mathbb{E}_{q(f_i \mid u)}\left[ - \frac{1}{2\sigma^2}(y_i- f_i)^2\right]\right]  + KL\left[q(u)\mid\mid p(u)\right] \\
&= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[\mathbb{E}_{p(f_i \mid u)}\left[ - \frac{1}{2\sigma^2}(y_i- f_i)^2\right]\right]  + KL\left[q(u)\mid\mid p(u)\right] \\
&= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[\mathbb{E}_{p(f_i \mid u)}\left[ - \frac{1}{2\sigma^2}(y_i^2+ f_i^2 -2y_i f_i)\right]\right]  + KL\left[q(u)\mid\mid p(u)\right] \\
&= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[- \frac{1}{2\sigma^2}\left(y_i^2+ \mathbb{E}_{p(f_i \mid u)}[f_i^2] -2y_i \mathbb{E}_{p(f_i \mid u)}[f_i]\right)\right]  + KL\left[q(u)\mid\mid p(u)\right] \quad \text{Using }p(f_i \mid u ) = \mathcal{N}(K_{f_iu}K_{uu}^{-1}u, K_{f_if_i} - Q_{f_if_i}) \\
&= \frac{-N}{2}\log(2\pi\sigma^2)+\sum_{i=1}^N \mathbb{E}_{q(u)}\left[- \frac{1}{2\sigma^2}\left(y_i^2+ K_{f_if_i} - Q_{f_if_i} + K_{f_iu}K_{uu}^{-1}uu^tK_{uu}^{-1}K_{uf_i} -2y_i K_{f_iu}K_{uu}^{-1}u\right)\right]  + KL\left[q(u)\mid\mid p(u)\right]   \\
\end{aligned}
$$

If we assume $q(u) = \mathcal{N}(u \mid m,S)$

Alternative derivation from step (6):

$$
\begin{aligned}
\mathcal{L}(q)&= \mathbb{E}_{q(f)}\left[\log \prod_{i=1}^N p(y_i\mid f_i)\right] + KL\left[q(u)\mid\mid p(u)\right] \\
&= \mathbb{E}_{p(f_1,..,f_N\mid u)q(u)}\left[\log p(y \mid f)\right] + KL\left[q(u)\mid\mid p(u)\right] \\
&= \mathbb{E}_{q(u)}\left[\mathbb{E}_{p(f \mid u)}\left[\log p(y \mid f)\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\
&\underbrace{\leq}_{\text{Jensen's}} \mathbb{E}_{q(u)}\left[\log \mathbb{E}_{p(f\mid u)}\left[ p(y \mid f)\right]\right] + KL\left[q(u)\mid\mid p(u)\right] \\
\end{aligned}
$$

We know that $p(f \mid u)=\mathcal{N}(K_{fu}K_{uu}^{-1}u,K_{ff}-Q_{ff})$ and $p(y\mid f)=\mathcal{N}(0,\sigma^2 I)$. So:

$$
\mathbb{E}_{p(f \mid u)}\left[ p(y \mid f)\right]
$$