# Variational Autoencoder

This project explores and implements the ideas of the seminal deep learning paper "Auto-Encoding Variational Bayes". Link to the paper: [arxiv](https://arxiv.org/abs/1312.6114).

### Significance

Why is this paper important?

The paper introduces the Variational Autoencoder (VAE), a powerful model for generative tasks, which involves generating new data samples similar to a given dataset. The paper develops mathematical solutions to the problem of using an autoencoder to generate new samples. Since autoencoders are generally trained in an unsupervised fashion—without relying on explicit data labels—they offer a versatile technique. This is why the can be adapted to a wide range of problems in deep learning. 
VAEs, on their own, are usually not state-of-the-art for specific tasks. However, their adaptability makes them a key building block in larger model architectures. For example, the current state of the art architecture for image generation - Stable Diffusion, relies on VAE as one of its core building blocks.

### Problem Statement

The goal of a VAE model is to learn a distribution of a latent hidden variable $z$ that explains the observed data $x$. There are two key challenges in this:
1. Computing the posterior distribution $p(z|x)$, which is often intractable due to the complexity of the marginal likelihood of the data $p(x)$
2. Efficiently learning model parameters through optimization.

### Proposed Solutions

The authors propose using **variational inference**, a framework for approximating intractable posterior distributions, combined with an efficient reparameterization trick to enable gradient-based optimization. This approach allows for training both the generative model and the approximate posterior simultaneously. The resulting model, the VAE, combines probabilistic modeling with neural networks, making it scalable and efficient.

#### Background

Data that we observe can be thought of as being represented or generated by a hidden latent variable, which cannot be observed directly. An example of this concept can be seen in Plato's "Allegory of the Cave." In Plato's work, a group of people are chained inside a cave and can only see shadows on the wall created by unseen objects passing before a fire. The shadows—a two-dimensional projection—are the data that can be observed, while the hidden three-dimensional objects are the real driving force behind the observations; they are the latent variable.

This latent variable can be either of higher dimensionality, as in Plato's Cave, or of lower dimensionality than the observed data. However, in the field of generative modeling, we generally seek to learn lower-dimensional latent representations. Reducing the dimensionality of the data can be seen as a form of compression and also has the potential to lead to semantically meaningful structures in the latent space in relation to the observed data.

### Key Ideas

#### Evidence Lower Bound - ELBO

Mathematically, the observed data $x$ and the latent variable $z$ can be modeled by a joint distribution:
$$p(x|z)$$
In the case of generative modeling one approach is to learn a model to maximize the likelihood $P(x)$ for all observed data points $x$. This simply means that a trained model should assign high likelihood to real observed data, and low likelihood to everything else. To access the likelihood of the observed data $P(x)$ we could manipulate the joint distribution of observed data and latent variable $P(x|z)$ in two different ways:
*  The first is to marginalize out the latent variable $z$, by integrating the      probability of the observed data across all possible values of $z$:
    $$p(x) = \int p(x, z)dz$$
    The problem is that computing this integral is not possible for more complex models. Some of the reasons for this are potential high dimensionality of the latent space and the lack of closed form solutions of complex functions involving neural networks.

* The second way to get $p(x)$ on its own is to use the chain rule of probability:
  $$P(X_1, X_2, \dots, X_n) = \prod_{i=1}^n P(X_i \mid X_1, X_2, \dots, X_{i-1})$$

    $$p(x,z) = p(x)p(z|x)$$ 
    dividing by $p(z|x)$ we get
    $$p(x) = \frac{p(x,z)}{p(z|x)}$$ 

    This has the problem that it involves having access to ground truth latent encoder $p(z|x)$, which is not available since the latent variable is by definition not directly observable and there is no ground truth mapping from $x$ to $z$. 


The key idea to solve this problems is to use the two equations for $p(x)$ above to derive a term called Evidence Lower Bound(ELBO), which places a lower bound on the evidence- the log likelihood of the observed data. This allows us to define an objective for the latent variable model- to maximize the probability of the observed data indirectly, by maximizing the ELBO - the lower bound of the log of that probability. Below is the equation for the ELBO and its connection to evidence:
$$ \mathbb{E}_{q_{\phi}(z \mid x)} \left[ \log \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)} \right] $$
$$$$
$$\log p(x) \ge \mathbb{E}_{q_{\phi}(z \mid x)} \left[ \log \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)} \right] $$
$$$$
where we introduce $q_{\phi}(z \mid x)$ as a flexible approximate variational distribution with parameters $\phi$ instead of the original intractable distribution $p(z|x)$. The idea is that a model is learned to estimate the true intractable distribution of latent variables over given observations for $x$.
As the models is trained its parameters are optimized and its estimation of the true prior distribution becomes better and better. 

#### ELBO Derivation

Below we derive the ELBO:

$$p(x) = \int p(x, z)dz$$
apply log on both sides,
$$ \log p(x) = \log \int p(x, z)dz$$
multiply the integrand by one in the form of $\frac{q_{\phi}(z \mid x)}{q_{\phi}(z \mid x)}$ to introduce the approximate variational distribution
$$ \log p_{\theta}(x) = \log \int q_{\phi}(z \mid x) \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)} \, dz $$
Now, the expectation of a arbitrary deterministic expression $f(x)$ is given by the integral $$ \mathbb{E}[f(x)] = \int f(x) p(x) \, dx $$ therefore, the integral expression for the evidence can be rewritten as an expectation of the function $\frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)} $ under the distribution $ q_{\phi}(z \mid x) $:
$$ \log p(x) = \log \mathbb{E}_{q_{\phi}(z \mid x)} \left[ \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)} \right] $$
Now because of the logarithm's convexity and the fact that the log of an expectation is always less than or equal to the expectation of the log (this is a consequence of Jensen's inequality):$$ \log \mathbb{E}[f(z)] \geq \mathbb{E}[\log f(z)] $$
we can bring the log inside the expectation:
$$ \log p(x) \ge  \mathbb{E}_{q_{\phi}(z \mid x)} \left  [ \log \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)} \right] $$

Thus we have derived the lower bound of the evidence. Now we can explore the ELBO term further.

If we start with the equation of the ELBO:
$$\mathbb{E}_{q_{\phi}(z \mid x)} \left  [ \log \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)} \right]$$
We can apply the chain rule of probability to substitute $p_{\theta}(x, z)$ with $p_{\theta}(x|z)p(z)$:
$$\mathbb{E}_{q_{\phi}(z \mid x)} \left  [ \log \frac{p_{\theta}(x|z)p(z)}{q_{\phi}(z \mid x)} \right]$$
Next since expectation is linear and using logarithm rules we can split the expression like this:
$$\mathbb{E}_{q_{\phi}(z \mid x)} \left  [ \log p_{\theta}(x|z) \right] +  \mathbb{E}_{q_{\phi}(z \mid x)} \left  [ \log \frac{p_{}(z)}{q_{\phi}(z \mid x)} \right] $$

Here we can recognize the second term, it is Kullback–Leibler Divergence which provides a way measure if two distributions are close together or not. It can be though of as the distance between two distributions since like distance it is also always non-negative. There is a small caveat with this interpretation though, since unlike distance KL divergence is not symmetric. Namely the KL divergence between distributions A and B is not the same as B and A. The definition of KL divergence is:$$ D_{\text{KL}}(P \parallel Q) = \mathbb{E}_{x \sim P} \left[ \log \frac{p(x)}{q(x)} \right] $$


If we substitute x in the general distribution with $z$ we get the second term of the previous expression:
$$
\underset{\text{Reconstruction term}}{\mathbb{E}_{q_{\phi}(z \mid x)} \left[ \log p_{\theta}(x \mid z) \right]} 

\underset{\text{KL Divergence term}}{- D_{\text{KL}}(q_{\phi}(z \mid x) \parallel p(z))}
$$



We can gain a better intuition about this expression if we look at the larger picture. The variational distribution $ q_{\phi}(z \mid x) $ is a neural network that is learned to approximate the hidden posterior distribution. This network is the encoder, it transforms inputs into distribution over possible latent. Similarly $p(x|z)$ is another network learnt to convert a given latent variable vector $z$ into an observation $x$. This second network is the decoder. Each term in the ELBO expression above measures the performance of one of these two networks:
* The reconstruction term measures the performance of the decoder, it ensures that the learned distribution is modeling effective latents that the original data can be generated from.
* The second term measures the performance of the encoder, it ensures that the learned variational distribution is as similar as possible to the prior belief held over the latent variables.

As already discussed, we want the trained model to maximize the probability of observed samples. Since, evidence is the log of that probability, maximizing the evidence lower bound accomplished that goal. From the final expression of the ELBO it is evident that to maximizing it is equivalent to maximizing the first (reconstruction) term and minimizing the second (KL Divergence) term.