# Variational autoencoders

In this project, we will explore the idea of variational autoencoders. Also, we will implement a variational autoencoder using Tensorflow to generate images like MNIST hand-written digits.

## Table of Contents
- [1 - Introduction](#1)
- [2 - Autoencoders](#2)
    - [2.1 - Dimension Reduction using Autoencoders](#2.1)
- [3 - Variational autoencoders](#3)
    - [3.1 - Theory behind VAEs](#3.1)
        - [3.1.1 - Evidence Lower BOund](#3.1.1)
        - [3.1.2 - Variational Inference](#3.1.2)
        - [3.1.3 - Reparameterization Trick](#3.1.3)
        - 
        - [3.1.5 - Hand-written Image Generator](#3.1.5)

<a name='1'></a>
## Introduction
Variational autoencoders are members of deep generative models that encode the inputs to a lower dimensional space, sample randomly from the latent or encoded space, and decode that sample to generate a new data. More percisely, variational autoencoders can be considered in two different ways:
1. VAEs are type of autoencoders. However, there is a key difference between AEs and VAEs that will be discussed when autoencoders are explained.
2. VAEs are probabilistic generative models that work with joint distribution of observed and latent variables

<a name='2'></a>
## Autoencoders

<a name='2.1'></a>
### Dimension Reduction using Autoencoders

<a name='3'></a>
## Variational autoencoders

<a name='3.1'></a>
### Theory behind VAEs

<a name='3.1.1'></a>
#### Evidence Lower BOund
Evidence lower bound (ELBO) is a key element in order to approximate posterior distribution. It is a lower bound on the log marginal likelihood of the data given the model parameters. By maximizing the ELBO, we are effectively minimizing the Kullback-Leibler (KL) divergence between the true posterior distribution and the approximating distribution(i will show it in this section).

Before finding this lower bound, it is a good idea to define our problem(i'll define our problem and our notation for each part). Consider we have two random variables, $X$(observation or real data) and $Z$(hidden or latent variable). $X$ and $Z$ are distributed according to a joint distribution $P(X,Z;\theta)$ where $\theta$ parameterizes the distribution.

Now we need to determine what is the evidence. The evidence is simply a log-likelihood of observations $x$ with fixed $\theta$. Intuitively, likelihood function shows how much our model and parameter($\theta$) align with observations. High value of likelihood function indicates that the model is appropriate for the given data. So, now the goal is to find a lower bound for $p(x;\theta)$. Assume that $Z$ follows $q$ distribution. Now we can use marginalization to achieve that lower bound as follows:

$$\begin{align} log p(x;\theta) &= log\int_{z} p(x,z;\theta) dz \\ &= log\int_{z} p(x,z;\theta) \frac{q(z)}{q(z)} dz \\ &= log E_{z\sim q}[\frac{p(x,Z;\theta)}{q(Z)}] \\ & \geq^* E_{z\sim q}[log \frac{p(x,Z;\theta)}{q(Z)}]\end{align}$$

$$\begin{align} => ELBO = E_{z\sim q}[log \frac{p(x,Z;\theta)}{q(Z)}] \ (1) \end{align}$$

\* : This inequality is the result of [Jensen's inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality). Since $log$ is a concave function, $log(E[X]) \geq E[log(X)]$ where X in this statement is $\frac{p(x,Z;\theta)}{q(Z)}$.

Now lets prove why maximizing the ELBO leads to minimizing the KL divergence between the true posterior distribution and the approximating distribution. Assume that we want to find a $q$ distribution that is the most accurate distribution in order to approximate $p(z|x;\theta)$.(in VAEs we need to find an approximation for $p(z|x;\theta)$, so it is needed to find a close distribution for that and it is the reason why i explain the statement with this distribution).

$$\begin{align} KL \ (q(z) \ || \ p(z|x;\theta)) &= E_{Z \sim q}[log \frac{q(Z)}{p(Z|x;\theta)}] \\ &= E_{Z \sim q}[log \ q(Z)] - E_{Z \sim q}[log \ p(Z|x;\theta)] \\ &= E_{Z \sim q}[log \ q(Z)] - E_{Z \sim q}[log \frac {p(Z,x;\theta)}{p(x;\theta)}] \\ &= E_{Z \sim q}[log \ q(Z)] - E_{Z \sim q}[log \ p(Z,x;\theta)] + E_{Z \sim q}[log \ p(x;\theta)] \\ &= E_{Z \sim q}[log \ p(x;\theta)] - E_{Z \sim q}[log\frac{p(x,Z;\theta)}{q(Z)}] \\ &=^{*,(1)} log \ p(x;\theta) - ELBO \\ &= evidence - ELBO\end{align}$$

\* : Notice that $log \ p(x;\theta)$ is not dependent on $Z$, so it acts like a constant and comes out of the expectation.

Now, notice that when $\theta$ is fixed and we are looking for a q that minimize the KL divergence, evidence acts as constant, so becuase of negative sign of ELBO, maximizing the ELBO leads to minimizing the KL divergence.

<a name='3.1.2'></a>
#### Variational Inference

In many practical scenarios, computing the exact posterior distribution is intractable due to the complexity of the model or the size of the data. Variational inference provides a framework to approximate this posterior distribution with a simpler distribution chosen from a parameterized family of distributions. 
The essence of variational inference is to pose the problem of approximating the posterior distribution as an optimization problem. The goal is to find the member of the chosen family of distributions that minimizes the Kullback-Leibler (KL) divergence from the true posterior distribution.
By maximizing the Evidence Lower Bound (ELBO), which is a lower bound on the log marginal likelihood of the data given the model parameters, variational inference seeks to find the best approximation to the true posterior distribution given the model and the observed data.

When we have a model with both hidden (Z) and observed (X) variables, and we want to figure out the likelihood of Z given X, variational inference helps ous to find a reasonable approximation. But, first lets check why we can not calculate $p(z|x)$ explicitly. Using Bayes theorem, $p(z|x)$ can be written as follows:
$$\begin{align} p(z|x) = \frac{p(z,x)}{p(x)} = \frac{p(x|z) \times p(z)}{\int_z p(z,x) dz} \end{align}$$

The important point is that calculating the denominator is not always feasible. Therefore, variational inference is used.

In this technique, we consider a family of distributions named variational distribution family and aim to find a $q$ that $q(z)$ be the closest distribution to $p(z|x)$ in the variational distribution family. Also, assume that $\phi$ controls these distributions.(it is the variational parameter). So, our goal is to find a $\phi$ that minimize the KL divergence between $q_{\phi}(z)$ and $p(z|x)$. Interestingly, based on what we showed in [Evidence Lower BOund](#3.1.1), instead of minimizing the KL divergence, we can maximize the ELBO to find the best approximation.

<a name='3.1.3'></a>
#### Reparameterization Trick