# **Variational Autoencoder**
* VAE is one of the building blocks of the Stable Diffusion. We will introducing in the model, how it work, the architecture and the maths behind VAEs. Understanding VAEs you will have covered more than 50 percent of the mess that you need for the Stable Difusion.
## **What is an Autoencoder?**
* AE is a model that is made of two smaller models: encoder and decoder. Both of them are joined together by this bottleneck **Z**.
* The goal of the encoder is to take some input and convert it into a lower dimensional representation (let's call it **Z**), and then if we take this lower dimensional representation, **Z**, and give it as input to the decoder we hope that the model will reproduce the original data 
    * We want to compress the original data into a lower dimension. It is like the process to compress a `.jpg` file into a `.zip` file, an then decompress for obtaining then the `.jpg` file.
    * The difference between the AE and compression is that the AE is a NN that could not reproduce the exact original input but will try to reproduce as much as possible of the original input.
    *  ***What makes a good AE?*** 
        * The code should be as small as possible that is the lower representation of the data, and the reconstructed input should be as close as possible to the original input.
    * ***What is the problem with AE?*** 
        * The problem with AEs is that the code learned by the model doesn't make sense. That is, the model just learns a mapping between input data and **Z** but doesn't learn any semantic relationship that distingish very well differents image inputs (there could be the situation where the code learned for a picture X is very similar to the code learned or a picture Y). So the model didn't capture any relationship between the data or any semantic relationship between the data: this is why we introduced the **Variational Autoencoder**.
## **What about the VAE?**
* In the VAE we learn a latent space (not a code) which represents a multivariate distribution over this data, and we hope that this multivariate distribution (this latent space) captures also the semantic relationship between the data.

* For example, we hope that all the **X-nature** pictures have a similar representation in this latent space and also all the **Y-nature** pictures have a similar representation as well as all **W-nature**.
    * The most important thing that we want to do with this VAE is we want to be able to sample from this latent space to generate new data.

## **Sampling the latent space**
* Note that when you use `Python` to generate a random number between 1 and 100, you are actually ***sampling*** from a pseudo-random distribution: it's called uniform random distribution because every number has equal probability of being chosen.
    * In the same way, we can sample from the **latent space** in order to generate a new random vector, give it to the decoder and generate new data. 
    * Following the idea, if we sample from a VAE latent space that was trained on food pictures and we happen to sample something that was exactly in between three pictures (eggs, floor and basil) we hope to get something that also in its meaning is similar to these three pictures (pasta with basil). It means that the model has captured somehow the relationship between the data it was trained upon so it can generate new data.
# **Why is it called *latent space*?**

* Since we model our data **as it is coming from** a variable **X that we can observe** and at the same time this variable X is **conditioned on** another random variable **Z that is not visible** to us (is hidden): ***latent means hidden***. That is a higher and abstract representation of these data. We want to learn about this abstract representation.
    * We will model this hidden variable as a multivariate Gaussian with means and variance.

## **Math introduction**
* VAE is the most important component of Stable Diffusion Models. Concepts like ELBO also come in stable diffusion so if you understand VAEs, it will make it easy for you to understand the Stable Diffusion. We will need the following concepts:

    * **Expectation of a random variable:**
    $$
    \mathbb{E}[X] = \int_{-\infty}^{\infty} x \, dP(x)
    $$
    * **Chain Rule of the probability:**
    $$
    P(X,Y) = P(X\mid Y)P(X)
    $$
    * **Bayes' Theorem:**
    $$
    P(X \mid Y) = \frac{P(Y \mid X) P(X)}{P(Y)}
    $$
    * **Kullback-Leibler Divergence:** This is a very important concept in ML and it's a divergence measure that allows you to measure the ***distance between two probability distributions***. So given probability distribution $p$ and $Q$, the KL Divergence tells you how far are these two probability distributions. At the same time, this is not a distance metric because it's not symmetric: $D_{\text{KL}}(P \parallel Q)  \neq D_{\text{KL}}(Q \parallel P)$. But like any distance metric, it is always $D_{\text{KL}}(P \parallel Q) \geq 0$, and it is $D_{\text{KL}}(P \parallel Q) = 0$ if and only if $P=Q$.
    $$
    D_{{KL}}(P \parallel Q) = \int P(x) \log \left(\frac{P(x)}{Q(x)}\right) \, dx
    $$
## **The Model**
We saw before that we want to model our data as coming from a random distribution that we call **X** which is conditioned on a hidden variable or latent variable called **Z**.

We can define the likelihood of our data as the marginalization over the joint probability with respect to the latent space:

$$
\mathbb{p}(x) = \int p(x,z) \, dz
$$

But that integral is **intractable** because we would need to evaluate this integral over all latent variables **Z**.
* ***What dos it mean to be intractable?*** It means that in theory we can calculate it, but in practice it is so slow and so computationally expensive that it's not worth.

And using the ***Chain Rule of Probability***:
$$
p(x) = \frac{p(x,z)}{p(z\mid x)}
$$
And we are trying to find $p(x)$ (the probability distribution over our data), but we need the ***ground truth*** of this which we don't have ($p(z\mid x)$, that is the probability distribution over the latent space given our data). $p(z\mid x)$ is also something we want to learn so we cannot use this relationship.
* In order to have a tractable $p(x)$ we need a tractable $p(z\mid x)$
* In order to have a tractable $p(z\mid x)$ we need a tractable $p(x)$

We will try to approximate $p(x)$ or $p(z\mid x)$. In order to it, starting from: 

$$
\log{p_\theta (x)} = \log{p_\theta (x)}
$$

Considering that:

$$
\int q_{\varphi}(z\mid x) dz = 1
$$

And:

$$
p_\theta(x) = \frac{p_\theta(x,z)}{p_\theta(z\mid x)}
$$

We can prove that:
$$
\log{p_\theta (x)} = \mathbb{E_{q_{\varphi}}}[\log{\frac{p_\theta(x,z)}{q_\varphi(z\mid x)}}] + D_{{KL}}(q_\varphi(z\mid x) \parallel p_\theta(z\mid x))
$$

Where from the beginning:

$$
p_\theta(z\mid x) \approx q_\varphi(z\mid x)
$$
Where $p_\theta(z\mid x)$ is what we want to find (we have thought that it is parametrized by $\theta$ that we don't know). However, if we could find something that is approximate to that (and that has its own parameters), we could have a more idea about the original one we want to know: ***$q_\varphi(z\mid x)$ is a surrogate***.
## ***ELBO***
Retaking that: 
$$
\log{p_\theta (x)} = \mathbb{E_{q_{\varphi}}}[\log{\frac{p_\theta(x,z)}{q_\varphi(z\mid x)}}] + D_{{KL}}(q_\varphi(z\mid x) \parallel p_\theta(z\mid x))
$$
We will call $\mathbb{E_{q_{\varphi}}}[\log{\frac{p_\theta(x,z)}{q_\varphi(z\mid x)}}]$ as **ELBO (Evidence Lower Bound)**
* ***What can we infer for this expression?*** Remembering that $D_{KL}$ is always greater than or equal to zero:
$$
\log{p_\theta (x)} \geq \mathbb{E_{q_{\varphi}}}[\log{\frac{p_\theta(x,z)}{q_\varphi(z\mid x)}}]
$$ 

So $\mathbb{E_{q_{\varphi}}}[\log{\frac{p_\theta(x,z)}{q_\varphi(z\mid x)}}]$ is a *lower bound* of $\log{p_\theta (x)}$. If we maximize that *lowe bound*, then $\log{p_\theta (x)}$ will be maximized. Furthermore, using the Chain Rule of Probability inside the ELBO $\log$ part, we can prove that:
$$
\log{p_\theta (x)} \geq \mathbb{E_{q_{\varphi}}}[\log{p_\theta (x \mid z)}]-D_{KL}(q_\varphi(z\mid x) \parallel p_\theta(z))
$$ 
Let's see this like:
$$
A \geq B - C
$$
For maximizing $A$, we need to maximize $B$ and minimize $C$ at the same time. That is, maximizing ELBO means:
* Maximizing $B$: maximizing the reconstruction likelihood of the decoder. $\log{p_\theta (x \mid z)}$ is something that given **Z** gives us the probability distribution over **X**, so here we are talking about the decoder: ***the model is learning maximize the reconstruction quality of the sample x given its latent representation Z***
* Minimizing $C$: minimizing the distance between the learned distribution and the prior belief we have over the latent variable. $D_{KL}(q_\varphi(z\mid x) \parallel p_\theta(z))$ contains $p_\theta(z)$ that is what we want our **Z** space to look like (multivariate gaussian), and also contains $q_\varphi(z\mid x)$ that is the learned distribution by the model. The model actually is minimizing the distance between what it is learning as the **Z** space and what we want the **Z** space to look like: ***so it is making the Z space to look like a multivariate gaussian***
## **Maximizing the ELBO: A little introduction to estimators**
When and how can you maximize something that has stochastic quantity inside (we have the probability distribution $q_\varphi(z\mid x)$)
* To maximize the function we usually take the  gradient and adjust the weights of the model so they move along the gradient direction.
* To minimize the function we take the gradient and adjust the weights of the model so they move against the gradient direction.
The problem is we are not calculating the true gradient when we run our model we are actually calculating what is called **Stochastic Gradient Descent**.
* **Stochastic Gradient Descent:**

    To minimize a function you need to evaluate the function over all the dataset (all the training data you have, not only on a single batch, but usually all the data doesn't fit in our RAM/GPU, so we only calcultae it on a mini-batch) but by doing it on a single batch you get a distribution over the possible gradient.

    When we use *Stochastic Gradient Descent* and we evaluate the gradient of our loss function we do not get the true gradient, we get a distribution over the gradient. If you do it long enough (over the entire training set, so one epoch), it will actually converge to the true gradient.
    
    The fact that it is a *Stochastic Gradient Descent* it also means that the gradient that we get **has a mean and a variance**.
    
    * The variance in *Stochastic Gradient Descent* is small enough so that we can use *SGC*. The problem with the variance, if we do the same job with the $\mathbb{E_{q_{\varphi}}}[\log{\frac{p_\theta(x,z)}{q_\varphi(z\mid x)}}]$ quantity, we get an ***estimator*** so this is called *estimating stochastic quantity* that has a high variance.
        * SGD is stochastic because we choose the minibatch randomly from our dataset and we then average the loss over the minibatch.
## **How to maximize the ELBOW?**
* Kingma, D.P. and Welling show that there is an estimator for the ELBO, and it however exhibits a very high variance. A estimator with a high variance could return you a very different gradient than what we accept, want or expect. It could take you away from the minimum, or direct you to a random place: this is not what we want. ***We cannot use an estimator with high variance***.
* The estimator before is however ***unbiased***, it means that if we do it many times it will converge, but because of it being a with the high variance we cannot use it in practice. An unbiased estimator meaning that even if at every step it may not be equal to he true expectation, on avarage it will converge to it, but as it is stochastic (in the case of a stochastic estimator), it also has avariance and it happens to be high for practical use.
* **How do we run back-propagation on a quantity that is stochastic?**  
Remember that we need to sample from our Z space to calculate the loss: we cannot alculate the derivative of the sampling operation, `PyTorch` cannot do that. ***We need a new estimator***.
* The idea is that we want to take the source of randomness outside of the model, and we will call it ***reparameterization trick***.
## **The Reparametrization Trick**
It means that we take the stochastic component outside of **Z** and create a new variable $\epsilon$ (a random source fixe between $N(0,1)$)that will be now our ***stochastic node***. We sample from Epsilon combine it with the parameters learned 
by the model ($\mu$ and $\sigma^2$ of our multivariate gaussian that we are trying to learn), and then we run back propagation through it.
## **Running backpropagation on the reparametrized model**
When we run back propagation we calculate our loss function we calculated the gradient, but the fact that the **Z** node is random or stochastic makes you not able to run back propagation through it, because we don't know how to calculate the gradient of the sampling operation.

However if we take the randomness outside of **Z** node to another node, we can run back propagation and update the parameters of 
our model and then let the back propagation also calculate the gradient along the $Z-\epsilon$ joint, we will just discard it because we don't care.

Now we can actually calculate the back propagation plus the new estimator that we found that has lower variance ***(a new estimator!)***:
$$
L(\theta,\varphi,x)= \mathbb{E_{q_{\varphi}}}[\log{\frac{p_\theta(x,z)}{q_\varphi(z\mid x)}}]=\mathbb{E_{p(\epsilon)}}[\log{\frac{p_\theta(x,z)}{q_\varphi(z\mid x)}}]
$$

Where $\epsilon \approx p(\epsilon)$, $z = g(\varphi,x,\epsilon)$ and $\mathbb{E_{p(\epsilon)}}[\log{\frac{p_\theta(x,z)}{q_\varphi(z\mid x)}}] \approx \hat{L}(\theta,\varphi,x)=\log{\frac{p_\theta (x,z)}{q_{\varphi}(z\mid x)}}$

Note that we replaced the stochastic quantity $\mathbb{E_{q_{\varphi}}} \rightarrow \mathbb{E_{p(\epsilon)}}$ which is actually coming from our noise source ($\epsilon$). Our new estimator is $\hat{L}(\theta,\varphi,x)=\log{\frac{p_\theta (x,z)}{q_{\varphi}(z\mid x)}}$ called ***Monte Carlo Estimator***.

* The Monte Carlo Estimator is unbiased.
## **Summary**
* We have found something called **ELBO** that if we maximize it we will actually learn the latent space.
* We also found an ***estimator*** for this **ELBO** that allows the back propagation to be run.
    * The next step is to combine all this knowledge together to simulate what the network will actually do.
        * We run the input through the encoder. The encoder ($q_{\varphi}(z \mid x)$) is something that given our image gives us the latent representation.
        * Then we sample from the noise source ($\epsilon$) which is outside the model (it's not inside either of **Z** or the NN). There is a function in `PyTorch` to sample it (`torch.randn_like(shape)`, because we will sampling from a distribution with zero mean and unitary variance, $N(0,1)$)
        * We will combine this sample this noisy sample with the parameters learned by the model ($\mu$ and $\log{(\sigma^2)}$ that refers the **latent space**)
        * We will pass it through the decoder so given **Z** gives us back **X** ($\log{p_\theta (x\mid z)}$), and then we will calculate the loss between the reconstructed sample ($X'$) and the original sample ($X$).
## Loss function
$$
\mathcal{L}(\theta, \phi; x) = -\mathbb{E}_{q_{\varphi}(z|x)}\left[\log p_{\theta}(x|z)\right] + D_{KL}\left(q_{\phi}(z|x) \parallel p(z)\right)
$$
We can see that it is made of two components. The loss function is basically the **ELBO**:
* One tells how far the learned distribution is from what we want our distribution to look like (that allows to calculate the $D_{KL}$ between what we want our **Z** space to look like and what is actually this **Z** space learned by the model).
* The second one is the quality of the reconstruction (we can just use the MSE loss that will basically evaluate pixel by pixel how our reconstructed sample is different from the original sample)
    * **How to combine the $\epsilon$-noise sampled from with the parameters learned by the model?**: Since we chose the model to be Gaussian, an we also chose the noise to be Gaussian, we can combine it like this:
    $$
    z = \mu + \sigma \cdot \epsilon
    $$
    * **Why we learned $\log{\sigma^2}$ instead only $\sigma^2$?**: Note that if we learn $\sigma^2$ we should force our model to learn a positive quantity.


Generated with https://kome.ai