## Preface
(Basically notes  from here:)<br>
Guide by original authors: https://arxiv.org/pdf/1906.02691.pdf

A major division in ML is generative vs discriminative models. Lets consider both from the context of modeling fruit pictures<br>
*Convention:*<br>
*x - the image of the fruit*
*z - the label of the fruit (also the distribution of that fruit)*
*   **discriminative** - model learns p(z|x)<br>
*   **generative** - model learns p(x|z)<br>
generative models can still be used for classification, but it requires using bayes rule which can be computationally expensive. Also generative models tend to make more assumptions about the underlying structures in data.

## Questions
*   Why is bayes rule intractable in machine learning?
*   Will AEs converge to VAEs if the dimmensionality of the latent space is small?

# Variational Autoencoder
Original Paper: https://arxiv.org/pdf/1312.6114.pdf <br>
Guide by original authors: https://arxiv.org/pdf/1906.02691.pdf
Shoutouts:
* https://www2.bcs.rochester.edu/sites/jacobslab/cheat_sheet/VariationalAutoEncoder.pdf
* https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

$z -> x -> p(z|x) -> \hat{z} -> p(x|\hat{z}) -> \hat{x}$

## KL Divergence
Motivation: As a distance metric to quantify the difference between two distributions. <br>
Key features:
* Independent of normalization factor
* Independent of scale factors as well

## Derivation 1

| | D1 | D2 |
| --- | --- | --- |
|p(a)| .8| .3|
|p(b)| .1| .3|
|p(c)| .1| .4|

Some sequence X **taken from D1**: abcba

Ratio to see distribution simmilarity: $\frac{p(X|D1)}{p(X|D2)}$
* closer to one means distributions are closer

Probability p(X|D1) = $p_{D1}(a)^{n_a} * p_{D1}(b)^{n_b} * p_{D1}(c)^{n_c}$ <br>
Probability p(X|D2) = $p_{D2}(a)^{n_a} * p_{D2}(b)^{n_b} * p_{D2}(c)^{n_c}$

Generalization:

$$\frac{p_{D1}(a)^{n_a} * p_{D1}(b)^{n_b} * p_{D1}(c)^{n_c}}{p_{D2}(a)^{n_a} * p_{D2}(b)^{n_b} * p_{D2}(c)^{n_c}}$$

Now put a log on it and exponent to normalize for scaling factors and the sequence size:

$$log(\frac{p_{D1}(a)^{n_a} * p_{D1}(b)^{n_b} * p_{D1}(c)^{n_c}}{p_{D2}(a)^{n_a} * p_{D2}(b)^{n_b} * p_{D2}(c)^{n_c}})^{\frac{1}{n}}$$

Now simplify:

$$\frac{n_a}{n}log(p_{D1}(a)) + \frac{n_b}{n}log(p_{D1}(b)) + \frac{n_c}{n}log(p_{D1}(c)) - \frac{n_a}{n}log(p_{D2}(a)) + \frac{n_b}{n}log(p_{D2}(b)) + \frac{n_c}{n}log(p_{D2}(c))$$

Recall that we are taking our random draw X from  D1, so as N -> big the ratio $\frac{n_a}{n} -> p_{D1}(a)$<br>
**Note:** If we were to assume the random draw X was from D1 then the ration would collapse to $p_{D2}(a)$, instead. This is why KL divergences is **Asymetric**. $D_{KL}(D1 || D2) \neq D_{KL}(D2 || D1)$

$$p_{D1}(a)log(p_{D1}(a)) + p_{D1}(b)log(p_{D1}(b)) + p_{D1}(c)log(p_{D1}(c)) - p_{D1}(a)log(p_{D2}(a)) - p_{D1}(b)log(p_{D2}(b)) - p_{D1}(c)log(p_{D2}(c))$$ <br>

re-arrange:

$$p_{D1}(a)log(p_{D1}(a)) - p_{D1}(a)log(p_{D2}(a)) + p_{D1}(b)log(p_{D1}(b)) - p_{D1}(b)log(p_{D2}(b)) + p_{D1}(c)log(p_{D1}(c)) - p_{D1}(c)log(p_{D2}(c))$$ <br>

Simplify:

$$p_{D1}(a)log(\frac{p_{D1}(a)}{p_{D2}(a)}) +p_{D1}(b)log(\frac{p_{D1}(b)}{p_{D2}(b)}) + p_{D1}(c)log(\frac{p_{D1}(c)}{p_{D2}(c)}) $$

Generalize for more M classes instead of just 3:

$$ \sum_i^m p_{D1}(i)log(\frac{p_{D1}(i)}{p_{D2}(i)}) $$

Generalize for Continuous distributions instead of discreet:

$$\int_{-\infty}^{\infty} p_{D1}(x)log(\frac{p_{D1}(x)}{p_{D2}(x)}) dx$$

or
$$\int_{x \sim D1} p_{D1}(x)log(\frac{p_{D1}(x)}{p_{D2}(x)}) dx$$

Equivalent to 

$$D_{KL} = \mathbb{E}_{x \sim D1(x)} [log(\frac{p_{D1}(x)}{p_{D2}(x)})]$$


# Take 1

**Motivation:**
<br>
Our purpose is to learn the generative
process, i.e., p(x|z) (we assume p(z) is known). A good p(x|z) would assign high probabilities to
observed x; hence, we can learn a good p(x|z) by maximizing the probability of observed data,
i.e., p(x). Assuming that p(x|z) is parameterized by θ, we need to solve the following optimization
problem. Which is to maximize the following:
<br>

$$p_\theta(x) = \int_zp_\theta(x|z)p(z)dz $$



Because that is the probability of $x \cup z$. In this case, the x is our samples which we have "drawn" from the "imaginary" distribution z. In other words this is the probability of us getting the data that we have. Since our samples obviously exist, we want $p_\theta(x)$ to be closer to 1.<br>
Unfortunately, taking the integral over $p(x)$ is intractable (aka very computationally expensive) when dealing with high dimmensions.<br>
<br>
The solution is to actually take a step back and look at a different part of the VAE, and that is the decoder **$p_\theta(z|x)$**, we can't actually integrate over that though since it is a latent mapping that we don't know. Instead the best we can do is try to apporoximate it with our own version **$q_\phi(z|x)$**.<br>
To do this we use **Variational Inference** with **KL divergence**.Aka, want to minimize the $D_{KL}(q_\phi(z|x) || p_\theta(z|x))$
$$
\begin{align}
D_{KL}(q_\theta(z|x) || p_\theta(z|x)) & = \int_{z}(q_\phi(z|x)log(\frac{q_\phi(z|x)}{p_\theta(z|x)}dz \\
& = \int_{z}(q_\phi(z|x)log(\frac{q_\phi(z|x)p_\theta(x)}{p_\theta(z,x)})dz \\
& = \int_{z}(q_\phi(z|x)\bigg(log\big(\frac{q_\phi(z|x)}{p_\theta(z,x)}\big) + log\big(p_\theta(x)\big)\bigg)dz \\
& = \int_{z}(q_\phi(z|x)\bigg(log\big(\frac{q_\phi(z|x)}{p_\theta(z,x)}\big)\bigg)dz + \int_{z}(q_\phi(z|x)log\big(p_\theta(x)\big)dz \\
D_{KL}(q_\theta(z|x) || p_\theta(z|x)) & = \int_{z}(q_\phi(z|x)\bigg(log\big(\frac{q_\phi(z|x)}{p_\theta(z,x)}\big)\bigg)dz + log\big(p_\theta(x)\big) \\
\end{align}
$$
Now use substitution with:
$$-\mathcal{L}(\phi, \theta) = \int_{z}(q_\phi(z|x)\bigg(log\big(\frac{q_\phi(z|x)}{p_\theta(z,x)}\big)\bigg)dz$$

$$
\begin{align}
D_{KL}(q_\theta(z|x) || p_\theta(z|x)) & =  - \mathcal{L}(\phi, \theta) + log\big(p_\theta(x)\big) \\
D_{KL}(q_\theta(z|x) || p_\theta(z|x)) + \mathcal{L}(\phi, \theta) & =  log\big(p_\theta(x)\big) \\
\end{align}
$$

Since KL Divergence has to be >= 0, we can re-write this as:
$$\mathcal{L}(\phi, \theta) <=  log\big(p_\theta(x)\big)$$

Reminder that our original objective was to maximize $p_\theta(x)$. We can do this by maximizing $\mathcal{L}(\phi, \theta)$, which is the lower bound for $p_\theta(x)$.

### Useful formulas:

$$p(a|b) = \frac{p(b|a)p(a)}{p(b)}$$
$$p(a|b)p(b) = p(a,b) = p(b,a) = p(b|a)p(a)$$

$$\mathbb{E}_{x \sim D1} f() = \int_xp(x)f()$$

$$\int_{z}(q_\phi(z|x)f()dz = f()$$

The maximization problem, to maximize $\mathcal{L}(\phi, \theta)$ with respect to $\phi$ and $\theta$

$$-\mathcal{L}(\phi, \theta) = \int_{z}(q_\phi(z|x)\bigg(log\big(\frac{q_\phi(z|x)}{p_\theta(z,x)}\big)\bigg)dz$$
