**Resume of "Auto-Encoding Variational Bayes" (Kingma and Welling, 2014)**

### **Introduction**
The paper "Auto-Encoding Variational Bayes" introduces the Variational Autoencoder (VAE), a generative model that addresses the problem of efficient approximate inference in probabilistic models with latent variables.

#### **Problem Statement**
Latent variable models are powerful tools for unsupervised learning, where the goal is to infer latent (hidden) variables \( z \) that explain the observed data \( x \). The key challenges in using such models are:
1. Computing the posterior distribution \( p(z|x) \), which is often intractable due to the complexity of \( p(x) \), the marginal likelihood of the data.
2. Efficiently learning model parameters through optimization.

#### **Proposed Solution**
The authors propose using **variational inference**, a framework for approximating intractable posterior distributions, combined with an efficient reparameterization trick to enable gradient-based optimization. This approach allows for training both the generative model and the approximate posterior simultaneously. The resulting model, the VAE, combines probabilistic modeling with neural networks, making it scalable and efficient.

---

### **Background**
The VAE builds on the framework of probabilistic latent variable models. In these models:

1. **Latent Variables:** A latent variable \( z \) is sampled from a prior distribution \( p(z) \).
2. **Generative Process:** The observed data \( x \) is generated from a conditional likelihood \( p(x|z) \), parameterized by the latent variable \( z \).

The marginal likelihood of the data is given by:
\[
p(x) = \int p(x|z)p(z)dz.
\]

The posterior distribution \( p(z|x) \) is needed for inference but is generally intractable because computing \( p(x) \) requires integrating over \( z \). The authors approximate \( p(z|x) \) with a variational distribution \( q(z|x) \), parameterized by a neural network.

---

### **Methodology**

#### **Evidence Lower Bound (ELBO)**
The key idea is to maximize a lower bound on the marginal likelihood \( \log p(x) \), known as the Evidence Lower Bound (ELBO):
\[
\log p(x) \geq \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z)),
\]
where:
- The first term, \( \mathbb{E}_{q(z|x)}[\log p(x|z)] \), is the expected log-likelihood of the data under the approximate posterior, also called the **reconstruction term**.
- The second term, \( D_{KL}(q(z|x) \| p(z)) \), is the Kullback-Leibler (KL) divergence between the approximate posterior \( q(z|x) \) and the prior \( p(z) \), which acts as a regularizer.

The ELBO can be written as:
\[
\mathcal{L}(x; \theta, \phi) = \mathbb{E}_{q(z|x; \phi)}[\log p(x|z; \theta)] - D_{KL}(q(z|x; \phi) \| p(z)).
\]
Here, \( \theta \) are the parameters of the generative model \( p(x|z) \), and \( \phi \) are the parameters of the variational distribution \( q(z|x) \).

#### **Optimization of the ELBO**
The authors propose to optimize the ELBO with respect to both \( \theta \) and \( \phi \). However, two challenges arise:
1. **Reparameterization Trick:** To compute gradients with respect to \( \phi \), the authors introduce the reparameterization trick. Instead of sampling \( z \sim q(z|x; \phi) \), they sample from a simple distribution (e.g., a standard normal) and transform the sample deterministically:
   \[
   z = \mu(x; \phi) + \sigma(x; \phi) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I).
   \]
   Here, \( \mu(x; \phi) \) and \( \sigma(x; \phi) \) are outputs of the encoder network, and \( \odot \) denotes element-wise multiplication.

2. **Stochastic Gradient Descent:** The reparameterization allows the gradient of the ELBO to be estimated efficiently using Monte Carlo methods, enabling the use of standard gradient-based optimizers.

---

### **Model Architecture**
The VAE consists of two main components:

1. **Encoder (Inference Model):** Approximates the posterior distribution \( q(z|x; \phi) \). This is implemented as a neural network that outputs the mean \( \mu \) and variance \( \sigma^2 \) of a Gaussian distribution.
   \[
   q(z|x; \phi) = \mathcal{N}(z; \mu(x; \phi), \text{diag}(\sigma^2(x; \phi))).
   \]

2. **Decoder (Generative Model):** Models the likelihood \( p(x|z; \theta) \). Given a latent variable \( z \), the decoder generates data \( x \) through a neural network parameterized by \( \theta \).

---

### **Derivation of the ELBO**
To derive the ELBO, start with the marginal likelihood:
\[
\log p(x) = \log \int p(x|z)p(z)dz.
\]
Introducing the variational distribution \( q(z|x) \):
\[
\log p(x) = \log \int q(z|x) \frac{p(x|z)p(z)}{q(z|x)} dz.
\]
Applying Jensen's inequality:
\[
\log p(x) \geq \int q(z|x) \log \frac{p(x|z)p(z)}{q(z|x)} dz = \mathcal{L}(x; \theta, \phi).
\]
The ELBO consists of two terms:
1. **Reconstruction Loss:**
   \[
   \mathbb{E}_{q(z|x)}[\log p(x|z)].
   \]
2. **KL Divergence:**
   \[
   -D_{KL}(q(z|x) \| p(z)).
   \]

---

### **Training Algorithm**
The training process involves:
1. Sampling a minibatch of data points \( x \).
2. Passing \( x \) through the encoder to compute \( \mu(x; \phi) \) and \( \sigma(x; \phi) \).
3. Sampling \( z \) using the reparameterization trick.
4. Passing \( z \) through the decoder to compute \( \log p(x|z; \theta) \).
5. Computing the KL divergence term.
6. Optimizing the ELBO using backpropagation.

---

### **Experiments**
The authors evaluate the VAE on several benchmark datasets, including MNIST and Frey Faces, showing:
- High-quality reconstructions and generative samples.
- Robustness to overfitting.

---

### **Contributions**
1. **Reparameterization Trick:** A novel way to compute gradients for stochastic latent variables, making variational inference scalable.
2. **VAE Framework:** A combination of probabilistic modeling and deep learning for efficient and scalable unsupervised learning.

---

### **Conclusion**
The Variational Autoencoder (VAE) introduces a principled and practical framework for generative modeling, leveraging variational inference and neural networks. It combines probabilistic latent variable models with scalable optimization techniques, laying the groundwork for further advancements in deep generative models.

