<a href="https://colab.research.google.com/github/Annanyas/Chess-State-Detection/blob/master/VAE_Inverse_Autoregressive_Flow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
###[Link to the code](https://drive.google.com/drive/folders/1--3VHjRRUErwYbmgGpM_EVrtpE7GcTzL?usp=sharing)
#### Variational Autoencoders (VAEs)
Variational Autoencoders (VAEs) are a class of generative models that learn a probabilistic mapping from a latent space to observed data. A VAE models the data distribution \( p(x) \) by introducing a latent variable \( z \) and optimizing the Evidence Lower Bound (ELBO) on the log-likelihood:

$
\log p(x) \geq \mathbb{E}_{q(z|x)}[\log p(x|z)] - \text{KL}(q(z|x) \| p(z))
$

Here, \( q(z|x) \) is the approximate posterior, \( p(z) \) is the prior, and \( \text{KL} \) represents the Kullback-Leibler divergence. VAEs rely on reparameterization for efficient gradient-based optimization, but their performance can be limited by the choice of a simple Gaussian posterior.

#### Normalizing Flows
Normalizing Flows extend the flexibility of \( q(z|x) \) by applying a sequence of invertible transformations \( f_k \) to a simple base distribution \( u \). The transformed variable \( z \) is computed as:

$
z = f_K \circ f_{K-1} \circ \dots \circ f_1(u)
$

The log-density of \( z \) is adjusted by the change of variables formula:

$
\log q(z) = \log q(u) - \sum_{k=1}^K \log \left| \det \frac{\partial f_k}{\partial u_k} \right|
$

While Normalizing Flows provide flexible posteriors, they are computationally expensive for high-dimensional latent spaces.

#### Motivation for Inverse Autoregressive Flow (IAF)
IAF addresses the limitations of simple posteriors by leveraging autoregressive models to parameterize flows efficiently. Using an invertible autoregressive transformation, IAF provides a flexible posterior with efficient sampling and density estimation. This is particularly advantageous for high-dimensional latent spaces, where standard Normalizing Flows may struggle.

#### Key Contributions
1. Introduces IAF to enhance posterior flexibility in VAEs.
2. Demonstrates improved log-likelihoods on complex datasets.
3. Provides an efficient framework for scaling Normalizing Flows in high-dimensional latent spaces.

This framework bridges the gap between simple Gaussian posteriors and computationally demanding flows, making it suitable for generative tasks across diverse domains.


In [None]:
# Code to set up the assignment
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/

Mounted at /content/drive
/content/drive/MyDrive


In [None]:
%


In [None]:
!pip3 install --upgrade --no-deps git+https://github.com/dlsys10714/mugrade.git
!pip3 install pybind11

Collecting git+https://github.com/dlsys10714/mugrade.git
  Cloning https://github.com/dlsys10714/mugrade.git to /tmp/pip-req-build-1p6izbtb
  Running command git clone --filter=blob:none --quiet https://github.com/dlsys10714/mugrade.git /tmp/pip-req-build-1p6izbtb
  Resolved https://github.com/dlsys10714/mugrade.git to commit 656cdc2b7ad5a37e7a5347a7b0405df0acd72380
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: mugrade
  Building wheel for mugrade (setup.py) ... [?25l[?25hdone
  Created wheel for mugrade: filename=mugrade-1.2-py3-none-any.whl size=3935 sha256=5c5e2e5576c5f467dae9e07e5437c7b56de89e7351109afbc30290bc9a8ca071
  Stored in directory: /tmp/pip-ephem-wheel-cache-e37t7j4q/wheels/8b/ba/3a/621da1207eab160c01968c5e0bd1266f505b9e3f8010376d61
Successfully built mugrade
Installing collected packages: mugrade
Successfully installed mugrade-1.2
Collecting pybind11
  Downloading pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Download

In [None]:
!make

-- Found pybind11: /usr/local/lib/python3.10/dist-packages/pybind11/include (found version "2.13.6")
-- Found cuda, building cuda backend
Thu Dec 12 17:47:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-------------------

In [None]:
import sys
sys.path.append('/content/drive/MyDrive/10714_1/hw4/python')
sys.path.append('/content/drive/MyDrive/10714_1/hw4/apps')
import needle as ndl
import needle.nn as nn
import numpy as np
import time
import matplotlib.pyplot as plt

Using needle backend


## Setup Overview
We aim to train variational autoencoders using the MNIST dataset, which contains grayscale images of handwritten digits (28x28 pixels). Each image is flattened into a vector of size 784. The two models will learn a probabilistic latent representation using:

**Case 1**: Mean-Field Approximation: A simpler, factorized approach to approximating the posterior.

**Case 2**: Inverse Autoregressive Flow (IAF): A more flexible posterior approximation using autoregressive transformations.

## Common Network Setup
Both cases share a foundational architecture for the encoder (variational posterior) and the decoder (generative model). Here's the structural breakdown:

### Encoder Network (Inference Network):
- **Input**: A flattened image of size $784$.
- **Hidden Layer**: Fully connected layer of size $512$ with ReLU activation.
- **Output Layer**:
  - **Mean ($\mu$)**: Outputs a vector of size $L$, where $L$ is the latent space dimension.
  - **Log-Variance ($\log \sigma^2$)**: Outputs another vector of size $L$.

The latent vector $z$ is sampled using the reparameterization trick:
$$
z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
$$
Here, $\sigma = \text{softplus}(\log \sigma^2)$ ensures positivity.

### Decoder Network (Generative Network):
- **Input**: A sampled latent vector $z$ of size $L$.
- **Hidden Layer**: Fully connected layer of size $512$ with ReLU activation.
- **Output Layer**: Outputs logits of size $784$, representing reconstructed pixel probabilities.

**Reconstruction Loss**: Uses the binary cross-entropy between input and reconstructed pixels.

## Training Details
- **Objective**: Minimize the Evidence Lower Bound (ELBO), given by:
$$
\text{ELBO} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - \text{D}_{KL}(q(z|x) \| p(z))
$$
- **Optimizer**: Adam with learning rate $10^{-3}$.
- **Batch Size**: $128$.
- **Dataset**: Preprocessed MNIST data, normalized to $[0, 1]$.




In [None]:
# Download the datasets you will be using for this assignment

import urllib.request
import os

if not os.path.isdir("./data/cifar-10-batches-py"):
    urllib.request.urlretrieve("https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz", "./data/cifar-10-python.tar.gz")
    !tar -xvzf './data/cifar-10-python.tar.gz' -C './data'

# Case 1: Mean-Field Approximation

### Overview

In **Mean-Field Approximation**, the goal is to approximate the posterior distribution \( q(z|x) \) of the latent variables \( z \) given the input \( x \). The key assumption is that the latent variables are independent. To achieve this, we model \( q(z|x) \) as a **multivariate Gaussian distribution** with a **diagonal covariance matrix**. This is known as a **diagonal Gaussian approximation**, which simplifies computations but assumes that the latent dimensions do not interact with each other.

The variational posterior distribution $( q(z|x) )$ is given by:

$
q(z|x) = \mathcal{N}(\mu(x), \text{diag}(\sigma^2(x)))
$

Here, $( \mu(x) )$ is the mean vector, and $( \sigma^2(x) )$ is the variance vector for the latent variables, both of which are parameterized by the encoder network.

### Encoder Network

The encoder network is responsible for computing the **mean** $( \mu \)$ and **log-variance** $ \log \sigma^2 $ of the latent space. The network has the following structure:

1. **Input**: The input to the encoder is the flattened image \( x \), which is a vector of size \( 784 \) (for 28x28 MNIST images).

2. **Hidden Layer**: A fully connected layer with \( 512 \) neurons, followed by a ReLU activation function.

3. **Output Layer**:
   - **Mean vector** $( \mu(x) )$ of size $( L )$, where $( L $) is the number of latent variables.
   - **Log-variance vector** $( \log \sigma^2(x) )$ of size $( L )$.

Thus, the encoder outputs two vectors, one for the mean and one for the log-variance.

### Reparameterization Trick

The reparameterization trick is applied to sample the latent variables \( z \). The reparameterization allows backpropagation through the sampling process. The latent variable \( z \) is sampled as follows:

$
z = \mu(x) + \sigma(x) \cdot \epsilon
$

where:
- $( \mu(x) )$ is the mean vector produced by the encoder.
- $( \sigma(x) = \text{softplus}(\log \sigma^2(x)) )$ is the standard deviation, derived from the log-variance output.
- $( \epsilon \sim \mathcal{N}(0, I) )$ is a standard Gaussian noise vector.

### Decoder Network

The decoder takes the sampled latent vector \( z \) and tries to reconstruct the input image \( x \). The decoder network has the following structure:

1. **Input**: The sampled latent vector \( z \), which has size \( L \).

2. **Hidden Layer**: A fully connected layer with \( 512 \) neurons, followed by a ReLU activation function.

3. **Output Layer**: The output is a vector of size \( 784 \), representing the reconstructed image.

### ELBO (Evidence Lower Bound)

The model is trained by minimizing the **ELBO**, which is a lower bound on the marginal likelihood of the data. The ELBO has two terms:
1. **Reconstruction Loss**: The difference between the input \( x \) and the reconstructed output $( \hat{x} )$.
   $
   \text{Reconstruction Loss} = -\mathbb{E}_{q(z|x)}[\log p(x|z)]
   $
   This term is typically computed using **binary cross-entropy** between the input and output.

2. **KL Divergence**: The second term is the **Kullback-Leibler (KL) divergence** between the variational posterior \( q(z|x) \) and the prior \( p(z) \). In this case, the prior is typically assumed to be a standard normal distribution \( p(z) = \mathcal{N}(0, I) \). The closed-form expression for KL divergence between two Gaussians is:
   
   $
   D_{KL}(q(z|x) || p(z)) = \frac{1}{2} \sum_{i=1}^L \left( \sigma_i^2 + \mu_i^2 - 1 - \log \sigma_i^2 \right)
   $

   This term penalizes the difference between the approximate posterior and the true prior, encouraging the model to learn a latent space close to a standard Gaussian distribution.

Thus, the total ELBO is:

$
\text{ELBO} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) || p(z))
$

### Latent Space Size

The latent space size \( L \) determines the dimensionality of the latent variables \( z \). Common choices for \( L \) are 32 or 128, depending on the complexity of the dataset and the model. A larger latent space allows for more expressive latent representations but also increases the model complexity.

### Training Procedure

1. **Compute** $( \mu(x) )$ and $(\log \sigma^2(x) ) $ using the encoder.
2. **Sample** \( z \) using the reparameterization trick.
3. **Pass** \( z \) through the decoder to obtain the reconstructed image $( \hat{x} )$.
4. **Compute** the ELBO by calculating the reconstruction loss and KL divergence.
5. **Update** the model parameters by minimizing the ELBO using an optimizer like Adam.

---

In [None]:
! python train_variational_autoencoder_pytorch.py --variational "mean-field"

Step 0         	Train ELBO estimate: -551.311	Validation ELBO estimate: -535.389	Validation log p(x) estimate: -390.758	Speed: 4.92e+07 examples/s
Step 9999      	Test ELBO estimate: -543.788	Test log p(x) estimate: -395.208	
Total time: 11.61 minutes


# Case 2: IAF

### 1. **IAF Overview**

Inverse Autoregressive Flow (IAF) is a technique used to improve the expressiveness of the variational posterior $( q(z|x) )$ in variational inference. It is a normalizing flow that transforms a simple distribution (like a Gaussian) into a more complex distribution by applying a series of invertible transformations. These transformations are autoregressive in nature, meaning that each step depends on the previous transformation.

IAF can be viewed as a mechanism for refining a simple latent distribution into one that more accurately represents the true posterior distribution in a probabilistic model. The idea is to start with a simple distribution (such as a Gaussian) and iteratively apply a series of transformations to it, conditioning each transformation on the previous one.

### IAF Flow Equation

Given a simple initial distribution $( p(z_0) )$ (e.g., Gaussian), the transformed latent variable \( z \) is generated by applying a series of invertible autoregressive transformations. Mathematically, this is expressed as:

$
z_t = f_t(z_{t-1}, \theta_t), \quad t = 1, 2, \dots, T
$

Where:
- $( f_t )$ represents the autoregressive transformation at time step \( t \),
- $( \theta_t )$ are the parameters of the transformation,
- \( T \) is the total number of transformations applied.

In each step, the transformation modifies the distribution of the latent variable in a way that makes it closer to the true posterior distribution.

---

## Code Explanation

### 1. **InverseAutoregressiveFlow Class**

The `InverseAutoregressiveFlow` class defines a single block of the IAF. It utilizes the **Masked Autoencoder for Distribution Estimation (MADE)** network to learn the transformation.

The class first creates the transformation network, typically a masked autoencoder, to compute the parameters of the transformation, which are the mean \( m \) and scale \( s \) for each input. The scale is then passed through a sigmoid function, ensuring that the transformation is controlled by a value between 0 and 1.

#### Equation for the Transformation:

The final transformation applied to the input \( x \) is given by:

\[
z = \sigma(s) \cdot x + (1 - \sigma(s)) \cdot m
\]

Where:
- $( \sigma(s) )$ is the sigmoid function applied to the scale \( s \),
- \( x \) is the input,
- \( m \) is the mean generated by the transformation network,
- \( z \) is the transformed output.

### 2. **FlowSequential Class**

The `FlowSequential` class applies multiple IAF blocks in sequence and accumulates the log-probabilities.

This class works by iterating through each block (IAF transformation) in a sequential manner, transforming the input at each step and accumulating the log-probabilities of the transformations. The final output is the transformed latent variable, along with the total log-probability of the transformation sequence.

### 2. **Key Components of the IAF Layer**

The IAF layer consists of several important components that help to refine the posterior distribution through a series of steps. These are explained below:

#### 2.1. **Latent Variable (z) Transformation**

- The core idea behind IAF is to transform a simple distribution (usually a Gaussian) into a more complex one by applying a series of transformations.
- Initially, we assume that the latent variable \( z \) follows a simple distribution, such as a standard Gaussian. Over multiple layers, we refine this distribution using invertible and autoregressive transformations, making the posterior \( q(z|x) \) more expressive and capable of modeling more complex data distributions.

#### 2.2. **Autoregressive Transformations**

- **Autoregressive flows** are the transformations that the latent variables undergo. Each transformation is **conditioned** on previous steps and gradually introduces non-linearity and complexity into the distribution.
- These transformations make the posterior distribution more flexible by introducing dependencies between the latent variables, which are initially assumed to be independent.
- In practice, this means that as each transformation step proceeds, the distribution of the latent variable becomes more complex and better able to capture the structure in the data.

#### 2.3. **Upsampling and Downsampling**

- **Upsampling** and **downsampling** layers are used to adjust the dimensionality of the latent variables \( z \) at different stages of the flow.
  - **Downsampling** typically reduces the resolution of the input, and it's often applied when dealing with high-dimensional data such as images. This allows the model to capture more abstract and global features of the data.
  - **Upsampling** helps recover fine-grained details by increasing the resolution, often after the data has been transformed through several layers of the flow.
- These operations enable the model to have flexibility in capturing both local and global patterns in the data.

#### 2.4. **Normalizing Flows**

- One important aspect of IAF is **normalizing flows**, where each step applies a transformation to the latent space in a way that the distribution becomes increasingly complex.
- **Invertibility** is key to normalizing flows, meaning that each transformation must be reversible so that the model can sample and compute likelihoods effectively.
- The transformations involve **parameterized functions** that model the distribution of the latent variable. In the case of IAF, these functions are typically autoregressive models like **masked convolutional networks**.

#### 2.5. **Prior and Posterior Distributions**

- In IAF, the **prior distribution** is usually a simple distribution (e.g., a standard Gaussian) that is applied to the latent variable \( z \).
- The **posterior distribution**, on the other hand, is the distribution \( q(z|x) \) that we aim to refine. Initially, the posterior may be a simple Gaussian, but as we apply autoregressive transformations, it becomes more complex and better suited to model the data.
- The objective is to make the posterior distribution match the true posterior as closely as possible.

#### 2.6. **KL Divergence**

- The **Kullback-Leibler (KL) divergence** measures how much the approximate posterior distribution deviates from the true posterior.
- In the context of IAF, we compute the KL divergence between the **prior** and the **posterior** distributions. This term is used in the loss function to regularize the model during training. Minimizing this divergence helps the model learn the correct distribution for the latent variables \( z \).

#### 2.7. **Reparameterization Trick**

- **Reparameterization** allows for backpropagation through the latent variable sampling process. In variational inference, when sampling from the posterior, the gradients cannot flow directly through the sampling operation.
- The reparameterization trick solves this by expressing the latent variable as a deterministic transformation of a simple noise variable (typically Gaussian). This enables efficient training using gradient-based methods.
- In IAF, the reparameterization trick is used after each transformation step, allowing the model to learn the parameters of the flow through backpropagation.


### 4. **Intuition Behind IAF**

Imagine trying to fit a very complex, unknown distribution with a simple one. A normal Gaussian distribution has limited flexibility to represent complex structures in data, like images or natural language. However, by applying **autoregressive transformations** iteratively, IAF introduces enough flexibility to model complex distributions by modifying the latent variables step by step.

This process is akin to "warping" a simple shape (the Gaussian) into something more complex that matches the data better, but with the added benefit of being able to sample and compute likelihoods efficiently.

### Conclusion

The **IAF layer** is an advanced technique that uses **autoregressive flows** and **invertible transformations** to transform a simple distribution into a complex one. By introducing non-linear dependencies between latent variables, it enhances the expressiveness of variational inference models, making them more capable of modeling complex data distributions.


Code in python/nn/nn_basic.py and train_vae.py and python/nn/mask.py

In [None]:
! python train_variational_autoencoder_pytorch.py --variational "flow"

Step 0         	Train ELBO estimate: -579.209	Validation ELBO estimate: -402.550	Validation log p(x) estimate: -358.586	Speed: 1.67e+07 examples/s
Step 9999      	Test ELBO estimate: -404.351	Test log p(x) estimate: -359.817	
Total time: 30.10 minutes


### Analysis of Mean-Field vs. Inverse Autoregressive Flow (IAF) Models

The comparison between the **Mean-Field** and **Inverse Autoregressive Flow (IAF)** models reveals important insights into their performance, behavior during training, and computational efficiency.

#### 1. **Train ELBO**:
   - The **Mean-Field** model starts with a **train ELBO estimate** of **-579.209**, which indicates a **relatively poorer** initial approximation of the posterior. This suggests that the model's initial parameters may not be well-optimized to represent the true posterior distribution.
   - The **IAF model**, by contrast, starts with a **train ELBO estimate** of **-551.311**, which indicates a **better initial approximation** of the posterior. This is because IAF uses a more expressive approach, with invertible autoregressive transformations, which can better capture the complexities of the distribution from the beginning.

#### 2. **Validation ELBO**:
   - The **Mean-Field** model shows a **validation ELBO estimate** of **-402.550**, which improves as training progresses, reaching **-404.351** at the final step (step 9999). This relatively small gap between the initial and final validation ELBOs suggests that the **Mean-Field** model is **not overfitting** and is stable during training.
   - The **IAF model** has an initial **validation ELBO estimate** of **-535.389**, and by the final step, it reaches **-543.788**, indicating **progressive improvement** in its ability to model the data. Although there is some gap between the validation and training ELBOs, this could be attributed to the additional complexity of the IAF model, as it refines the posterior distribution iteratively.

#### 3. **Validation Log p(x)**:
   - In terms of **validation log p(x)**, the **Mean-Field** model has a **log p(x) estimate** of **-358.586**, which suggests that it provides a **decent fit** to the validation data at the start of training. By step 9999, it shows a slight improvement, with a test log p(x) of **-359.817**.
   - The **IAF model** initially has a **validation log p(x)** of **-390.758**, but it shows improvement over time, ending with a **test log p(x) estimate** of **-395.208**. Despite having a slightly higher final log p(x) estimate, the **IAF model's superior expressiveness** allows it to model complex dependencies in the data, resulting in **better overall fit** in the long run.

#### 4. **Speed**:
   - The **Mean-Field** model demonstrates a **faster training speed** at **4.92 million examples per second**, which can be attributed to its simpler architecture. This enables the model to process data more efficiently, making it more suitable for real-time applications where speed is crucial.
   - On the other hand, the **IAF model** processes data at **1.67 million examples per second**, reflecting its more complex structure, which requires additional computational resources for the autoregressive transformations.

### **Conclusion**:
In conclusion, the **IAF model** demonstrates a more **expressive posterior distribution**, as evidenced by its better initial approximation and its ability to model complex data distributions. However, the **Mean-Field model** is more **computationally efficient** and provides **faster training**, making it more suitable for applications where speed is a priority. The **IAF model** may be preferred when accuracy and a more expressive posterior are crucial, but it requires more computational resources and time for optimization. Both models have their strengths, and the choice between them depends on the specific task requirements and the trade-off between accuracy and efficiency.
