# Generative Vision Modeling - Variational Auto Encoders
Oğuzhan Ercan - x.com/oguzhannercan

In this chapter, we will be studying on variational auto encoders. The reason for invastigating on variational auto encoders is going from an image to latent space. The "Latent Space" will be discussed in more detailed at DDPM and DDIM sections. In this notebook, we will start to building an Auto Encoder, after that, we will build and Variational Auto Encoder and we will introduce vector quantization to these models. 

### What is Auto Encoder

_A standard Autoencoder (AE) is a neural network architecture composed of two main parts: an encoder and a decoder. The encoder takes an input and compresses it into a lower-dimensional representation called the latent space or code. This compressed representation is then fed into the decoder, which attempts to reconstruct the original input from it. The primary goal of an AE is to learn efficient and useful representations of the input data by minimizing the reconstruction error between the input and the output. - Gemini 2.0_ 

![Auto Encoder Architecture](media/ae.png)

_Figure 1: Auto Encoder Architecture_
### What is Variational Auto Encoder

_A Variational Autoencoder (VAE) is a generative model that learns a probabilistic distribution over the latent space of the input data. Unlike standard autoencoders that learn a fixed encoding, VAEs encode the input into parameters of a probability distribution, typically a Gaussian. This allows for generating new data points by sampling from this learned latent distribution and decoding it back to the input space. VAEs are particularly useful for tasks like generating realistic images and other complex data. - Gemini 2.0_

![Variational Auto Encoder Architecture](media/vae-gaussian.png)

_Figure 2: Variational Auto Encoder Architecture_

As seen in the figure below, a vae differs from an ae at latent space prediction z, instead of predicting directly z, it assumes the latent space have standart gaussian distrubition, and predicts mean and variance of the sample, then samples a data point from it. Below, we will show the difference between AE and VAE, then we will implement them.


| **Aspect**            | **Autoencoder (AE)**                                   | **Variational Autoencoder (VAE)**                       |
|-----------------------|-------------------------------------------------------|--------------------------------------------------------|
| **Objective**         | Minimize reconstruction error $$  \|x - \hat{x}\|^2  $$ | Minimize reconstruction error + KL divergence $$  D_{\text{KL}}(q(z|x) \| p(z))  $$ |
| **Latent Space**      | Deterministic: $$  z = f(x)  $$                         | Probabilistic: $$  z \sim \mathcal{N}(\mu, \sigma^2)  $$ |
| **Loss Function**     | $$  \mathcal{L} = \|x - \hat{x}\|^2  $$                | $$  \mathcal{L} = \|x - \hat{x}\|^2 - \frac{1}{2} \sum (1 + \log \sigma^2 - \mu^2 - e^{\log \sigma^2})  $$ |
| **Accuracy**          | High reconstruction fidelity                         | Moderate fidelity due to regularization               |
| **Use Cases**         | Data compression, denoising, feature extraction      | Generative modeling, data synthesis, anomaly detection |
| **Generative Ability**| No                                                   | Yes, via sampling $$  z \sim p(z)  $$                    |
| **Complexity**        | Simpler optimization                                 | Increased complexity with KL term                     |

### Architectural Design Choices

In the following sections, we will briefly describe the encoder - decoder architectures.

For simplicity, we will first build an Encoder - Decoder architecture with basic convolution layers, after that, we will build stronger architectures that can capture better feature and details.

As seen at figure 1, an auto encoder takes a data, in our cases this will be an image. More spesifically we will be working on ImageNet dataset. Encoder applies some transformations, and after these transformations, we can see that the channel size increases and width x height decreases, which is a typical conv2d transformation with appropiate hyper parameters. So we need to decide what will be the shape of encoder's output, which we call latent vector. A typical latent vector for an basic autoencoder is Batch_Size x 512 x 16 x 16, so we will take an image with shape 3 x 64 x 64 (channel x width x height) and convert it the 1x512x1x1.  




In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Encoder(nn.Module):
    def __init__(self,):
        super(Encoder, self).__init__()
        
        # bx3x64x64 -> bx64x32x32 
        self.conv1 = nn.Conv2d(3, 64, 4, 2, 1)
        self.bn1 = nn.BatchNorm2d(64)
        # bx64x32x32 -> bx128x16x16
        self.conv2 = nn.Conv2d(64, 128, 4, 2, 1)
        self.bn2 = nn.BatchNorm2d(128)
        # bx128x16x16 -> bx256x8x8
        self.conv3 = nn.Conv2d(128, 256, 4, 2, 1)
        self.bn3 = nn.BatchNorm2d(256)
        # bx256x32x32 -> bx512x4x4
        self.conv4 = nn.Conv2d(256, 512, 4, 2, 1)
        self.bn4 = nn.BatchNorm2d(512)
        

    def forward(self, x):
        x = F.leaky_relu(self.bn1(self.conv1(x)), 0.2)
        x = F.leaky_relu(self.bn2(self.conv2(x)), 0.2)
        x = F.leaky_relu(self.bn3(self.conv3(x)), 0.2)
        x = F.leaky_relu(self.bn4(self.conv4(x)), 0.2)
        return x

In [2]:
tensor = torch.randn(1, 3, 64, 64)
encoder = Encoder()
print(encoder(tensor).shape)

torch.Size([1, 512, 4, 4])


In [3]:
class Decoder(nn.Module):
    def __init__(self,):
        super(Decoder, self).__init__()
        
        # bx512x4x4 -> bx256x8x8
        self.conv1 = nn.ConvTranspose2d(512, 256, 4, 2, 1)
        self.bn1 = nn.BatchNorm2d(256)
        # bx256x8x8 -> bx128x16x16
        self.conv2 = nn.ConvTranspose2d(256, 128, 4, 2, 1)
        self.bn2 = nn.BatchNorm2d(128)
        # bx128x16x16 -> bx64x32x32
        self.conv3 = nn.ConvTranspose2d(128, 64, 4, 2, 1)
        self.bn3 = nn.BatchNorm2d(64)
        # bx64x32x32 -> bx3x64x64
        self.conv4 = nn.ConvTranspose2d(64, 3, 4, 2, 1)
        

    def forward(self, x):
        x = F.leaky_relu(self.bn1(self.conv1(x)), 0.2)
        x = F.leaky_relu(self.bn2(self.conv2(x)), 0.2)
        x = F.leaky_relu(self.bn3(self.conv3(x)), 0.2)
        x = torch.tanh(self.conv4(x))
        return x

In [4]:
tensor = torch.randn(1, 512, 4, 4)
decoder = Decoder()
print(decoder(tensor).shape)

torch.Size([1, 3, 64, 64])


In [5]:
class AE(nn.Module):
    def __init__(self,):
        super(AE, self).__init__()
        self.encoder = Encoder()
        self.decoder = Decoder()
        
    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

In [7]:
tensor = torch.randn(1, 3, 64, 64)
ae = AE()
print(ae(tensor).shape)

torch.Size([1, 3, 64, 64])


Above, we build an auto encoder, which first encodes an image with shape 64x64x3 to latent vector with shape 512x4x4, then decodes back it to image space. Now we will train it with a few samples. 

In [None]:
from data.dataloader import get_dataloader
from torch.optim import Adam
from torch.nn import L1Loss
