"there are many available options for these design choices, leading to various trade-offs in terms of the learned representation"

- Learning the `log variance` instead of variance of the latent space encourages learning stability:
    https://stats.stackexchange.com/questions/353220/why-in-variational-auto-encoder-gaussian-variational-family-we-model-log-sig
- Expand the standard VAE to a beta one. Maybe something interesting would come up
- VAE learns representations in a small latent space. Why don't we try squeezing resnet structure inside it?

In [None]:
import torch
import torch.nn as nn
import torch.functional as F


In [11]:
# check if the GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# VAE model

interestingly:

In forward pass, there is a step of sampling $\varepsilon$ from standard normal 
distribution. In my opinion it turns the stochastic process to a deterministic
one: from all possible generated images $\hat{x}$ given input image $x$, 
we sample only one of them, using (more or less) normally distributed latent 
variables. Without sampling the backprop won't work (that's also why we need 
reparametrization trick)

In analogy, in reinforcement learning with MDP model, each time we sample one
sequence (so-called episode) till termination. Out of all possible outcomes,
we see only one outcome in one episode. And we trained the model based on this
outcome.

This makes me wonder: in VAE, if we **reuse** the image we trained already, 
could we dig more information from it simply by resampling? 
The resampled image from the latent space
is most likely different from the previous one, with similarities.
And the cross-entropy value should be different.

Furthermore, in classification problem, when we choose the most probable class
from the softmaxed linear output layer as the label, we are also "sampling", 
i.e. turning stochastic into deterministic. 
However, in the learning phase we kept the stochasticity in cross entropy.

In [None]:
class VariationalAutoEncoder(nn.Module):
    """ implementation of the Variational AutoEncoder

    Args:
        - device: the device to run the model on
        - input_size: the size of the input data
        - hidden_size: the size of the hidden layer
        - z_space_size: the size of the latent space
    """

    def __init__(self, device,
                 input_size: int, z_space_size: int,
                 hidden_size: int = 256,
                 ) -> None:
        super(VariationalAutoEncoder, self).__init__()

        self.device = device

        # encoder: from x to two channels: mean and log_var
        self.encoder = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_size, z_space_size),
            nn.LeakyReLU(0.2)
            )

        # mean and logvariance layers from the input to the latent space
        self.mean_layer = nn.Linear(z_space_size, 2)
        self.logvar_layer = nn.Linear(z_space_size, 2)

        # decoder: from z to x^hat
        self.decoder = nn.Sequential(
            nn.Linear(2, z_space_size),
            nn.LeakyReLU(0.2),
            nn.Linear(z_space_size, hidden_size),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_size, input_size),
            # nn.Sigmoid()
            )

    def encode(self, x):
        return self.encoder(x)

    def decode(self, z):
        return self.decoder(z)

    # reparameterization trick, the randomness is pushed outside the network
    # z = mu + sigma * epsilon, where epsilon ~ N(0, 1)
    def reparameterize(self, mean, log_var):
        # similar to mean, epsilon is in fact a vector of
        # size of the latent space
        epsilon = torch.randn_like(mean).to(self.device)
        var = torch.exp(0.5 * log_var)
        # return the latent space vector
            # here the variance is also modeled as a vector, mathematically it
            # should be a diagonal matrix and epsilon should be matrix
            # multiplied with the variance matrix
        z = mean + var * epsilon
        return z

    def forward(self, x):
        # put everything together
        x = self.encode(x)
        mean = self.mean_layer(x)
        log_var = self.logvar_layer(x)
        z = self.reparameterize(mean, log_var)
        x_hat = self.decode(z)
        return x_hat, mean, log_var

In [47]:
# test the dimensionality of the model
model = VariationalAutoEncoder(device, 784, 2)
model.to(device)
test_tensor = torch.randn(1, 784).to(device)
print("input space size: ", test_tensor.shape)
print("latent space size: ", model(test_tensor)[1].shape)
output = model(test_tensor)
print("output space size: ", output[0].shape)

input space size:  torch.Size([1, 784])
latent space size:  torch.Size([1, 2])
output space size:  torch.Size([1, 784])


# Loss function

reconstruction: the ability to copy the original image

$-log($

In [None]:
# reconstruction loss ()) + beta KL divergence (regularization)
def loss_function():
    # reconstruction loss =
