<a href="https://colab.research.google.com/github/yandexdataschool/MLatImperial2022/blob/master/Seminars/lab_08_02_VAE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Autoencoders

# Theory

## Variational Autoencoder

### Problem setting

A set of independent and identically distributed samples from true data distribution is given: $x_i \sim p_{true}(x)$, $i = 1, \dots, N$.

The problem is to build a probabilistic model $p_\theta(x)$ of the true data distribution $p_{true}(x)$.

The model $p_\theta(x)$ must be able to estimate probabilistic density function (p. d. f.) for given $x$ and to sample $x \sim p_\theta(x)$.

### Probabilistic model
$z \in \mathbb{R}^d$ is a latent variable.

The generative process of VAE:
1. Sample $z \sim p(z)$.
2. Sample $x \sim p_\theta(x | z)$.

The parameters of distribution $p_\theta(x | z)$ are obtained using a neural network with weights $\theta$ and $z$ as an input.
This network is called generator or decoder.

The above generative process induce the following model p. d. f. for $x$:

$$p_\theta(x) = \mathbb{E}_{z \sim p(z)} p_\theta(x | z)$$

### Model parameterization

A priori distribution on the latent variables is standard normal distribution: $p(z) = \mathcal{N}(z | 0, I)$.

The distributions on the components of $x$ are conditionally independent given $z$: $p_\theta(x | z) = \prod\limits_{i = 1}^D p_\theta(x_i | z)$.

If i-th component is real-valued, we can use Gaussian generative distribution: $p_\theta(x_i | z) = \mathcal{N}(x_i | \mu_i(z, \theta), \sigma^2_i(z, \theta))$.
Here $\mu(z, \theta)$ и $\sigma(z, \theta)$ are deterministic functions defined by neural networks with parameters $\theta$.

If i-th component is categorial, then we can use categorical generative distribution: $p_\theta(x_i | z) = Cat(Softmax(\omega_i(z, \theta)))$, where $\omega_i(z, \theta)$ is also a deterministic function described by neural network.

Binary components are the special case of categorical ones. For them categorical distribution turns into Bernoulli distibution with just one parameter.

_Tip:_ some pixels are black in the whole MNIST train set, so likelihood maximization forces the probability of these pixels to be black to 1.
Therefore the weights for these pixels go to infinity.
To avoid divergence of the training procedure, we may add a clipping level into generative network: e. g. clipping layer into range $[-10, 10]$ before final activation.

### Variational lower bound

To fit the model to data we maximize marginal log-likelihood $\log p_\theta(x)$ of the train set.

Nevertheless, $\log p_\theta(x)$ cannot be optimized straightforwardly, because there is integral in high-dimensional space inside the logarithm which cannot be computed analytically or numerically estimated with enough accuracy in a reasonable amount of time.

So to perform optimization we maximize the _variational lower bound_ (VLB) on log-likelihood instead:
$$\log p_\theta(x) = \mathbb{E}_{z \sim q_\phi(z | x)} \log p_\theta(x) = 
\mathbb{E}_{z \sim q_\phi(z | x)} \log \frac{p_\theta(x, z) q_\phi(z | x)}{q_\phi(z | x) p_\theta(z | x)} = 
\mathbb{E}_{z \sim q_\phi(z | x)} \log \frac{p_\theta(x, z)}{q_\phi(z | x)} + KL(q_\phi(z | x) || p_\theta(z | x))$$
$$\log p_\theta(x) \geqslant \mathbb{E}_{z \sim q_\phi(z | x)} \log \frac{p_\theta(x | z)p(z)}{q_\phi(z | x)} = 
\mathbb{E}_{z \sim q_\phi(z | x)} \log p_\theta(x | z) - KL(q_\phi(z | x) || p(z)) = L(x; \phi, \theta)
\to \max\limits_{\phi, \theta}$$

$q_\phi(z | x)$ is called a proposal, recognition or variational distribution. It is usually defined as a Gaussian with parameters from a neural network with weights $\phi$ which takes $x$ as an input:
$q_\phi(z | x) = \mathcal{N}(z | \mu_\phi(x), \sigma^2_\phi(x)I)$.
Usually neural network defines $\log\sigma_\phi(x)$ or $\log(\exp(\sigma_\phi(x) - 1))$ instead of $\sigma_\phi(x)$. So $\sigma_\phi(x)$ is always positive by design and also more scale-independent.

### Discussion of VLB

One can show that the gap between VLB $L(x; \phi, \theta)$ on log-likelihood and the log-likelihood $\log p_\theta(x)$ itself is KL-divergence between proposal and aposteriori distributions over $z$: $KL(q_\phi(z | x) || p_\theta(z | x))$.
Maximum of $L(x; \phi, \theta)$ with fixed $\theta$ is achieved when $q_\phi(z | x) = p_\theta(z | x)$.
Nevertheless, $p_\theta(z | x)$ is untracable, so instead of numerically computing it, VLB is optimized w. r. t. $\phi$ using backpropagation and reparameterization trick (see below).
The closer $q_\phi(z | x)$ to $p_\theta(z | x)$, the more precise is VLB.
The true posterior distribution $p_\theta(z | x)$ often cannot be decribed by one Gaussian, so the gap between VLB and LL never reaches zero.

The first term of VLB - $\mathbb{E}_{z \sim q_\phi(z | x)} \log p_\theta(x | z)$ - is called reconstruction loss.
The model describes this term is an autoencoder with one stochastic layer which tries to restore input object $x$.
If $q_\phi(z | x)$ is a delta-function, then an autoencoder with a stochastic layer turns into an ordinary autoencoder.
That is why $q_\phi(z | x)$ and $p_\theta(x | z)$ are called encoder and decoder respectivelly.

The term $KL(q_\phi(z | x) || p(z))$ is called regularizer.
It forces $z \sim q_\phi(z | x)$ to be close to $0$.
But, as described above, it also forces $q_\phi(z | x)$ to be close to $p_\theta(z | x)$, which is even more important.
One can use a coefficient before KL-divergence or even a different regularizer.
Naturally, after that optimiation of VLB usually becomes unrelated to the log-likelihood of the initial probabilistic model.
This decreases intrpretability of the model and avoids theoretical guarantees.

KL-divergence between two Gaussians can be computed analytically, which improves the speed and stability of optimization procedure.

### Reparameterization trick
We use stochastic gradient ascent in order to maximize $L(x; \phi, \theta)$.

The gradient of the reconstruction loss w. r. t. $\theta$ is computed using backpropagation.
$$\frac{\partial}{\partial \theta} L(x; \phi, \theta) = \mathbb{E}_{z \sim q_\phi(z | x)} \frac{\partial}{\partial \theta} \log p_\theta(x | z)$$

The gradient of the reconstruction loss w. r. t. $\phi$ can be computed using reparametrization trick:
$$\varepsilon \sim \mathcal{N}(\varepsilon | 0, I)$$
$$z = \mu + \sigma \varepsilon \Rightarrow z \sim \mathcal{N}(z | \mu, \sigma^2I)$$
$$\frac{\partial}{\partial \phi} L(x; \phi, \theta) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(\varepsilon | 0, I)} \frac{\partial}{\partial \phi} \log p_\theta(x | \mu_\phi(x) + \sigma_\phi(x) \varepsilon) - \frac{\partial}{\partial \phi} KL(q_\phi(z | x) || p(z))$$

### Log-likelihood estimation

Model log-likelihood $\log p_\theta(x) = \log \mathbb{E}_{z \sim p(z)} p_\theta(x | z)$ is estimated using the hold-out validation set.

Likelihood can be estimated using Monte-Carlo method:

$$z_i \sim p(z), i = 1, \dots, K$$
$$p_\theta(x) \approx \frac{1}{K} \sum\limits_{i = 1}^K p_\theta(x | z_i)$$

This estimation above is unbiased, but also useless for us.

For log-likelihood estimation the averaging is also performed inside the logarithm:
$$\log p_\theta(x) \approx \log \frac{1}{K} \sum\limits_{i = 1}^K p_\theta(x | z_i),\,\,\,\,z_i \sim p(z)$$

Note that this estimate is biased now.


![img](https://blog.bayeslabs.co/assets/img/vae-gaussian.png)

## Dataset

In [None]:
import torchvision
from torchvision.datasets import MNIST
from torch.utils.data import Dataset, DataLoader
import torch
from torch import nn
from torch import optim

In [None]:
class MNISTDataset(Dataset):
    def __init__(self, X, y=None, device='cuda'):
        self.device = device
        self.X, self.y = self.preprocess_data(X, y)
        
    def preprocess_data(self, X, y):
        X_preproc = torch.tensor(X / 255.,
                                    dtype=torch.float).reshape(-1, 28 * 28).to(self.device)
        
        if (y is None):
            return X_preproc, None
        
        return X_preproc, torch.tensor(y).to(self.device)
        
    def __len__(self):
        return self.X.shape[0]
    
    def __getitem__(self, idx):
        if (self.y is None):
            return self.X[idx]
        
        return self.X[idx], self.y[idx]

In [None]:
BATCH_SIZE = 128

train = MNIST('mnist', download=True, train=True)
train_ds = MNISTDataset(train.train_data, train.train_labels)
train_dl = DataLoader(train_ds, batch_size = BATCH_SIZE, shuffle=True)

test = MNIST('mnist', download=True, train=False)
test_ds = MNISTDataset(test.test_data, test.test_labels)
test_dl = DataLoader(test_ds, batch_size = BATCH_SIZE, shuffle=False)

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

def show_images(x):
    plt.figure(figsize=(10, 10))
    x = x.view(-1, 1, 28, 28).cpu()
    mtx = torchvision.utils.make_grid(x, nrow=10, pad_value=1)
    plt.imshow(mtx.permute([1, 2, 0]).numpy(), cmap='gray')
    plt.axis('off')

show_images(train_ds[:10][0])

## Utils

In [None]:
from IPython.display import clear_output

class Logger:
    def __init__(self):
        self.train_loss_batch = []
        self.train_loss_epoch = []

        self.test_loss_batch = []
        self.test_loss_epoch = []

        self.test_LLMC_batch = []
        self.test_LLMC_epoch = []

        self.train_batches_per_epoch = 0
        self.test_batches_per_epoch = 0
        self.test_LLMC_batches_per_epoch = 0

        self.epoch_counter = 0

    def fill_train(self, loss):
        self.train_loss_batch.append(loss)
        self.train_batches_per_epoch += 1

    def fill_test(self, loss):
        self.test_loss_batch.append(loss)
        self.test_batches_per_epoch += 1

    def fill_test_LLMC(self, loss):
        self.test_LLMC_batch.append(loss)
        self.test_LLMC_batches_per_epoch += 1

    def finish_epoch(self, make_plot=True):
        self.train_loss_epoch.append(np.mean(
            self.train_loss_batch[-self.train_batches_per_epoch:]
        ))
        self.test_loss_epoch.append(np.mean(
            self.test_loss_batch[-self.test_batches_per_epoch:]
        ))
        self.test_LLMC_epoch.append(np.mean(
            self.test_LLMC_batch[-self.test_LLMC_batches_per_epoch:]
        ))
        self.train_batches_per_epoch = 0
        self.test_batches_per_epoch = 0
        self.test_LLMC_batches_per_epoch = 0
    
        if make_plot:
            clear_output()
  
        print("epoch #{} \t train_loss: {:.8} \t test_loss: {:.8} \t LLMC: {:.8}".format(
                  self.epoch_counter,
                  self.train_loss_epoch[-1],
                  self.test_loss_epoch [-1],
                  self.test_LLMC_epoch [-1]
              ))
    
        self.epoch_counter += 1

        if make_plot:
            plt.figure(figsize=(11, 5))

            plt.subplot(1, 2, 1)
            plt.plot(self.train_loss_batch, label='train loss')
            plt.xlabel('# batch iteration')
            plt.ylabel('loss')
            plt.legend()

            plt.subplot(1, 2, 2)
            plt.plot(self.train_loss_epoch, label='average train loss')
            plt.plot(self.test_loss_epoch , label='average test loss' )
            plt.plot(self.test_LLMC_epoch , label='average test LLMC' )
            plt.legend()
            plt.xlabel('# epoch')
            plt.ylabel('loss')
            plt.show();

In [None]:
def train(model, optimizer, dl_train, dl_test, n_epochs, calc_LLMC=False, K=50):
    logger = Logger()
    
    for i_epoch in range(n_epochs):
        model.train()
        for batch_X, _ in dl_train:
            optimizer.zero_grad()
            
            loss = model.calc_loss(batch_X)
            loss.backward()
            optimizer.step()

            logger.fill_train(loss.item())
            
        model.eval()
        with torch.no_grad():
            for batch_X, _ in dl_test:
                loss = model.calc_loss(batch_X)
                if (calc_LLMC):
                    LLMC = compute_log_likelihood_monte_carlo(batch_X, model, bernoulli_log_likelihood, K)
                    logger.fill_test_LLMC(-LLMC)
                logger.fill_test(loss.item())

        logger.finish_epoch()

In [None]:
n = 15
digit_size = 28

from scipy.stats import norm
import numpy as np

grid_x = norm.ppf(np.linspace(0.05, 0.95, n))
grid_y = norm.ppf(np.linspace(0.05, 0.95, n))

def draw_manifold(generator):
    figure = np.zeros((digit_size * n, digit_size * n))
    for i, yi in enumerate(reversed(grid_x)):
        for j, xi in enumerate(grid_y):
            z_sample = np.array([[xi, yi]])

            x_decoded = generator(z_sample)
            digit = x_decoded
            figure[i * digit_size: (i + 1) * digit_size,
                   j * digit_size: (j + 1) * digit_size] = digit
    plt.figure(figsize=(10, 10))
    plt.imshow(figure, cmap='gray')
    plt.axis('off')
    plt.show()

In [None]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

def draw_latent_space(model, data, target, use_TSNE=False):
    model.eval()
    with torch.no_grad():
        z = model.encode(data).cpu().numpy()
    if (use_TSNE):
        z = TSNE(2).fit_transform(z)

    plt.figure(figsize=(7, 6))
    plt.scatter(z[:, 0], z[:, 1], c=target.cpu().numpy(), cmap='gist_rainbow', alpha=0.75)
    plt.colorbar()
    plt.show()

## Autoencoder

In [None]:
class AE(nn.Module):
    def __init__(self, d, D):
        """
        Initialize model weights.
        Input: d, int - the dimensionality of the latent space.
        Input: D, int - the dimensionality of the object space.
        """
        super(type(self), self).__init__()
        self.d = d
        self.D = D
        self.encoder = nn.Sequential(
            nn.Linear(self.D, 200),
            nn.LeakyReLU(),
            nn.Linear(200, self.d)
        )
        self.decoder = nn.Sequential(
            nn.Linear(self.d, 200),
            nn.LeakyReLU(),
            nn.Linear(200, self.D),
            nn.Sigmoid()
        )

    def encode(self, x):
        """
        Generate a latent code given the objects.
        Input: x, Tensor of shape n x D.
        Return: Tensor of shape n x d.
        """
        return self.encoder(x)

    def decode(self, z):
        """
        Generate objects given the latent representations.
        Input: z, Tensor of shape n x d - the latent representations.
        Return: Tensor of shape n x D.
        """
        return self.decoder(z)

    def calc_loss(self, batch):
        """
        Compute batch loss. Batch loss is an average of per-object losses.
        Per-object loss is a sum of reconstruction L2-error and
        L2-regularization of the latent representations.

        Input: batch, Tensor of shape n x D.
        Return: Tensor, scalar - loss function for the batch.
        """
        <YOUR_CODE>
        return loss

    def generate_samples(self, num_samples):
        """
        Generate samples from standard normal distribution in the latent space.
        Input: num_samples, int - number of sample to be generated.
        Return: Tensor of shape num_samples x D.
        """
        return self.decode(torch.randn((num_samples, self.d)).cuda())

### Training models

In [None]:
AE_d2 = AE(2, 28*28).cuda()
optimizer = torch.optim.Adam(AE_d2.parameters(), lr=1e-3)

train(AE_d2, optimizer, train_dl, test_dl, n_epochs=25)

In [None]:
AE_d10 = AE(10, 28*28).cuda()
optimizer = torch.optim.Adam(AE_d10.parameters(), lr=1e-3)

train(AE_d10, optimizer, train_dl, test_dl, n_epochs=25)

### Evaluating results

Visual inspection

In [None]:
show_images(AE_d2.generate_samples(20).detach().cpu())

In [None]:
show_images(AE_d10.generate_samples(20).detach().cpu())

In [None]:
def sample_like(model, data):
    with torch.no_grad():
        z = model.encode(data)
        z_samples = torch.randn(64, z.shape[-1]).cuda() * 1.0 + z
        alike_gen = model.decode(z_samples).squeeze().cpu().numpy().reshape(8, 8, 28, 28)

    plt.figure(figsize=(8, 8))
    plt.imshow(np.transpose(alike_gen, (0, 2, 1, 3)).reshape(28*8, 28*8), cmap='gray')
    plt.axis('off')

In [None]:
sample_like(AE_d2, train_ds[0][0].unsqueeze(0))

In [None]:
sample_like(AE_d10, train_ds[0][0].unsqueeze(0))

Latent space visualization (from the decoder's side)

In [None]:
def draw_manifold_ae(model):
    generator = lambda z: model.decode(torch.from_numpy(z).float().cuda()).view(28, 28).data.cpu().numpy()
    return draw_manifold(generator)

In [None]:
draw_manifold_ae(AE_d2)

Latent space visualization (from the encoder's side)

In [None]:
draw_latent_space(AE_d2, test_ds[::10][0], test_ds[::10][1])

In [None]:
draw_latent_space(AE_d10, test_ds[::10][0], test_ds[::10][1], use_TSNE=True)

## Probabilistic utils

In [None]:
def bernoulli_log_likelihood(x_true, x_distr):
    """
    Compute log-likelihood of objects x_true for the generated by model
    component-wise Bernoulli distributions.
    Each object from x_true has K corresponding distrbutions from x_distr.
    Log-likelihood estimation must be computed for each pair of an object
    and a corresponding to the object distribution.

    Input: x_true, Tensor of shape n x D.
    Input: x_distr, Tensor of shape n x K x D - parameters of component-wise
           Bernoulli distributions.
    Return: Tensor of shape n x K - log-likelihood for each pair of an object
            and a corresponding distribution.
    """
    x_true_ = x_true.unsqueeze(1)
    return (x_true_*torch.log(x_distr) + (1-x_true_) * torch.log(1 - x_distr)).sum(dim=-1)


def kl(q_mu, q_sigma, p_mu, p_sigma):
    """
    
    Compute KL-divergence KL(q || p) between n pairs of Gaussians
    with diagonal covariational matrices.

    Input: q_mu, p_mu, Tensor of shape n x d - mean vectors for n Gaussians.
    Input: q_sigma, p_sigma, Tensor of shape n x d - standard deviation
           vectors for n Gaussians.
    Return: Tensor of shape n - each component is KL-divergence between
            a corresponding pair of Gaussians.
    """
    log_part = torch.log(p_sigma) - torch.log(q_sigma)
    sqr_sum = (q_mu - p_mu)*(q_mu - p_mu) + q_sigma*q_sigma
    div_part = 2*p_sigma*p_sigma
    KL = log_part + sqr_sum/div_part - 0.5
    return KL.sum(dim=1)

## Variational Autoencoder

In [None]:
class ClampLayer(nn.Module):
    def __init__(self, min=None, max=None):
        super().__init__()
        self.min = min
        self.max = max

    def forward(self, input):
        return torch.clamp(input, self.min, self.max)


class Reshape(torch.nn.Module):
    def __init__(self, *shape):
        super().__init__()
        self.shape = shape

    def forward(self, x):
        return x.reshape(x.shape[0], *self.shape)

In [None]:
class VAE(nn.Module):
    def __init__(self, d, D):
        """
        Initialize model weights.
        Input: d, int - the dimensionality of the latent space.
        Input: D, int - the dimensionality of the object space.
        """
        super(type(self), self).__init__()
        self.d = d
        self.D = D
        self.proposal_network = nn.Sequential(
            nn.Linear(self.D, 400),
            nn.LeakyReLU(),

            nn.Linear(400, 150),
            nn.LeakyReLU(),

            nn.Linear(150, 80),
            nn.LeakyReLU(),
        )
        self.proposal_mu_head = nn.Linear(80, self.d)
        self.proposal_sigma_head = nn.Sequential(
            nn.Linear(80, self.d),
            nn.Softplus()
        )
        self.generative_network = nn.Sequential(
            nn.Linear(self.d, 80),
            nn.LeakyReLU(),

            nn.Linear(80, 150),
            nn.LeakyReLU(),

            nn.Linear(150, 400),
            nn.LeakyReLU(),

            nn.Linear(400, self.D),
            ClampLayer(-10, 10),
            nn.Sigmoid()
        )

    def proposal_distr(self, x):
        """
        Generate proposal distribution over z.
        Note that sigma is positive by design of neural network.
        Input: x, Tensor of shape n x D.
        Return: tuple(Tensor, Tensor),
                Each Tensor is a matrix of shape n x d.
        """
        mu = self.proposal_mu_head(self.proposal_network(x))
        sigma = self.proposal_sigma_head(self.proposal_network(x))
        return mu, sigma

    def prior_distr(self, x):
        """
        Generate prior distribution over z.

        Input: x, Tensor of shape n x D.
        Return: tuple(Tensor, Tensor),
                Each Tensor is a matrix of shape n x d.
        """
        n = x.size()[0]
        mu = torch.zeros((n, self.d)).cuda()
        sigma = torch.ones((n, self.d)).cuda()
        return mu, sigma

    def sample_latent(self, mu, sigma, K=1):
        """
        Generate samples from Gaussians with diagonal covariance matrices in latent space.
        Samples must be differentiable w. r. t. parameters of distribution!
        Use reparametrization trick.
        Input: mu, Tensor of shape n x d - mean vectors for n Gaussians.
        Input: sigma, Tensor of shape n x d - standard deviation vectors
               for n Gaussians.
        Input: K, int - number of samples from each Gaussian.
        Return: Tensor of shape n x K x d.
        """
        n = mu.size()[0]
        eps = torch.randn((n, K, self.d)).cuda()
        return eps * sigma.unsqueeze(1) + mu.unsqueeze(1)

    def generative_distr(self, z):
        """
        Compute a tensor of parameters of Bernoulli distribution over x
        given a tensor of latent representations.
        Input: z, Tensor of shape n x K x d - tensor of latent representations.
        Return: Tensor of shape n x K x D - parameters of Bernoulli distribution.
        """
        n, K, _ = z.size()
        return self.generative_network(z.view(-1, self.d)).view(n, K, self.D)

    def calc_loss(self, batch):
        """
        Compute VLB for batch. The VLB for batch is an average of VLBs for batch's objects.
        VLB must be differentiable w. r. t. model parameters, so use reparametrization!
        Input: batch, Tensor of shape n x D.
        Return: Tensor, scalar - VLB.
        """
        <YOUR_CODE>
        return loss

    def generate_samples(self, num_samples):
        """
        Generate samples from the model.
        Tip: for visual quality you may return the parameters of Bernoulli distribution instead
        of samples from it.
        Input: num_samples, int - number of samples to generate.
        Return: Tensor of shape num_samples x D.
        """
        mu, sigma = self.prior_distr(torch.zeros(size=(num_samples, )))
        z = self.sample_latent(mu, sigma)
        return self.generative_distr(z)

### Log-likelihood estimates

In [None]:
def log_mean_exp(data):
    max_ = torch.max(data, dim=-1).values.unsqueeze(-1)
    data_exp = torch.exp(data - max_)
    return torch.log(data_exp.mean(dim=-1)) + max_

def compute_log_likelihood_monte_carlo(batch, model, generative_log_likelihood, K):
    """
    Monte-Carlo log-likelihood estimation for a batch.

    Input: batch, Tensor of shape n x D for VAE
    Input: model, Module - object with methods prior_distr, sample_latent and generative_distr,
           described in VAE class.
    Input: generative_log_likelihood, function which takes batch and distribution parameters
           produced by the generative network.
    Input: K, int - number of latent samples.
    Return: float - average log-likelihood estimate for the batch.
    """
    mu, sigma = model.prior_distr(batch)
    z = model.sample_latent(mu, sigma, K=K)
    params = model.generative_distr(z)
    log_likelihood = log_mean_exp(generative_log_likelihood(batch, params))
    return log_likelihood.mean().data.item()

### Traning models

In [None]:
VAE_d2 = VAE(2, 28*28).cuda()
optimizer = torch.optim.Adam(VAE_d2.parameters(), lr=7e-4)

train(VAE_d2, optimizer, train_dl, test_dl, n_epochs=25, calc_LLMC=True)

In [None]:
VAE_d10 = VAE(10, 28*28).cuda()
optimizer = torch.optim.Adam(VAE_d10.parameters(), lr=7e-4)

train(VAE_d10, optimizer, train_dl, test_dl, n_epochs=25, calc_LLMC=True, K=50)

### Evaluating results

Visual inspection

In [None]:
show_images(VAE_d2.generate_samples(20).detach().cpu())

In [None]:
show_images(VAE_d10.generate_samples(20).detach().cpu())

In [None]:
def sample_like(model, data):
    with torch.no_grad():
        mu, sigma = model.proposal_distr(data)
        z = model.sample_latent(mu, sigma, K=64)
        alike_gen = model.generative_distr(z).squeeze().cpu().numpy().reshape(8, 8, 28, 28)

    plt.figure(figsize=(8, 8))
    plt.imshow(np.transpose(alike_gen, (0, 2, 1, 3)).reshape(28*8, 28*8), cmap='gray')
    plt.axis('off')

In [None]:
sample_like(VAE_d2, train_ds[0][0].unsqueeze(0))

In [None]:
sample_like(VAE_d10, train_ds[0][0].unsqueeze(0))

Latent space visualization (from the decoder's side)

In [None]:
def draw_manifold_vae(model):
    generator = lambda z: model.generative_distr(torch.from_numpy(z).unsqueeze(1).float().cuda()).view(28, 28).data.cpu().numpy()
    return draw_manifold(generator)

In [None]:
draw_manifold_vae(VAE_d2)

Latent space visualization (from the encoder's side)

In [None]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

def draw_latent_space_VAE(model, data, target, use_TSNE=False):
    model.eval()
    with torch.no_grad():
        z = model.sample_latent(*model.proposal_distr(data)).squeeze().cpu().numpy()
    if (use_TSNE):
        z = TSNE(2).fit_transform(z)

    plt.figure(figsize=(7, 6))
    plt.scatter(z[:, 0], z[:, 1], c=target.cpu().numpy(), cmap='gist_rainbow', alpha=0.75)
    plt.colorbar()
    plt.show()

In [None]:
draw_latent_space_VAE(VAE_d2, test_ds[::10][0], test_ds[::10][1])

In [None]:
draw_latent_space_VAE(VAE_d10, test_ds[::10][0], test_ds[::10][1], use_TSNE=True)