# Part 8

## Lecture 8
Used for:
- Learn compression to store large datasets
- pre-training for feature learning
- density estimation
- initializing weights
- generating new data samples

**Size of hidden layer**:
- Undercomplete 
    - *h < x*
    - Compress the input, good for training samples
- Overcomplete 
    - *h > x*
    - No compression, useful for representation learning
    - copying input could be prevented with **regularization**

<img src="./image/over_and_undercomplete.png" height="300" />

- Denoising autoencoder
    - Use **regularization** to make robust to noise
    - $g(f(x+e)) = x$
    - Corrupt the data on purpose 
- Contractive autoencoder
    - Penalize unwanted variations
        - if x changes, h does not change much; **robust**
    - Frobenius norm of the Jacobian $\omega(h)$ measures:
        - **how much the activations change when input changes**.

**Generate samples**?
- Auto encoder does not allow for that; is not continuous
- Variational allows since it has a hidden layer distribution sampling

<img src="./image/Encoders.png" height="200" />

**Generative Adversarial Networks**
- **Goal**: 
    - get a discriminator output of > 0.5
        - generate new sample images
- _sample from high dimensional training distribution_
    - add random noise, learn transformation etc.
- networks:
    - **Generator**: create real looking images
        - Aims to minimize D, such that it is close to 1 (for fake data)
        - **Only needs random noise as input**
    - **Discriminator**: judge if real or fake
        - Aims to maximize D, such that it is close to 1 (for real data)
- Evaluating likelihoods: Higher likelihood for images does not necessarily mean visually better

**VAE** vs **AE**:
- Latent space of VAE has all points in latent space close to the origin, yielding meaningful reproductions
    - Continuous space, which AE has **not**.
- VAE penalizes the **structure of the latent space**
- VAE and AE penalize **reconstructions**

## Assignment 8: (Variational) auto-encoders
Generating "meaningful" samples using unsupervised learning. Unsupervised learners could be used to exploit (hidden) useful structure in data.

**Dimension reduction**: 
- reduce amount of features without losing (most important) information. 
    <!-- - $$g: \mathbb{R}^n \rightarrow \mathbb{R}^k \quad \quad n \gg k$$ -->
    - Conversion of *bmp* to *jpeg*.

<img src="./image/dimension_reduction.png" height="250" />

**The Auto-encoder**:
- The outcome ($y =: \hat{x}$) is compared to the the input so it learns both how to encode the input signals and decode it back. 
- Is the concatanation of encoder and decoder: $$y = h(g(x)).$$
- Loss: $$f_L = L \left( x, y \right) = L \left( x, h(g(x)) \right).$$
- $L$ is mostly chosen to be the MSE-loss.
- Assumes not all data is equally important, there is always some noise and it can be reduced to something that only contains the most important information.

<img src="./image/auto-encoder.png" height="250" />

** Latent space**:
- the encoded vector lives in a low dimensional space. 
    - Often called: **Latent space**
    - Potential of generating new data samples
- Latent space gaps can cause the generated examples to be "gibberish", which happens when you only penalize the outcomes
- Variational auto-encoders (VAE's) penalize also the structure of the latent space. This results in generated examples to be less "gibberish".

<img src="./image/auto-encoder-architecture.png" height="250" />

**Architecture**:
- to 100 means, limit dimensions to 100 variables
- Make use of MSE-loss because this is a regression and not classification problem.

In [None]:
import torch
import torch.nn.functional as F
import torch.nn as nn
from torchinfo import summary


#encoder
class Encoder(nn.Module):
    def __init__(self, latent_dims, s_img, hdim):
        super(Encoder, self).__init__()
        self.linear1 = nn.Linear(s_img*s_img, hdim[0])
        self.linear2 = nn.Linear(hdim[0], hdim[1])
        self.linear3 = nn.Linear(hdim[1], latent_dims)
        self.relu    = nn.ReLU()

    def forward(self, x):
        x = torch.flatten(x, start_dim=1)
        x = self.relu(self.linear1(x))
        x = self.relu(self.linear2(x))
        x = self.linear3(x)

        return x

#decoder
class Decoder(nn.Module):
    def __init__(self, latent_dims, s_img, hdim):
        super(Decoder, self).__init__()
        self.linear1 = nn.Linear(latent_dims, hdim[1])
        self.linear2 = nn.Linear(hdim[1], hdim[0])
        self.linear3 = nn.Linear(hdim[0], s_img*s_img)
        self.relu    = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, z):
        z = self.relu(self.linear1(z))
        z = self.relu(self.linear2(z))
        z = self.sigmoid(self.linear3(z))
        z = z.reshape((-1, 1, s_img, s_img))

        return z

#autoencoder
class Autoencoder(nn.Module):
    def __init__(self, latent_dims, s_img, hdim = [100, 50]):
        super(Autoencoder, self).__init__()
        self.encoder = Encoder(latent_dims, s_img, hdim)
        self.decoder = Decoder(latent_dims, s_img, hdim)

    def forward(self, x):
        z = self.encoder(x)
        y = self.decoder(z)

        return y


# Learnable parameters: 
n_samples, in_channels, s_img, latent_dims = 3, 1, 28, 2
hdim = [100, 50] #choose hidden dimension
bias = False

model_ouput = summary(
    Autoencoder(latent_dims, s_img, hdim),
    (n_samples, in_channels, s_img, s_img),
    verbose=2,
    col_width=16,
    col_names=["input_size", "output_size", "kernel_size", "num_params", "mult_adds"],
)

Layer (type:depth-idx)                   Input Shape      Output Shape     Kernel Shape     Param #          Mult-Adds
├─Encoder: 1-1                           [3, 1, 28, 28]   [3, 2]           --               --               --
|    └─linear1.weight                                                      [100, 784]
|    └─linear2.weight                                                      [50, 100]
|    └─linear3.weight                                                      [2, 50]
|    └─Linear: 2-1                       [3, 784]         [3, 100]         [784, 100]       78,500           235,200
|    └─ReLU: 2-2                         [3, 100]         [3, 100]         --               --               --
|    └─Linear: 2-3                       [3, 100]         [3, 50]          [100, 50]        5,050            15,000
|    └─ReLU: 2-4                         [3, 50]          [3, 50]          --               --               --
|    └─Linear: 2-5                       [3, 50]          

**Variational Auto Encoders**:
- Penalize structure and outcome of latent space.
- **Reparameterization trick**:
    - Used to allow backpropagation
    - $$z = \mu_x +  \sigma_x \zeta= g_1(x) + g_2 (x) \zeta$$ where $\zeta$ is randomly sampled from $\mathcal{N} (0, I)$.
- _Encoder_: Generates a distribution from which z is chosen randomly. 
    - Yields a **continuous** and **complete** latent space.
    - This is because our encouder outputs a range of possible values from which we'll randomly sample to feed into the decorder model.
- _Decoder_: generate new data
-  the latent space is regularized if the distribution is penalized.
    - the regularization term tries to make the network learn a normal distribution close to mean 0 and variance of 1. 
    - the reproduction term  is maximizing the reconstruction likelihood
    - the regularization term is to encourage learned distribution to be similar to the true prior distribution which we assume to follow Gaussian distribution.

$$
\begin{aligned}
f_{L} &= \underbrace{L (x, h(z))}_{\text{reproduction term}} + \underbrace{R \left(\mathcal{N} (\mu_x, \sigma_x), \mathcal{N} (0, I) \right)}_{\text{regularization term}} \\
&= L (x, h(z)) + R \left(\mathcal{N} (g_2 (x), g_1(x)), \mathcal{N} (0, I) \right)
\end{aligned}
$$

<img src="./image/var_encoder.png" height="300" />

In [None]:
import torch
import torch.nn.functional as F
import torch.nn as nn
from torchinfo import summary


#encoder
class VarEncoder(nn.Module):
    def __init__(self, latent_dims, s_img, hdim):
        super(VarEncoder, self).__init__()
        
        #layers for g1
        self.linear1_1 = nn.Linear(s_img*s_img, hdim[0])
        self.linear2_1 = nn.Linear(hdim[0], hdim[1])
        self.linear3_1 = nn.Linear(hdim[1], latent_dims)

        #layers for g2
        self.linear1_2 = nn.Linear(s_img*s_img, hdim[0])
        self.linear2_2 = nn.Linear(hdim[0], hdim[1])
        self.linear3_2 = nn.Linear(hdim[1], latent_dims)

        self.relu    = nn.ReLU()

        #distribution setup
        self.N = torch.distributions.Normal(0, 1)
        self.N.loc = self.N.loc
        self.N.scale = self.N.scale
        self.kl = 0

    def kull_leib(self, mu, sigma):
        return (sigma**2 + mu**2 - torch.log(sigma) - 1/2).sum()

    def reparameterize(self, mu, sig):
        return mu + sig*self.N.sample(mu.shape)

    def forward(self, x):
        x = torch.flatten(x, start_dim=1)
        
        x1 = self.relu(self.linear1_1(x))
        x1 = self.relu(self.linear2_1(x1))

        x2 = self.relu(self.linear1_2(x))
        x2 = self.relu(self.linear2_2(x2))

        sig = torch.exp(self.linear3_1(x1))
        mu = self.linear3_2(x2)

        #reparameterize to find z
        z = self.reparameterize(mu, sig)

        #loss between N(0,I) and learned distribution
        self.kl = self.kull_leib(mu, sig)

        return z

#decoder
class Decoder(nn.Module):
    def __init__(self, latent_dims, s_img, hdim):
        super(Decoder, self).__init__()
        self.linear1 = nn.Linear(latent_dims, hdim[1])
        self.linear2 = nn.Linear(hdim[1], hdim[0])
        self.linear3 = nn.Linear(hdim[0], s_img*s_img)
        self.relu    = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, z):
        z = self.relu(self.linear1(z))
        z = self.relu(self.linear2(z))
        z = self.sigmoid(self.linear3(z))
        z = z.reshape((-1, 1, s_img, s_img))

        return z

#autoencoder
class VarAutoencoder(nn.Module):
    def __init__(self, latent_dims, s_img, hdim = [100, 50]):
        super(VarAutoencoder, self).__init__()
        self.encoder = VarEncoder(latent_dims, s_img, hdim)
        self.decoder = Decoder(latent_dims, s_img, hdim)

    def forward(self, x):
        z = self.encoder(x)
        y = self.decoder(z)

        return y


# Learnable parameters: 
n_samples, in_channels, s_img, latent_dims = 3, 1, 28, 2
hdim = [100, 50] #choose hidden dimension
bias = False

model_ouput = summary(
    VarAutoencoder(latent_dims, s_img, hdim),
    (n_samples, in_channels, s_img, s_img),
    verbose=2,
    col_width=16,
    col_names=["input_size", "output_size", "kernel_size", "num_params", "mult_adds"],
)

Layer (type:depth-idx)                   Input Shape      Output Shape     Kernel Shape     Param #          Mult-Adds
├─VarEncoder: 1-1                        [3, 1, 28, 28]   [3, 2]           --               --               --
|    └─linear1_1.weight                                                    [100, 784]
|    └─linear2_1.weight                                                    [50, 100]
|    └─linear3_1.weight                                                    [2, 50]
|    └─linear1_2.weight                                                    [100, 784]
|    └─linear2_2.weight                                                    [50, 100]
|    └─linear3_2.weight                                                    [2, 50]
|    └─Linear: 2-1                       [3, 784]         [3, 100]         [784, 100]       78,500           235,200
|    └─ReLU: 2-2                         [3, 100]         [3, 100]         --               --               --
|    └─Linear: 2-3              

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=de0be7a9-29e1-4ab6-9ce7-607fa646094e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>