![](pics/header.png)

# Deep Learning: Autoencoder

Kevin Walchko

---

These notes come from Udacity's Deep Learning Nanodegree

## Why?

Autoencoders are able to reproduce the input. 

![](pics/autoencoder.png)

- Compression: like PCA or dimensional reduction, autoencoder can reduce the input down to the minimum amount of information needed to reconstruct it
- Denoising: since the autoencoder understands the minimum set of information (compression above), it can remove unnecessary information (noise) and reproduce the original
- Image reconstruction: similar to denoiseing, but instead of random additive values, it works with missing information (damage image or missing color planes) to reconstruct a full, colorized image

## Error Calculation

Mean squared error (MSE) is a good choice when comparing pixel quantities rather than class probabilities. 

```python
criterion = nn.MSELoss()
```

## Transpose Convolution

Convolution can be thought of as downsampling an image. To up sample an image, you would do transposed convolution which can be thought of as the inverse of regular convolution.

![](pics/transposed-conv.gif)

## Alternative to Transposed Convolution

Transposed convolution can leave artifacts in yiur image. Thus, an alternative is to:

1. `F.upsample(x)`
1. `F.relu(nn.Conv2d(x))`

In [2]:
import torch.nn as nn
import torch.nn.functional as F
from torchinfo import summary

# define the NN architecture
class ConvAutoencoder(nn.Module):
    def __init__(self):
        super(ConvAutoencoder, self).__init__()
        ## encoder layers ##
        # conv layer (depth from 1 --> 16), 3x3 kernels
        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)  
        # conv layer (depth from 16 --> 8), 3x3 kernels
        self.conv2 = nn.Conv2d(16, 4, 3, padding=1)
        # pooling layer to reduce x-y dims by two; kernel and stride of 2
        self.pool = nn.MaxPool2d(2, 2)
        
        ## decoder layers ##
        self.conv4 = nn.Conv2d(4, 16, 3, padding=1)
        self.conv5 = nn.Conv2d(16, 1, 3, padding=1)
        

    def forward(self, x):
        # add layer, with relu activation function
        # and maxpooling after
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        # add hidden layer, with relu activation function
        x = F.relu(self.conv2(x))
        x = self.pool(x)  # compressed representation
        
        ## decoder 
        # upsample, followed by a conv layer, with relu activation function  
        # this function is called `interpolate` in some PyTorch versions
        x = F.upsample(x, scale_factor=2, mode='nearest')
        x = F.relu(self.conv4(x))
        # upsample again, output should have a sigmoid applied
        x = F.upsample(x, scale_factor=2, mode='nearest')
        x = F.sigmoid(self.conv5(x))
        
        return x

# initialize the NN
model = ConvAutoencoder()
summary(model)

Layer (type:depth-idx)                   Param #
ConvAutoencoder                          --
├─Conv2d: 1-1                            160
├─Conv2d: 1-2                            580
├─MaxPool2d: 1-3                         --
├─Conv2d: 1-4                            592
├─Conv2d: 1-5                            145
Total params: 1,477
Trainable params: 1,477
Non-trainable params: 0

## Style Transfer

- Gram matrix (G) contains non-localized information and contains information about the style of a given layer: 
    - given a block of feature maps (dxhxw), vectorize each feature map: 8x4x4 -> 8x16
    - now, multiply by transpose: 8x16 x 16x8 = 8x8 = G
    - G(1,2) contains the similarities between layer 1 and layer 2 or feature map 1 and feature map 2

## Style Loss

Style loss is calculated from the MSE between the target and style gram matrices. This loss is decreased by <u>only</u> changing the Target image.

$$
\mathcal{L}_{style} = a \sum_i w_i (T_{s,i} - S_{s,i})^2 \\
\mathcal{L}_{content} = \frac {1}{2} \sum (T_c - C_c)^2 \\
TotalStyleTransferLoss = \alpha \mathcal{L}_{content} + \beta \mathcal{L}_{style} \\
\alpha < \beta \Rightarrow \frac {\alpha}{\beta}
$$

as $\frac {\alpha}{\beta}$ decreases, there is less content ($\alpha$) and more style ($\beta$) in an image.