# Transposed (Fractionally-strided) Convolutional Layers

## Why transeposed convolution is needed

For some tasks like semantic segmentation, we need the input size and the output size to be the same. 

Transposed convolutional layers are used to increase (upsample) the spatial dimensions of intermediate feature maps, so that we can get back the the origianal input size even if the input has already been downsampled by CNN layers.

## The Deconvolution Operation

Deconvolution with a $2 \times 2$ input, a $2 \times 2$ kernal, stride 1 and no padding:
![](./trans_conv.svg)

Suppose we have an input size of $n_h, n_w$, a kernel size of $k_h, k_w$. The final output size will be
$$(n_h+k_h-1, n_w+k_w-1).$$

In [7]:
import torch
from torch import nn

In [2]:
def deconv_2d(X, K):
    k_h, k_w = K.shape
    n_h, n_w = X.shape
    Y = torch.zeros(size = (n_h+k_h-1, n_w+k_w-1))
    
    for i in range(n_h):
        for j in range(n_w):
            Y[i:i+k_h, j:j+k_w] += K*X[i,j]
            
    return Y

In [3]:
X = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
deconv_2d(X, K)

tensor([[ 0.,  0.,  1.],
        [ 0.,  4.,  6.],
        [ 4., 12.,  9.]])

In [4]:
### Pytorch deconv API
X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2)
tconv = torch.nn.ConvTranspose2d(1, 1, kernel_size=2, bias=False)
tconv.weight.data = K
tconv(X)

tensor([[[[ 0.,  0.,  1.],
          [ 0.,  4.,  6.],
          [ 4., 12.,  9.]]]], grad_fn=<SlowConvTranspose2DBackward>)

### Deconv with padding

Deconv with padding is totally different from Conv with padding:
1. it is applied to the output layer;
2. it removes colomns and rows from the output layer.

If the padding size is $(p_h, p_w)$, then $p_h$ rows will be removed both from the downside and upside of the output and $p_w$ rows will be removed both from the leftside and right side of the output. Thus the output size with padding $(p_h, p_w)$ will be 
$$(n_h+k_h-p_h*2-1, n_w+k_w-p_w*2-1).$$

In [5]:
X = torch.ones((3, 3))
K = torch.ones((3,3))
X = X.reshape((1,1,3,3))
K = K.reshape((1,1,3,3))
tconv_no_padding = torch.nn.ConvTranspose2d(1, 1, kernel_size = 3, bias = False)
tconv_no_padding.weight.data = K
Y_no_padding = tconv_no_padding(X)
print(Y_no_padding)
tconv = torch.nn.ConvTranspose2d(1, 1, kernel_size = 3,  padding = (2, 1), bias = False)
tconv.weight.data = K
Y = tconv(X)
print (Y)

tensor([[[[1., 2., 3., 2., 1.],
          [2., 4., 6., 4., 2.],
          [3., 6., 9., 6., 3.],
          [2., 4., 6., 4., 2.],
          [1., 2., 3., 2., 1.]]]], grad_fn=<SlowConvTranspose2DBackward>)
tensor([[[[6., 9., 6.]]]], grad_fn=<SlowConvTranspose2DBackward>)


### Deconv with stride
Different from Conv with stride, stride in Deconv is for intermediate result. Striding will make the output size larger.

![](./trans_conv_stride2.svg)

Suppose the stride size is $s_h, s_w$. For every to adjacent element in a row, in the intermediate output, there will be a gap of size $s_h-1$, so the final output size in a row is $n_h+k_h-1+(s_h-1)*(n_h-1)$, which is $s_h*(n_h-1)+k_h$. 

The output size is 
$$(s_h \times (n_h-1)+k_h, s_w \times (n_w-1)+k_w).$$

### Deconv for multiple channels

For multiple channels, like CNN, if the input has $c_i$ channels, then the kernel size will be $c_i \times k_h \times k_w$.  If the output channel size is $c_o$, then the kernel size will be $c_o \times c_i \times k_h \times k_w$.

## A trick
An Deconv with the same parameters as a CNN will restore the input size.

In [8]:
X = torch.rand(size=(1, 10, 16, 16))
conv = nn.Conv2d(10, 20, kernel_size=5, padding=2, stride=3)
tconv = nn.ConvTranspose2d(20, 10, kernel_size=5, padding=2, stride=3)
tconv(conv(X)).shape == X.shape

True