# Masked vs cropped implementation for Gated PixelCNN

Hi all, in this notebook we will compare the masked implemntation of the convolutions from the Gated PixelCNN versus the alternative sugexted in the paper, the use of convolutions operaritions with appropriate croppings and padding to achieve the same result.
Let's check out!

First, we willcheck if both implementation create the same result. For this we will create a 5x5 matrix filled with ones as our input example.

In [1]:
import math

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow import nn
from tensorflow.keras import initializers

In [2]:
test_ones_2d = np.ones([1, 5, 5, 1], dtype='float32')

In [3]:
print(test_ones_2d[0,:,:,0].squeeze())

[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]


Now, let's copy themasked implementation that we have been using for our Gated PixelCNN models.

# Masked convolutions

In [4]:
class MaskedConv2D(keras.layers.Layer):
    """Convolutional layers with masks extended to work with Gated PixelCNN.

    Convolutional layers with simple implementation of masks type A and B for
    autoregressive models. Extended version to work with the verticala and horizontal
    stacks from the Gated PixelCNN model.

    Arguments:
    mask_type: one of `"V"`, `"A"` or `"B".`
    filters: Integer, the dimensionality of the output space (i.e. the number of output
        filters in the convolution).
    kernel_size: An integer or tuple/list of 2 integers, specifying the height and width
        of the 2D convolution window.
        Can be a single integer to specify the same value for all spatial dimensions.
    strides: An integer or tuple/list of 2 integers, specifying the strides of the
        convolution along the height and width.
        Can be a single integer to specify the same value for all spatial dimensions.
        Specifying any stride value != 1 is incompatible with specifying any
        `dilation_rate` value != 1.
    padding: one of `"valid"` or `"same"` (case-insensitive).
    kernel_initializer: Initializer for the `kernel` weights matrix.
    bias_initializer: Initializer for the bias vector.
    """

    def __init__(self,
                 mask_type,
                 filters,
                 kernel_size,
                 strides=1,
                 padding='same',
                 kernel_initializer='glorot_uniform',
                 bias_initializer='zeros'):
        super(MaskedConv2D, self).__init__()

        assert mask_type in {'A', 'B', 'V'}
        self.mask_type = mask_type

        self.filters = filters

        if isinstance(kernel_size, int):
            kernel_size = (kernel_size, kernel_size)
        self.kernel_size = kernel_size

        self.strides = strides
        self.padding = padding.upper()
        self.kernel_initializer = initializers.get(kernel_initializer)
        self.bias_initializer = initializers.get(bias_initializer)

    def build(self, input_shape):
        kernel_h, kernel_w = self.kernel_size

        self.kernel = self.add_weight('kernel',
                                      shape=(kernel_h,
                                             kernel_w,
                                             int(input_shape[-1]),
                                             self.filters),
                                      initializer=self.kernel_initializer,
                                      trainable=True)

        self.bias = self.add_weight('bias',
                                    shape=(self.filters,),
                                    initializer=self.bias_initializer,
                                    trainable=True)

        mask = np.ones(self.kernel.shape, dtype=np.float32)

        # Get centre of the filter for even or odd dimensions
        if kernel_h % 2 != 0:
            center_h = kernel_h // 2
        else:
            center_h = (kernel_h - 1) // 2

        if kernel_w % 2 != 0:
            center_w = kernel_w // 2
        else:
            center_w = (kernel_w - 1) // 2

        if self.mask_type == 'V':
            mask[center_h + 1:, :, :, :] = 0.
        else:
            mask[:center_h, :, :] = 0.
            mask[center_h, center_w + (self.mask_type == 'B'):, :, :] = 0.
            mask[center_h + 1:, :, :] = 0.

        self.mask = tf.constant(mask, dtype=tf.float32, name='mask')

    def call(self, input):
        masked_kernel = tf.math.multiply(self.mask, self.kernel)
        x = nn.conv2d(input,
                      masked_kernel,
                      strides=[1, self.strides, self.strides, 1],
                      padding=self.padding)
        x = nn.bias_add(x, self.bias)
        return x

With this implementation, we will recreate all convolutional operation that occur inside of the Gated Block. These operations are:

- Vertical stack
- Vertical to horizontal stack
- Horizontal stack - convolution layer with mask type "A"
- Horizontal stack - convolution layer with mask type "B"



                                              IMAGE GATED BLOCK
                                              


## Vertical stack

In [5]:
mask_type = 'V'
kernel_size = (3, 3)

conv = MaskedConv2D(mask_type=mask_type,
                    filters=1,
                    kernel_size=kernel_size,
                    padding='same',
                    kernel_initializer='ones',
                    bias_initializer='zeros')

result_v = conv(test_ones_2d)

print('MASK')
print(conv.mask.numpy().squeeze())
print('')
print('OUTPUT')
print(result_v.numpy().squeeze())

MASK
[[1. 1. 1.]
 [1. 1. 1.]
 [0. 0. 0.]]

OUTPUT
[[2. 3. 3. 3. 2.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]]


## Vertical to horizontal stack

In [6]:
padding = keras.layers.ZeroPadding2D(padding=((1, 0), 0))
cropping = keras.layers.Cropping2D(cropping=((0, 1), 0))

x = padding(result_v)
result = cropping(x)

print('INPUT')
print(result_v.numpy().squeeze())
print('')
print('OUTPUT')
print(result.numpy().squeeze())

INPUT
[[2. 3. 3. 3. 2.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]]

OUTPUT
[[0. 0. 0. 0. 0.]
 [2. 3. 3. 3. 2.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]]


## Horizontal stack - convolution layer with mask type "A"

In [7]:
mask_type = 'A'
kernel_size = (1, 3)

conv = MaskedConv2D(mask_type=mask_type,
                    filters=1,
                    kernel_size=kernel_size,
                    padding='same',
                    kernel_initializer='ones',
                    bias_initializer='zeros')

result = conv(test_ones_2d)

print('MASK')
print(conv.mask.numpy().squeeze())
print('')
print('OUTPUT')
print(result.numpy().squeeze())

MASK
[1. 0. 0.]

OUTPUT
[[0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1.]]


## Horizontal stack - convolution layer with mask type "B"

In [8]:
mask_type = 'B'
kernel_size = (1, 3)

conv = MaskedConv2D(mask_type=mask_type,
                    filters=1,
                    kernel_size=kernel_size,
                    padding='same',
                    kernel_initializer='ones',
                    bias_initializer='zeros')

result = conv(test_ones_2d)

print('MASK')
print(conv.mask.numpy().squeeze())
print('')
print('OUTPUT')
print(result.numpy().squeeze())

MASK
[1. 1. 0.]

OUTPUT
[[1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]]


Using the results of the masked approach as reference, let's check the cropped method.

# Cropped and padded convolutions

## Vertical stack

First, let's checkout this operation that some strategic padding and applying the convolution in "valid" mode to achieve the same result from the masked version. 

In [9]:
kernel_h = 2
kernel_w = 3

kernel_size = (kernel_h, kernel_w)

padding = keras.layers.ZeroPadding2D(padding=((kernel_h - 1, 0), (int((kernel_w - 1) / 2), int((kernel_w - 1) / 2))))

res = padding(test_ones_2d)

conv = keras.layers.Conv2D(filters=1,
                           kernel_size=kernel_size,
                           strides=1,
                           padding='valid',
                           kernel_initializer='ones',
                           bias_initializer='zeros')

result_v = conv(res)

print('INPUT')
print(test_ones_2d.squeeze())
print('')
print('PADDED INPUT')
print(res.numpy().squeeze())
print('')
print('CONV FILTER')
print(conv.weights[0].numpy().squeeze())
print('')
print('OUTPUT')
print(result_v.numpy().squeeze())

INPUT
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

PADDED INPUT
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 1. 1. 1. 0.]
 [0. 1. 1. 1. 1. 1. 0.]
 [0. 1. 1. 1. 1. 1. 0.]
 [0. 1. 1. 1. 1. 1. 0.]
 [0. 1. 1. 1. 1. 1. 0.]]

CONV FILTER
[[1. 1. 1.]
 [1. 1. 1.]]

OUTPUT
[[2. 3. 3. 3. 2.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]]


Now, let's implement a layer that we will include all the previous operations.

In [10]:
class VerticalConv2D(keras.layers.Conv2D):
    """https://github.com/JesseFarebro/PixelCNNPP/blob/master/layers/VerticalConv2D.py"""

    def __init__(self,
                 filters,
                 kernel_size,
                 **kwargs):
        if not isinstance(kernel_size, tuple):
            kernel_size = (kernel_size // 2 + 1, kernel_size)

        super(VerticalConv2D, self).__init__(filters, kernel_size, **kwargs)

        self.pad = tf.keras.layers.ZeroPadding2D(
            (
                (kernel_size[0] - 1, 0),  # Top, Bottom
                (kernel_size[1] // 2, kernel_size[1] // 2),  # Left, Right
            )
        )

    def call(self, inputs):
        inputs = self.pad(inputs)
        output = super(VerticalConv2D, self).call(inputs)

        return output

In [11]:
kernel_h = 2
kernel_w = 3

kernel_size = (kernel_h, kernel_w)

conv = VerticalConv2D(filters=1,
                      kernel_size=kernel_size,
                      strides=1,
                      padding='valid',
                      kernel_initializer='ones',
                      bias_initializer='zeros')

result_v = conv(test_ones_2d)

print('INPUT')
print(test_ones_2d.squeeze())
print('')
print('CONV FILTER')
print(conv.weights[0].numpy().squeeze())
print('')
print('OUTPUT')
print(result_v.numpy().squeeze())

INPUT
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

CONV FILTER
[[1. 1. 1.]
 [1. 1. 1.]]

OUTPUT
[[2. 3. 3. 3. 2.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]]


## Vertical to horizontal stack
In this operation, the implementation continue the same.

In [12]:
padding = keras.layers.ZeroPadding2D(padding=((1, 0), 0))
cropping = keras.layers.Cropping2D(cropping=((0, 1), 0))

x = padding(result_v)
result = cropping(x)

print('INPUT')
print(result_v.numpy().squeeze())
print('')
print('OUTPUT')
print(result.numpy().squeeze())

INPUT
[[2. 3. 3. 3. 2.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]]

OUTPUT
[[0. 0. 0. 0. 0.]
 [2. 3. 3. 3. 2.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]
 [4. 6. 6. 6. 4.]]


## Horizontal stack - convolution layer with mask type "A"
Again, let's check each operation step by step.

In [13]:
kernel_size = (1, 1)
conv = keras.layers.Conv2D(filters=1,
                           kernel_size=kernel_size,
                           strides=1,
                           kernel_initializer='ones',
                           bias_initializer='zeros')

padding = keras.layers.ZeroPadding2D(padding=(0, (1, 0)))
cropping = keras.layers.Cropping2D(cropping=(0, (0, 1)))

res = conv(test_ones_2d)
res_2 = padding(res)
res_3 = cropping(res_2)

print('INPUT')
print(test_ones_2d.squeeze())
print('')
print('CONV FILTER')
print(conv.weights[0].numpy().squeeze())
print('')
print('CONVOLUTION RESULT')
print(res.numpy().squeeze())
print('')
print('PADDED RESULT')
print(res_2.numpy().squeeze())
print('')
print('CROPPED RESULT')
print(res_3.numpy().squeeze())

INPUT
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

CONV FILTER
1.0

CONVOLUTION RESULT
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

PADDED RESULT
[[0. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1.]]

CROPPED RESULT
[[0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1.]]


Note: Since our input test just have one channel, the convolution 1x1 looks like did not perform any change.

## Horizontal stack - convolution layer with mask type "B"
The step by step of the mask type "B" convolution layer is a little different.

In [14]:
kernel_size = (1, 2)
kernel_h, kernel_w = kernel_size

padding = keras.layers.ZeroPadding2D(padding=((int((kernel_h - 1) / 2), int((kernel_h - 1) / 2)), (kernel_w - 1, 0)))
conv = keras.layers.Conv2D(filters=1,
                           kernel_size=kernel_size,
                           strides=1,
                           padding='valid',
                           kernel_initializer='ones',
                           bias_initializer='zeros')

res = padding(test_ones_2d)
result = conv(res)

print('INPUT')
print(test_ones_2d.squeeze())
print('')
print('PADDED INPUT')
print(res.numpy().squeeze())
print('')
print('CONV FILTER')
print(conv.weights[0].numpy().squeeze())
print('')
print('RESULT')
print(result.numpy().squeeze())

INPUT
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

PADDED INPUT
[[0. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1.]]

CONV FILTER
[1. 1.]

RESULT
[[1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]]


In this case, we also implemented a layer version encapsulation these operations

In [15]:
class HorizontalConv2D(keras.layers.Conv2D):
    def __init__(self,
                 filters,
                 kernel_size,
                 **kwargs):
        if not isinstance(kernel_size, tuple):
            kernel_size = (kernel_size // 2 + 1,) * 2

        super(HorizontalConv2D, self).__init__(filters, kernel_size, **kwargs)
        self.pad = tf.keras.layers.ZeroPadding2D(
            (
                (kernel_size[0] - 1, 0),  # (Top, Bottom)
                (kernel_size[1] - 1, 0),  # (Left, Right)
            )
        )

    def call(self, inputs):
        inputs = self.pad(inputs)
        outputs = super(HorizontalConv2D, self).call(inputs)

        return outputs

In [16]:
kernel_size = (1, 2)
conv = HorizontalConv2D(filters=1,
                        kernel_size=kernel_size,
                        strides=1,
                        kernel_initializer='ones',
                        bias_initializer='zeros')

result = conv(test_ones_2d)

print('INPUT')
print(test_ones_2d.squeeze())
print('')
print('CONV FILTER')
print(conv.weights[0].numpy().squeeze())
print('')
print('RESULT')
print(result.numpy().squeeze())

INPUT
[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]

CONV FILTER
[1. 1.]

RESULT
[[1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]
 [1. 2. 2. 2. 2.]]


# Execution time
Now we will compare the time that takes to perform each convolutional operation.

In [17]:
import time
def measure_time(conv_fn):
    exec_time = []
    n_iter = 100
    for _ in range(n_iter):
        test_input = np.random.rand(128, 256, 256, 1).astype('float32') 
        start = time.time()
        conv_fn(test_input)
        exec_time.append(time.time() - start)
    exec_time = np.array(exec_time, dtype='float32')
    return exec_time.mean(), exec_time.std()

## Vertical stack

In [18]:
mask_type = 'V'
kernel_size = (3, 3)
masked_conv = MaskedConv2D(mask_type=mask_type,
                    filters=32,
                    kernel_size=kernel_size,
                    padding='same',
                    kernel_initializer='ones',
                    bias_initializer='zeros')

@tf.function
def test_masked_fn(x):
    _ = masked_conv(x)
    

masked_time = measure_time(test_masked_fn)
# ----------------------------------------------------------------

kernel_size = (2, 3)
cropped_conv = VerticalConv2D(filters=32,
                      kernel_size=kernel_size,
                      strides=1,
                      padding='valid',
                      kernel_initializer='ones',
                      bias_initializer='zeros')

@tf.function
def test_cropped_fn(x):
    _ = cropped_conv(x)

cropped_time = measure_time(test_cropped_fn)
# ----------------------------------------------------------------

print("Vertical stack")
print(f"Masked convolution:         {masked_time[0]:.8f} +- {masked_time[1]:.8f} seconds")
print(f"Cropped padded convolution: {cropped_time[0]:.8f} +- {cropped_time[1]:.8f} seconds")

Vertical stack
Masked convolution:         0.01410292 +- 0.00891058 seconds
Cropped padded convolution: 0.01386628 +- 0.00675169 seconds


## Horizontal stack - convolution layer with mask type "A"

In [19]:
mask_type = 'A'
kernel_size = (1, 3)
masked_conv = MaskedConv2D(mask_type=mask_type,
                    filters=1,
                    kernel_size=kernel_size,
                    padding='same',
                    kernel_initializer='ones',
                    bias_initializer='zeros')

@tf.function
def test_masked_fn(x):
    _ = masked_conv(x)
    
masked_time = measure_time(test_masked_fn)
# ----------------------------------------------------------------

kernel_size = (1, 1)
conv = keras.layers.Conv2D(filters=1,
                           kernel_size=kernel_size,
                           strides=1,
                           kernel_initializer='ones',
                           bias_initializer='zeros')

padding = keras.layers.ZeroPadding2D(padding=(0, (1, 0)))
cropping = keras.layers.Cropping2D(cropping=(0, (0, 1)))

@tf.function
def test_cropped_fn(x):
    x = conv(x)
    x = padding(x)
    x = cropping(x)

cropped_time = measure_time(test_cropped_fn)
# ----------------------------------------------------------------

print("Horizontal stack - convolution layer with mask type 'A'")
print(f"Masked convolution:         {masked_time[0]:.8f} +- {masked_time[1]:.8f} seconds")
print(f"Cropped padded convolution: {cropped_time[0]:.8f} +- {cropped_time[1]:.8f} seconds")

Horizontal stack - convolution layer with mask type 'A'
Masked convolution:         0.01360846 +- 0.00381987 seconds
Cropped padded convolution: 0.01365352 +- 0.00476047 seconds


## Horizontal stack - convolution layer with mask type "B"


In [20]:
mask_type = 'B'
kernel_size = (1, 3)
masked_conv = MaskedConv2D(mask_type=mask_type,
                    filters=1,
                    kernel_size=kernel_size,
                    padding='same',
                    kernel_initializer='ones',
                    bias_initializer='zeros')

@tf.function
def test_masked_fn(x):
    _ = masked_conv(x)
    
masked_time = measure_time(test_masked_fn)
# ----------------------------------------------------------------

kernel_size = (1, 2)
cropped_conv = HorizontalConv2D(filters=1,
                        kernel_size=kernel_size,
                        strides=1,
                        kernel_initializer='ones',
                        bias_initializer='zeros')

@tf.function
def test_cropped_fn(x):
    _ = cropped_conv(x)

cropped_time = measure_time(test_cropped_fn)
# ----------------------------------------------------------------

print("Horizontal stack - convolution layer with mask type 'B'")
print(f"Masked convolution:         {masked_time[0]:.8f} +- {masked_time[1]:.8f} seconds")
print(f"Cropped padded convolution: {cropped_time[0]:.8f} +- {cropped_time[1]:.8f} seconds")

Horizontal stack - convolution layer with mask type 'B'
Masked convolution:         0.01353339 +- 0.00374499 seconds
Cropped padded convolution: 0.01384839 +- 0.00734248 seconds


Altough its looks like cropped is better in the vertical convolution, the difference does not to look very significant.

# REFERENCES

https://wiki.math.uwaterloo.ca/statwiki/index.php?title=STAT946F17/Conditional_Image_Generation_with_PixelCNN_Decoders#Gated_PixelCNN

https://www.slideshare.net/suga93/conditional-image-generation-with-pixelcnn-decoders

https://www.youtube.com/watch?v=1BURwCCYNEI