# Pixel Recurrent Neural Networks


Link: https://arxiv.org/abs/1601.06759

Authors: Aa ̈ron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu

Institution: Google DeepMind

Publication: arXiv

Date: 2016

## Background Materials

- PixelRNN review from Magenta https://github.com/tensorflow/magenta/blob/master/magenta/reviews/pixelrnn.md
- Deep Residual Learning for Image Recognition https://arxiv.org/abs/1512.03385

## What is this paper about?

two-dimentional pixcel by pixcel image generation by RNN and CNN variant deep neural networks architectures

## What is the motivation of this research?


- building complex and expressive model that are also tractable and scalable is hard in generative modeling
  - intractable approach: VAE that focuses on stochastic latent variable
  - tractable approach: autoregressive models such as NADE
    - but not expressive enough to model highly nonlinear and long-range correlations
- Thesis & Bethge (2015) showed very promising results with two-dimensional RNN in grayscale images and textures


## What makes this paper different from previous research?

- architectual novelities: fast two-dimensional recurrent layers and effective use of residual connections
- achieved log-likelihood score considerably better than the previous state of the art

## How this paper achieve it?


### Model


#### Generating an Image Pixel by Pixel

A joint distribution $p(x)$ of image $x$ formed of $n \times n$ pixels is :

$p(x) = \prod_{i=1}^{n^2}p(x_i\lvert x_1,...,x_{i-1})$

Each pixel $x_i$ is in turn determined three color channel values, RGB.

$p(x_i\lvert x_1,...,x_{i-1}) = p(x_i\lvert \boldsymbol{x}_{<i}) = p(x_i,R\lvert \boldsymbol{x}_{<i})p(x_i,G \lvert \boldsymbol{x}_{<i},R)p(x_i,B \lvert \boldsymbol{x}_{<i}, x_i,R, x_i,G)$

Each of the colors is thus conditioned on the other channnels as well as on all the previous pixels.

#### Pixels as Discrete Variables

$p(x)$ is models as a discrete distribution. Each cannel variable $x_i,*$ takes 256 distinct values (where * is R , G or B).

The discrete distribution has an advantage of being arbitrary multimodal without prior on the shape, as shown in softmax activation below.

<img src="img/Pixel_Recurrent_Neural_Networks_Figure6.png" width=300>


### Pixel Recurrent Neural Networks


#### Row LSTM

Row LSTM processes image row by row from top to bottom.

The input-to-state component is first computed for entire two dimentional input map using one-dimensional convolution of size $k \times 1$.

The convolution is masked to include valid context as figure below (kernel size = 3).

<img src="img/Pixel_Recurrent_Neural_Networks_v2_Figure2c.png" width="200">

The state-to-state component of the LSTM layer, the new hidden state $h_i$ and cell state $c_i$ are obtained as follows:

$c_i = f_i \odot c_{i-1} + i_i \odot g_i $ (from LSTM definition)

$h_i = o_i \odot \tanh(c_i) $ (from LSTM definition)

$[o_i, f_i, i_i, g_i] = \sigma(K^{ss} \circledast h_{i-1} + K^{is} \circledast x_i)$ (modified from LSTM definition)

where $g_i$ is the content gate (a new input), $x_i$ is input map of row $i$ of size $h \times n \times 1$, $\circledast$ represents the convolution operation and $\odot$ the element wise multiplication. $K^{ss}$ and $K^{is}$ are kernel weights for state-to-state and input-to-state components. $\sigma$ is activation function.

Each step computes the new state for an entire row of the input map.


#### Diagonal BiLSTM

The Diagonal BiLSTM is able to capturfe entire available context. Each of the two directions of the layer scans the image from a corner to a opposite corner in diagonal fashion.

<img src="img/Pixel_Recurrent_Neural_Networks_v2_Figure2r.png" width="200">

To apply convolution along diagonal easily, input map is first skewed as described below.

<img src="img/Pixel_Recurrent_Neural_Networks_Figure3.png" width="300">

For each of the two directions, the input-to-state component is simply a $1\times1$ convolution $K^{is}$.

The state-to-state recurrent component is then computed with column-wise convolution K^{ss} with kernel of size $2 \times 1$.

The output is caluculated with equation same as Row LSTM.

The output feature map is skewed back to $n \times n$ map and the right output is shifted down by one row to prevent from seeing future pixels and added to the left output map.

The Diagonal BiLSTM has an advantage that it uses minimal $2 \times 1$ convolutional kernel. Larger kernel size is not helpful because Diagonal BiLSTM has already global receptive field.


#### Residual Connections

PixelRNN was trained up to 12 layers.

To increase convergence speed residual connections (He et al, 2015) were used.


#### Masked Convolution

Two types of masks are used for PixelRNN.

<img src="img/Pixel_Recurrent_Neural_Networks_Figure2r.png" width="200">

Mask A is applied only to the first layer and restricts the connections to those neighboring pixels and to those colors in the current pixels.

Mask B is applied to all the subsequent layers and relaxed the restriction of mask A by allowing the connection from a color to itself.


#### PixelCNN

The Row and Diagonal LSTM has potentially unbounded dependency range but this comes with computational cost.

The PixelCNN use standard convolutional layers to capture bounded receptive field.

Multiple convolutional layers are used to preserve the spatial resolution.


#### Multi-Scale PixelRNN

The Multi-Scale PixelRNN is composed of an unconditional PixelRNN and one or more conditional PixelRNNs.

The unconditional network first generates a smaller $s \times s$ image that is subsampled from the original image.
The conditional network then takes the $s \times s$ image as an additional input and generates a larger $n \times n$ image.

<img src="img/Pixel_Recurrent_Neural_Networks_Figure2c.png" width="100" >

The conditional network is biased with an upsampled version of the small $s \times s$ image.


## Dataset used in this study

- MNIST
- CIFAR-10

## Implementations

- https://github.com/carpedm20/pixel-rnn-tensorflow
- https://github.com/openai/pixel-cnn (PixelCNN++)
- https://github.com/PrajitR/fast-pixel-cnn (PixelCNN++)


## Further Readings

- Conditional Image Generation with PixelCNN Decoders https://arxiv.org/abs/1606.05328
- PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications https://arxiv.org/abs/1701.05517