<a name='0'></a>

# Deep Autoregressive Generative Models

Deep autoregressive generative models are kind of generative models that generate one pixel(or token in language generation) after another.

***Outline:***

* [1. Introduction to autoregressive models](#1)
* [2. Auto-regressive models architectures](#2)
  * [2.1 Pixel Recurrent Neural Networks - PixelRNN](#2-1)
  * [2.2 Pixel Convolutional Neural Networks - PixelCNN](#2-2)
  * [2.3 Image Transformer](#2-3)
  * [2.4 Image GPT - Generative Pretraining from Pixels](#2-4)
* [3. Advantages and disadvantages of autoregressive models](#3)
* [4. Final notes](#3)
* [5. Further Learning](#4)

<a name='1'></a>

## 1. Introduction to Autoregressive Models

Deep autoregressive generative models are a kind of neural network architectures that produce the outputs conditioned on the previous generated outputs. In the language of probability, autoregressive models learn the joint conditional probability distribution of the dataset.

Autoregressive models are sequence models by design. An example of autoregressive model is recurrent neural networks(RNNs). Virtually, other modern deep learning architecture such as convolutional neural networks and transformers can be represented or wired as autoregressive models as well as we will see.

Autoregressive generative models can be represented with a chain rule of probability:

$$ p(x1, x2, x3, x4) = p(x1)p(x2 | x1)p(x3 | x1, x2)p(x4 | x1, x2, x3) $$

When we say autoregressive models, we are merely saying that a current variable is conditioned on previous variables. To estimate the probability $p(x2)$, one has to look at $x1$, to generate $x3$, one has to look at $(x1, x2)$, etc...


Deep autoregressive models are used in many applications such as image generation, audio generation, and other generative tasks. In the next section, we will review some popular deep autoregressive architectures.

<a name='2'></a>

## 2. Deep Autoregressive Architectures and Applications

Now that we have a rough understanding of autoregressive generative models, let's review some popular autoregressive generative architectures.

<a name='2-1'></a>

### 2. 1 Pixel Recurrent Neural Networks - PixelRNN

Generative image modelling revolves around learning the distribution of the images and sampling the new images from such distribution. One of the reasons why this task used to be difficult is that images are high-dimensional and unstructured. To effectively learns the distribution of images, you must have a model that is very expressive, tractable, and scalable enough to work on natural images.

[PixelRNN](https://arxiv.org/abs/1601.06759) is deep neural network architecture that can generate images by predicting each individual pixel autoregressively. PixelRNN is one of the earliest generative models that gave a hope that in future people might be able to generate photo-realistic images.

![image](https://drive.google.com/uc?export=view&id=1z6iQChiV3PHNL1GEpbghQIq1paac2tON)

PixelRNN architecture is mostly made of recurrent networks. It is made of 12 two-dimensional long-short term memories(LSTM). PixelRNN also uses a convolution layer to parallely computes all the states along the spatial dimension of the image. The author proposed two kinds of LSTMs: the first is Row LSTM that is applied along each individual row of the image and Diagonal BiLSTM that is applied along the diagonal of the image. The reason why authors used LSTMs in PixelRNN is due to their ability to handle long-range dependencies. Also, to foster convergence and smooth propogation of gradients, the authors incorporated residual connections in PixelRNN.

![image](https://drive.google.com/uc?export=view&id=1OK2bUyIzNCa8vIxH4gidZpuA4p-lULpH)

Like said above, Row LSTM is an undirectional LSTM layer that takes every row of the image from top to bottom. Each whole row is computed at once with additional 1D convolution layer. In Row LSTM, a current predicted pixel depends on the previous pixels in same row and above rows. Diagonal BiLSTM is a two-directional LSTM layer that predict pixels along the diagonal of the image, starting from the corner at the top of diagonal and ending at bottom corner of the diagonal. Like in Row LSTM, for each current pixel, all previous pixels(from the left of the current pixel to top of current pixel) are all computed once. Experiments(from the paper) showed that Diagonal BiLSTM achieves best results than Row LSTM since it captures the global context of the image.


Below image shows the images sampled from PixelRNN trained on Cifar-10 and ImageNet datasets 32x32 resolution. If you don't look at images carefully, you can be tricked into thinking that these are real images but zooming out, they are not actually real. The model is merely trying to predict pixel based on previous pixels until it builds up the whole image, and that's it. There is no context and knowledge of what being generated.


But this was a state of the art in image geneation in 2016 :-).

![image](https://drive.google.com/uc?export=view&id=1Z_kn-p7u-UyDj3-ebH0ztfMGvLDHQcXW)





<a name='2-2'></a>

### 2.2 Pixel Convolutional Neural Networks - PixelCNN

PixelCNN was introduced along [Pixel Recurrent Neural Network(PixelRNN)](https://arxiv.org/abs/1601.06759) in the same paper. The authors proposed PixelCNN to overcome the slowness of PixelRNN caused by sequential operations of LSTMs states. PixelCNN uses standard convolution layers(and residual connections) to process the pixel positions at once. Although the pixel generation process is still sequential (due to that each generated pixel is fed to next pixel to be generated), the computation is significantly reduced for all conditioned pixels can be computed at once.


As it is the case for all autoregressive networks, a current pixel to be generated in PixelCNN must only be conditioned on previous generated pixels. To ensure each pixel only sees the previous pixels(not future pixels), PixeCNN uses masked convolutions(shown on the image below).

![image](https://drive.google.com/uc?export=view&id=1HJpLjTIpaCa_IQIlB0Dep2iXGZQrm_nN)

PixelCNN was slightly discussed in PixelRNN paper but its subquent works attempts to explain it further. Those subsquent works are [Conditional Image Generation with PixelCNN Decoders](https://arxiv.org/abs/1606.05328) and [PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications](https://arxiv.org/abs/1701.05517). [Brian Keng](https://bjlkeng.github.io/) also wrote a great [article](https://bjlkeng.github.io/posts/pixelcnn/) about PixelCNN and went further in explaining things we didn't talk about here and provided implementation as well.

<a name='2-3'></a>

### 2.3 Image Transformer

PixelRNN and PixelCNN that we saw previously convinced us that image generation can be seen as a sequence modelling problem because we generate one pixel at time where each pixel to be generated is conditioned on previously generated pixels. PixelRNN and PixelCNN showed good generation results in their time, but they had drawbacks. For example, PixeRNN was extremely hard to train due to the sequential operations of LSTMs and the facts that images are high-dimensional data. Unlike low dimensional texts that LSTMs are used to, images are high-dimensional which means it's computationally hard to generate high-resolution images and no wonder why it was tested on small datasets like Cifar-10 32x32 images. Also, PixelCNN is easier to train since their computation can be parallized but it takes many layers to grow receptive field enough that you can generate reasonable images(many layers means many parameters means high compute means training complexity). Image Transformer(directly inspired by [language transformer](https://arxiv.org/abs/1706.03762)) was introduced to overcome the drawback of PixelCNN: to grow receptive field without requiring many layers.

[Image Transformer](https://arxiv.org/abs/1802.05751) is purely based on self-attention, the principal layer in Transformer. As you might guess, it is also an autoregressive generative model just like PixelRNN or PixelCNN. In autoregressive generation, the current pixel to be generated is conditioned on all previously generated pixels. Since images naturally have 3 color channels(reg, green, blue), Image Transformer treated color channels as discrete values. The first generated pixel is red color channel, green is conditioned on red, and blue is conditioned on red and green.


![image](https://drive.google.com/uc?export=view&id=17RCMkBX3KVioSflNxcJFpw61njwDzWoT)

There are two kinds of experiments that authors of Image Transformer tried: the first one was unconditional generation where the model just learn the joint conditional distribution and sample from that distribution and class conditional generation where class labels are incorporated into the distribution. Class-conditional generation was done in other prior works such as [PixelCNN++](17RCMkBX3KVioSflNxcJFpw61njwDzWoT) and it is probably what gradually inspired current text-image generative models. More on text-image generation later.


<a name='2-4'></a>

### 2.4 Image GPT - Generative Pretraining from Pixels

Another recent autoregressive generative model is [Image GPT](https://openai.com/blog/image-gpt/) that was introduced in the paper [Generative Pretraining from Pixels](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf). Image GPT is an autoregressive transformer that can generate images by predicting one pixel after another, each current pixel being conditioned on pixels generated previously. Image GPT was exactly like standard language transformer taking a sequence of 1D pixels(unrolled 2D image) instead of pixels.

Below is are the samples of images generated by Image GPT. You can see more samples in the model associated [blog](https://openai.com/blog/image-gpt/).


![image](https://drive.google.com/uc?export=view&id=1Yfkv0PhVbQtWEJZgHoOYoz0Mr2ehxD0w)

The authors of Image GPT also used representations learned during generation in classification tasks but that is beyond the scope of this notebook. Interested readers can learn more about that in the [paper](https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf)!



<a name='3'></a>

## 3. Advantages and Disadvantages of Autoregressive Generative Models

Autoregressive models have a number of advantages and have been long used in image generation. Even in 2022, people still get amazing [generation results](https://parti.research.google) with auto-regressive based models. One of their advantages is that they compute the likelihood or probability distribution explicitly. To generate new images is simply to sample from the learned distribution. Also, using advanced algorithms(such as transformer) and modern training techniques, autoregressive models are able to generate photorealistic images. Lastly, autoregressive generative models are not tied to any particular architecture since most deep learning architectures can virtually be wired as autoregressive architectures. They reveal the ***idea*** of performing generative tasks! That means that it's easy to plug in any architecture that is working well at the time(to day that architecture would be a transformer :-)).


A well-known disadvantage of autoregressive models is their slowness caused by the fact that you have to generate each pixel at a time taking prior generated pixels as inputs during test-time. Using architectures whose computations can be parallized can substantially improves the training of autoregressive models, but predicting or generating pixels remain sequential, you still have to predict one pixel after another.

<a name='4'></a>

## 4. Final Notes

In this notebook, we have been learning deep autoregressive generative models, the kind of generative models that predict one pixel after another until building up the whole image. Examples of autoregressive image generative models are PixelRNN, PixelCNN, Image Transformer, and Image GPT. We focused on image generation, but the same ideas apply in other modalities such as language and [speech](https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio).

<a name='5'></a>

## 5. Further Learning

* [Lecture 19 - Generative Models I by Justin Johnson](https://www.youtube.com/watch?v=Q3HU2vEhD5Y&list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r&index=20&t=2050s)

* [Autoregressive Models in Deep Learning — A Brief Survey](https://www.georgeho.org/deep-autoregressive-models/)

* [Stanford CS236 Notes on Autoregressive Models](https://deepgenerativemodels.github.io/notes/autoregressive/)

* [Lecture 15: Autoregressive and Reversible Models - Roger Grosse](https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/readings/L15%20Autoregressive%20and%20Reversible%20Models.pdf)

* [Tutorial 12: Autoregressive Image Modeling - Phillip Lippe](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial12/Autoregressive_Image_Modeling.html)

[BACK TO TOP](#0)