<a href="https://colab.research.google.com/github/96jonesa/StyleGan2-Colab-Demo/blob/master/output_small_set_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is this?

This is a demo to walk people through the results of training StyleGAN2 on various small datasets and under various configurations. Explanations of the effects of certain critical techniques are provided. Most if not all of these models were trained (at the time of writing this) for only about 6 hours on a single P100 GPU via Colab (typically with 4 models training on separate P100s simultaneously). Typically, models of this nature would be trained up to 100x as long. These will be trained further and this demo will be updated frequently in the near future. However, the results are already decent enough for the purposes of demonstration and comparison.

All of the models were trained using a notebook I wrote to allow for easy training using the free resources provided by Colab and Google Drive, linked here:

https://colab.research.google.com/drive/1prEbP9AgZnxGCXtZkP-pgqRJoHcHJPou?usp=sharing

Using the PyTorch implementation of StyleGAN2 available at:

https://github.com/lucidrains/stylegan2-pytorch

The GitHub repo I made for this project is available at:

https://github.com/96jonesa/StyleGan2-Colab-Demo

The public directory on my Google Drive containing all sample images used in this notebook:

https://drive.google.com/drive/folders/1gpZKmuvOnsuRmCo3MEcpST_WC1Laaz3W?usp=sharing

#How to use:

Login to Google (Drive).

Image displays are full size 8x8 grids of 128x128 images (32x32 in case of cifar10). You will need to scroll through the cell output to see each of the (typically 4) grids displayed in each output. Simply place you cursor somewhere inside the output, then scroll.

You can either run all the cells then scroll through with Cmd/Ctrl+F9 or 'Runtime > Run all'

Or you can step through cell-by-cell by running each cell individually with Cmd/Ctrl+Enter (runs current cell) or Shift+Enter (runs current cell and moves focus to next cell)

#Descriptions of datasets (citations at end of notebook):

metfaces: "image dataset of human faces extracted from works of art" -NVlabs

celeba: "large-scale face attributes dataset with more than 200K celebrity images" -Lie et al.

afhq_dog: "5,000 high-quality images [of dogs] at 512×512 resolution" -Choi et al.

cifar10_horse: 5000 32x32 color images of horses -Krizhevsky

# How does StyleGAN2 work (somewhat technically)?

GANs typically have a generator network and a discriminator network. The discriminator network is a glorified classification network - it seeks to determine if a given image came from the underlying data distribution under consideration (i.e. does it look like real data from the dataset being used for training). The generator network seeks to produce images that the discriminator will classify as having come from the underlying data distribution, however it does not get to look at the images in the dataset - it must improve based only on how the discriminator reacts to samples it produces. In order to generate diverse images, generator networks are given random noise as input. Typically during training, there discriminator network is fed some real data, then some generated data, then optimized via stochastic gradient descent. The generator network is optimized after getting the results back from the discriminator network, with fancy scheduling regarding how often each network is optimized (to enduce stabilization) depending on the GAN architecture.

In StyleGAN2, the generator network is composed of a mapping network and a synthesis network. The mapping network is composed of a sequence of fully-connected layers which serve to transform the random noise input (latent code) into a vector in an intermediate latent space in which the factors of variation are more linear than in the original latent space (i.e. it maps the latent code into a space better topologically structured for the problem). Between the mapping and synthesis networks, affine transformations are learned to produce styles from the intermediate latent vector. These styles are fed as inputs to layers in modules of the synthesis network corresponding to different resolutions (from 4x4 up to the desired resolution). There are several convolutional (each along with standard deviation modulation and normalization) style blocks per resolution module. Per-style block scaling factors are also learned and applied to the noise before being input to intermdetiate layers of the corresponding style block. After a few style blocks, upsampling occurs to bump the network up to the next resolution (4x4 to 8x8 to 16x16 and so on) until the desired resolution is attained.

StyleGAN2 also uses an exponential moving average of the generator (synthesis and mapping) network parameters to compensate for the oscillatory tendencies of GANs, as well as mixed regularities (occasionally using two latent space codes instead of one and simply switching from the intermediate latent vector produced by one to the intermediate latent vector produced by the other at a randomly selected point in the synthesis network, essentially mixing the styles which would be generated by the two). This difference in generated samples due to the addition of these techniques is demonstrated below.

There has also been a very recent development in data augmentation that applies to GANs in general (differentiable augmentation), reducing the amount of data required to obtain high-quality results by (in some cases) orders of magnitude. The effect of applying this technique is demonstrated below.

# Download the sample images:

In [None]:
# Utilities for downloading publicly shared Google Drive files (from my Google Drive).

import requests

def download_file_from_google_drive(id, destination):
    URL = 'https://docs.google.com/uc?export=download'

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, 'wb') as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

In [None]:
import zipfile
from IPython.display import Image, display

file_id = '1uEuLbFWRdAbaB2p3pvQOUUMj7_FJF3bN'
destination = 'demo_samples.zip'
download_file_from_google_drive(file_id, destination)
zip_ref = zipfile.ZipFile('demo_samples.zip', 'r')
zip_ref.extractall('demo_samples')
zip_ref.close()

# First things first

First, we will look at sample images generated by the pretrained models using no differentiable augmentation and no attention. We expect the results to lack diversity (due to the lack of differentiable augmentation) and be of relatively poor quality (due to the lack of attention). In order, these 8x8 image displays are from training on the following datasets:

1. cifar10_horse  (32x32 resolution, 5000 images, 97000 iterations @ 4.52 it/s on P100 GPU)
2. afhq_dog       (128x128 resolution, 5000 images, 38000 iterations @ 1.78 it/s on P100 GPU)
3. metfaces       (128x128 resolution, 1336 images, 27000 iterations @ 1.27 it/s on P100 GPU)
4. celeba         (128x128 resolution, 202599 images, 36000 iterations @ 1.64 it/s on P100 GPU)

cifar10_horse, no differentiable augmentation, no attention, standard:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/cifar10_horse_aug_00_attn_none_97.jpg'))

afhq_dog, no differentiable augmentation, no attention, standard:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/afhq_dog_aug_00_attn_none_38.jpg'))

metfaces, no differentiable augmentation, no attention, standard:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/metfaces_aug_00_attn_none_27.jpg'))

celeba, no differentiable augmentation, no attention, standard:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/celeba_aug_00_attn_none_36.jpg'))

# EMA parameters

Notice the poor quality? The model also generates samples from a version of the model which use an exponential moving average of the generator (synthesis and mapping) network parameters to compensate for the oscillatory tendencies of GANs. This improves stability and quality significantly. The result is shown below for the same models as above:

cifar10_horse, no differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/cifar10_horse_aug_00_attn_none_97-ema.jpg'))

afhq_dog, no differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/afhq_dog_aug_00_attn_none_38-ema.jpg'))

metfaces, no differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/metfaces_aug_00_attn_none_27-ema.jpg'))

celeba, no differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/celeba_aug_00_attn_none_36-ema.jpg'))

# Mixed regularities

The model also generates samples using EMA generator (synthesis and mapping) parameters as well as mixed regularities (using two latent space codes instead of one and simply switching from the intermediate latent vector produced by one to the intermediate latent vector produced by the other at a randomly selected point in the synthesis network, essentially mixing the styles which would be generated by the two), which causes a decorrelation of neighboring styles and thus allows for more fine-tuned diversity (however, I have found that when under-trained this leads to lower perceptual quality, presumably because the model no longer over-focuses on each distinct style):

cifar10_horse, no differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/cifar10_horse_aug_00_attn_none_97-mr.jpg'))

afhq_dog, no differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/afhq_dog_aug_00_attn_none_38-mr.jpg'))

metfaces, no differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/metfaces_aug_00_attn_none_27-mr.jpg'))

celeba, no differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/celeba_aug_00_attn_none_36-mr.jpg'))

# Early stopping

By now you have likely noticed the extremely low quality of the afhq_dog model. This model has drifted too far from proper convergence due to repeated failures to convince the discriminator with samples that were moving in the right direction, so it has given in to pressures away from the correct direction and is now repeatedly outputting garbage. In a soon-to-come update to this demo, the use of lower learning rates will be explored as a means of avoiding this issue. Here are the EMA generator (synthesis and mapping) parameters and mixed regularities outputs from an earlier training checkpoint of the same model (22000 iterations instead of 38000 iterations):

cifar10_horse early stop, no differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/afhq_dog_aug_00_attn_none_22-ema.jpg'))

cifar10_horse early stop, no differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/afhq_dog_aug_00_attn_none_22-mr.jpg'))

# Differentiable augmentation

By now you have likely noticed the extreme lack of diversity in the afhq_dog and metfaces models, as well as a more subtle lack of diversity in the cifar10_horses and celeba models. This is due to the small amount of data available in the former two datasets, and a reasonable yet still lacking amount of data in the latter two datasets. A recent innovation in data augmentation which uses differentiable augmentations of the data has led to models obtaining high quality results with up to 70x less data. The following show the results of training the above models while applying this augmentation to the discriminator input with probability 0.2 at each iteration (the EMA generator (synthesis and mapping) parameters results are shown, without mixed regularities). In order, these were trained on:

1. cifar10_horse  (32x32 resolution, 5000 images, 97000 iterations @ 4.51 it/s on P100 GPU)
2. afhq_dog       (128x128 resolution, 5000 images, 38000 iterations @ 1.56 it/s on P100 GPU)
3. metfaces       (128x128 resolution, 1336 images, 27000 iterations @ 1.05 it/s on P100 GPU)
4. celeba         (128x128 resolution, 202599 images, 36000 iterations @ 1.64 it/s on P100 GPU)

cifar10_horse, 0.2 differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/cifar10_horse_aug_02_attn_none_97-ema.jpg'))

afhq_dog, 0.2 differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/afhq_dog_aug_02_attn_none_38-ema.jpg'))

metfaces, 0.2 differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/metfaces_aug_02_attn_none_27-ema.jpg'))

celeba, 0.2 differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/celeba_aug_02_attn_none_36-ema.jpg'))

# Cool, but let's see that with mixed regularities

The same augmented data models produced the following sample images using mixed regularities:

cifar10_horse, 0.2 differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/cifar10_horse_aug_02_attn_none_97-mr.jpg'))

afhq_dog, 0.2 differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/afhq_dog_aug_02_attn_none_38-mr.jpg'))

metfaces, 0.2 differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/metfaces_aug_02_attn_none_27-mr.jpg'))

celeba, 0.2 differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/celeba_aug_02_attn_none_36-mr.jpg'))

# Longer training

The following show the results of training the above models while applying differentiable augmentation to the discriminator input with probability 0.2 at each iteration (same as above), but this time trained over three times as long. The EMA results are shows. In order, these were trained on:

1. cifar10_horse (32x32 resolution, 5000 images, 314000 iterations @ 4.51 it/s on P100 GPU)

2. afhq_dog (128x128 resolution, 5000 images, 119000 iterations @ 1.56 it/s on P100 GPU)

3. metfaces (128x128 resolution, 1336 images, 88000 iterations @ 1.05 it/s on P100 GPU)

4. celeba (128x128 resolution, 202599 images, 120000 iterations @ 1.64 it/s on P100 GPU)

cifar10_horse, 0.2 differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/cifar10_horse_aug_02_attn_none_314-ema.jpg'))

afhq_dog, 0.2 differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/afhq_dog_aug_02_attn_none_119-ema.jpg'))

metfaces, 0.2 differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/metfaces_aug_02_attn_none_88-ema.jpg'))

celeba, 0.2 differentiable augmentation, no attention, EMA parameters:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/celeba_aug_02_attn_none_120-ema.jpg'))

# And the mixed regularities results...

cifar10_horse, 0.2 differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/cifar10_horse_aug_02_attn_none_314-mr.jpg'))

afhq_dog, 0.2 differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/afhq_dog_aug_02_attn_none_119-ema.jpg'))

metfaces, 0.2 differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/metfaces_aug_02_attn_none_88-ema.jpg'))

celeba, 0.2 differentiable augmentation, no attention, mixed regularities:

In [None]:
display(Image('/content/demo_samples/StyleGan2_small_set_demo_samples/celeba_aug_02_attn_none_120-ema.jpg'))

# Coming soon:

Results from training using attention (absent from StyleGAN2) on every layer, with and without augmentation.

Results from using various lower learning rates to improve model stability and diversity of generated results.

Results from training with various other differentiable augmentation probabilities (0.1 and 0.3).

Results from training with contrastive loss (absent from StyleGAN2).

Results from training for more iterations under these configurations.

Results from training with larger image sizes (higher resolutions - on the high-resolution datasets).

Results from training on additional interesting small datasets.

# Citations:

```
@inproceedings{choi2020starganv2,
  title={StarGAN v2: Diverse Image Synthesis for Multiple Domains},
  author={Yunjey Choi and Youngjung Uh and Jaejun Yoo and Jung-Woo Ha},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

@inproceedings{liu2015faceattributes,
 title = {Deep Learning Face Attributes in the Wild},
 author = {Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou},
 booktitle = {Proceedings of International Conference on Computer Vision (ICCV)},
 month = {December},
 year = {2015} 
}

@article{Karras2019stylegan2,
  title   = {Analyzing and Improving the Image Quality of {StyleGAN}},
  author  = {Tero Karras and Samuli Laine and Miika Aittala and Janne Hellsten and Jaakko Lehtinen and Timo Aila},
  journal = {CoRR},
  volume  = {abs/1912.04958},
  year    = {2019},
}

@misc{zhao2020feature,
    title   = {Feature Quantization Improves GAN Training},
    author  = {Yang Zhao and Chunyuan Li and Ping Yu and Jianfeng Gao and Changyou Chen},
    year    = {2020}
}

@misc{chen2020simple,
    title   = {A Simple Framework for Contrastive Learning of Visual Representations},
    author  = {Ting Chen and Simon Kornblith and Mohammad Norouzi and Geoffrey Hinton},
    year    = {2020}
}

@article{,
  title     = {Oxford 102 Flowers},
  author    = {Nilsback, M-E. and Zisserman, A., 2008},
  abstract  = {A 102 category dataset consisting of 102 flower categories, commonly occuring in the United Kingdom. Each class consists of 40 to 258 images. The images have large scale, pose and light variations.}
}

@article{afifi201911k,
  title   = {11K Hands: gender recognition and biometric identification using a large dataset of hand images},
  author  = {Afifi, Mahmoud},
  journal = {Multimedia Tools and Applications}
}

@misc{zhang2018selfattention,
    title   = {Self-Attention Generative Adversarial Networks},
    author  = {Han Zhang and Ian Goodfellow and Dimitris Metaxas and Augustus Odena},
    year    = {2018},
    eprint  = {1805.08318},
    archivePrefix = {arXiv}
}

@article{shen2019efficient,
  author    = {Zhuoran Shen and
               Mingyuan Zhang and
               Haiyu Zhao and
               Shuai Yi and
               Hongsheng Li},
  title     = {Efficient Attention: Attention with Linear Complexities},
  journal   = {CoRR},  
  year      = {2018},
  url       = {http://arxiv.org/abs/1812.01243},
}

@misc{zhao2020image,
    title  = {Image Augmentations for GAN Training},
    author = {Zhengli Zhao and Zizhao Zhang and Ting Chen and Sameer Singh and Han Zhang},
    year   = {2020},
    eprint = {2006.02595},
    archivePrefix = {arXiv}
}

@misc{karras2020training,
    title   = {Training Generative Adversarial Networks with Limited Data},
    author  = {Tero Karras and Miika Aittala and Janne Hellsten and Samuli Laine and Jaakko Lehtinen and Timo Aila},
    year    = {2020},
    eprint  = {2006.06676},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@article{article,
author = {Krizhevsky, Alex},
year = {2012},
month = {05},
pages = {},
title = {Learning Multiple Layers of Features from Tiny Images},
journal = {University of Toronto}
}

@misc{karras2020training,
    title={Training Generative Adversarial Networks with Limited Data},
    author={Tero Karras and Miika Aittala and Janne Hellsten and Samuli Laine and Jaakko Lehtinen and Timo Aila},
    year={2020},
    eprint={2006.06676},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
```


