---
#### Progressive Growing of GANs

By now, you probably have realized how difficult it can be to train GANs. They are fairly unstable, especially when trying to generated high dimensional samples, such as high resolution images! 

However, researchers never lack ideas to improve them and this 2017 paper made trainings of GANs more stable: [Progressive growing of GANs for improved quality, stability and variation](https://arxiv.org/pdf/1710.10196.pdf). 

The main idea behind this paper is the following: since training GANs on smaller images is easier, we can progressively grow the network and the generated images dimensions to make training easier for the network. It is illustrated by the figure below:

<img src='../images/progan2.png' width=80% />


##### Layer fading 

Each level, or depth, is training for a certain number of epochs (eg, 10 epochs). Then a new layer is added in the discriminator and the generator and we start training with these additional layers. However, when a new layer is added, it is fadded in smoothly, as described by the following figure:

<img src='../images/layer_fading2.png' width=70% />

The `toRGB` and `fromRGB` layers are the layers projecting the feature vector to the RGB space (HxWx3) and the layer doing the opposite, respectively. 

Let's look at the example:
* **(a)** the network is currrently training at 16x16 resolution, meaning that the generated images are 16x16x3
* **(b)** we are adding two new layers to train at 32x32 resolution. However, we are fading the new layers by doing the following:
    * for the generator, we take the output of the 16x16 layer and use nearest neighbor image resize to double its resolution to 32x32. The same output will also be fed to the 32x32 layer. Then we calculate the output of the network by doing a weighted sum of $(1- \alpha)$ the upsampled 16x16 image and $\alpha$ the 32x32 layer output. 
    * for the discriminator, we do something similar but to reduce the resolution, we use an average pooling layer
    * the network trains for N epochs at each resolution. During the first $N/2$ epochs, we start with $/alpha = 0$ and increase alpha linearly to $/alpha = 1$. Then we train for the remaining $N/2$ epochs with $/alpha = 1$.
* **(c)** the network is now training at 32x32 resolution

##### Exercerise

In this exercise, you will implement the Generator of the ProGan model. To make your life easier, I already implemented two torch modules: `GeneratorFirstBlock` and `GeneratorBlock`. 
* The `GeneratorFirstBlock` module takes the the latent vector as input and outputs a multi-dimensional feature maps
* the `GeneratorBlock` module correspond to each layer added when increasing the resolution

**Notes:** In the paper, the authors are using a new type of normalization, called PixelNormalization. I encourage you to read the paper but for the sake of simplicity, I did not add any normalization here. 

---
#### Progressive Growing of GANs (ProGAN): Building Images from Coarse to Fine

Imagine learning to draw a portrait. You wouldn't start by meticulously crafting individual eyelashes. Instead, you'd begin with basic shapes—an oval for the head, rough positions for eyes and mouth—then gradually add finer details. Progressive Growing of GANs (ProGAN) applies this same intuitive approach to neural network training, revolutionizing how we generate high-resolution images.

Before ProGAN's introduction in 2017, training GANs to generate high-resolution images presented formidable challenges. When researchers attempted to train networks to produce 1024×1024 pixel images directly, they encountered unstable training, mode collapse, and poor image quality. The fundamental problem lay in asking the network to learn everything simultaneously: global structure, local features, fine textures, and pixel-level details all at once.

##### The Core Innovation: Growing Networks Layer by Layer

ProGAN's breakthrough insight involves starting small and growing progressively larger. The training begins with both generator and discriminator networks designed to handle tiny 4×4 pixel images. Once these networks achieve stable performance at this resolution, new layers are added to both networks, doubling the resolution to 8×8 pixels. This process continues systematically: 16×16, 32×32, 64×64, 128×128, 256×256, 512×512, and finally 1024×1024 pixels.

This approach mirrors how we naturally learn complex skills. Just as a pianist masters simple scales before attempting complex compositions, ProGAN allows the networks to master low-resolution generation before tackling the intricacies of high-resolution synthesis.

The mathematical elegance of this approach becomes clear when we consider the learning task at each stage. At 4×4 resolution, the network learns fundamental concepts like basic shapes, color distributions, and simple textures. At 8×8, it refines these concepts while adding slightly more complex patterns. Each resolution doubling introduces new challenges but builds upon previously learned foundations.

##### Architecture Design: Symmetric Growth for Generator and Discriminator

The beauty of ProGAN lies in its symmetric approach to network growth. Both generator and discriminator expand simultaneously, maintaining balance throughout training. This symmetry proves crucial because it prevents either network from becoming too powerful relative to the other—a common cause of training instability in GANs.

Let me walk you through how this growth process works in detail. Initially, we have a generator that takes a latent vector and produces a 4×4 image, paired with a discriminator that evaluates 4×4 images. Both networks use standard convolutional layers appropriate for this resolution.

When transitioning to 8×8 resolution, new layers are added to both networks. The generator receives additional upsampling and convolution layers, while the discriminator gets corresponding downsampling and convolution layers. However, and this detail is crucial, these new layers aren't simply switched on immediately. Instead, ProGAN employs a smooth transition mechanism that gradually fades in the new layers.

```mermaid
flowchart TD
    Noise["Input Noise Vector<br>512-dimensional<br>z ~ N(0,1)"] --> FC["Fully Connected Layer<br>Reshape to 4×4×512"]
    
    FC --> Conv4x4_1["Conv Block 4×4<br>512 → 512 channels<br>Learn basic structure"]
    Conv4x4_1 --> Conv4x4_2["Conv Block 4×4<br>512 → 512 channels<br>Refine features"]
    
    Conv4x4_2 --> ToRGB_4["To RGB 4×4<br>Generate low-res image"]
    Conv4x4_2 --> Up1["Upsample 2×<br>4×4 → 8×8"]
    
    Up1 --> Conv8x8_1["Conv Block 8×8<br>512 → 512 channels<br>Add medium details"]
    Conv8x8_1 --> Conv8x8_2["Conv Block 8×8<br>512 → 512 channels<br>Refine 8×8 features"]
    
    Conv8x8_2 --> ToRGB_8["To RGB 8×8<br>Generate med-res image"]
    Conv8x8_2 --> Up2["Upsample 2×<br>8×8 → 16×16"]
    
    Up2 --> Conv16x16_1["Conv Block 16×16<br>512 → 256 channels<br>Add finer details"]
    Conv16x16_1 --> Conv16x16_2["Conv Block 16×16<br>256 → 256 channels<br>Refine 16×16 features"]
    
    Conv16x16_2 --> MoreLayers["... Progressive Blocks<br>32×32 → 64×64 → 128×128<br>→ 256×256 → 512×512"]
    MoreLayers --> FinalBlock["Final Block 1024×1024<br>Add finest details<br>Skin texture, hair strands"]
    
    FinalBlock --> Output["High-Resolution Image<br>1024×1024 RGB<br>Photorealistic quality"]
    
    ToRGB_4 --> FadeIn1["Fade-in Mechanism<br>Smooth transition<br>between resolutions"]
    ToRGB_8 --> FadeIn1
    FadeIn1 --> FadeIn2["Progressive Integration<br>of new layers"]
    
    style Noise fill:#BCFB89
    style FC fill:#9AE4F5
    style Conv4x4_1 fill:#FBF266
    style Conv4x4_2 fill:#FBF266
    style Conv8x8_1 fill:#FA756A
    style Conv8x8_2 fill:#FA756A
    style Conv16x16_1 fill:#0096D9
    style Conv16x16_2 fill:#0096D9
    style ToRGB_4 fill:#FCEB14
    style ToRGB_8 fill:#FCEB14
    style FadeIn1 fill:#FE9237
    style Output fill:#FE9237
```

##### The Fade-in Mechanism: Smooth Transitions Between Resolutions

The fade-in mechanism represents one of ProGAN's most clever innovations. When adding new layers for higher resolution, the network doesn't immediately switch to using only the new path. Instead, it maintains both the old (lower resolution) and new (higher resolution) paths, combining their outputs using a learned mixing parameter.

During the transition from 4×4 to 8×8 resolution, for example, the generator produces outputs through two paths. The old path upsamples the 4×4 output to 8×8 using simple interpolation. The new path processes the features through the newly added 8×8 layers. These two outputs are then combined using a weighted average, where the weight gradually shifts from favoring the old path to favoring the new path.

Mathematically, if we denote the old path output as $O_{old}$ and the new path output as $O_{new}$, the final output during transition becomes:
$$O_{final} = (1 - \alpha) \cdot O_{old} + \alpha \cdot O_{new}$$

The parameter $\alpha$ starts at 0 (using only the old path) and gradually increases to 1 (using only the new path) over the course of training. This smooth transition prevents the sudden disruption that would occur if we immediately switched to the new architecture.

Think of this like learning to ride a bicycle with training wheels. The training wheels (old path) provide stability while you're learning, but gradually become less important as your balance (new path) improves. Eventually, you can ride without training wheels, but the transition was smooth rather than abrupt.

##### Training Dynamics: Balancing Speed and Stability

The training schedule in ProGAN requires careful orchestration. Each resolution phase consists of two stages: the transition phase and the stabilization phase. During the transition phase, new layers are gradually faded in using the mechanism described above. During the stabilization phase, the network trains exclusively with the new layers, allowing it to fully adapt to the higher resolution.

The duration of each phase matters significantly. Too short a transition phase can lead to training instabilities as the network doesn't have sufficient time to adapt to the new architecture. Too long a transition phase wastes computational resources without providing additional benefits. The original ProGAN paper suggests spending roughly equal time in transition and stabilization phases, but this can be adjusted based on the specific dataset and desired quality.

One fascinating aspect of this training approach is how it naturally implements a form of curriculum learning. The network first masters the easiest task (generating coherent 4×4 images), then progresses to increasingly difficult tasks. This curriculum proves far more effective than attempting to learn everything simultaneously.

Consider what happens in the generator during early training phases. At 4×4 resolution, the network learns fundamental concepts about the data distribution. It discovers that faces should have certain color patterns, that objects should have coherent shapes, and that backgrounds should contrast appropriately with foregrounds. These lessons remain valuable as the network grows, providing a stable foundation for learning finer details.

##### Architectural Choices: Pixel Normalization and Equalized Learning Rate

Beyond the progressive growing mechanism itself, ProGAN introduces several other innovations that contribute to training stability. Pixel normalization replaces batch normalization in the generator, normalizing each pixel's feature vector to unit length. This technique prevents the generator from creating excessively large values that could destabilize training.

The pixel normalization operation can be expressed as:
$$\hat{x}_{i,j,k} = \frac{x_{i,j,k}}{\sqrt{\frac{1}{N}\sum_{k'=1}^{N} x_{i,j,k'}^2 + \epsilon}}$$

where $x_{i,j,k}$ represents the feature value at spatial location $(i,j)$ and feature channel $k$, and $N$ is the number of feature channels.

Equalized learning rate represents another crucial innovation. Instead of using standard weight initialization schemes, ProGAN initializes all weights from a normal distribution with unit variance, then scales them at runtime based on the fan-in of each layer. This approach ensures that all layers learn at similar rates, preventing some layers from dominating others during training.

The runtime scaling factor for each layer is computed as:
$$c = \sqrt{\frac{2}{n}}$$

where $n$ is the fan-in (number of input connections) of the layer. This scaling factor is applied during the forward pass, effectively normalizing the learning rate across layers of different sizes.

##### Understanding the Discriminator's Progressive Growth

While much attention focuses on the generator's progressive growth, the discriminator's evolution proves equally important. The discriminator starts by learning to distinguish real 4×4 image patches from generated ones. This seemingly simple task teaches the network fundamental concepts about image statistics and realistic texture patterns.

As resolution increases, the discriminator faces increasingly sophisticated challenges. At 8×8, it must distinguish not just basic textures but also spatial relationships between adjacent regions. At 16×16, it begins recognizing more complex patterns like facial features or object boundaries. By 1024×1024, it's evaluating fine details like skin pores, hair strands, and subtle lighting effects.

This progressive learning proves crucial for maintaining the adversarial balance. If the discriminator learned to evaluate 1024×1024 images from the beginning, it would likely overpower the generator early in training, providing gradients that are either too strong (causing instability) or too weak (causing poor learning).

The discriminator's architecture mirrors the generator's growth pattern but in reverse. Where the generator upsamples feature maps to increase resolution, the discriminator downsamples them. Where the generator combines features from different scales, the discriminator separates them. This symmetry ensures that both networks grow in capability at similar rates.

##### Practical Benefits: Quality, Stability, and Efficiency

The practical advantages of progressive growing extend well beyond theoretical elegance. First and foremost, it enables stable training of high-resolution GANs. Before ProGAN, generating 1024×1024 images with GANs was extremely difficult and often unsuccessful. ProGAN made such high-resolution generation routine and reliable.

Training efficiency represents another significant benefit. By starting with small images and growing progressively, the network spends most of its training time on smaller, computationally cheaper operations. Training a 4×4 generator requires far fewer computational resources than training a 1024×1024 generator. Since much of the fundamental learning happens at lower resolutions, this approach achieves better results with less computation.

The progressive approach also provides natural checkpoints for evaluation and early stopping. Researchers can monitor the network's progress at each resolution, identifying problems before they compound at higher resolutions. If 64×64 generation shows serious artifacts, it's better to address these issues before proceeding to 128×128 rather than discovering them only after extensive high-resolution training.

Memory efficiency improvements deserve special mention. Training deep networks for high-resolution image generation requires substantial GPU memory. Progressive growing allows training to begin with smaller memory requirements, scaling up gradually as the networks grow. This approach makes high-resolution GAN training accessible to researchers with more modest computational resources.

##### Limitations and Considerations

Despite its revolutionary impact, progressive growing has certain limitations that researchers should understand. The training time becomes longer due to the multi-stage process. While each individual stage trains faster than direct high-resolution training, the cumulative time across all stages can exceed direct training approaches when they succeed.

The architectural constraints imposed by progressive growing can sometimes limit network design flexibility. The requirement for symmetric growth between generator and discriminator, while beneficial for stability, constrains architectural innovation. Some network designs that work well for direct training may not adapt naturally to progressive growing.

Certain types of artifacts can become "baked in" during early training phases and persist through higher resolutions. If the network learns incorrect patterns during 4×4 training, these patterns may prove difficult to unlearn during later phases. This phenomenon suggests that careful monitoring of early training phases becomes crucial for final quality.

The fade-in mechanism, while generally beneficial, occasionally produces its own artifacts during transition phases. Images generated during transitions may show subtle blending artifacts where the old and new paths don't perfectly align. These artifacts typically resolve once the transition completes, but they can affect applications requiring consistent quality throughout training.

##### Legacy and Impact on Subsequent Research

ProGAN's influence on generative modeling extends far beyond its immediate success. The progressive growing principle has been adapted for other types of generative models, including variational autoencoders and normalizing flows. The concept of curriculum learning in neural networks, while not originated by ProGAN, gained significant attention due to its success.

StyleGAN, perhaps the most famous successor to ProGAN, initially incorporated progressive growing before later versions moved to different approaches. The stability improvements demonstrated by ProGAN convinced researchers that high-resolution GAN training was practical, leading to increased investment in GAN research and applications.

The architectural innovations introduced alongside progressive growing—pixel normalization, equalized learning rate, and minibatch standard deviation—have found applications in many subsequent GAN architectures. These techniques have become standard tools in the GAN researcher's toolkit, valued for their contributions to training stability.

##### Modern Perspectives and Evolution

Contemporary GAN research has largely moved beyond progressive growing, developing alternative approaches for stable high-resolution training. Techniques like spectral normalization, self-attention mechanisms, and improved loss functions have provided different solutions to the challenges that progressive growing addressed.

However, the fundamental insight behind progressive growing—that complex tasks benefit from structured, gradual learning—remains valuable. Modern architectures often incorporate similar principles in different forms, such as multiscale training, hierarchical generation, or attention mechanisms that focus on different levels of detail.

The success of progressive growing also highlighted the importance of training dynamics in generative models. This insight has influenced research into adaptive training schedules, dynamic architectures, and other approaches that modify the learning process based on training progress.

Progressive Growing of GANs represents a landmark achievement in making high-resolution image generation practical and reliable. Its approach of building complexity gradually, maintaining balanced adversarial training, and introducing architectural innovations for stability has influenced an entire generation of generative models. While current research has moved toward different techniques, the fundamental principles demonstrated by ProGAN continue to inform our understanding of how to train complex generative models effectively. The method's success illustrates how sometimes the most powerful innovations come not from adding complexity, but from finding smarter ways to manage it.

In [None]:
import numpy as np
import torch
import torch.nn as nn

In [None]:
class GeneratorFirstBlock(nn.Module):
    """
    This block follows the ProGan paper implementation.
    Takes the latent vector and creates feature maps.
    """
    def __init__(self, latent_dim: int):
        super(GeneratorFirstBlock, self).__init__()
        # initial block 
        self.conv0 = nn.ConvTranspose2d(latent_dim, 512, kernel_size=4)
        self.conv1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.activation = nn.LeakyReLU(0.2)

    def forward(self, x: torch.Tensor):
        # x is a (batch_size, latent_dim) latent vector, we need to turn it into a feature map
        x = torch.unsqueeze(torch.unsqueeze(x, -1), -1)
        x = self.conv0(x)
        x = self.activation(x)
        
        x = self.conv1(x)
        x = self.activation(x)
        return x

In [None]:
class GeneratorBlock(nn.Module):
    """
    This block follows the ProGan paper implementation.
    """
    def __init__(self, in_channels: int, out_channels: int):
        super(GeneratorBlock, self).__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        self.activation = nn.LeakyReLU(0.2)

    def forward(self, x: torch.Tensor):
        x = interpolate(x, scale_factor=2)
        x = self.conv1(x)
        x = self.activation(x)

        x = self.conv2(x)
        x = self.activation(x)
        return x

Using the above two blocks, you can implement the Generator module. The end resolution that we want to reach is 512x512 and we will start at a 4x4 resolution. 


#### init
The `__init__` method should contain enough blocks to work at full resolution. We are only instantiating the generator once! So you will need to:
* create one GeneratorFirstBlock module
* create enough GeneratorBlocks modules such that the final resolution is 512x512
* create one `toRGB` layer per resolution. 

The number of filters in each layer is controlled by the `num_filters` function below.


#### forward

The forward method does the following:
* takes the latent vector, the current resolution and `alpha` as input. 
* run the latent vector through the different blocks and perform `alpha` fading


**Tips:**
* you can the torch `interpolate` function to double the resolution of an image
* you can use the `np.log2` function to map the resolution of the input image to a "depth" (or stage) level. For example, `np.log2(512) = 9` and `np.log2(4)` = 2.
* when training at 4x4 resolution, you should not perform $\alpha-$fading.

In [None]:
import tests

from torch.nn.functional import interpolate

In [None]:
def num_filters(stage: int, 
                fmap_base: int = 8192,
                fmap_decay: float = 1.0,
                fmap_max: int = 512): 
    """
    A small helper function to computer the number of filters for conv layers based on the depth.
    From the original repo https://github.com/tkarras/progressive_growing_of_gans/blob/master/networks.py#L252
    """
    return min(int(fmap_base / (2.0 ** (stage * fmap_decay))), fmap_max)

In [None]:
class Generator(nn.Module):
    """
    Generator: takes a latent vector as input and output an image.
    args:
    - max_resolution: max image resolution
    - latent_dim: dimension of the input latent vector
    """
    def __init__(self, max_resolution: int, latent_dim: int):
        super(Generator, self).__init__()

        # following the original implementation
        resolution_log2 = int(np.log2(max_resolution))

        # layers blocks
        self.blocks = [GeneratorFirstBlock(latent_dim)]
        for res in range(1, resolution_log2 - 1):
            self.blocks.append(GeneratorBlock(num_filters(res), num_filters(res+1)))
        self.blocks = nn.ModuleList(self.blocks)

        # to rgb blocks
        self.to_rgb = [nn.Conv2d(num_filters(res), 3, kernel_size=1) for res in range(1, resolution_log2)]
        self.to_rgb = nn.ModuleList(self.to_rgb)

    def forward(self, x: torch.Tensor, current_res: int, alpha: float = 1.0):
        resolution_log2 = int(np.log2(current_res))

        # to rgb operation
        if current_res == 4:
            x = self.blocks[0](x)
            images_out = self.to_rgb[0](x)
        else:
            # blocks
            for block in self.blocks[:resolution_log2-2]:
                x = block(x)

            previous_img = self.to_rgb[resolution_log2-3](x)
            previous_img_scaled = interpolate(previous_img, scale_factor=2)

            x = self.blocks[resolution_log2-2](x)
            new_img = self.to_rgb[resolution_log2-2](x)
            images_out = new_img * alpha + (1 - alpha) * previous_img_scaled
        return images_out

In [None]:
generator = Generator(max_resolution=512, latent_dim=128)

In [None]:
tests.check_progan_generator(generator)