---
### StyleGan

Look at this: [https://thispersondoesnotexist.com/](https://thispersondoesnotexist.com/) ! Would you believe this picture is not real??? Well it is not! Feel free to refresh the page to look at more examples. This picture has been generated by [StyleGan](https://arxiv.org/pdf/1812.04948.pdf), a groundbreaking architecture in the world of GANs! 

The StyleGan uses a lot of tricks that we have seen in the course (eg, progressive growing) but also relies on a novel approach to controlled generation. The figure below describes the architecture of the network:

<p align="center">
    <img src='../images/stylegan.png' width=60% />
</p>

<br>

<p align="center">
    <img src='../images/style_gan.png' width=60% />
</p>

Wow! We have a completely new network in our GAN. This network is called the **noise mapping** network. The architecture of this network is fairly simple, it only consists of 8 fully connected layers. The authors argue that mapping the latent vector $z$ to a new latent vector $w$ facilitates **disentanglement**. 

We talked about disentanglement in the course: when modifying the latent vector $z$ to control the aspect of the generated image, we often face entangled features. For example, longer hair can be correlated with a more feminine face. Using the mapping vector in StyleGan facilities the decorrelation of such features.

What happens next? The generated $w$ latent vector is injected into a "classic" generator network. However, this network has two components you are not yet familiar with, as seen in the figure above. Indeed, after each convolution layer, we see that the authors are adding **noise**. Moreover, the convolution output with added noise is then fed into a **adaptive instance normalization layer or AdaIN**.

In this notebook, you will implement a **noise injection layer** and the **AdaIn layer**.

---
#### StyleGAN: Revolutionizing High-Quality Image Generation

StyleGAN represents one of the most significant breakthroughs in generative adversarial networks, fundamentally changing how we think about image generation and controllable synthesis. To understand why StyleGAN matters, imagine being an artist who can not only create incredibly realistic portraits but also precisely control every aspect of the subject's appearance—from their age and gender to subtle details like hair texture and facial expression. StyleGAN achieves exactly this level of control in the digital realm.

The key insight behind StyleGAN lies in recognizing that traditional GANs treat all aspects of image generation equally, mixing low-level details like texture with high-level features like face shape in an entangled manner. StyleGAN introduces a novel architecture that disentangles these different aspects, allowing unprecedented control over the generation process.

##### The Architecture Revolution: From Noise to Style

Traditional GANs begin with a random noise vector that gets transformed through a series of layers until it becomes an image. This approach has a fundamental limitation: the initial noise vector must encode everything about the final image, from global structure to fine details, in a single compressed representation. StyleGAN takes a radically different approach by introducing a mapping network and style-based generation.

##### The Mapping Network: Creating the Style Space

The first major innovation in StyleGAN is the mapping network, which transforms the input noise vector $z$ into an intermediate latent code $w$. This mapping network consists of several fully connected layers that learn to map from the noise space $\mathcal{Z}$ to what the authors call the style space $\mathcal{W}$.

This transformation serves a crucial purpose. The original noise space $\mathcal{Z}$ typically follows a simple distribution like a multivariate Gaussian, but the actual distribution of meaningful images is far from Gaussian. The mapping network learns to warp the simple input distribution into a more suitable intermediate representation where different dimensions correspond to meaningful and disentangled factors of variation.

The mapping network can be expressed mathematically as:
$$w = f(z)$$
where $f$ is an 8-layer multilayer perceptron that maps from $\mathcal{Z}$ to $\mathcal{W}$.

##### Style-Based Generation: Adaptive Instance Normalization

The second revolutionary component is the style-based generator architecture. Instead of feeding the latent code directly into the generator layers, StyleGAN uses the intermediate latent code $w$ to control the generation process through Adaptive Instance Normalization (AdaIN).

Each generator layer receives its style information from a learned affine transformation of $w$. The AdaIN operation normalizes the feature maps and then applies style-specific scaling and shifting:

$$\text{AdaIN}(x_i, y) = y_{s,i} \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}$$

where $x_i$ is the feature map of the $i$-th channel, $y_{s,i}$ and $y_{b,i}$ are the style-derived scaling and bias parameters, and $\mu(x_i)$ and $\sigma(x_i)$ are the mean and standard deviation of $x_i$.

This approach allows different layers to be controlled by different style vectors, enabling fine-grained control over different aspects of the generated image. Coarse layers control high-level features like face shape and pose, while fine layers control details like hair texture and skin pores.

##### The Complete StyleGAN Architecture

Let me illustrate how all these components work together in the complete StyleGAN architecture:

```mermaid
flowchart TD
    Z["Random Noise Vector<br>z ~ N(0,1)<br>512-dimensional"] --> MappingNet["Mapping Network<br>8 FC layers<br>z → w transformation"]
    
    MappingNet --> W["Style Vector w<br>512-dimensional<br>Intermediate latent space"]
    
    W --> A1["Affine Transform A1<br>w → (scale₁, bias₁)"]
    W --> A2["Affine Transform A2<br>w → (scale₂, bias₂)"]
    W --> A3["Affine Transform A3<br>w → (scale₃, bias₃)"]
    W --> ADots["... more transforms"]
    W --> An["Affine Transform An<br>w → (scaleₙ, biasₙ)"]
    
    Const["Learned Constant<br>4×4×512 tensor"] --> Conv1["Conv Layer 1<br>3×3 convolution"]
    Conv1 --> AdaIN1["AdaIN Layer 1<br>Normalization + Style"]
    A1 --> AdaIN1
    AdaIN1 --> Noise1["Noise Injection<br>Learned per-pixel noise"]
    Noise1 --> Act1["Activation<br>LeakyReLU"]
    
    Act1 --> Upsample1["Upsample<br>4×4 → 8×8"]
    Upsample1 --> Conv2["Conv Layer 2<br>3×3 convolution"]
    Conv2 --> AdaIN2["AdaIN Layer 2<br>Normalization + Style"]
    A2 --> AdaIN2
    AdaIN2 --> Noise2["Noise Injection<br>Fine detail control"]
    Noise2 --> Act2["Activation<br>LeakyReLU"]
    
    Act2 --> MoreLayers["... more conv blocks<br>Progressive upsampling<br>8×8 → 16×16 → ... → 1024×1024"]
    A3 --> MoreLayers
    ADots --> MoreLayers
    
    MoreLayers --> FinalConv["Final Conv Layer<br>Convert to RGB"]
    An --> FinalConv
    FinalConv --> Output["Generated Image<br>High-resolution RGB<br>1024×1024 or higher"]
    
    style Z fill:#BCFB89
    style MappingNet fill:#9AE4F5
    style W fill:#FBF266
    style A1 fill:#FA756A
    style A2 fill:#FA756A
    style A3 fill:#FA756A
    style An fill:#FA756A
    style Const fill:#0096D9
    style AdaIN1 fill:#FCEB14
    style AdaIN2 fill:#FCEB14
    style Output fill:#FE9237
```

##### Understanding Style Control at Different Scales

One of StyleGAN's most remarkable features is how different layers control different aspects of the generated image. This hierarchical control emerges naturally from the architecture and provides intuitive editing capabilities.

**Coarse Styles (Low-resolution layers, 4×4 to 8×8)**: These layers control high-level aspects such as pose, general hair style, face shape, and eyeglasses. Changes at this level affect the overall composition and major structural elements of the image.

**Middle Styles (Medium-resolution layers, 16×16 to 32×32)**: These layers govern finer facial features, hair style details, eyes open or closed, and other mid-level attributes. This is where many recognizable facial characteristics are determined.

**Fine Styles (High-resolution layers, 64×64 to 1024×1024)**: These layers control micro-features such as skin texture, hair color variations, background details, and fine-grained lighting effects. Changes here affect the surface appearance without altering the overall structure.

This hierarchical organization allows for incredibly precise control. You can modify just the hair color of a generated person without affecting their facial structure, or change their pose without altering their identity—capabilities that were previously impossible with traditional GANs.

##### Progressive Growing and Training Stability

StyleGAN builds upon the progressive growing technique introduced in Progressive GAN, where the network starts by learning to generate low-resolution images and gradually adds layers to increase resolution. This approach provides several advantages for training stability and final image quality.

The progressive training begins with 4×4 images and progressively adds layers to reach the final resolution. During this process, both the generator and discriminator grow in tandem, with new layers being gradually faded in rather than added abruptly. This smooth transition prevents training instabilities that often occur when suddenly changing the network architecture.

The mathematical formulation for the fade-in process involves a mixing parameter $\alpha$ that gradually transitions from the lower-resolution output to the higher-resolution output:
$$\text{output} = (1 - \alpha) \cdot \text{upsampled\_low\_res} + \alpha \cdot \text{high\_res}$$

This progressive approach allows the network to first learn the overall structure and composition at low resolution, then refine details at higher resolutions. This mirrors how human artists often work, sketching the overall composition before adding fine details.

##### The Power of Style Mixing

StyleGAN's architecture enables a fascinating capability called style mixing, where different parts of the style vector $w$ can come from different source images. This creates a powerful tool for controlled image synthesis and editing.

In style mixing, you generate two different style vectors $w_1$ and $w_2$ from two random noise vectors $z_1$ and $z_2$. Then, you use $w_1$ for some layers and $w_2$ for others. For example, you might use $w_1$ for coarse layers (controlling face structure) and $w_2$ for fine layers (controlling skin texture and hair details).

This technique reveals the disentangled nature of StyleGAN's representation. When you mix styles from two different faces, you get coherent combinations rather than nonsensical blends. The face structure from one person can be seamlessly combined with the hair style from another, creating novel but realistic combinations.

##### Truncation Trick and Quality Control

StyleGAN introduces a clever technique called the truncation trick to balance between diversity and quality in generated images. The insight is that samples from the extreme regions of the latent space often produce lower-quality images, while samples closer to the center tend to produce higher-quality but less diverse results.

The truncation trick modifies the sampling process by constraining the latent vectors:
$$w' = \bar{w} + \psi(w - \bar{w})$$

where $\bar{w}$ is the average latent vector computed over many samples, and $\psi$ is the truncation factor. When $\psi = 1$, you get the original sampling distribution. When $\psi < 1$, you get samples closer to the average, typically resulting in higher quality but reduced diversity.

This technique provides a dial for controlling the quality-diversity tradeoff, which proves invaluable for practical applications where consistent quality is more important than maximum diversity.

##### Perceptual Path Length and Disentanglement

StyleGAN introduces a novel metric called Perceptual Path Length (PPL) to measure the quality of the latent space representation. This metric quantifies how smoothly the generated images change as you move through the latent space. A good latent space should produce smooth, meaningful transitions rather than abrupt, nonsensical changes.

The PPL metric is computed by measuring the perceptual distance between images generated from nearby points in the latent space:
$$\text{PPL} = \mathbb{E}\left[\frac{1}{\epsilon^2} d(\text{G}(w), \text{G}(w + \epsilon \delta))\right]$$

where $d$ is a perceptual distance function (typically using VGG features), $\epsilon$ is a small step size, and $\delta$ is a random direction in the latent space.

Lower PPL values indicate smoother, more consistent interpolations between generated images, which correlates with better disentanglement and more meaningful latent representations.

##### StyleGAN2: Addressing Artifacts and Improving Quality

The success of StyleGAN led to StyleGAN2, which addressed several artifacts present in the original architecture. The most notable issues were droplet-like artifacts that appeared in generated images, particularly noticeable in fine details like hair and water.

StyleGAN2 introduced several architectural improvements. The AdaIN operation was modified to prevent information from bypassing the style modulation. The progressive growing was replaced with residual connections and improved network designs. Most importantly, the generator architecture was redesigned to eliminate the characteristic artifacts while maintaining the controllability that made StyleGAN famous.

The path length regularization was also improved in StyleGAN2, leading to even smoother and more disentangled latent spaces. These improvements resulted in significantly higher image quality and more reliable controllability.

##### Applications and Impact

StyleGAN has revolutionized numerous applications in computer graphics, entertainment, and research. In the entertainment industry, it enables rapid prototyping of character designs and concept art. Fashion and beauty industries use it for virtual try-ons and style exploration. Researchers employ it to generate synthetic datasets for training other machine learning models.

The controllability of StyleGAN has opened new possibilities in image editing. Applications like semantic face editing, where users can adjust specific facial attributes with simple sliders, become possible thanks to the disentangled latent space. This level of control was previously achievable only through manual photo editing by skilled artists.

Perhaps most importantly, StyleGAN has demonstrated that generative models can achieve both high quality and meaningful controllability simultaneously. This combination has influenced the design of subsequent generative models across different domains, from text generation to 3D modeling.

##### Theoretical Implications and Future Directions

StyleGAN's success has important theoretical implications for understanding generative models. It demonstrates that the choice of latent space representation significantly affects the quality and controllability of generated content. The mapping network's role in transforming a simple noise distribution into a more structured intermediate representation provides insights into how neural networks can learn meaningful data representations.

The hierarchical control mechanism in StyleGAN suggests that natural images have an inherent hierarchical structure that can be exploited for better generation and editing. This insight has influenced research in other domains, leading to hierarchical generative models for text, audio, and 3D content.

Current research directions include extending StyleGAN's principles to other domains, improving the disentanglement of latent representations, and developing more efficient training procedures. The fundamental insights from StyleGAN continue to shape the development of next-generation generative models, promising even more powerful and controllable content creation tools in the future.

StyleGAN represents more than just a technical advancement; it embodies a new philosophy of generative modeling where controllability and quality go hand in hand. By understanding the principles behind StyleGAN, we gain insights not only into current capabilities but also into the future possibilities of AI-driven content creation. The architecture's elegant solution to the disentanglement problem continues to inspire new research directions and practical applications, making it a cornerstone of modern generative artificial intelligence.

## Noise injection

The noise injection helps with something that the authors call **stochastic variation**. They argue that many aspects of a human face are stochastic, such as hair curls or freckles. By adding random noise at different levels in the generator, they can create more variability without changing the overall image. For example in the image below, we can see how different noise vectors impacts the placement of the hair.

<br>
<img src='../images/stochastic_variation.png' width=40% />
<br>

After each convolution layer in the generator, the authors added a noise injection layer. A random gaussian noise is added to the output and scaled by a learned factor. Let's look at an example together:
* let's say that the output shape of the convolution layer is `(1, 256, 32, 32)` where 256 is the number of channels and 32x32 the spatial dimensions.
* we create a random noise matrix of dimensions `(1, 1, 32, 32)`
* we multiply the above random by a learned scaling factor vector of dimensions `(1, 256, 1, 1)`. This learned scaling factor is initialized with zeros.

In [1]:
import torch
import torch.nn as nn

import tests

class ApplyNoise(nn.Module):
    """StyleGAN noise injection layer that adds scaled Gaussian noise to feature maps.
    
    This layer implements the noise injection mechanism used in StyleGAN architectures.
    It adds per-channel scaled noise to the input tensor, where the scaling factors
    are learned parameters. The noise is sampled independently for each spatial location.

    Attributes:
        channels (int): Number of input/output channels.
        weights (nn.Parameter): Learnable scaling factors for noise injection,
            shape (1, channels, 1, 1).

    Example:
        >>> noise_layer = ApplyNoise(channels=512)
        >>> x = torch.randn(1, 512, 64, 64)
        >>> noisy_x = noise_layer(x)
        >>> noisy_x.shape
        torch.Size([1, 512, 64, 64])
    """

    def __init__(self, channels: int):
        """Initialize noise injection layer.
        
        Args:
            channels: Number of input feature map channels.
        """
        super(ApplyNoise, self).__init__()
        self.channels = channels
        self.weights = nn.Parameter(torch.zeros(1, channels, 1, 1))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply scaled noise to input tensor.
        
        Args:
            x: Input tensor of shape (batch, channels, height, width).
            
        Returns:
            Tensor with same shape as input, with learned noise added.
            
        Note:
            - Noise is resampled for each forward pass (different each call)
            - Weights are initialized to zero (no noise initially)
        """
        noise = torch.randn(1, 1, x.shape[2], x.shape[3], device=x.device)
        return x + self.weights * noise

In [2]:
apply_noise = ApplyNoise(512)

In [3]:
tests.check_apply_noise(apply_noise)

Basic noise layer structure verified


#### Adaptive instance normalization


The Adaptive instance normalization (AdaIN) is a variation of the **Instance Normalization layer**. In the course, we have discussed about the importance of Batch Normalizations layers. However, we also have seen that in some cases (eg, when using gradient penalties), Batch Normalization is not the preferred type of normalization layer. 

This figure from the [Group Normalization](https://arxiv.org/pdf/1803.08494.pdf) paper helps to understand the differences between the normalization layers. In the figure below, $H$ and $W$ are the spatial dimensions, $C$ the channel dimension and $N$ the batch dimension.


<br>
<img src='../images/normalization_layers.png' width=80% />
<br>

In [**Batch Normalization**](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html), we normalize pixels of the same channel, accross the batch and spatial dimensions.

In [**Layer Normalization**](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html), we normalize pixels of the same batch index, accross the channel and spatial dimensions.

In [**Instance Normalization**](https://pytorch.org/docs/stable/generated/torch.nn.InstanceNorm2d.html), we normalize pixels of the same batch index and channel, accross the spatial dimensions only.

In [**Group Normalization**](https://pytorch.org/docs/stable/generated/torch.nn.GroupNorm.html), we group pixels of the batch index together. 

The AdaIN layer is an extension of the Instance Normalization layer. It takes as input both the output of the previous convolution layer and the latent vector $w$. Then it performs the following:
* map the latent vector $w$ to styles vector $(y_{s}, y_{b})$ through learned affine transformations (fully connected layers).
* calculate the output of the layer using the following equation: $y_{s} * In(x) + y_{b}$ where $In(x)$ is the input $x$ fed through an instance normalization layer.

In [4]:
class AdaIN(nn.Module):
    """Adaptive Instance Normalization layer for StyleGAN architectures.
    
    Implements feature map normalization conditioned on a latent vector w,
    allowing style transfer through learned affine transformations.

    Attributes:
        channels (int): Number of input/output channels
        w_dim (int): Dimension of latent style vector
        instance_norm (nn.InstanceNorm2d): Instance normalization layer
        linear_s (nn.Linear): Style network for scaling (gamma)
        linear_b (nn.Linear): Style network for bias (beta)

    Example:
        >>> adain = AdaIN(channels=512, w_dim=128)
        >>> x = torch.randn(4, 512, 64, 64)  # Input features
        >>> w = torch.randn(4, 128)          # Style vector
        >>> out = adain(x, w)
        >>> out.shape
        torch.Size([4, 512, 64, 64])

    Note:
        - Expects input tensor shapes [N, C, H, W] and [N, W_DIM]
        - Output maintains same spatial dimensions as input
        - Implements equation: output = (x_norm * gamma) + beta

    Adaptive Instance Normalization layer
    
    args:
    - channels: number of channels of the input
    - w_dim: dimension of the latent vector w
    
    inputs:
    - x: float32 tensor of dim [N, C, H, W]
    - w: float32 tensor of dim [N, W_DIM]    
    """
    def __init__(self, channels: int, w_dim: int):
        """Initialize AdaIN layer.
        
        Args:
            channels: Number of input feature channels
            w_dim: Dimension of style latent vector
        """
        super().__init__()
        self.channels = channels
        self.w_dim = w_dim
        self.instance_norm = nn.InstanceNorm2d(channels)
        self.linear_s = nn.Linear(w_dim, channels)  # Gamma network
        self.linear_b = nn.Linear(w_dim, channels)  # Beta network
        
    def forward(self, x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
        """Apply adaptive instance normalization.
        
        Args:
            x: Input tensor of shape [N, C, H, W]
            w: Style vector of shape [N, W_DIM]
            
        Returns:
            Normalized and styled tensor of same shape as input
        """
        x = self.instance_norm(x)       
        ys = self.linear_s(w)[..., None, None]  # Add spatial dims
        yb = self.linear_b(w)[..., None, None]
        return x * ys + yb  # Scale and shift

In [6]:
adain = AdaIN(512, 128)

In [7]:
tests.check_adain(adain)

Congrats, you successfully implemented the AdaIN layer!
