## **Generative Adversarial Networks for Synthesizing New Data (Part 2/2)**

### **Improving the quality of synthesized images using a convolutional and Wasserstein GAN**

- we will implement a `DCGAN`, which will enable us to improve the performance we saw in the previous GAN example. Additionally, we will briefly talk about an extra key technique, `Wasserstein GAN (WGAN)`.

- This technique includes;
  - Transposed convolution
  - Batch normalization (BatchNorm)
  - WGAN loss function

- `DCGAN` stands for Deep Convolutional Generative Adversarial Network. It is a type of GAN that uses convolutional neural networks (CNNs) in both the generator and discriminator architectures. The key idea behind DCGANs is to leverage the power of CNNs to capture spatial hierarchies in image data, which helps in generating more realistic images.

#### **Transposed convolution**

- `Transposed convolution`, also known as `deconvolution` or `fractionally strided convolution`, is a technique used in neural networks to upsample feature maps. It is commonly used in the generator network of GANs to increase the spatial dimensions of the generated images.

- In a standard convolution operation, we apply a filter to an input feature map to produce a smaller output feature map. In contrast, transposed convolution works in the opposite direction: it takes a smaller input feature map and produces a larger output feature map by applying learned filters in a way that "reverses" the convolution process.

- This operation allows the generator to create higher-resolution images from `lower-dimensional latent vectors`, enabling it to produce more detailed and realistic outputs.

- Transposed convolution is particularly useful in `GANs` because it helps the generator learn to create images that closely resemble the training data by progressively refining and upsampling the generated images through multiple layers of transposed convolutions.

- To understand the `transposed convolution` operation, let’s go through a simple thought experiment. Assume that we have an input feature map of size `n×n`. Then, we apply a `2D convolution` operation with certain `padding` and `stride` parameters to this `n×n` input, resulting in an output feature map of size `m×m`. Now, the question is, how we can apply another convolution operation to obtain a feature map with the initial dimension `n×n` from this `m×m` output feature map while maintaining the connectivity patterns between the input and output? Note that only the shape of the `n×n` input matrix is recovered and not the actual matrix values.


![Transposed Convolution Operation](./figures/17_09.png)


- Upsampling feature maps using `transposed convolution` is a crucial technique in the generator network of GANs, as it allows the model to create high-resolution images that closely resemble the training data.

- It works by inserting zeros between the elements of the input feature map, effectively increasing its spatial dimensions. Then, a standard convolution operation is applied to this expanded feature map using learned filters, which helps to refine and generate the final output image.



![Applying transposed convolution to a 4×4 input](./figures/17_10.png)



- In summary, `transposed convolution` is a powerful technique that enables the generator in GANs to produce high-quality images by progressively upsampling and refining feature maps through learned convolutional filters.

#### **Batch normalization**

- `Batch normalization (BatchNorm)` is a technique used in deep learning to stabilize and accelerate the training of neural networks. It works by normalizing the inputs of each layer to have a mean of zero and a standard deviation of one, which helps to reduce internal covariate shift.

- In the context of GANs, `BatchNorm` is commonly applied to both the generator and discriminator networks. By normalizing the activations within each mini-batch, `BatchNorm` helps to improve the convergence of the training process and allows for the use of higher learning rates.

- `BatchNorm` also helps to mitigate issues such as mode collapse in GANs, where the generator produces limited diversity in its outputs. By stabilizing the training dynamics, `BatchNorm` encourages the generator to explore a wider range of outputs, leading to more diverse and realistic synthesized images.

Assume that we have the net preactivation feature maps obtained after a convolutional layer in a four-dimensional tensor, $Z$, with the shape `[m×c×h×w]`, where `m` is the number of examples in the batch (i.e., batch size), `h×w` is the spatial dimension of the feature maps, and $c$ is the number of channels. `BatchNorm` can be summarized in three steps, as follows:

1. **Compute the mean and variance** for each channel across the mini-batch and spatial dimensions:
   $$
   \mu_c = \frac{1}{m \cdot h \cdot w} \sum_{i=1}^{m} \sum_{j=1}^{h} \sum_{k=1}^{w} Z_{i,c,j,k}
   $$
   
   $$
   \sigma_c^2 = \frac{1}{m \cdot h \cdot w} \sum_{i=1}^{m} \sum_{j=1}^{h} \sum_{k=1}^{w} (Z_{i,c,j,k} - \mu_c)^2
   $$
   where $Z_{i,c,j,k}$ represents the value of the feature map at the `i`-th example in the batch, `c`-th channel, and spatial location `(j, k)`.
   
   And, $\mu_c$ and $\sigma_c^2$ are the mean and variance for channel `c`, respectively.

2. **Normalize the feature maps** using the computed mean and variance: 
   $$
   \hat{Z}_{i,c,j,k} = \frac{Z_{i,c,j,k} - \mu_c}{\sqrt{\sigma_c^2 + \epsilon}}
   $$

   where $\epsilon$ is a small constant added for numerical stability.

3. **Scale and shift** the normalized feature maps using learnable parameters $\gamma_c$ and $\beta_c$:
   $$
   Y_{i,c,j,k} = \gamma_c \hat{Z}_{i,c,j,k} + \beta_c
   $$

![Batch Normalization Process](./figures/17_11.png)

- The PyTorch API provides a class, `nn.BatchNorm2d()` (`nn.BatchNorm1d()` for 1D input), that we can use as a layer when defining our models; it will perform all of the steps that we described for `BatchNorm`.
- Note that the behavior for updating the learnable parameters, $\gamma$ and $\beta$ , depends on whether the model is a training model not. These parameters are learned only during training and are then used for normalization during evaluation.

### **Implementing the generator and discriminator**

The architectures of the generator and discriminator networks are summarized in the following two figures.


- The generator takes a vector, $z$, of size `100` as input. Then, a series of `transposed convolutions` using `nn.ConvTranspose2d()` upsamples the feature maps until the spatial dimension of the resulting feature maps reaches `28×28`. The number of channels is reduced by half after each transposed convolutional
layer, except the last one, which uses only one output filter to generate a grayscale image. Each transposed convolutional layer is followed by `BatchNorm` and `leaky ReLU` activation functions, except the last one, which uses `tanh activation` (without BatchNorm).


**Architecture of the Generator**

![Arch. of the generator](./figures/17_12.png)


The discriminator receives images of size `1×28×28`, which are passed through four `convolutional layers`. The first three convolutional layers reduce the spatial dimensionality by `4` while increasing the number of channels of the feature maps. Each convolutional layer is also followed by `BatchNorm`
and `leaky ReLU` activation. The last convolutional layer uses kernels of size `7×7` and a single filter to reduce the spatial dimensionality of the output to `1×1×1`. Finally, the convolutional output is followed by a `sigmoid function` and squeezed to one dimension:


**Architecture of the Discriminator**

![Arch. of the discriminator](./figures/17_13.png)




---

### **Dissimilarity measures between two distributions**

- The goal of generative model is to learn how to synthesize new samples that have the same distribution as the distribution of the training dataset.

![methods to measure the dissimilarity between distributions P and Q](./figures/17_14.png)

**Let's gain an understanding of these measures by briefly stating what they are trying to accomplish in simple words:**

- The first one, `TV` distance, measures the largest difference between the two distributions at each point.
- The `EM` distance can be interpreted as the minimal amount of work needed to transform one distribution into the other.
- The `Kullback-Leibler (KL)` and `Jensen-Shannon (JS)` divergence measures come from the field of information theory. `KL` divergence is not symmetric, that is, $KL(P||Q) \neq KL(Q||P)$ in contrast to JS divergence.





![Example of calculating the different dissimilarity measures](./figures/17_15.png)





---
#### **Earth Mover’s Distance (EMD)**

**1. Problem Context: Why EMD Was Introduced**

In classical GANs, the generator and discriminator are trained using a divergence measure between the **real data distribution** $`p_{\mathrm{data}}`$ and the **generated distribution** $`p_{G}`$:

* Jensen–Shannon Divergence (JSD)

However:

* When the supports of $`p_{\mathrm{data}}`$ and $`p_{G}`$ **do not overlap** (as is common early in training),

  * $`\mathrm{JSD}(p_{\mathrm{data}} \,\|\, p_{G})`$ becomes **constant** → gradients become **zero**.

This leads to *unstable training* and *vanishing gradients*.

A different way was needed to measure “distance” between distributions **even when they do not overlap**.

This leads us to the **Earth Mover’s Distance (EMD).**



**2. Definition of Earth Mover’s Distance**

Think of two distributions as piles of **dirt**.

* $`p_{\mathrm{data}}`$ is one pile.
* $`p_{G}`$ is another.

The **Earth Mover’s Distance** asks:

> **What is the minimum amount of work needed to move one pile into the shape of the other?**

Work = (amount of dirt) × (distance moved)

Formally:

$$\mathrm{EMD}(p_{\mathrm{data}}, p_{G})
\;\;=\;\;
\inf_{\gamma \in \Pi(p_{\mathrm{data}}, p_{G})}
\mathbb{E}_{(x,y)\sim\gamma} \left[\,\|x - y\|\,\right]$$

Where:

* $`\gamma`$ is a **transport plan** describing how mass moves from $x$ to $y$
* $`\Pi(p_{\mathrm{data}}, p_{G})`$ is the set of all valid transport plans.



**3. Why EMD is Better for GAN Training**

If two distributions **do not overlap**, EMD still gives a meaningful measure:

* It provides a **smooth gradient**
* It reflects **how far** generated samples must move to match the real ones
* Gradient does **not vanish** even when distributions are disjoint

This is crucial for learning.



**4. From EMD to the WGAN Objective**

Directly computing EMD is expensive.
So WGAN uses a dual form known as the **Kantorovich–Rubinstein Duality**:

$$
\mathrm{EMD}(p_{\mathrm{data}}, p_G)
=

\sup_{|f|*{L \le 1}}
\mathbb{E}*{x\sim p_{\mathrm{data}}}[f(x)]
-

\mathbb{E}_{x\sim p_G}[f(x)]
$$


Here:

* $`f`$ is restricted to be **1-Lipschitz**
* In WGAN, $`f`$ is represented by a neural network (called the **critic**; not a discriminator)

So the WGAN objective becomes:

$$
\max_{w \in \mathrm{Lip}(1)}
\mathbb{E}*{x\sim p*{\mathrm{data}}}[D_w(x)]
-

\mathbb{E}_{z\sim p_z}[D_w(G(z))]
$$

And the **generator** minimizes:

$`\min_G \mathbb{E}_{z\sim p_z}[D_w(G(z))]`$

Notice:

* No log terms
* No probabilities
* No sigmoid activation
* No binary cross-entropy

The critic outputs a **real number**, not a probability.



**5. Why the Lipschitz Constraint Matters**

WGAN requires $`D_w`$ to be **1-Lipschitz**, meaning:

$$|D_w(x_1) - D_w(x_2)| \le \|x_1 - x_2\|$$

This restriction ensures the output behaves like a distance function — enabling **stable gradients**.

**How to enforce it:**

| Method                 | Used In         | Notes                     |
| ---------------------- | --------------- | ------------------------- |
| Weight Clipping        | WGAN (original) | Simple but harms capacity |
| Gradient Penalty       | WGAN-GP         | Standard and stable       |
| Spectral Normalization | SN-GAN          | Common in image models    |



**6. Key Conceptual Difference**

| Model         | Discriminator Output    | Optimization Goal                          |
| ------------- | ----------------------- | ------------------------------------------ |
| Classical GAN | Probability (0 to 1)    | Classify real vs fake                      |
| WGAN          | Real number (unbounded) | Measure *difference* between distributions |

The WGAN critic does **not** try to classify.
It tries to assign **higher scores** to real data than generated data.



**7. Intuition Summary**

* EMD measures how much *work* is needed to turn fake samples into real ones.
* WGAN approximates EMD via the critic.
* Because EMD has **smooth gradients**, WGAN avoids:

  * Vanishing gradients
  * Training collapse
  * Oscillation instability

This is why `WGAN` is significantly more stable than classical GANs.

---

### **Using EM distance in practice for GANs**

**Using Earth Mover’s Distance (EMD) in Practice for GANs
→ The WGAN / WGAN-GP Training Procedure**

**1. Key Reminder**

Although the theoretical metric is *Earth Mover’s Distance*, **we do *not* compute EMD directly** in practice.

Instead, we use the **Wasserstein GAN (WGAN)** formulation, which provides a **trainable approximation** of EMD using a *critic network* (not a discriminator).



**2. Replace the Discriminator With a Critic**

| Classical GAN                                    | WGAN                                                  |
| ------------------------------------------------ | ----------------------------------------------------- |
| Discriminator outputs probability `D(x) ∈ [0,1]` | Critic outputs a real score `C(x) ∈ ℝ`                |
| Uses BCE loss                                    | Uses Wasserstein loss                                 |
| Classifies real vs fake                          | Computes distance between real and fake distributions |

No sigmoid activation is used in the critic’s final layer.



**3. Wasserstein Loss Functions (Practical Training Objective)**

**Critic loss (maximize difference)**

$$L_C = -\mathbb{E}_{x \sim p_{\mathrm{data}}}[C(x)] + \mathbb{E}_{z \sim p_z}[C(G(z))]$$

**Generator loss (minimize critic score for fake samples)**

$$L_G = -\mathbb{E}_{z \sim p_{z}}[C(G(z))]$$

These work as *minimization* losses in code:

```python
loss_critic = -(torch.mean(C_real) - torch.mean(C_fake))
loss_generator = -torch.mean(C_fake)
```



**4. Enforcing the 1-Lipschitz Constraint**

**This is the critical practical step.**
The critic **must** satisfy the Lipschitz condition:

$$|C(x_1)-C(x_2)| \le \|x_1-x_2\|$$

There are 3 standard enforcement mechanisms:

| Method                 | Name          | Stability         | Most Common Today      |
| ---------------------- | ------------- | ----------------- | ---------------------- |
| Weight Clipping        | Original WGAN | Unstable          | ❌ Not recommended      |
| Gradient Penalty       | **WGAN-GP**   | Very stable       | ✅ Standard approach    |
| Spectral Normalization | SN-GAN        | Stable, efficient | ✅ Often used in images |



**5. WGAN-GP Gradient Penalty Term**

Add the penalty:

$$\lambda \cdot \mathbb{E}_{\hat{x}\sim p_{\hat{x}}}
\left[
(\lVert \nabla_{\hat{x}} C(\hat{x}) \rVert_2 - 1)^2
\right]$$

where $`\hat{x}`$ is an interpolation between real and fake samples.

In code:

```python
epsilon = torch.rand(batch_size, 1, 1, 1, device=device)
x_hat = epsilon * real_images + (1 - epsilon) * fake_images
x_hat.requires_grad_(True)

critic_scores = critic(x_hat)
gradients = torch.autograd.grad(
    outputs=critic_scores,
    inputs=x_hat,
    grad_outputs=torch.ones_like(critic_scores),
    create_graph=True,
    retain_graph=True,
    only_inputs=True,
)[0]

grad_norm = gradients.view(gradients.size(0), -1).norm(2, dim=1)
gradient_penalty = ((grad_norm - 1) ** 2).mean()
```

Then:

```python
loss_critic = (torch.mean(C_fake) - torch.mean(C_real)) + λ * gradient_penalty
```



**6. Training Loop Structure**

WGAN trains the critic **more often** than the generator:

```python
for step in training_steps:
    for _ in range(n_critic):  # e.g., 5
        update critic (minimize loss_critic)
    update generator (minimize loss_generator)
```

Typical values:

* `n_critic = 5`
* `λ = 10` (gradient penalty weight)



**7. Practical Signs of Correct Training**

| Symptom                                | Meaning                             | Fix                         |
| -------------------------------------- | ----------------------------------- | --------------------------- |
| Critic loss → large negative values    | Critic winning too strongly         | Increase generator updates  |
| Generator loss → large negative values | Generator overpowering critic       | Increase critic updates     |
| Critic gradient norms explode          | Lipschitz violation                 | Increase gradient penalty λ |
| Mode collapse (repeated outputs)       | Generator ignoring critic gradients | Increase n_critic           |



**8. Summary**

| Classical GAN                  | WGAN / WGAN-GP                        |
| ------------------------------ | ------------------------------------- |
| Uses Jensen-Shannon            | Uses Earth Mover (Wasserstein-1)      |
| Probabilities & BCE loss       | Real-valued critic & Wasserstein loss |
| Can suffer vanishing gradients | Stable gradients throughout           |
| Often unstable training        | Much more robust optimization         |

**In practice:**

* Use **WGAN-GP**
* Use **gradient penalty**
* Train **critic multiple times per generator step**

---


### **Gradient Penalty (WGAN-GP)**


**Context**

In the Wasserstein GAN (WGAN) framework, the discriminator is replaced by a **critic** that estimates the Wasserstein-1 distance between the real and generated data distributions. For this distance to be computed correctly, the critic must satisfy the **1-Lipschitz constraint**, meaning that its gradients with respect to inputs must be bounded by:

$$||\nabla_x f(x)|| \le 1$$

Originally, WGAN enforced this using **weight clipping**, but this caused optimization issues (capacity loss, gradient explosion, and training instability).
Thus, **WGAN-GP** introduced **Gradient Penalty**, which enforces the Lipschitz constraint *smoothly*.



**Core Idea**

During training, the critic receives real samples $`x_{\mathrm{real}}`$ and generated samples $`x_{\mathrm{fake}}`$.
A **linear interpolation** is computed:

$`\hat{x} = \epsilon x_{\mathrm{real}} + (1 - \epsilon) x_{\mathrm{fake}}, \quad \epsilon \sim \mathrm{Uniform}(0, 1)`$

The gradient of the critic w.r.t. this interpolated sample is computed:

$$g = \nabla_{\hat{x}} f(\hat{x})$$

The **gradient norm** is:

$$\|g\|_2 = \sqrt{\sum_i g_i^2}$$

The gradient penalty term is then:

$$\lambda (\|g\|_2 - 1)^2$$

This term is added to the critic loss.
The full WGAN-GP critic loss:

$$\mathcal{L}_{\mathrm{critic}} = \mathbb{E}[f(x_{\mathrm{fake}})] - \mathbb{E}[f(x_{\mathrm{real}})] + \lambda \, \mathbb{E}_{\hat{x}}[(\|\nabla_{\hat{x}} f(\hat{x})\|_2 - 1)^2]$$



**Why Interpolate Between Real and Fake?**

The optimal critic under Wasserstein distance is **1-Lipschitz almost everywhere**, and the Lipschitz constraint is tightest on the **shortest transport path** between real and generated distributions.
Interpolating forces the constraint **where the critic must be most accurate**.



**Role of the Coefficient $`\lambda`$**

* $`\lambda`$ controls the strength of the Lipschitz penalty.
* Typical value: $`\lambda = 10`$
* Too small → Lipschitz condition violated → unstable training.
* Too large → critic becomes too smooth → weaker gradients → slow generator updates.



**Why Gradient Penalty Works Better than Weight Clipping**

| Method           | Effect                                                                  |
| ---------------- | ----------------------------------------------------------------------- |
| Weight Clipping  | Forces critic weights into a narrow range → underfits → poor gradients. |
| Gradient Penalty | Allows full network capacity while **gently enforcing** Lipschitzness.  |

Thus, **WGAN-GP** produces **more stable training**, **better gradients**, and **higher-quality samples**.



**Key Intuition**

The gradient penalty does **not** force every gradient to be exactly 1.
Instead, it **encourages** the critic to behave like a **smooth, well-conditioned metric function**, rather than a sharp classifier.
This is exactly what is required to measure Wasserstein distance.

---


##### **Mode collapse**

Due to the adversarial nature of GAN models, it is notoriously hard to train them. One common cause of failure in training GANs is when the generator gets stuck in a small subspace and learns to generate similar samples. This is called `mode collapse`.

- The synthesized examples in this figure are not cherry-picked. This shows that the generator has failed to learn the entire data distribution, and instead, has taken a lazy approach focusing on a subspace:


![Example of mode collapse](./figures/17_16.png)


Besides the vanishing and exploding gradient problems that we saw previously, there are some further aspects that can also make training GAN models difficult (indeed, it is an art). Here are a few suggested tricks from GAN artists.

- One approach is called `mini-batch discrimination`, which is based on the fact that batches consisting of only real or fake examples are fed separately to the discriminator. In mini-batch discrimination, we let the discriminator compare examples across these batches to see whether a batch is real or fake. The diversity of a batch consisting of only real examples is most likely higher than the diversity of a fake batch if a model suffers from mode collapse.

- Another technique that is commonly used for stabilizing GAN training is `feature matching`. In feature matching, we make a slight modification to the objective function of the generator by adding an extra term that minimizes the difference between the original and synthesized images based on intermediate representations (feature maps) of the discriminator.


- During the training, a GAN model can also get stuck in several modes and just hop between them. To avoid this behavior, you can store some old examples and feed them to the discriminator to prevent the generator from revisiting previous modes. This technique is referred to as experience replay. Furthermore, you can train multiple GANs with different random seeds so that the combination of all of them covers a larger part of the data distribution than any single one of them.

---