# **Generative Adversarial Networks for Synthesizing New Data (Part 1/2)**


### **Generative Adversarial Networks (GANs)**


**1. Core Idea**

A GAN consists of **two neural networks** trained **simultaneously but with opposing objectives**:

| Component             | Role                                   | Goal                                                            |
| --------------------- | -------------------------------------- | --------------------------------------------------------------- |
| **Generator (G)**     | Produces synthetic samples             | Fool the discriminator into thinking generated samples are real |
| **Discriminator (D)** | Classifies inputs as real or generated | Correctly distinguish real samples from generated samples       |

You can think of this as a **zero-sum game**:

* The **generator** tries to *minimize* the probability of being detected.
* The **discriminator** tries to *maximize* its classification accuracy.



**2. Mathematical Formulation**

The original GAN objective (Goodfellow et al., 2014):

$$\min_G \max_D \; V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log (1 - D(G(z)))]$$

Where:

* $`p_{\text{data}}`$ is the real data distribution
* $`p_z`$ is a simple noise distribution (e.g., Gaussian or Uniform)
* $`z`$ is a random noise vector fed to the generator
* $`G(z)`$ is the generated fake sample
* $`D(x)`$ outputs a probability that $`x`$ is real

The training process alternates:

1. Update $`D`$ to maximize correct classification.
2. Update $`G`$ to minimize the discriminator’s ability to detect fakes.



**3. Generator Network (G)**

**Input:** Noise vector $`z \sim p_z`$ (e.g., $`z ∈ \mathbb{R}^{128}`$)

**Output:** Synthetic example in data space (e.g., $`28×28`$ image)

Goal: Learn a **mapping**:

$$G: \mathbb{R}^k \rightarrow \mathbb{R}^n$$

* If generating images: often uses **transposed convolution layers** to upsample noise to image resolution.
* Without guidance, the generator tries to **match the real data distribution**.



**4. Discriminator Network (D)**

**Input:** Either real or generated sample

**Output:** A scalar $`D(x) ∈ [0, 1]`$ (probability real)

Goal: Learn a **decision boundary** that distinguishes:

* Real data $`x ~ p_{\text{data}}`$
* Fake data $`G(z)`$

Typically uses **standard convolutional networks** for image data.



**5. Training Instability Issues**

GANs are known to be **unstable** due to the adversarial optimization structure.

Common issues:

| Issue                   | Description                                                              |
| ----------------------- | ------------------------------------------------------------------------ |
| **Mode Collapse**       | Generator produces limited diversity (e.g., same face for all inputs)    |
| **Non-convergence**     | Loss oscillates because the networks chase each other                    |
| **Vanishing Gradients** | Discriminator becomes too strong → generator receives no learning signal |



**6. Improvements to Stabilize Training**

| Improvement                    | Idea                                                | Key Objective Change                                                        |
| ------------------------------ | --------------------------------------------------- | --------------------------------------------------------------------------- |
| **WGAN** (Wasserstein GAN)     | Measure distance using Earth-Mover distance         | $`\min_G \max_{D \in 1\text{-Lip}} \mathbb{E}[D(x)] - \mathbb{E}[D(G(z))]`$ |
| **Gradient Penalty (WGAN-GP)** | Enforces Lipschitz constraint using gradient norm   | Stabilizes gradients and prevents explosion                                 |
| **Spectral Normalization**     | Constrain weight matrix norms                       | Keeps discriminator from being overly sharp                                 |
| **Progressive Growing**        | Start small and grow networks (for high-res images) | Improves training stability for large outputs                               |



**7. Why GANs Work (Game-Theoretic Perspective)**

GAN training seeks a **Nash equilibrium**:

* Generator’s best strategy matches the data distribution
* Discriminator’s best strategy outputs $`1/2`$ everywhere

At equilibrium:

$$p_G = p_{\text{data}}$$

Meaning the generator has learned to **model the true data distribution**.



**8. Intuition Summary**

* The **generator** learns by **seeing where it fails**, not by being told the correct answer.
* The **discriminator** provides **dense, informative gradients**, unlike traditional supervised labels.
* The learning process is **self-supervised**: no real labels are required.
* GANs are powerful for **distribution learning** and **sample generation**.



**9. Where GANs Are Used**

| Application                | Explanation                                         |
| -------------------------- | --------------------------------------------------- |
| Image synthesis            | Generate realistic human faces, landscapes, objects |
| Image-to-image translation | e.g., Sketch → Photo, Day → Night                   |
| Super-resolution           | Convert low-res images to high-res                  |
| Data augmentation          | Generate synthetic training data                    |
| Audio and Music synthesis  | Generate realistic sound samples                    |



---

## **Introducing generative adversarial networks**

- Objective of `GAN` is to synthesize new data samples that has the same distribution as its training dataset. 
- `GANs` are considered as **unsupervised learning** models since they do not require labeled data for training.
- `GANs` are used in various applications including 
  - image generation,
  - image-to-image translation,
  - super-resolution,
  - data augmentation, and
  - audio synthesis.




### **Starting with autoencoders**

- `Autoencoders` are neural networks that learn to compress data into a lower-dimensional representation and then reconstruct the original data from this representation.
- They consist of two main components:
  - **Encoder**: Maps input data to a latent space.
  - **Decoder**: Reconstructs the input data from the latent representation.
- The `encoder` acts as a feature extractor or data compression function, while the `decoder` serves as a generative model that can produce new data samples from the latent space.


![Autoencoder Architecture](./figures/17_01.png)


- We can add multiple hidden layers with nonlinearities (as in a multilayer NN) to create a **deep autoencoder**, that can learn more complex representations of the data.
- Autoencoders uses Convolutional layers for image data to better capture spatial hierarchies and patterns.

### **Generative models for synthesizing new data**

- `Autoencoders` are deterministic models that learn to reconstruct input data, but they do not inherently generate new, diverse samples. After an autoencoder is trained, given an input, $x$, it will be able to reconstruct the input from its compressed version in a lower-dimensional space. Therefore, it cannot generate entirely new samples that are different from the training data.

- A `generative model` can generate new data samples, $\hat{x}$, from a random noise vector, $z$, sampled from a simple distribution, such as a `Gaussian` or `uniform` distribution. The goal of the generative model is to learn a mapping from the noise space to the data space, such that the generated samples resemble the real data distribution. For example, each element of $z$ can be sampled from a standard normal distribution, i.e., $z_i \sim \mathcal{N}(0, 1)$ or from a uniform distribution, i.e., $z_i \sim \text{Uniform}(-1, 1)$.


![Generative Model Mapping](./figures/17_02.png)


- Both the `autoencoder` and the `generative model` learn a mapping from a lower-dimensional space to the data space. However, the key difference is that the autoencoder learns to reconstruct specific input data, while the generative model learns to produce new samples that resemble the overall data distribution.

- It is possible to generalize an autoencoder into a generative model by introducing stochasticity into the latent space. This can be achieved using a `variational autoencoder (VAE)`, which learns a probabilistic mapping from the latent space to the data space. In a VAE, the encoder outputs parameters of a probability distribution (e.g., mean and variance of a Gaussian) instead of a single point in the latent space. During training, random samples are drawn from this distribution to generate new data samples through the decoder. This allows the VAE to generate diverse samples that capture the underlying data distribution, similar to other generative models.

### **Generating new samples with GANs**

To understand what `GANs` are, we first need to understand their two main components: the **generator** and the **discriminator**. 

- **Generator (G)**: The generator is a neural network that takes a random noise vector, $z$, as input and produces synthetic data samples, $\hat{x} = G(z)$. The goal of the generator is to create samples that are indistinguishable from real data.

- **Discriminator (D)**: The discriminator is another neural network that takes either real data samples or generated samples as input and outputs a probability score, $D(x)$, indicating whether the input is real (from the training data) or fake (produced by the generator). The goal of the discriminator is to correctly classify real and fake samples.

You can think of this as a **zero-sum game**:
* The **generator** tries to *minimize* the probability of being detected.
* The **discriminator** tries to *maximize* its classification accuracy.

The training process involves alternating updates to the generator and discriminator:
1. Update the discriminator to maximize its ability to distinguish real from fake samples.
2. Update the generator to minimize the discriminator's ability to detect fake samples.


![Discriminator distinguishing real vs fake samples](./figures/17_03.png)


- In a `GAN` model, the generator and discriminator are trained simultaneously in an adversarial manner. The generator aims to produce realistic samples that can fool the discriminator, while the discriminator strives to become better at distinguishing real samples from those generated by the generator. This adversarial training process continues until the generator produces samples that are indistinguishable from real data, and the discriminator can no longer reliably tell them apart.


### **Understanding the loss functions for the generator and discriminator networks in a GAN model**

- The original `GAN` objective (Goodfellow et al., 2014):

$$\min_G \max_D \; V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log (1 - D(G(z)))]$$

Where:
* $`p_{\text{data}}`$ is the real data distribution
* $`p_z`$ is a simple noise distribution (e.g., Gaussian or Uniform)
* $`z`$ is a random noise vector fed to the generator
* $`G(z)`$ is the generated fake sample
* $`D(x)`$ outputs a probability that $`x`$ is real or fake

The expression consists of two parts:

1. **Discriminator Loss**: $`\mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)]`$ - This term encourages the discriminator to assign high probabilities to real data samples.

2. **Generator Loss**: $`\mathbb{E}_{z \sim p_z}[\log (1 - D(G(z)))]`$ - This term encourages the generator to produce samples that the discriminator classifies as real (i.e., low probabilities for fake samples).


A practical way of training GANs is to alternate between updating the discriminator and the generator:

1. **Update Discriminator (D)**: Maximize the objective with respect to $`D`$ while keeping $`G`$ fixed. This involves maximizing the likelihood of correctly classifying real and fake samples.

2. **Update Generator (G)**: Minimize the objective with respect to $`G`$ while keeping $`D`$ fixed. This involves minimizing the likelihood of the discriminator correctly identifying fake samples.


### **Optimization objective of the generator network in a GAN model**


**1. The GAN Value Function**

The original GAN objective is a **minimax game**:

$$\min_G \max_D V(D, G)$$

where

$$V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}\left[\log D(x)\right] + \mathbb{E}_{z \sim p_z}\left[\log(1 - D(G(z)))\right]$$



**2. Training Alternates Between Two Steps**

| Step                   | Network Updated | What’s Fixed | Goal                                        |
| ---------------------- | --------------- | ------------ | ------------------------------------------- |
| **Discriminator Step** | $D$             | $G$          | Maximize ability to tell real vs fake apart |
| **Generator Step**     | $G$             | $D$          | Generate samples that fool $D$              |

This alternating training keeps the learning signal flowing to both networks.


**3. Updating the Discriminator**

When **$G$ is fixed**, the discriminator receives two types of samples:

| Sample Source    | Label | Training Signal         |
| ---------------- | ----- | ----------------------- |
| Real data $x$    | 1     | Encourage $D(x)$ → 1    |
| Fake data $G(z)$ | 0     | Encourage $D(G(z))$ → 0 |

So we **maximize**:

$$\log D(x) + \log(1 - D(G(z)))$$

Equivalently, we **minimize binary cross-entropy** with targets:

* Real → label **1**
* Fake → label **0**



![The steps in building a GAN model](./figures/17_04.png)




**4. Updating the Generator (Problem and Solution)**

When **$D$ is fixed**, the generator is updated using:

$$\min_G \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

But early in training:

* $G(z)$ looks nothing like real data.
* $D(G(z)) \approx 0$ with high confidence.

Then:

$$\log(1 - D(G(z))) \approx \log 1 = 0$$

and **its gradient is almost zero** → **no learning signal**.

This is the **saturation problem**.



**5. Solution: Replace the Loss (Non-Saturating Loss)**

Instead of minimizing:

$$\log(1 - D(G(z)))$$

we **maximize**:

$$\log D(G(z))$$

which is equivalent to training $G$ as if the fake samples were labeled **real (label = 1)**.

So the generator’s new loss is:

$$\min_G -\mathbb{E}_{z \sim p_z}\left[\log D(G(z))\right]$$

This provides a **strong gradient** even when $D(G(z))$ is small.



**6. Final Label Assignment Summary**

| Network           | Input              | Target Label | Why                          |
| ----------------- | ------------------ | ------------ | ---------------------------- |
| **Discriminator** | Real sample $x$    | 1            | Encourage real → real        |
| **Discriminator** | Fake sample $G(z)$ | 0            | Encourage fake → fake        |
| **Generator**     | Fake sample $G(z)$ | **1**        | Encourage fake → appear real |

So the **generator’s labels are intentionally “flipped”** — this is called **non-saturating generator loss**.



**7. Intuition Summary**

* The **discriminator** learns to **identify fake vs. real**.
* The **generator** learns to **fool the discriminator**, not to match real images directly.
* In early training, $D$ is strong → gradients vanish → we **swap labels** for generator.
* This gives the generator a **healthy gradient** to update from.



**Where This Fits Into GAN Stabilization Theory**

This non-saturating loss is foundational and is used in **almost all modern GANs**:

* DCGAN
* StyleGAN
* BigGAN
* CycleGAN

It is part of making GANs **trainable and stable**.

## **Implementing a GAN from scratch**

- The original `GAN` paper by Goodfellow et al. (2014) introduced the concept of training two neural networks, a **generator** and a **discriminator**, in an adversarial manner to synthesize new data samples that resemble the training data distribution. 


### **Implementing the generator and the discriminator networks**

- implementation of our first GAN model with a `generator` and a `discriminator` as two fully connected neural networks with one or more hidden layers.


![GAN model with fully connected networks](./figures/17_08.png)


- For each hidden layer, we use the `Leaky ReLU` activation function, which introduces non-linearity and helps the networks learn complex patterns in the data.
  
- The use of `ReLU` results in sparse gradients, which can lead to dead neurons during training. `Leaky ReLU` allows a small, non-zero gradient when the unit is not active, helping to keep the neurons alive and improving learning.

- In the `discriminator network`, each hidden layer is followed by a `dropout` layer. This regularization technique randomly sets a fraction of the input units to zero during training, which helps prevent overfitting and improves the model's generalization to unseen data.

- The `output layer` in the `generator` uses the hyperbolic tangent `Tanh` activation function to ensure that the generated samples are in the same range as the normalized input data (typically between `-1` and `1` for image data). This helps the generator produce outputs that are more realistic and consistent with the training data distribution.

- The ouput layer in the discriminator uses the `sigmoid` activation function to output a probability score between `0` and `1`, indicating whether the input sample is `real` or `fake`. This probabilistic output is essential for the binary classification task that the discriminator performs.


![Leaky ReLU activation function](./figures/17_17.png)

- The `Leaky ReLU` activation function is defined as:
  - $$f(x) = \begin{cases} x & \text{if } x \geq 0 \\ \alpha x & \text{if } x < 0 \end{cases}$$
  - where `α` is a small constant (e.g., `0.01`) that determines the slope of the function for negative input values.

### **Defining the training dataset**

- For training our `GAN` model, we will use the `MNIST` dataset, which consists of grayscale images of handwritten digits (0-9). The dataset contains `60,000` training images and `10,000` test images, each with a resolution of `28x28` pixels.
- We will normalize the pixel values to the range `[-1, 1]` to match the output range of the generator network, which uses the `Tanh` activation function in its output layer.
- This normalization helps the generator produce outputs that are more realistic and consistent with the training data distribution.
- Use `torchvision.transforms.ToTensor` to convert images to PyTorch tensors and `torchvision.transforms.Normalize` to scale pixel values to the desired range.

**Refer to the code implementation in the notebook (chp17_part1_GPU.ipynb) for details on defining the dataset and data loader.**

---