# FIT5215: Deep Learning Exam Revision Notes

## 1. Course Overview: The Big Picture

This unit covers the following key areas of Deep Learning:
* **Fundamentals (Weeks 1, 2, 4, 5):** ML basics, Feed-forward Neural Networks (FFNs), backpropagation, and optimization.
* **Deep Computer Vision (Weeks 3, 6, 10):** Convolutional Neural Networks (CNNs), architectures (ResNet), and Vision Transformers (ViT).
* **Sequential / Time-Series (Weeks 7, 9):** Recurrent Neural Networks (RNNs), LSTMs, GRUs, Seq2Seq, and Transformers.
* **Representation Learning (Week 8):** Word2Vec (Skip-Gram, CBOW).
* **Deep Generative Models (Week 11):** Generative Adversarial Networks (GANs) and Diffusion Models.

---

## 2. Feed-Forward Neural Networks (FFNs)

### 2.1. Architecture & Forward Propagation

* **Definition:** A base model for deep learning composed of an input layer, hidden layers, and an output layer.
* **Parameters:** Weight matrices ($W^k$) and bias vectors ($b^k$) for each layer $k$.
* **Forward Propagation (Classification):**
    1.  **Input:** $h^0(x) = x$
    2.  **Hidden Layers ($k=1$ to $L-1$):**
        * **Linear Operation:** $\bar{h}^k(x) = h^{k-1}(x)W^k + b^k$
        * **Activation:** $h^k(x) = \sigma(\bar{h}^k(x))$ (introduces non-linearity)
    3.  **Output Layer (Logits):** $h^L(x) = h^{L-1}(x)W^L + b^L$
    4.  **Prediction:** $p(x) = \text{softmax}(h^L(x))$
* **Forward Propagation (Regression):**
    * Same as classification, but the output layer is linear.
    * **Prediction:** $\hat{y} = h^L(x)$

### 2.2. Activation Functions

The key to giving NNs non-linear capabilities. Without them, a deep network is just a linear model.

| Function | Formula | Output Range | Gradient | Notes |
| :--- | :--- | :--- | :--- | :--- |
| **Sigmoid** | $\sigma(z) = \frac{1}{1 + e^{-z}}$ | (0, 1) | $\sigma(z)(1 - \sigma(z))$ | S-shaped. Prone to **vanishing gradients** as gradient is near 0 for large positive/negative inputs (saturation). |
| **tanh** | $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ | (-1, 1) | $1 - \sigma^2(z)$ | S-shaped and zero-centered, which can help convergence. Still saturates. |
| **ReLU** | $\text{ReLU}(z) = \max(0, z)$ | [0, $\infty$) | $\begin{cases} 1 & \text{if } z \ge 0 \\ 0 & \text{otherwise} \end{cases}$ | Most common default. Fast to compute. Solves vanishing gradient for $z>0$. Can "die" if $z<0$ (no gradient). |

### 2.3. Loss Function & Training

* **Loss (Classification):** Cross-Entropy (CE) Loss / Negative Log-Likelihood.
    * Measures the difference between predicted probabilities $p(x)$ and the true label $y$.
    * $CE(x, y) = -\log p_y(x)$
* **Training:** The goal is to find parameters $\theta = \{(W^k, b^k)\}$ that minimize the total loss over the training set $D$.
    * $L(D; \theta) = \frac{1}{N} \sum_{i=1}^N CE(1_{y_i}, p(x_i))$
* **Process:** This minimization is done using optimizers (like SGD) which require calculating the gradient of the loss w.r.t. all parameters.

---

## 3. Optimization

### 3.1. Calculus & Backpropagation

* **Goal:** Find $\nabla_\theta J(\theta)$, the gradient of the loss w.r.t. parameters.
* **Chain Rule:** The core of backpropagation. It allows us to compute gradients layer by layer, starting from the loss and moving backward.
    * If $z = g(y)$ and $y = f(x)$, then $\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}$.
* **Jacobian Matrix:** A matrix of all first-order partial derivatives of a vector-valued function. The chain rule for vectors/matrices uses the Jacobian.
* **Key Gradient (Softmax + CE Loss):** The gradient of the CE loss w.r.t. the logits $h$ (output *before* softmax) is remarkably simple:
    * $\frac{\partial l}{\partial h} = p - 1_y$ (where $p$ is the softmax probability vector and $1_y$ is the one-hot true label).
* **Backpropagation:** An algorithm that uses the chain rule to efficiently compute gradients. It propagates the error gradient backward from the output layer to the input layer.
    * **Forward Pass:** Compute layer outputs and the final loss.
    * **Backward Pass:** Compute gradients for each $W^k$ and $b^k$ using the chain rule.

### 3.2. Optimization Algorithms

* **Problem:** The loss surface of a deep network is highly complex, non-convex, and filled with exponentially more **saddle points** than local minima.
* **Gradient Descent (GD):**
    * Update rule: $\theta_{t+1} = \theta_t - \eta \nabla_\theta J(\theta_t)$
    * Computes the gradient $\nabla_\theta J(\theta_t)$ using the **entire** training set $N$.
    * Impractical for large datasets ($O(N)$ cost per step).
* **Stochastic Gradient Descent (SGD):**
    * Estimates the gradient using a **mini-batch** $b$ of data (e.g., $b=32$).
    * $\nabla_\theta \tilde{L}(\theta_t) = \frac{1}{b} \sum_{k=1}^b \nabla_\theta l(x_{i_k}, y_{i_k}; \theta_t)$
    * This is an unbiased estimate of the true gradient and is much faster ($O(b)$ cost).
    * Update rule: $\theta_{t+1} = \theta_t - \eta \nabla_\theta \tilde{L}(\theta_t)$
* **SGD with Momentum:**
    * Adds a "velocity" vector $v$ that accumulates past gradients.
    * $v = \alpha v + (1-\alpha)g$ (This is a common, simplified form; slides may show $v = \alpha g_{t-1} + ...$)
    * Update: $\theta = \theta - \eta v$
    * Helps to speed up and stabilize convergence, dampening oscillations.
* **AdaGrad (Adaptive Gradient):**
    * Adaptively scales the learning rate for each parameter.
    * Accumulates the sum of squared gradients $\gamma$ for each parameter.
    * Update: $\theta = \theta - \frac{\eta}{\sqrt{\epsilon + \gamma}} \odot g$
    * Gives smaller updates to parameters with large gradients (prevents overshooting) and larger updates to parameters with small gradients.
    * Weakness: The learning rate always decreases, which can cause training to stall.

---

## 4. Convolutional Neural Networks (CNNs)

### 4.1. CNN Architecture

A typical CNN consists of a **Feature Extractor** followed by a **Classifier**.
1.  **Feature Extractor:** `(CONV -> ReLU -> CONV -> ReLU -> POOL)`*...
2.  **Classifier:** `(FLATTEN -> FC -> ReLU -> FC -> SOFTMAX)`

### 4.2. Convolutional Layer

* **Operation:** Slides a small filter/kernel (e.g., 3x3) over the input image/feature map. At each position, it computes a dot product (convolution) between the kernel and the input patch.
* **Key Components:**
    * **Kernel/Filter ($f_h, f_w$):** The small matrix of learnable weights. Detects specific features (edges, corners, textures).
    * **Strides ($s_h, s_w$):** The step size the kernel moves. A stride > 1 downsamples the output.
    * **Padding ($p$):** Adds a border (usually of zeros) to the input. "Zero padding" is common. `p=1` for a 3x3 kernel is often used to preserve the input dimensions (known as 'same' padding).
* **Output Size:** The height ($H_o$) and width ($W_o$) of the output feature map are:
    * $W_o = \lfloor \frac{W_i + 2p - f_w}{s_w} \rfloor + 1$
    * $H_o = \lfloor \frac{H_i + 2p - f_h}{s_h} \rfloor + 1$
* **Feature Volume:** An input of shape `(C_in, H_i, W_i)` convolved with `C_out` filters (each of shape `(C_in, f_h, f_w)`) produces an output feature map of shape `(C_out, H_o, W_o)`.

### 4.3. Pooling Layer

* **Operation:** A non-linear downsampling layer. Operates independently on each feature map.
* **Types:**
    * **Max Pooling:** Takes the maximum value from a window (e.g., 2x2). Most common.
    * **Average Pooling:** Takes the average value from the window.
* **Purpose:**
    1.  Reduces the spatial dimensions ($H \times W$) of the feature maps.
    2.  Reduces the number of parameters and computation.
    3.  Provides basic translation invariance.
* **Hinton's Critique:** Pooling is a "big mistake" because it throws away precise spatial information, failing to capture relationships between parts (e.g., a "face" is just a bag of features: one nose, two eyes).

### 4.4. Global Pooling

* **Operation:** A pooling layer where the kernel size is equal to the entire feature map size.
* **Types:** Global Average Pooling (GAP) or Global Max Pooling (GMP).
* **Purpose:**
    * Reduces a feature map of `(C, H, W)` to a vector of `(C, 1, 1)`.
    * Often used to replace the `Flatten` and `FC` layers at the end of a CNN, drastically reducing parameters and preventing overfitting.

### 4.5. ResNet (Residual Networks)

* **Problem:** Very deep "plain" networks suffer from *degradation* (accuracy gets worse) due to vanishing gradients.
* **Solution:** The **Residual Block** with a **skip connection**.
* **Architecture:** The block learns a *residual* function $g(x)$ instead of the full mapping $f(x)$.
    * **Plain block:** $h(x) = \text{ReLU}(W_2 \cdot \text{ReLU}(W_1 x))$
    * **Residual block:** $h(x) = \text{ReLU}(\text{MainPath}(x) + x)$
    * The model learns $g(x) = \text{MainPath}(x)$, so the full output is $f(x) = g(x) + x$.
* **Benefit:** The gradient can flow directly through the identity skip connection ($\frac{\partial f}{\partial x} = \frac{\partial g}{\partial x} + 1$), preventing the gradient from vanishing.
* **1x1 Convolutions:** Used in skip connections when the input and output dimensions (channels or spatial size) do not match.

---

## 5. Practical Skills & Regularization

### 5.1. Common Training Problems

* **Gradient Vanishing:** Gradients become exponentially small in deep networks, especially with saturating activations (sigmoid, tanh). Lower layers learn very slowly or not at all.
    * **Fix:** Use ReLU, use ResNets, use good initialization (He).
* **Gradient Exploding:** Gradients become exponentially large, leading to unstable training (NaNs). Common in RNNs.
    * **Fix:** Gradient Clipping (clamping gradients to a max value), use ResNets, good initialization.
* **Internal Covariate Shift (ICS):** The distribution of each layer's inputs changes during training, as the parameters of previous layers change. This forces the layer to constantly re-adapt.
    * **Fix:** Batch Normalization.
* **Overfitting:** The model "memorizes" the training data (low training error) but fails to generalize to new data (high test error).
    * **Fix:** Regularization (L1, L2, Dropout, Data Augmentation), Early Stopping.

### 5.2. Key Techniques

* **Weight Initialization:** Crucial to break symmetry and maintain good gradient flow.
    * **Xavier/Glorot Init:** Good for `sigmoid` and `tanh`. Variance scales with $\frac{1}{n_{in} + n_{out}}$.
    * **He Init:** Designed for `ReLU`. Variance scales with $\frac{1}{n_{in}}$.
* **Batch Normalization (BN):**
    * Normalizes the output of a layer (before activation) across the mini-batch to have zero mean and unit variance.
    * Learns two parameters, $\gamma$ (scale) and $\beta$ (shift), to restore representative power.
    * **Benefits:** Solves ICS, allows higher learning rates, smooths the loss landscape, acts as a regularizer.
* **L1/L2 Regularization:**
    * Adds a penalty to the loss function based on the magnitude of the weights $\theta$.
    * $J(\theta) = \text{Loss} + \Omega(\theta)$
    * **L2 (Weight Decay):** $\Omega(\theta) = \lambda \sum ||W^k||^2_F$. Prefers small, diffuse weights.
    * **L1:** $\Omega(\theta) = \lambda \sum |W^k|$. Promotes sparsity (many weights become zero).
* **Dropout:**
    * Regularization technique. During training, randomly "drops" (sets to zero) a fraction $p$ of neurons in a layer.
    * Forces the network to learn redundant representations; prevents "co-adaptation."
    * At test time, all neurons are used (dropout is turned off).
* **Early Stopping:**
    * Monitor the validation loss. Stop training when the validation loss starts to increase, even if training loss is still decreasing.
* **Data Augmentation:**
    * Create more training data by applying realistic transformations (e.g., for images: flip, rotate, crop, color shift).
    * Acts as a powerful regularizer, making the model invariant to these transformations.
* **Label Smoothing:**
    * Regularization. Changes hard one-hot labels (e.g., `[0, 1, 0]`) to soft labels (e.g., `[0.05, 0.9, 0.05]`).
    * Prevents the model from becoming overconfident.
* **Mixup / CutMix:**
    * Data augmentation. Blends two images ($x_1, x_2$) and their labels ($y_1, y_2$).
    * $\hat{x} = \lambda x_1 + (1-\lambda) x_2$
    * $\hat{y} = \lambda y_1 + (1-\lambda) y_2$
    * CutMix cuts a patch from one image and pastes it onto another.

---

## 6. Adversarial Attacks

* **Adversarial Example:** An input $x_{adv}$ created by adding a small, human-imperceptible perturbation $\delta$ to a clean image $x$, such that $x_{adv}$ fools the model.
* **How it Works:** Attacks move the input $x$ in the direction that **maximizes** the loss. This is the *opposite* of training.
* **Untargeted Attack:** Goal is to make the model predict *any* wrong class.
    * Find $x_{adv}$ that maximizes $l(f(x'), y_{\text{true}})$.
    * **Fast Gradient Sign Method (FGSM):** A one-step attack.
        * $x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x l(f(x), y))$
* **Targeted Attack:** Goal is to make the model predict a *specific* target class $y_{\text{target}}$.
    * Find $x_{adv}$ that *minimizes* $l(f(x'), y_{\text{target}})$.
    * $x_{adv} = x - \epsilon \cdot \text{sign}(\nabla_x l(f(x), y_{\text{target}}))$
* **Projected Gradient Descent (PGD):** A stronger, iterative version of FGSM. Takes multiple small steps, projecting the result back into the $\epsilon$-ball (e.g., $||x' - x||_\infty \le \epsilon$) at each step.
* **Defense: Adversarial Training:**
    * The most effective known defense.
    * Find adversarial examples (e.g., using PGD).
    * Train the model on a mix of clean and adversarial examples. This makes the loss surface smoother around data points.

---

## 7. Recurrent Neural Networks (RNNs)

### 7.1. Basic RNN

* **Architecture:** A network with a "loop." The hidden state $h_t$ at time $t$ is a function of the previous hidden state $h_{t-1}$ and the current input $x_t$.
* **Equations:**
    * $h_t = \tanh(h_{t-1}W_h + x_t U_x + b_h)$
    * $\hat{y}_t = \text{softmax}(h_t V_y + b_y)$
* **Key Idea:** Parameters ($W_h, U_x, V_y$) are **shared** across all time steps.
* **Problem:** Vanishing/Exploding Gradients. The gradient signal has to flow through many matrix multiplications ($W_h^T$) back in time. If $W_h$ is "small," gradients vanish; if "large," they explode. This makes it impossible to capture **long-term dependencies**.

### 7.2. LSTM (Long Short-Term Memory)

* **Solution:** Solves the long-term dependency problem with a **gating mechanism** and a separate **cell state** $c_t$ (long-term memory).
* **Core Idea:** The network learns *when* to forget, store, and output information.
* **Gates (all are sigmoid functions):**
    1.  **Forget Gate ($f_t$):** Decides what to throw away from $c_{t-1}$.
        * $f_t = \sigma(x_t U_f + h_{t-1} W_f + b_f)$
    2.  **Input Gate ($i_t$):** Decides what new information to store in $c_t$.
        * $i_t = \sigma(x_t U_i + h_{t-1} W_i + b_i)$
        * Candidate values: $g_t = \tanh(x_t U_g + h_{t-1} W_g + b_g)$
    3.  **Cell State Update:**
        * $c_t = f_t \odot c_{t-1} + i_t \odot g_t$ (Forget old, add new)
    4.  **Output Gate ($o_t$):** Decides what to output as the hidden state $h_t$.
        * $o_t = \sigma(x_t U_o + h_{t-1} W_o + b_o)$
        * $h_t = o_t \odot \tanh(c_t)$

### 7.3. GRU (Gated Recurrent Unit)

* A simplified version of LSTM with fewer parameters.
* Merges $c_t$ and $h_t$ into a single hidden state $h_t$.
* Has two gates:
    1.  **Update Gate ($z_t$):** Controls how much of $h_{t-1}$ to keep (like $f_t$ and $i_t$ combined).
    2.  **Reset Gate ($r_t$):** Controls how much of $h_{t-1}$ to use when computing the candidate state.

### 7.4. RNN Architectures

* **Many-to-One:** Input sequence $\rightarrow$ single output (e.g., Sentiment Analysis).
* **One-to-Many:** Single input $\rightarrow$ output sequence (e.g., Image Captioning).
* **Many-to-Many (Seq2Seq):** Input sequence $\rightarrow$ output sequence (e.g., Machine Translation).

---

## 8. Word2Vec

* **Goal:** Learn vector representations (embeddings) for words that capture semantic meaning (e.g., $v_{\text{King}} - v_{\text{Man}} + v_{\text{Woman}} \approx v_{\text{Queen}}$).
* **Pretext Task:** Turns an unsupervised problem (learning from text) into a supervised one.
* **Models:**
    1.  **CBOW (Continuous Bag-of-Words):**
        * **Task:** Predict a target word given its surrounding context words.
        * (e.g., `[The, quick, fox, jumps]` $\rightarrow$ `brown`)
    2.  **Skip-Gram:**
        * **Task:** Predict context words given a single target word.
        * (e.g., `brown` $\rightarrow$ `[The, quick, fox, jumps]`)
* **Negative Sampling:** An optimization to avoid the expensive `softmax` over the entire vocabulary $N$.
    * Instead of an $N$-class classification, it becomes a binary classification:
    * Is this pair `(target, context)` a real pair (1) or a "negative" (randomly sampled) pair (0)?

---

## 9. Advanced Sequential Models (Seq2Seq & Attention)

### 9.1. Seq2Seq (Encoder-Decoder)

* **Architecture:** Two RNNs (often LSTMs/GRUs).
    1.  **Encoder:** Reads the entire input sequence (e.g., "I love deep learning") and compresses it into a single **context vector** $c$ (the final hidden state $h_T$).
    2.  **Decoder:** Takes $c$ as its initial hidden state and generates the output sequence token by token (e.g., "<BOS>", "J'adore", "le", "deep", "learning", "<EOS>").
* **Problem:** The fixed-size context vector $c$ is a **bottleneck**. It's difficult to store all information from a long sentence in one vector.

### 9.2. Attention Mechanism

* **Solution:** Solves the bottleneck by allowing the Decoder to look back at *all* Encoder hidden states ($h_1, ..., h_T$) at *each* step of decoding.
* **Process (at Decoder step $j$):**
    1.  **Query:** Get the current decoder hidden state $q_{j-1}$.
    2.  **Keys:** Use all encoder hidden states $h_1, ..., h_T$ as keys.
    3.  **Scores:** Compute an **alignment score** $e_j = \text{score}(q_{j-1}, h_i)$ for each $h_i$. (e.g., $\text{score} = q^T h$ (dot), $q^T W_a h$ (general)).
    4.  **Weights:** Convert scores to probabilities: $\alpha_j = \text{softmax}(e_j)$. These $\alpha_j$ weights show "how much attention" to pay to each input word.
    5.  **Context:** Create a dynamic context vector $c_j$ as a weighted sum: $c_j = \sum_i \alpha_{ji} h_i$.
    6.  **Predict:** Use $c_j$ and $q_{j-1}$ to predict the next word $y_j$.

---

## 10. Transformers

* **Motto:** "Attention is All You Need." Throws away all recurrence and convolutions, relying only on attention.
* **Benefits:**
    * **Parallelizable:** Computes all steps at once, unlike RNNs. Massively faster on GPUs.
    * **Long-Range:** Path length between any two tokens is $O(1)$, perfectly capturing long-range dependencies.

### 10.1. Self-Attention (Scaled Dot-Product)

* The core of the Transformer. It's attention *within* a single sequence (e.g., inputs attending to other inputs).
* **Process:**
    1.  For each input token $x_i$, create three vectors (via learnable matrices $W_Q, W_K, W_V$):
        * **Query ($q_i$):** "What I am looking for."
        * **Key ($k_i$):** "What I am."
        * **Value ($v_i$):** "What I will provide."
    2.  Compute attention scores by taking the dot product of every query with every key: $q_i \cdot k_j$.
    3.  **Matrix Form:**
        * $Q = X W_Q$, $K = X W_K$, $V = X W_V$
        * $\text{Attention}(Q, K, V) = \text{softmax}(\frac{Q K^T}{\sqrt{d_k}}) V$
        * The $\sqrt{d_k}$ scaling prevents dot products from becoming too large and killing the softmax gradient.

### 10.2. Transformer Architecture

* **Multi-Head Attention:** Runs self-attention $h$ times (e.g., $h=8$) in parallel with different $W_Q, W_K, W_V$ matrices. Concatenates the results.
* **Encoder Block:**
    1.  Multi-Head Self-Attention
    2.  Add & Norm (Residual Connection + Layer Norm)
    3.  Point-wise FFN
    4.  Add & Norm
* **Decoder Block:** Same as Encoder, but adds a *second* Multi-Head Attention block that performs **Cross-Attention** (Queries $Q$ from Decoder, Keys $K$ and Values $V$ from Encoder output).
* **Positional Encoding:** Since there is no recurrence, sin/cos functions are added to the input embeddings to inject position information.

---

## 11. Vision Transformer (ViT)

* **Goal:** Apply the Transformer to images.
* **Problem:** Self-attention on $224 \times 224$ pixels is computationally infeasible.
* **Solution: "An image is worth 16x16 words"**
    1.  **Patches:** Split the image (e.g., $224 \times 224$) into a grid of non-overlapping 16x16 patches.
    2.  **Flatten & Project:** Flatten each patch ($16 \times 16 \times 3$) into a vector and use a linear layer to project it to the model dimension $d_{\text{model}}$.
    3.  **[CLS] Token:** Add a learnable "class token" to the beginning of the sequence.
    4.  **Positional Encoding:** Add positional encodings to the patch embeddings.
    5.  **Transformer:** Feed this sequence of patch "tokens" into a standard Transformer Encoder.
    6.  **Classify:** Use the final output corresponding to the `[CLS]` token to feed into an MLP head for classification.
* **vs. CNNs:** ViTs lack the *inductive bias* (locality, translation invariance) of CNNs. Therefore, they require *much more* data (e.g., ImageNet-21k, JFT-300M) to learn these properties from scratch.

---

## 12. Model Fine-Tuning (PEFT)

* **Problem:** Fine-tuning an entire large model (like ViT or BERT) on a new task is computationally expensive.
* **Solution: Parameter-Efficient Fine-Tuning (PEFT):** Freeze the pre-trained model and only train a small number of *new* parameters.
* **Methods:**
    * **Adapters:** Add small, learnable FFN "bottleneck" modules *inside* each Transformer block.
    * **LoRA (Low-Rank Adaptation):** Modifies a weight matrix $W$ by adding a low-rank update: $W \rightarrow W + B \cdot A$, where $B$ and $A$ are small, new matrices. Only $B$ and $A$ are trained.
    * **Prompt Tuning:** Prepends learnable "prompt" vectors to the input sequence.

---

## 13. Generative Adversarial Networks (GANs)

* **Goal:** A generative model that learns to create realistic data (e.g., images).
* **Architecture (Two-Player Game):**
    1.  **Generator ($G$):** The "Counterfeiter." Tries to create fake data ($G(z)$) from random noise $z$ that looks real.
    2.  **Discriminator ($D$):** The "Police." Tries to distinguish between real data $x$ and fake data $G(z)$.
* **Minimax Objective:** $G$ and $D$ play a game. $D$ tries to *maximize* this function, while $G$ tries to *minimize* it.
    * $\min_G \max_D J(G, D) = \mathbb{E}_{x \sim p_d}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$
* **Training:**
    1.  **Train $D$:** Freeze $G$. Show $D$ a batch of real images (labels=1) and a batch of fake images (labels=0). Update $D$ via gradient *ascent*.
    2.  **Train $G$:** Freeze $D$. Generate a batch of fake images. Pass them to $D$ and try to fool it (labels=1). Update $G$ via gradient *descent*.
* **Nash Equilibrium:** The ideal convergence point where $G$ produces perfect fakes ($p_g = p_d$) and $D$ is completely confused ($D(x) = 0.5$ for all $x$).
* **Problems:**
    * **Mode Collapse:** $G$ finds one "good" fake that fools $D$ and only produces that one (or a few) images, failing to capture the diversity of the data.
    * **Hard Convergence:** The minimax training is unstable and hard to balance.
* **Evaluation:**
    * **Inception Score (IS):** Measures quality. Good generated images should be:
        1.  **High-Quality:** A classifier (like InceptionNet) is confident about what it is (low entropy $p(y|x)$).
        2.  **Diverse:** The model produces a wide variety of classes (high entropy $p(y)$).