# Recap
## Quiz Questions Explained

---

### Question 1: Traditional vs. Deep Learning Approaches

* **The Question:** This question contrasts how feature extraction is handled in traditional machine learning pipelines versus end-to-end deep learning. 🧠
* **Correct Answers Explained:**
    * **C. In traditional approach, the training signal from classifier cannot be used to improve feature extractor.**  In traditional methods (e.g., using SIFT features with an SVM classifier), feature extraction and classification are separate, sequential steps. You first design and extract features using a fixed algorithm, then you train a classifier on those features. The performance of the classifier (the "training signal") does not influence or update the feature extraction process itself. 
    * **B. In deep learning approach, the training signal from classifier can be used to improve feature extractor.**  Deep learning models, like CNNs, learn features and classify in an end-to-end fashion. The convolutional layers act as the feature extractor, and the final fully connected layers act as the classifier. The loss calculated at the output is backpropagated through the *entire* network, simultaneously updating the weights of both the classifier and the feature extractor layers to improve overall performance. 

---

### Question 2: CNN Tensor Shape Calculation (with Flatten)

* **The Question:** This asks you to calculate the shape of the data tensor as it passes through various layers of a Convolutional Neural Network (CNN). 
* **Solution Breakdown:**
    * The input has a batch size of 64, 3 channels, and is 32x32 pixels. The shape is `[64, 3, 32, 32]`. 
    * **A (Filters):** The first convolutional layer has 100 filters, each with a kernel size of 5x5.  To process the 3-channel input, each filter must have a depth of 3. Thus, the filter tensor shape is `[num_filters, input_channels, height, width]` -> **`[100, 3, 5, 5]`**. 
    * **B (Conv2D Output):** We use the formula `output_size = floor((W - F + 2P) / S) + 1`.
        * Input size `W = 32`, Filter size `F = 5`, Padding `P = 0`, Stride `S = 2`. 
        * `output_size = floor((32 - 5 + 0) / 2) + 1 = floor(13.5) + 1 = 13 + 1 = 14`.
        * The output shape is `[batch, num_filters, height, width]` -> **`[64, 100, 14, 14]`**.
    * **C (Pooling Output):**
        * Input size `W = 14`, Pool size `F = 2`, Padding `P = 1`, Stride `S = 2`. 
        * `output_size = floor((14 - 2 + 2*1) / 2) + 1 = floor(7) + 1 = 8`.
        * The shape is `[batch, channels, height, width]` -> **`[64, 100, 8, 8]`**.
    * **D (Flatten Output):** The flatten operation reshapes the 3D feature maps into a 1D vector for each item in the batch.
        * The shape becomes `[batch, channels * height * width]` -> **`[64, 100 * 8 * 8]`**. 
    * **E (Final Output):** The output layer has 10 neurons for 10 classes. 
        * The shape is `[batch, num_classes]` -> **`[64, 10]`**.
* **Correct Answer:** **A** matches all these calculated shapes. 

---

### Question 3: CNN Tensor Shape Calculation (with GAP)

* **The Question:** This is similar to Question 2, but the `flatten` layer is replaced with a **Global Average Pooling (GAP)** layer. 
* **Solution Breakdown:**
    * The shapes for **A, B, and C** are identical to Question 2.
    * **D (GAP Output):** Global Average Pooling computes the average value of each feature map (channel) across its spatial dimensions (`H x W`), reducing it to a single value.
        * Input shape: `[64, 100, 8, 8]` (from C).
        * GAP reduces the `8x8` spatial dimensions to `1x1`, resulting in a shape of `[64, 100, 1, 1]`, which is then typically squeezed to **`[64, 100]`**. 
    * **E (Final Output):** This `[64, 100]` tensor is fed into a final fully connected layer that outputs 10 values for the 10 classes. 
        * The shape is `[batch, num_classes]` -> **`[64, 10]`**.
* **Correct Answer:** **B** matches these calculated shapes. 

---

### Question 4: The Receptive Field

* **The Question:** This question asks about the properties of a neuron's receptive field in a CNN.  The receptive field is the region of the original input image that influences the activation of a particular neuron.
* * **Correct Answers Explained:**
    * **D. The value of a neuron is computationally relevant to its receptive field.**  This is true by definition. A neuron's output is calculated *only* from the values within its receptive field in the previous layer. 
    * **C. Receptive field of neurons on higher layers become larger.**  As we move deeper into the network (to higher layers), each neuron takes input from multiple neurons in the layer below it. This stacking effect causes the receptive field relative to the original input image to expand, allowing deeper neurons to recognize more complex and larger patterns. 

---

### Question 5: The Residual Block

* **The Question:** Identify the correct diagram for a standard residual block from ResNet. 
* **Correct Answer Explained:**
    * **A.** This diagram shows the original and most common architecture of a residual block.  The key components are:
        1.  **Main Path:** The input `x` goes through a series of layers, typically `Conv -> BatchNorm -> ReLU -> Conv -> BatchNorm`.
        2.  **Skip Connection:** The original input `x` "skips" these layers.
        3.  **Addition:** The output of the main path is added element-wise to the original input from the skip connection.
        4.  **Final Activation:** The result of the addition is passed through a final `ReLU` activation function.

---

### Question 6: Residual Block Code Analysis 1

* **The Question:** Predict the output tensor shape of a PyTorch `Residual` block with specific parameters. 
* **Code Analysis:**
    * Input `X` shape: `[10, 3, 32, 32]` (Batch, Channels, Height, Width).
    * Block parameters: `num_channels=3`, `use_1x1conv=False`, `strides=1`.
    * The two `conv` layers have `kernel_size=3`, `padding=1`, and `stride=1`. With these parameters, a convolution operation **preserves the spatial dimensions** (Height and Width).
    * Since `num_channels=3`, the number of output channels from the convolutions also matches the input.
    * `use_1x1conv` is `False`, so the skip connection passes `X` through unchanged.
    * The main path output and the skip connection `X` both have the shape `[10, 3, 32, 32]`, so they can be added. The final output shape remains the same.
* **Correct Answer:** **A. `[10, 3, 32, 32]`**. 

---

### Question 7: Residual Block Code Analysis 2

* **The Question:** Predict the output tensor shape again, but with different parameters that involve downsampling and changing the number of channels. 
* **Code Analysis:**
    * Input `X` shape: `[10, 3, 32, 32]`.
    * Block parameters: `num_channels=6`, `use_1x1conv=True`, `strides=2`.
    * **Main Path:**
        * `conv1` has `stride=2` and outputs `num_channels=6`. It will halve the spatial dimensions (`32x32` -> `16x16`) and increase channels (`3` -> `6`). The shape after `conv1` is `[10, 6, 16, 16]`.
        * `conv2` has `stride=1`, so it preserves the shape `[10, 6, 16, 16]`.
    * **Skip Connection Path:**
        * Since `use_1x1conv` is `True`, a 1x1 convolution (`conv3`) with `stride=2` and `num_channels=6` is applied to the original input `X`. This is necessary to match the shape of the main path's output.
        * This 1x1 convolution also changes the shape from `[10, 3, 32, 32]` to `[10, 6, 16, 16]`.
    * The addition is valid as both tensors now have the same shape.
* **Correct Answer:** **B. `[10, 6, 16, 16]`**. 

---

### Question 8: Full ResNet Shape Analysis

* **The Question:** Trace the shape of a tensor as it flows through a complete, multi-stage ResNet model defined in code. 
* **Solution Breakdown:**
    * Input: `[32, 3, 64, 64]`
    * `Conv2d(stride=2)` -> `[32, 64, 32, 32]`
    * `MaxPool2d(stride=2)` -> `[32, 64, 16, 16]`
    * **A (ResnetBlock 1):** Channels=64, no stride. Shape preserved -> **`[32, 64, 16, 16]`**.
    * **B (ResnetBlock 2):** Channels=64, no stride. Shape preserved -> **`[32, 64, 16, 16]`**.
    * **C (ResnetBlock 3):** Channels=128, stride=2. Spatial dims halved -> **`[32, 128, 8, 8]`**.
    * **D (ResnetBlock 4):** Channels=256, stride=2. Spatial dims halved -> **`[32, 256, 4, 4]`**.
    * **E (AdaptiveAvgPool2d):** Reduces spatial dims to 1x1 -> **`[32, 256, 1, 1]`**.
    * **F (Flatten & Linear):** After flattening (`[32, 256]`) and passing through a final linear layer with 10 outputs -> **`[32, 10]`**.
* **Correct Answer:** **A** correctly lists all the intermediate shapes. 

---

### Question 9: ResNet Architecture Properties

* **The Question:** Asks about general true statements regarding the ResNet architecture. 
* **Correct Answers Explained:**
    * **C. 1x1 Conv in skip-connection is used to change number of output channels.**  This is a crucial function. When the main path of a residual block changes the number of channels or the spatial dimensions (using a stride > 1), a 1x1 convolution is used in the skip connection to transform the original input to match the new shape for the element-wise addition. 
    * **D. A ResNet model consists of many ResNet blocks, each ResNet block consists of many residual blocks...**  This describes the hierarchical structure. For example, ResNet-50 has four main stages (ResNet blocks), and these stages are composed of 3, 4, 6, and 3 residual blocks, respectively. 

---

### Question 10: Adversarial Examples

* **The Question:** Defines the basic properties of an adversarial example `x_adv` created from a clean input `x`. 
* **Correct Answers Explained:**
    * **A. x_adv and x look very similar under human perspective.**  A key characteristic of adversarial examples is that the perturbation added to the clean image is very small, making it imperceptible or nearly identical to the original image for a human observer. 
    * **D. `argmax f(x_adv) != argmax f(x)`**  This is the goal of the attack. Despite being visually similar, the perturbed image `x_adv` is intentionally crafted to make the model `f` predict a different class than it would for the original image `x`. 

---

### Question 11: L-infinity Norm Constraint in Attacks

* **The Question:** Interprets the meaning of the mathematical constraint $||x' - x||_\infty \le \epsilon$ used in generating adversarial examples. 
* **Correct Answers Explained:**
    * **D. The highest absolute difference between pixels of x_adv and x is less than or equal 𝝐.**  This is the mathematical definition of the **L-infinity norm**. It finds the maximum absolute change across all pixels and constrains this single largest change to be no more than a small value $\epsilon$. 
    * **A. This constraint to make sure that x_adv and x look very similar under human perspective.**  By limiting the maximum perturbation of any single pixel, this constraint ensures that the overall modification to the image is small and not easily noticeable, thereby preserving visual similarity. 

---

### Question 12: Untargeted Attacks

* **The Question:** Analyzes the objective function for generating an adversarial example: $\mathbf{x}_{adv} = \arg\max_{x'} l(f(x'; \theta), y)$, where `y` is the true label. 
* **Correct Answers Explained:**
    * **B & C:** The objective is to find an image `x'` that **maximizes the loss** with respect to the correct label `y`. Maximizing the loss means making the model's prediction as wrong as possible for the true class.  This simultaneously decreases the chance of predicting `y` and increases the chance of predicting *any other* incorrect label `y'`. 
    * **E. It is an untargeted attack.**  The goal is simply to cause a misclassification, without caring what the new, incorrect class is. This is the definition of an untargeted attack. 

---

### Question 13: Targeted Attacks

* **The Question:** Analyzes the objective function: $\mathbf{x}_{adv} = \arg\min_{x'} l(f(x'; \theta), y')$, where `y'` is a specific target label different from the true label `y`. 
* **Correct Answers Explained:**
    * **B. We maximally increase the chance to predict x_adv with label y'.**  Here, the objective is to **minimize the loss** with respect to a specific *target* class `y'`. This directly trains the attack to modify the input `x` in a way that makes the model most likely to predict that target class. 
    * **C. It is a targeted attack.**  Because the attack aims to have the model output a specific, pre-determined incorrect label, it is a targeted attack. 

---

### Question 14: Adversarial Training

* **The Question:** Asks for correct statements describing **adversarial training**, a common defense technique. 
* **Correct Answers Explained:**
    * **B. At each iteration, we use an adversarial attack such as PGD to augment the data.**  Instead of standard augmentation (like rotation or cropping), adversarial training generates new training samples by applying an adversarial attack (like Projected Gradient Descent, PGD) to the clean images in each batch. 
    * **C. We update the model parameters to let the model predict the clean and adversarial images to their ground-truth labels.**  The model is trained to correctly classify both the original images and their adversarially perturbed versions, forcing it to learn more robust features that are not easily fooled. 
    * **E. The final loss consists the losses over clean and adversarial examples.**  The total loss function to be minimized is a combination of the loss on the original batch and the loss on the generated adversarial batch. This ensures the model learns from both types of examples. 

## Revision Notes: Key Takeaways

### 1. CNN Architecture & Tensor Dimensions
* **Conv Layer Output Size:** The spatial dimensions (Height/Width) of a convolutional layer's output can be calculated with: $W_{out} = \lfloor \frac{W_{in} - F + 2P}{S} \rfloor + 1$.
* **Shape Preservation:** A common setup to preserve the spatial size is using `Stride=1` and `Padding = (F-1)/2` (for an odd filter size F).
* **Flatten vs. Global Average Pooling (GAP):**
    * **Flatten:** Unrolls a `[C, H, W]` feature map into a long vector of size `C*H*W`. It's sensitive to spatial location and has many parameters in the following FC layer.
    * **GAP:** Reduces a `[C, H, W]` feature map to a vector of size `C` by taking the average of each channel. It is more robust to spatial translations and dramatically reduces the number of parameters.

### 2. Receptive Fields
* The **receptive field** is the area of the input image that a single neuron in a feature map "sees".
* As you go deeper into a CNN, the receptive field size **increases**, allowing the network to learn features that represent larger and more abstract concepts.

### 3. ResNet (Residual Networks)
* **Core Idea:** Solves the degradation problem (accuracy getting saturated and then degrading as networks get deeper) by using **skip connections**.
* **Residual Block:** The main component, where the input `x` is added to the output of a few stacked layers, `F(x)`. The block learns the *residual* mapping `F(x)` instead of the direct mapping `H(x)`. The output is `H(x) = F(x) + x`.
* **1x1 Convolutions:** Used in skip connections to match dimensions when the main path changes the number of channels or downsamples the spatial resolution (with `stride > 1`).

### 4. Adversarial Attacks & Defense
* **Adversarial Example:** An input that has been slightly modified with a perturbation, imperceptible to humans, but designed to cause a machine learning model to make an incorrect prediction.
* **Attack Objective:**
    * **Untargeted:** Maximize the loss w.r.t. the *true* label `y`. Goal: cause any misclassification.
    * **Targeted:** Minimize the loss w.r.t. a *false target* label `y'`. Goal: cause a specific misclassification.
* **L-infinity Norm ($||\cdot||_\infty$):** A common way to constrain the perturbation. It limits the maximum change made to any single pixel, ensuring the change is not obvious.
* **Adversarial Training (Defense):** A powerful defense method that involves augmenting the training data with adversarial examples and training the model to correctly classify them. This makes the model more robust.