## Quiz Questions Explained

---

### Question 1: Self-Attention

* **The Question:** This question asks for the defining characteristics of **self-attention**. 🤔
* **Correct Answers Explained:**
    * **A. With self-attention, we can compute the token embeddings in parallel.**  Unlike RNNs which process tokens one by one, the self-attention mechanism calculates the relationships between all tokens in a sequence simultaneously using matrix multiplications, allowing for massive parallelization. 
    * **C. For self-attention, the queries (Q), keys (K), and values (V) computed based on source sequences.**  In self-attention, the sequence attends to *itself*. This means the Query, Key, and Value matrices are all derived from the same input sequence (the source). 
    * **E. Self-attention can capture intra-sequence dependencies between source sequences.**  The purpose of self-attention is to model the relationships *within* a single sequence (intra-sequence), allowing each token to understand its context based on all other tokens in that same sequence. 

---

### Question 2: Cross-Attention

* **The Question:** This question asks for the defining characteristics of **cross-attention**, contrasting it with self-attention.
* **Correct Answers Explained:**
    * **A. With self-attention, we can compute the token embeddings in parallel.**  *Note: This seems to be a copy-paste error in the quiz question/answer itself, as it repeats a statement about self-attention. However, the computation in cross-attention is also parallelizable.*
    * **D. For self-attention, the queries (Q) are computed based on target sequences, while keys (K), and values (V) are computed based on source sequences.**  This is the key difference. In cross-attention (used in an encoder-decoder's decoder), the Queries come from the decoder's sequence (the target), while the Keys and Values come from the encoder's sequence (the source). This allows the target sequence to "look at" the source sequence. 
    * **F. Self-attention can capture inter-sequence dependencies between source sequences and target sequences.**  Cross-attention models the relationships *between* two different sequences (inter-sequence), which is crucial for tasks like machine translation where the decoder needs to align its output with the encoder's input. 

---

### Question 3: Drawbacks of CNNs

* **The Question:** This question asks you to identify the inherent limitations of Convolutional Neural Networks (CNNs) that motivated the development of architectures like Vision Transformers.
* **Correct Answers Explained:**
    * **A. CNNs cannot capture the global information of images.**  CNNs build up their understanding of an image hierarchically. A filter in an early layer only sees a small local patch. While deeper layers have larger receptive fields, they never truly see the *entire* image at once in the way a Transformer's self-attention does. 
    * **C. CNNs cannot capture the spatial relationship of local objects inside images.**  A max-pooling layer, common in CNNs, is designed to be translation-invariant, meaning it detects a feature regardless of its exact position in a patch. A side effect is that it loses precise information about the spatial relationships between features. 
    * **D. CNNs are locality sensitivity.**  This means a CNN's filters are strongly biased towards learning from local pixel neighborhoods. This "inductive bias" is a strength for many vision tasks but a limitation when long-range dependencies are important. 

---

### Question 4: Vision Transformers (ViTs) - The Basics

* **The Question:** This covers the initial steps of how a Vision Transformer (ViT) processes an image.
* **Correct Answers Explained:**
    * **A. For ViTs, a token or visual word is a patch of an image.**  Instead of processing pixel by pixel, a ViT first divides an image into a grid of non-overlapping patches (e.g., 16x16 pixels each). Each patch is treated as a "token," analogous to a word in a sentence. 
    * **D. We apply a linear projection to the flattened patches to transform them to token embeddings.**  The raw pixel values of each patch are flattened into a long vector and then passed through a standard linear layer (a trainable projection) to create the patch embeddings. 
    * **F. We inject the class token to the token embeddings of the patches and learn the class token fixed during training.**  A special, learnable `[class]` token is prepended to the sequence of patch embeddings. This token's purpose is to aggregate global information from the entire image. 
    * **G. On top of the class token at the final layer, we build up the MLP head to make predictions.**  After passing through the Transformer encoder, the final output embedding corresponding to the `[class]` token is fed into a small Multi-Layer Perceptron (MLP) for the final classification. 

---

### Question 5: How ViTs Capture Global Information

* **The Question:** This question asks which component of the ViT architecture is responsible for its ability to understand global context.
* **Correct Answers Explained:**
    * **C. This is because the Multi-head Self-attention layers of the Encoder blocks.**  Self-attention is the mechanism that allows every single patch (token) in the image to directly interact with and attend to every other patch. This direct, all-to-all comparison is how long-range dependencies and global information are captured. 
    * **D. The global information is captured in the class token at the final layer because this summarizes the token embeddings at the input layer.**  Through the layers of self-attention, the `[class]` token aggregates information from all the patch embeddings, effectively becoming a summary representation of the entire image, which is then used for classification. 

---

### Question 6: Properties of Vision Transformers

* **The Question:** This question asks for general true statements about the characteristics and requirements of ViTs.
* **Correct Answers Explained:**
    * **A. ViTs can naturally capture the global information of images.**  As established, this is a core strength of the self-attention mechanism, which has a global receptive field from the very first layer. 
    * **C. ViTs can find the long-term dependencies among image patches.**  This is another way of saying they capture global information. They can model how a patch in the top-left corner relates to a patch in the bottom-right corner directly. 
    * **D. We need massive datasets to train ViTs.**  ViTs have fewer built-in "vision-specific" biases than CNNs. To learn these patterns from scratch, they are very data-hungry and typically require pre-training on enormous datasets (like ImageNet-21k or JFT-300M) to perform well. 

---

### Question 7: Swin Transformers - Hierarchical Vision

* **The Question:** This question introduces the **Swin Transformer** and asks about its key architectural differences from the original ViT.
* **Correct Answers Explained:**
    * **A. Swin Transformers employ smaller patches of [3,4,4].**  Swin starts by dividing the image into much smaller, non-overlapping patches (e.g., 4x4 pixels) to create a higher-resolution feature map initially. 
    * **C. Swin Transformers apply a linear projection to flattened patches to gain [C, H/4, W/4].**  After the initial patching (with 4x4 patches), a linear embedding layer projects the patches into a feature map with dimensions `(H/4, W/4)` and `C` channels. 
    * **F. For Swin Transformers, we apply patch merging to down-sample the input shape by two while doubling the depth.**  Swin introduces a hierarchical structure like a CNN. It uses a **patch merging** layer at different stages to downsample the spatial resolution by a factor of 2 (e.g., from H/4, W/4 to H/8, W/8) while doubling the number of channels (from C to 2C). 

---

### Question 8: Swin Transformers - Patch Merging

* **The Question:** This question asks for the specifics of the **Patch Merging** process in Swin Transformers.
* **Correct Answers Explained:**
    * **A. We merge 2x2 neighbourhood patches, concatenate their embeddings, and then apply a linear projection.**  This is the exact mechanism. It takes a group of 2x2 adjacent patches, concatenates their feature vectors, and then uses a linear layer to reduce the dimensionality. 
    * **E. If we input the patch merging [C, H/4, W/4], we gain [2C, H/8, W/8].**  The 2x2 merging reduces the height and width by half (`H/4 -> H/8`, `W/4 -> W/8`). Concatenating the four C-dimensional vectors creates a 4C-dimensional vector, which the linear layer then projects down to a 2C-dimensional vector, effectively doubling the channel depth. 

---

### Question 9: Swin Transformers - Window Self-Attention

* **The Question:** Asks about the core attention mechanism in Swin Transformers.
* **Correct Answers Explained:**
    * **B. We divide all token embeddings into many local windows and then apply the Self Attention to each local windows independently.**  To make self-attention computationally efficient, Swin doesn't compute it globally. Instead, it partitions the feature map into non-overlapping **windows** (e.g., of size 7x7) and computes self-attention *only within* each window. 
    * **D. The output shape of Window Self-Attention is the same as the input shape.**  Like standard self-attention, the windowed version processes the tokens within its local scope and outputs new embeddings of the same dimension and shape, just with updated, context-aware information. 

---

### Question 10: Window Self-Attention - Pros and Cons

* **The Question:** Asks about the implications of using windowed self-attention.
* **Correct Answers Explained:**
    * **A. The Window Self-Attention can speed up the standard Self-Attention.**  The computational complexity of self-attention is quadratic with respect to the number of tokens. By restricting attention to small, fixed-size windows, Swin achieves a linear complexity, which is much faster. 
    * **C. The Window Self-Attention only allows a token to interact with the ones in the same local window.**  This is the main trade-off. For efficiency, information is confined within each window, and there is no direct communication between tokens in different windows. 

---

### Question 11: Swin Transformers - Shifted Window Self-Attention

* **The Question:** Asks about the **Shifted Window** mechanism, which is Swin's solution to the problem identified in the previous question.
* **Correct Answers Explained:**
    * **B. The Shifted Window Self-Attention allows a token to interact with the ones in the different local windows.** 
    * **C. The Shifted Window Self-Attention enables the interaction across local windows.** 
    * **D. The Shifted Window Self-Attention shifts a local window to right and bottom to become a new local window.**  In consecutive blocks, Swin alternates between regular windowing and a **shifted** window configuration. By shifting the grid, the new windows cross the boundaries of the old ones, allowing information to be mixed between windows from the previous layer. 
    * **F. The output shape of Shifted Window Self-Attention is the same as the input shape.**  This mechanism is still a form of self-attention, so it maintains the shape of the feature map. 

---

### Question 12: Principle of Fine-Tuning with Additional Components

* **The Question:** This question asks about the general philosophy behind Parameter-Efficient Fine-Tuning (PEFT) methods.
* **Correct Answers Explained:**
    * **A. We insert additional components to pretrained ViTs that favour the original computation of ViTs and then fine-tune the additional components.** 
    * **C. We insert additional components to pretrained ViTs that favour the original computation of ViTs and then consider the additional components as variables to optimize in optimizers.**  The core idea of PEFT is to **freeze** the vast majority of the large pretrained model's weights and only train a small number of *newly added*, lightweight parameters (the "additional components"). This is much faster and more memory-efficient than full fine-tuning. 

---

### Question 13: Prompt-Tuning

* **The Question:** Asks specifically where **Prompt-Tuning** adds its learnable parameters.
* **Correct Answer Explained:**
    * **A. We insert learnable prompts to token embeddings of ViTs and then fine-tune these prompts.**  Prompt-tuning works by prepending a small number of new, learnable "prompt" tokens to the sequence of input patch embeddings. Only these prompt tokens are trained, while the rest of the model is frozen. 

---

### Question 14: Fine-Tuning with Adapters

* **The Question:** Asks specifically where **Adapters** are inserted for fine-tuning.
* **Correct Answer Explained:**
    * **B. We insert adapters to pointwise networks of ViTs and then fine-tune these adapters.**  Adapters are small, bottleneck-like neural network modules (e.g., two dense layers with a non-linearity) that are inserted *inside* the Transformer blocks, typically after the feed-forward networks (FFN/pointwise networks). Only the weights of these small adapter modules are trained. 

---

### Question 15: Fine-Tuning with LoRA

* **The Question:** Asks specifically where **LoRA (Low-Rank Adaptation)** modifies the model.
* **Correct Answer Explained:**
    * **C. We insert low-ranked matrices to the key, query, and value matrices of ViTs and then fine-tune these low-ranked matrices.**  LoRA works by hypothesizing that the change in weights during fine-tuning has a low "intrinsic rank." It freezes the original weight matrices (like $W_Q$ and $W_K$) and injects the update as the product of two much smaller, low-rank matrices. Only these small matrices are trained. 

## Revision Notes: Key Takeaways

### 1. Self-Attention vs. Cross-Attention Recap
* **Self-Attention:** A sequence attends to itself. Q, K, and V all come from the **same sequence**. It models **intra-sequence** relationships.
* **Cross-Attention:** One sequence attends to another. Q comes from the **target sequence** (e.g., decoder), while K and V come from the **source sequence** (e.g., encoder). It models **inter-sequence** relationships.

---

### 2. Vision Transformers (ViT) 🖼️
* **Motivation:** Overcomes CNN limitations like weak global context by applying the Transformer architecture directly to images.
* **Core Process:**
    1.  **Patching:** Divide an image into a grid of patches (e.g., 16x16). Each patch is a "token". 
    2.  **Embedding:** Flatten and linearly project each patch into a vector. 
    3.  **Class Token:** Prepend a learnable `[class]` token to the sequence to aggregate global information. 
    4.  **Transformer Encoder:** Process the sequence of tokens with self-attention.
    5.  **MLP Head:** Use the final `[class]` token's output for classification. 
* **Key Property:** Has a global receptive field from the start but is data-hungry and needs large-scale pre-training. 

---

### 3. Swin Transformer  स्विन्
* **Motivation:** Brings the strengths of CNNs (hierarchical structure, locality) to Transformers, making them more efficient and effective as a general-purpose vision backbone.
* **Key Innovations:**
    * **Hierarchical Features:** Creates feature maps at different scales using a **Patch Merging** layer, which downsamples resolution and increases channel depth. 
    * **Windowed Self-Attention (W-MSA):** Computes self-attention only within local, non-overlapping windows to achieve linear complexity. 
    * **Shifted Window Self-Attention (SW-MSA):** Alternates W-MSA with a shifted window configuration to allow for cross-window connections, enabling global information flow. 

---

### 4. Parameter-Efficient Fine-Tuning (PEFT) 💡
* **Goal:** Adapt a large, pre-trained model to a new task without retraining all its billions of parameters.
* **Principle:** **Freeze** the original model weights and inject a small number of new, trainable parameters.  This saves massive amounts of computation and memory.
* **Popular PEFT Methods:**
    * **Prompt-Tuning:** Adds learnable "prompt" tokens to the input sequence. 
    * **Adapters:** Inserts small, bottleneck-like neural network modules inside the Transformer blocks (usually after the FFN). 
    * **LoRA (Low-Rank Adaptation):** Modifies the weight matrices in the attention mechanism (Q, K, V) by adding a low-rank update composed of two small, trainable matrices. 