## Quiz Questions Explained

---

### Question 1: Encoder-Decoder Model Roles

* **The Question:** This asks about the fundamental roles of the main components in an encoder-decoder architecture. 🏗️
* **Correct Answers Explained:**
    * **C. Encoder tries to encode an input sequence to a context vector.** The **encoder's** entire job is to read the input sequence (e.g., a sentence in French) and compress all its information into a single, fixed-size vector representation, known as the **context vector**.
    * **B. Decoder tries to read from context vector to generate an output sequence.** The **decoder's** job is to take that context vector and "unfold" it, generating the output sequence one element at a time (e.g., the translated sentence in English).
    * **E. Context vector summarizes an input sequence.** The context vector is the final, condensed representation of the entire input sequence that the encoder produces.

---

### Question 2: Seq2Seq for Machine Translation

* **The Question:** This question focuses on the typical neural network types used in a sequence-to-sequence (seq2seq) model and the origin of the context vector.
* **Correct Answers Explained:**
    * **C. Encoder is a recurrent neural network and decoder is a recurrent neural network.** Because machine translation deals with ordered sequences of words, the standard and most natural choice for both the encoder and decoder is an **RNN** (or its more advanced variants like LSTM or GRU).
    * **E. Context vector could be the last hidden state of encoder.** After the encoder RNN processes the entire input sequence, its **final hidden state** is used as the context vector, as it's assumed to contain a summary of the whole sequence.

---

### Question 3: Seq2Seq Likelihood Derivation

* **The Question:** This asks you to interpret the terms in the mathematical formula for the probability of generating a target sequence in a seq2seq model.
* **Correct Answers Explained:**
    * **A. In the derivation (1), c is viewed as a summary of the sequence $x_{1:T_x}$**. The derivation conditions the entire output sequence probability on `c`, which is the context vector generated by the encoder from the input sequence `x`.
    * **C. In the derivation (2), $q_{j-1}$ is viewed as a summary of the sequence $y_{1:j-1}$**. $q_{j-1}$ represents the hidden state of the decoder at the previous timestep. This state is a summary of the output words the decoder has generated so far ($y_1$ through $y_{j-1}$).
    * **E. $P(y_j|q_{j-1}, c, \theta)$ means that on top of $q_{j-1}$, c, we can build up some dense layers to predict $y_j$.** This conditional probability is typically modeled by taking the decoder's hidden state $q_{j-1}$ (and sometimes the context `c`) and passing it through a dense layer followed by a softmax function to get the probability distribution over the vocabulary for the next word, $y_j$.

---

### Question 4: Formulating the Conditional Distribution

* **The Question:** This question presents three different ways a decoder can be structured and asks which are valid ways to model the conditional probability of the next word.
* **Correct Answers Explained:**
    * **E. Only (a) and (b) can be used to formulate the above conditional distribution.** 
        * **Diagram (a)** shows the most basic decoder: the context vector `c` initializes the first hidden state, and each subsequent word is generated based on the previous hidden state and the previously generated word. This is a valid formulation.
        * **Diagram (b)** shows a common improvement where the context vector `c` is concatenated with the word embedding at *every* timestep. This constantly reminds the decoder of the overall context and is also a valid formulation.
        * **Diagram (c)** depicts an **attention mechanism**, where a *new*, dynamic context vector is calculated at each step. This is a different and more advanced formulation than the one described by $P(y_j|q_{j-1}, c, \theta)$, which assumes a single, fixed context vector `c`.

---

### Question 5: The Decoding Process

* **The Question:** This question covers the step-by-step process of using a trained seq2seq model for inference (decoding).
* **Correct Answers Explained:**
    * **A. In phase 1, we feed the input sequence to the encoder to evaluate the context c as the last hidden state of the encoder.** This is the encoding step, where the input is summarized.
    * **C. In phase 2, we feed BOS symbol the decoder and decode output sequence from this symbol.** The decoding process is kick-started by feeding a special `<BOS>` (Begin-of-Sequence) token to the decoder.
    * **E. In phase 2, we initialize the first hidden state of decoder with the last hidden state of the encoder.** The context vector `c` is used as the initial hidden state for the decoder, giving it the summary of what it needs to translate.
    * **G. In phase 2, if we use the greedy strategy, at each timestep, we choose the next output item that maximizes the conditional distribution.** Greedy decoding is a deterministic strategy where you always pick the word with the highest probability at each step.

---

### Question 6: Time-Varied vs. Fixed Context

* **The Question:** This asks for the advantages of using a dynamic, time-varied context (i.e., from an attention mechanism) compared to a single fixed-length context vector.
* **Correct Answers Explained:**
    * **A. Fixed-length context is possibly less powerful to capture long input sequences, while timely varied context can provide dynamic and timely adapted context for input sequences.** The fixed-length vector is an information bottleneck; it's difficult to cram the meaning of a long sentence into one vector. A dynamic context overcomes this.
    * **E. Timely varied context can focus on some input items or words that are more important to generate specific output items or words, while fixed-length context cannot.** This is the core idea of attention. The decoder can "look back" at the input and decide which words are most relevant for generating the *current* output word, creating a more focused and accurate translation.

---

### Question 7: Global Attention

* **The Question:** Asks for the correct definition of **global attention**.
* **Correct Answers Explained:**
    * **D. In the global attention, the time varied context is computed based on all encoder hidden states.** Global attention lives up to its name by considering *every single hidden state* from the encoder when creating the context vector for the decoder at each step.
    * **F. In the global attention, the time varied context is a linear combination of all encoder hidden states.** The context vector is a **weighted average** (a linear combination) of all encoder hidden states, where the weights are determined by the relevance of each input word to the current output word.

---

### Question 8: Local Attention

* **The Question:** Asks for the correct definition of **local attention**.
* **Correct Answers Explained:**
    * **A. In the local attention, the time varied context is computed based on all encoder hidden states in a selective window.** Unlike global attention, local attention is a compromise that only considers a smaller, predefined *window* of encoder hidden states around a predicted alignment position. This makes it more computationally efficient.
    * **E. In the local attention, the time varied context is a linear combination of all encoder hidden states in a selective window.** Similar to global attention, the context vector is still a weighted average, but only of the hidden states *within that selected window*.

---

### Question 9: Attention Calculation Example

* **The Question:** This question asks you to interpret a diagram showing how an attention-based context vector is calculated.
* **Correct Answers Explained:**
    * **A. The second word is more important to the generation of the current output word.** The attention weights ($a_t$) are `[0.2, 0.5, 0.2, 0.1]`. The highest weight, **0.5**, corresponds to the second encoder hidden state ($\bar{h}_2$), indicating that the second input word ("am") has the most influence on the current decoding step.
    * **C. $c_t = 0.2\bar{h}_1 + 0.5\bar{h}_2 + 0.2\bar{h}_3 + 0.1\bar{h}_4$**. The context vector $c_t$ is calculated as the weighted sum of the encoder hidden states, where the weights are the attention scores. This formula correctly represents that calculation.

---

### Question 10: Transformer Positional Encoding

* **The Question:** Asks about the purpose of Positional Encoding in the Transformer architecture.
* **Correct Answers Explained:**
    * **B. It helps capture the position of a word/token in a sentence.** The self-attention mechanism in a Transformer doesn't have a built-in sense of sequence order. Positional encodings are added to the input to give the model information about the position of each word.
    * **D. It is added to the embeddings of words/tokens in a sentence.** The positional encoding vector has the same dimension as the word embedding and is directly **added** to it, combining word meaning with position information.

---

### Question 11: Transformer Layer Norm

* **The Question:** Asks about the properties of Layer Normalization as used in Transformers.
* **Correct Answers Explained:**
    * **B. It normalizes the input tensor across the embedding size dimension...** Unlike Batch Norm, which normalizes across the batch dimension, Layer Norm normalizes across the feature (or embedding) dimension for each individual example in the batch.
    * **D. It has the scaling and shifting parameters $\gamma$ and $\beta$.** Just like Batch Norm, Layer Norm has learnable affine parameters that allow the model to scale and shift the normalized output, preserving its expressive power.
    * **E. It is more effective than Batch Norm for sequential data.** Layer Norm's statistics are independent of other examples in the batch and sequence length, making it much more stable and effective for variable-length sequences common in NLP.

---

### Question 12: Self-Attention Mechanism

* **The Question:** Asks how the output of a self-attention layer relates to its input.
* **Correct Answers Explained:**
    * **B. The token embedding $z_i$ is mainly dependent on its previous token embedding $x_i$, but other $x_j (j \ne i)$ also contributes...** The output $z_i$ for a token is a weighted sum of the *entire* input sequence. Every other token $x_j$ contributes to $z_i$ to some degree.
    * **C. More similar $x_j$ is to $x_i$, more contribution it is to the the computation of $z_i$.** The attention weights are calculated based on a similarity score (dot product) between tokens. A higher similarity score results in a larger weight, meaning more "similar" or "relevant" tokens have a greater influence on the output.

---

### Question 13: Self-Attention Calculation Steps

* **The Question:** This question asks for the correct sequence of computations within a self-attention layer.
* **Correct Answers Explained:**
    * **A. We use three weight matrices $W_Q, W_K, W_V$ to compute $Q, K, V$ respectively.** The input embeddings `X` are projected into three different spaces—Query, Key, and Value—using three distinct, learnable weight matrices.
    * **C. We rely on $Q, K$ to compute the attention scores...** The attention scores, which represent the relevance of each word to every other word, are calculated using the Query and Key matrices (typically as $QK^T$).
    * **D. $Q, K$ can be considered as two other views of $X$.** Since Q and K are linear projections of X, they represent the input data from different perspectives, tailored for calculating similarity.
    * **E. We apply the softmax function to the attention scores...** The raw attention scores are normalized using a softmax function to turn them into a probability distribution of attention weights.
    * **G. We multiply A and V to obtain the new token/word embeddings $Z = AV$.** The final output is a weighted sum of the Value vectors, where the weights are the attention probabilities from matrix A.

---

### Question 14: Multi-Head Self-Attention

* **The Question:** Asks about the key features of Multi-Head Self-Attention.
* **Correct Answers Explained:**
    * **A. Each head has its own $W_Q, W_K, W_V$.** The "multi-head" design runs several self-attention mechanisms in parallel. Critically, each "head" has its own independent set of Q, K, and V weight matrices, allowing it to learn different types of relationships.
    * **C. We perform each head independently.** Since each head has its own parameters, their computations are done in parallel and do not depend on each other.
    * **F. We concatenate the outputs of each head and input this concatenation to one more linear layer $W_o$...** The outputs from all the parallel heads are concatenated and then passed through a final linear projection layer ($W_o$) to produce the final output of the multi-head block.

---

### Question 15: Cross-Attention

* **The Question:** Asks about the role and mechanism of Cross-Attention in the Transformer decoder.
* **Correct Answers Explained:**
    * **A. We use the Cross-Attention to inject the encoder output to the decoder layers.** Cross-attention is the bridge between the encoder and decoder. It's how the decoder "looks at" the source sentence to inform its output generation.
    * **C. For the Cross-Attention, the decoder input is used to compute $Q$, whereas the encoder output is used to compute $K, V$.** This is the key difference from self-attention. The Query comes from the decoder (representing the question "what information do I need now?"), while the Key and Value come from the encoder's output (representing the available information from the source sentence).
    * **F. The Cross-Attention is involved in the computation of decoder output.** It's a fundamental sub-layer within each block of the Transformer's decoder stack.

## Revision Notes: Key Takeaways

### 1. The Encoder-Decoder Framework 🌉
* **Concept:** A powerful architecture for sequence-to-sequence (seq2seq) tasks like machine translation.
* **Encoder:** Reads and condenses an input sequence into a fixed-size **context vector**. Typically an RNN in classic models.
* **Decoder:** Takes the context vector and generates an output sequence step-by-step.
* **Bottleneck:** The single, fixed-size context vector can struggle to summarize long and complex sequences, which led to the development of attention.

### 2. The Attention Mechanism 🎯
* **Core Idea:** Allows the decoder to dynamically "look back" at the entire input sequence and focus on the most relevant parts at each step of generating the output. This overcomes the fixed-context bottleneck.
* **Context Vector ($c_t$):** With attention, the context vector is no longer fixed. A new, time-varied $c_t$ is computed at each decoding step as a **weighted sum of all encoder hidden states**.
* **Types of Attention:**
    * **Global Attention:** Considers *all* encoder hidden states to compute the context vector.
    * **Local Attention:** A more efficient version that only considers encoder hidden states within a smaller *window*.

### 3. The Transformer: Beyond Recurrence
* The Transformer architecture discards RNNs entirely and relies solely on **attention mechanisms**.
* **Positional Encoding:** Since Transformers have no recurrence, they have no inherent sense of word order. Positional encodings are vectors that are **added to the word embeddings** to provide the model with information about a token's position in the sequence.
* **Layer Normalization:** Used instead of Batch Normalization as it's more stable for variable-length sequences. It normalizes across the feature/embedding dimension for each example independently.

### 4. Self-Attention: The Heart of the Transformer ❤️
* **Concept:** A mechanism where a sequence pays attention to itself to create context-aware representations for each of its elements.
* **The QKV Trio:** For each input token, three vectors are created via linear projections:
    * **Query (Q):** What I'm looking for.
    * **Key (K):** What I contain.
    * **Value (V):** What I will give you.
* **Calculation:**
    1.  Compute similarity scores: $Scores = Q \cdot K^T$.
    2.  Normalize scores into probabilities: $Attention = \text{softmax}(Scores)$.
    3.  Compute output: $Z = Attention \cdot V$. The output for each token is a weighted sum of all other tokens' Value vectors.
* **Multi-Head Attention:** Runs multiple self-attention "heads" in parallel, each with its own QKV matrices. This allows the model to learn different types of relationships simultaneously. The results are concatenated and passed through a final linear layer.
* **Cross-Attention:** Used in the Transformer's decoder. It's the same mechanism, but the **Q comes from the decoder** and the **K and V come from the encoder's output**. This is how the decoder attends to the input sentence.