# Quiz : Attention Based Models Assesssment
---

### Q1. What is the primary purpose of a sequence-to-sequence (seq2seq) model? 
1. Classification 
2. Sequence generation 
3. Image recognition 
4. Anomaly detection

The correct answer is:

**2. Sequence generation**

Seq2seq models are primarily used for tasks where one sequence needs to be transformed into another, such as **machine translation, text summarization, and speech recognition**.


### Q2. Which components are essential in a Seq2Seq model? 
1. Input layer and output layer 
2. Encoder and decoder 
3. Convolutional layers 
4. Dropout and batch normalization

The correct answer is:

**2. Encoder and decoder** 

A **Seq2Seq model** consists of:

* **Encoder** → processes the input sequence and compresses it into a context (hidden state).
* **Decoder** → generates the output sequence step by step using that context.

Other options (like convolutional layers, dropout, etc.) can be used additionally, but **encoder–decoder is the core architecture**.


### Q3. In a Seq2Seq model, what does the encoder output? 
1. A sequence of words 
2. A fixed-size context vector 
3. A classification label 
4. An attention score

The correct answer is:

**2. A fixed-size context vector** 

In a standard **Seq2Seq model**, the **encoder** processes the input sequence and compresses it into a **context vector (hidden state)**. This vector captures the essential information of the input sequence and is passed to the **decoder**, which generates the output sequence.

With **attention mechanisms**, instead of just a single fixed-size vector, the encoder can output a **sequence of hidden states**, and the decoder selectively focuses on different parts — but in the basic Seq2Seq, it's a **fixed-size context vector**.


### Q4. What mechanism allows a model to focus on specific parts of an input sequence? 
1. Dropout 
2. Pooling 
3. Attention 
4. Activation function

The correct answer is:

**3. Attention**

The **attention mechanism** allows a model to focus on the most relevant parts of the input sequence at each step of decoding, rather than relying only on a single fixed-size context vector.

This is especially useful in tasks like **machine translation**, where different words in the input sequence may be more important at different stages of generating the output.


### Q5. Which attention mechanism is also known as dot-product attention? 
1. Additive attention 
2. Multiplicative attention 
3. Self-attention 
4. Contextual attention

The correct answer is:

**2. Multiplicative attention**

* **Additive attention** → uses a feedforward network with a learned weight to compute alignment scores.
* **Multiplicative attention (dot-product attention)** → computes alignment scores by taking the **dot product** between the encoder hidden state and the decoder hidden state.
* **Self-attention** → applies attention within the same sequence (used in Transformers).

So, **dot-product attention = multiplicative attention**.


### Q6. What is the key benefit of using attention mechanisms in Seq2Seq models? 
1. Reducing model complexity 
2. Improving long-range dependency capture 
3. Enhancing model regularization 
4. Accelerating model training

The correct answer is:

**2. Improving long-range dependency capture**

Attention mechanisms allow the decoder to **directly access relevant parts of the input sequence** instead of relying only on a single fixed-size context vector. This helps the model better handle **long sequences** and capture **long-range dependencies**, which is a major limitation of basic Seq2Seq models without attention.


### Q7. What does self-attention compute? 
1. Attention scores between different sequences 
2. Attention scores between different positions in the same sequence 
3. Loss function 
4. Gradient descent steps

The correct answer is:

**2. Attention scores between different positions in the same sequence**

**Self-attention** (used in Transformers) lets each token in a sequence look at and weigh the importance of **all other tokens in that same sequence**.

* Example: In the sentence *“The cat sat on the mat”*, the word **“cat”** can attend to **“the”** and **“mat”** to understand its context.

This is what makes Transformers powerful for capturing **contextual relationships** within sequences.


### Q8. Which model architecture first introduced the self-attention mechanism? 
1. RNN 
2. LSTM 
3. Transformer 
4. GRU

The correct answer is:

**3. Transformer**

The **Transformer architecture** (introduced in the paper *“Attention Is All You Need”*, 2017) was the first to fully rely on **self-attention** mechanisms, completely removing recurrence (RNNs, LSTMs, GRUs).

This innovation made Transformers highly effective for **parallelization, long-range dependency handling, and large-scale training**.


### Q9. What is the primary function of the decoder in a Seq2Seq model? 
1. To classify input data 
2. To generate an output sequence from a context vector 
3. To reduce dimensionality 
4. To preprocess input data

The correct answer is:

**2. To generate an output sequence from a context vector**

In a **Seq2Seq model**:

* The **encoder** processes the input sequence and encodes it into a **context vector (or hidden states with attention)**.
* The **decoder** takes this context and **generates the output sequence step by step**, often predicting the next token based on the context and previously generated tokens.


### Q10. In Transformer models, what is the purpose of positional encoding? 
1. To add non-linearity 
2. To encode the position of tokens in the sequence 
3. To increase the model's depth 
4. To perform data augmentation


The correct answer is:

**2. To encode the position of tokens in the sequence**

Since **Transformers** rely only on **self-attention** (which is order-invariant), they have no inherent sense of word order.
**Positional encoding** injects information about the **position of each token** in the sequence (e.g., first word, second word, etc.), allowing the model to understand sequence structure.

Without positional encoding, a Transformer would treat a sentence like *“dog bites man”* the same as *“man bites dog”*.


### Q11. Which of the following is a characteristic feature of the LSTM architecture? 
1. Linear activation function 
2. Gating mechanisms 
3. Dropout layers 
4. Positional encoding

The correct answer is:

**2. Gating mechanisms**

LSTMs (Long Short-Term Memory networks) use **input, forget, and output gates** to regulate the flow of information.

* These gates help the model **retain long-term dependencies** and reduce the **vanishing gradient problem** common in standard RNNs.

Other options (like dropout or positional encoding) can be added, but they are **not defining features** of LSTMs.


### Q12. How does the forget gate in an LSTM cell contribute to its function? 
1. It decides which parts of the current input to ignore 
2. It decides which parts of the cell state to erase 
3. It normalizes the input sequence 
4. It adds noise to the training process

The correct answer is:

**2. It decides which parts of the cell state to erase**

In an **LSTM cell**:

* The **forget gate** controls which information from the **previous cell state** should be **discarded (erased)** and which should be kept.
* This allows the LSTM to **remove irrelevant information** and retain only useful long-term dependencies.

### Q13. What is the main advantage of using a Bidirectional LSTM (Bi-LSTM)? 
1. Faster training speed 
2. Capturing information from both past and future contexts 
3. Reducing overfitting 
4. Increasing model regularization

The correct answer is:

**2. Capturing information from both past and future contexts**

A **Bidirectional LSTM (Bi-LSTM)** processes the sequence in **both directions**:

* Forward LSTM → captures **past context** (left to right).
* Backward LSTM → captures **future context** (right to left).

This is especially useful in tasks like **speech recognition, text classification, and machine translation**, where context from both directions improves understanding.


### Q14. Which of the following is NOT a type of attention mechanism? 
1. Additive attention 
2. Multiplicative attention 
3. Recursive attention 
4. Self-attention

The correct answer is:

**3. Recursive attention**

* **Additive attention** → Uses a feedforward network to compute alignment scores.
* **Multiplicative (dot-product) attention** → Uses dot products to compute similarity scores.
* **Self-attention** → Computes attention within the same sequence (core of Transformers).

**Recursive attention** is **not a standard attention mechanism** in Seq2Seq or Transformer models.


### Q15. How does the update gate in a GRU differ from an LSTM's input gate? 
1. It does not exist 
2. It combines the functionality of the forget and input gates 
3. It only controls the cell state 
4. It reduces overfitting in GRU models

The correct answer is:

**2. It combines the functionality of the forget and input gates**

* In **LSTMs**, we have **separate forget and input gates**:

  * Forget gate → decides what to erase from the cell state.
  * Input gate → decides what new information to add.

* In **GRUs**, the **update gate** merges these two roles into one:

  * It controls both how much of the **past information** to keep and how much of the **new input** to add.

This makes **GRUs simpler** (fewer parameters) compared to LSTMs, while still capturing dependencies effectively.


### Q16. What role do residual connections play in Transformer models? 
1. They add noise to the model to prevent overfitting 
2. They improve gradient flow and model stability 
3. They reduce the model's complexity 
4. They increase the attention scores

The correct answer is:

**2. They improve gradient flow and model stability**

In **Transformer models**, **residual connections (skip connections)**:

* Allow gradients to flow more easily through deep networks → reducing the **vanishing gradient problem**.
* Help preserve the original input signal by adding it back after each sub-layer (attention or feedforward).
* Improve **training stability** and enable stacking of many layers without performance degradation.

That’s why residual connections are a **key ingredient** in deep architectures like Transformers, ResNets, etc.


### Q17. Which of the following statements about beam search is correct? 
1. It is used to train Seq2Seq models 
2. It is a decoding algorithm that considers multiple sequences 
3. It accelerates the training process 
4. It reduces the number of parameters in the model

The correct answer is:

**2. It is a decoding algorithm that considers multiple sequences**

* **Beam search** is used during **decoding** (not training) in Seq2Seq and Transformer models.
* Instead of picking only the single most probable token at each step (like greedy search), it keeps track of the **top-k most likely sequences** (the “beam width”).
* This helps generate more accurate and meaningful sequences in tasks like **machine translation, text summarization, and speech recognition**.


### Q18. What is the main reason for scaling the dot-product in Scaled Dot-product Attention? 
1. To normalize the attention weights 
2. To prevent the softmax function from producing extremely small gradients 
3. To increase the model's capacity 
4. To add non-linearity to the model

The correct answer is:

**2. To prevent the softmax function from producing extremely small gradients**

In **Scaled Dot-Product Attention**:

* The dot product between **query (Q)** and **key (K)** can grow large when the vector dimension (**dₖ**) is high.
* Large values push the **softmax** into regions where it outputs very small gradients → making training unstable.
* To fix this, the dot product is scaled by **1 / √dₖ**, which keeps values in a reasonable range and stabilizes learning.


### Q19. In a Transformer, what does the term "multi-head attention" refer to? 
1. Using multiple sequence for attention 
2. Parallel attention mechanisms with different parameters 
3. Attention applied to multiple layers 
4. Attention applied across different models

The correct answer is:

**2. Parallel attention mechanisms with different parameters** 

In **multi-head attention**:

* The input is projected into multiple **query, key, and value** spaces (heads).
* Each head learns to focus on **different aspects of the sequence** (e.g., short-range vs. long-range dependencies).
* The outputs of all heads are concatenated and combined.

This allows the Transformer to capture **richer relationships** than a single attention mechanism.


### Q20. Which technique is commonly used in Seq2Seq models to handle variable-length output sequences? 
1. Padding 
2. Masking 
3. Beam search 
4. Redularization

The correct answer is:

**2. Masking**

In **Seq2Seq models** (especially with attention or Transformers):

* **Masking** is used so the model knows **which positions are valid** and which are just padding.
* This prevents the model from attending to **padded tokens** and ensures that decoding stops correctly for **variable-length outputs**.

**Padding** is applied to inputs for batching, but **masking** is the key technique that tells the model to **ignore padded parts** during training and inference.


### Q21. What is the primary benefit of using the Transformer architecture over traditional RNNs? 
1. Faster computation and parallelization 
2. Fewer parameters 
3. Simpler implementation 
4. Better performance on short sequence

The correct answer is:

**1. Faster computation and parallelization**

The **Transformer architecture** eliminates recurrence (RNNs, LSTMs, GRUs) and instead uses **self-attention**, which:

* Processes all tokens in a sequence **in parallel** (RNNs process step by step).
* Enables much **faster training on GPUs/TPUs**.
* Handles **long-range dependencies** more effectively.

This parallelism is the **key advantage** that made Transformers dominant in NLP and beyond.

### Q22. Which gate in the LSTM cell directly controls how much of the previous memory should be retained? 
1. Input gate 
2. Output gate 
3. Forget gate 
4. Memory gate

The correct answer is:

**3. Forget gate** 

In an **LSTM cell**:

* **Forget gate** → decides how much of the **previous cell state (memory)** should be **retained or discarded**.
* **Input gate** → controls how much new information to add.
* **Output gate** → controls how much of the cell state is exposed as the hidden state.

Thus, the **forget gate** is the one that **directly manages retention of past memory**.


### Q23. In the Transformer model, what is the significance of using layer normalization? 
1. It normalizes the input data 
2. It accelerates model convergence by stabilizing activations 
3. It adds non-linearity to the model 
4. It reduces the model size

The correct answer is:

**2. It accelerates model convergence by stabilizing activations**

In **Transformers**, **layer normalization** is applied after sub-layers (like attention and feed-forward networks) to:

* Stabilize activations across features.
* Prevent exploding/vanishing values.
* Make training more stable and **speed up convergence**.

Unlike batch normalization, **layer norm works better with variable sequence lengths** and is well-suited for NLP tasks.


### Q24. Which of the following is true about the Additive Attention mechanism? 
1. It is more computationally efficient than multiplicative attention 
2. It involves a linear transformation of the query and key vectors 
3. It is used exclusively in RNNs 
4. It uses the softmax function to compute attention weights

The correct answer is:

**4. It uses the softmax function to compute attention weights**

About **Additive Attention (Bahdanau Attention)**:

* It applies **linear transformations** on the query and key vectors, then combines them with a feedforward network to produce alignment scores.
* These scores are passed through a **softmax** to obtain attention weights.
* Compared to multiplicative (dot-product) attention, it is **less computationally efficient** but works better for small dimensions.

So, statement **4** is correct.


### Q25. How does the Transformer model capture positional information in sequences? 
1. Through recurrent connections 
2. Using positional encodings added to the input embeddings 
3. By using special tokens for start and end of sequence 
4. By using convolutional layers

The correct answer is:

**2. Using positional encodings added to the input embeddings**

Since the **Transformer** has **no recurrence or convolution**, it doesn’t inherently know the order of tokens.

To capture **sequence order**, it adds **positional encodings** (sine & cosine functions or learned embeddings) to the input embeddings.

This allows the model to understand token positions like **first, second, third**, etc., within the sequence.


### Q26. Which problem does the GRU architecture solve that is often encountered in vanilla RNNs? 
1. Overfitting 
2. Vanishing gradient problem 
3. Insufficient model capacity 
4. High computational cost

The correct answer is:

**2. Vanishing gradient problem**

* **Vanilla RNNs** struggle with **long sequences** because gradients shrink during backpropagation, making it hard to learn long-term dependencies.
* **GRUs (Gated Recurrent Units)** use **update and reset gates** to regulate information flow, which helps preserve gradients and capture long-range dependencies.

Thus, like LSTMs, GRUs were designed to **solve the vanishing gradient problem**.


### Q27. What is the role of the output gate in an LSTM cell? 
1. It determines how much of the cell state should be output as the hidden state 
2. It decides how much information should be forgotten 
3. It updates the memory cell 
4. It normalizes the output sequence

The correct answer is:

**1. It determines how much of the cell state should be output as the hidden state**

In an **LSTM cell**:

* **Forget gate** → decides what to erase from memory.
* **Input gate** → decides what new information to add.
* **Output gate** → decides how much of the **cell state** is revealed as the **hidden state (hₜ)** for the next time step and for predictions.

So the **output gate controls exposure of memory to the outside**.


### Q28. Why is the Transformer architecture considered more scalable than RNNs for long sequences? 
1. It uses fewer parameters 
2. It processes sequences in parallel rather than sequentially 
3. It has a simpler implementation 
4. It is specifically designed for short sequences

The correct answer is:

**2. It processes sequences in parallel rather than sequentially**

* **RNNs/LSTMs/GRUs** process input **step by step**, which makes them **slow** and hard to scale for long sequences.
* **Transformers** use **self-attention**, allowing **all tokens to be processed simultaneously** → enabling **parallelization on GPUs/TPUs**.
* This parallelism makes Transformers **far more scalable** for **long sequences** compared to RNNs.


### Q29. Which of the following does the "query" vector represent in the context of attention mechanisms? 
1. The context vector for the entire sequence 
2. A vector that is compared against keys to compute attention scores 
3. The input to the encoder 
4. The output of the decoder

The correct answer is:

**2. A vector that is compared against keys to compute attention scores** 

In **attention mechanisms**:

* **Query (Q)** → represents what we are looking for (e.g., the current decoder state).
* **Key (K)** → represents the attributes of the input sequence elements.
* **Value (V)** → contains the actual information to be aggregated.

The **query** is matched against the **keys** to produce **attention scores**, which are then used to weight the **values**.


### Q30. In a Transformer model, how does self-attention differ from cross-attention? 
1. Self-attention only considers the query sequence itself, while cross-attention involves both  the query and a separate context 
2. Self-attention is used only during inference, while cross-attention is used during training 
3. Self-training computes attention between different models, while cross-attention is withing a single model 
4. Self-attention involves multiple sequence, while cross-attention is limited to a single sequence

The correct answer is:

**1. Self-attention only considers the query sequence itself, while cross-attention involves both the query and a separate context**

* **Self-attention** → Query (Q), Key (K), and Value (V) come from the **same sequence**. Example: encoder self-attention lets each token attend to all tokens in the input.
* **Cross-attention** → Q comes from the **decoder**, while K and V come from the **encoder output**. Example: in machine translation, the decoder attends to the encoder’s representation of the source sentence.

This is how the Transformer decoder connects input (source) and output (target) sequences.
