
## **Steps in Self-Attention Layer**

1. **Input Representation**

   * Start with input embeddings $X \in \mathbb{R}^{n \times d_{model}}$, where:

     * $n$ = sequence length
     * $d_{model}$ = embedding dimension

2. **Linear Transformations**

   * Project the input into **Query (Q)**, **Key (K)**, and **Value (V)** vectors using learnable weight matrices:

     $$
     Q = XW_Q,\quad K = XW_K,\quad V = XW_V
     $$

     where $W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_k}$.

3. **Similarity Scores (Dot Product)**

   * Compute similarity between queries and keys:

     $$
     \text{scores} = QK^T
     $$

     This results in a $n \times n$ matrix representing the relevance between tokens.

4. **Scaling**

   * Scale the scores to prevent large values (which can cause gradient instability):

     $$
     \text{scaled\_scores} = \frac{\text{scores}}{\sqrt{d_k}}
     $$

5. **Softmax Normalization**

   * Apply softmax to convert scaled scores into attention weights:

     $$
     \alpha_{ij} = \frac{\exp(\text{scaled\_scores}_{ij})}{\sum_{k=1}^n \exp(\text{scaled\_scores}_{ik})}
     $$

     This ensures weights sum to 1 for each query.

6. **Weighted Sum of Values**

   * Multiply the attention weights with the value vectors:

     $$
     \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
     $$

7. **Output**

   * The result is a context-aware representation of each token, capturing relationships across the sequence.




## **Steps in Self-Attention Layer**

1. **Input Representation**

   * Start with input embeddings $X \in \mathbb{R}^{n \times d_{model}}$, where:

     * $n$ = sequence length
     * $d_{model}$ = embedding dimension

2. **Linear Transformations**

   * Project the input into **Query (Q)**, **Key (K)**, and **Value (V)** vectors using learnable weight matrices:

     $$
     Q = XW_Q,\quad K = XW_K,\quad V = XW_V
     $$

     where $W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_k}$.

3. **Similarity Scores (Dot Product)**

   * Compute similarity between queries and keys:

     $$
     \text{scores} = QK^T
     $$

     This results in a $n \times n$ matrix representing the relevance between tokens.

4. **Scaling**

   * Scale the scores to prevent large values (which can cause gradient instability):

     $$
     \text{scaled\_scores} = \frac{\text{scores}}{\sqrt{d_k}}
     $$

5. **Softmax Normalization**

   * Apply softmax to convert scaled scores into attention weights:

     $$
     \alpha_{ij} = \frac{\exp(\text{scaled\_scores}_{ij})}{\sum_{k=1}^n \exp(\text{scaled\_scores}_{ik})}
     $$

     This ensures weights sum to 1 for each query.

6. **Weighted Sum of Values**

   * Multiply the attention weights with the value vectors:

     $$
     \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
     $$

7. **Output**

   * The result is a context-aware representation of each token, capturing relationships across the sequence.



| Question                                                                          | Answer                                                                                                                                                                 |
| --------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Q1. What is the main purpose of the self-attention mechanism in Transformers?** | To allow each token in the input to focus on other tokens, capturing context and dependencies regardless of their position in the sequence.                            |
| **Q2. Why do we scale the dot product by √d\_k?**                                 | Scaling prevents large dot-product values when the dimensionality is high, which could lead to very small gradients after softmax, slowing down training.              |
| **Q3. Difference between Self-Attention and Cross-Attention?**                    | Self-Attention uses Q, K, V from the same sequence, while Cross-Attention uses Q from one sequence (e.g., decoder) and K, V from another (e.g., encoder).              |
| **Q4. How does Multi-Head Attention relate to Self-Attention?**                   | Multi-Head Attention performs multiple self-attentions in parallel with different learned projections, allowing the model to capture different types of relationships. |
| **Q5. Computational complexity of Self-Attention?**                               | O(n²·d) where n is sequence length and d is embedding dimension — due to QKᵀ computation for all token pairs.                                                          |
