# Self-attention Mechanism
- Used to allow the model to *focus on different parts of a single input sequence* to understand the relationships and dependencies between elements of that sequence.

## Key Idea:
- Imagine reading "The cat sat on the mat." To understand "sat," you might pay attention to "cat" (what sat?) and "mat" (where did it sit?).
- Self-attention allows the model to automatically make these connections by dynamically focusing on the most relevant parts of the sentence.

## Advantages:
- Captures relationships.
- Order independent (unlike RNNs), processing all tokens in parallel, making it efficient for processing long sequences.

<img title="Self-Attention Mechanism" alt="Demonstrating how Self-Attention Mechanism works" src="https://i.ibb.co/R6MGmRy/1.png" />

<hr />

<img title="Self-Attention Mechanism" alt="Demonstrating how Self-Attention Mechanism works" src="https://i.ibb.co/P5qx6WG/2.png" />

In [35]:
import numpy as np
import torch
import torch.nn.functional as F

In [36]:
# Example: Word embeddings for a sequence of 3 words
embeddings = torch.tensor([[1.0, 0.0, 1.0],  # Word 1
                           [0.0, 1.0, 1.0],  # Word 2
                           [1.0, 1.0, 0.0]]) # Word 3

In [37]:
# Learned weight matrices for Query, Key, and Value (random initialization)
W_q = torch.randn((3, 3))
W_k = torch.randn((3, 3))
W_v = torch.randn((3, 3))

In [38]:
# Transform input embeddings into Q, K, V
Q = embeddings @ W_q
K = embeddings @ W_k
V = embeddings @ W_v

In [39]:
# Compute dot product between Q and K (transpose K for compatibility)
scores = Q @ K.T  # Shape: (3, 3) -> Attention scores for each word pair

In [40]:
# Scale scores
d_k = Q.shape[-1]  # Dimension of key
scaled_scores = scores / np.sqrt(d_k)

In [41]:
# Apply softmax to get attention weights
attention_weights = F.softmax(scaled_scores, dim=-1)

In [42]:
# Compute the final output as weighted sum of values
output = attention_weights @ V

In [43]:
# Display the outputs
print("Attention Weights:\n", attention_weights)
print("\nSelf-Attention Output:\n", output)

Attention Weights:
 tensor([[0.1324, 0.1646, 0.7029],
        [0.0565, 0.0028, 0.9407],
        [0.3616, 0.0502, 0.5882]])

Self-Attention Output:
 tensor([[ 0.7576, -1.8594,  0.2935],
        [ 1.2198, -1.8325,  0.3260],
        [ 0.6338, -1.7643,  0.4956]])
