# Chapter 9, Example 1

In [13]:
import torch
import torch.nn.functional as F

# Assuming we have two words x_1 and x2 with the respective embedding
# Define the embeddings
x_1 = torch.tensor([1, 2, 3], dtype=torch.float32)
x_2 = torch.tensor([4, 5, 6], dtype=torch.float32)
embeddings = torch.stack([x_1, x_2])

# Define the projection matrices
W_Q = torch.tensor([[0.01, 0.03],
                    [0.02, 0.02],
                    [0.03, 0.01]], dtype=torch.float32)

W_K = torch.tensor([[0.05, 0.05],
                    [0.06, 0.05],
                    [0.07, 0.05]], dtype=torch.float32)

W_V = torch.tensor([[0.02, 0.02],
                    [0.01, 0.02],
                    [0.01, 0.01]], dtype=torch.float32)

# Compute the query, key, and value representations
Q = torch.matmul(embeddings, W_Q)
K = torch.matmul(embeddings, W_K)
V = torch.matmul(embeddings, W_V)

# Scaled Dot Product Attention

The Scaled Dot Product Attention mechanism is a core component of the Transformer architecture, which has been influential in various NLP tasks. This mechanism computes attention scores based on the dot product of the query and key matrices, and then scales the scores to stabilize gradients, especially in deeper models or larger embeddings.

## Steps:

1. **Compute Dimension (`d_k`)**:
    - `d_k` represents the dimension of the key vectors.
    - It's obtained from the last dimension of the `W_K` tensor.

2. **Calculate Attention Scores (`attn_scores`)**:
    - The attention scores are computed by taking the dot product of the query matrix `Q` and the transposed key matrix `K`.
    - In the context of the attention mechanism, the matrices `Q`, `K`, and `V` typically have the shape `(batch_size, sequence_length, d_k)`, where `d_k` represents the dimension of the key (and also the query) vectors.
    - The operation `matmul(Q, K)` would attempt to multiply the last dimension of `Q` with the second-to-last dimension of `K`. However, this isn't the desired behavior for the dot product. By transposing the last two dimensions of `K` using `K.transpose(-2, -1)`, we modify its shape to `(batch_size, d_k, sequence_length)`. Now, executing `matmul(Q, K.transpose(-2, -1))` multiplies the last dimension of `Q` (which is `d_k`) with the second-to-last dimension of the transposed `K` (also `d_k`), yielding the desired shape `(batch_size, sequence_length, sequence_length)`.
    - *Why `-2` and `-1`?* In PyTorch, negative indices for dimensions count from the last dimension backward. Specifically, `-1` refers to the last dimension, and `-2` pertains to the second-to-last dimension. Utilizing negative indices in this context ensures that the code remains general and adaptable to tensors with varying numbers of dimensions.
    - The scores are then scaled down by dividing by the square root of `d_k`. This scaling helps in stabilizing the gradients, especially when the dimensions of the key vectors are large.

3. **Compute Attention Weights (`attention_weights`)**:
    - The scaled attention scores are passed through a softmax function along the last dimension to produce the attention weights. This ensures that the weights are normalized and sum up to 1 for each query.

4. **Compute Attention Output (`attention_out`)**:
    - The attention output is computed by taking the dot product of the attention weights and the value matrix `V`. This step essentially takes a weighted sum of the values based on the attention weights, giving more importance to values that are more relevant to the given query.

The resulting `attention_out` tensor provides a context-aware representation of the input, emphasizing the most relevant parts of the input for each query.



In [15]:
# Compute the Scaled Dot Product Attention
d_k = W_K.size(-1)  # dimension of keys
attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)  # scaling by sqrt(d_k)
attention_weights = F.softmax(attn_scores, dim=-1)
attention_out = torch.matmul(attention_weights, V)

In [18]:
print("Query:\n", Q)
print("Key:\n", K)
print("Key transposed:\n", K.transpose)
print("Value:\n", V)
print("Attention scores:\n", attn_scores)
print("Attention weights:\n", attention_weights)
print("Self Attention Output:\n", attention_out)

Query:
 tensor([[0.1400, 0.1000],
        [0.3200, 0.2800]])
Key:
 tensor([[0.3800, 0.3000],
        [0.9200, 0.7500]])
Key transposed:
 <built-in method transpose of Tensor object at 0x79fa72edf060>
Value:
 tensor([[0.0700, 0.0900],
        [0.1900, 0.2400]])
Attention scores:
 tensor([[0.0588, 0.1441],
        [0.1454, 0.3567]])
Attention weights:
 tensor([[0.4787, 0.5213],
        [0.4474, 0.5526]])
Self Attention Output:
 tensor([[0.1326, 0.1682],
        [0.1363, 0.1729]])
