Q3: Scaled Dot-Product Attention

Task: Implement the scaled dot-product attention mechanism. Given matrices Q (Query), K (Key), and V (Value), your function should:

- Compute the dot product of Q and Kᵀ
- Scale the result by dividing it by √d (where d is the key dimension)
- Apply softmax to get attention weights
- Multiply the weights by V to get the output
  Use the following test inputs:

Q = np.array([[1, 0, 1, 0], [0, 1, 0, 1]])

K = np.array([[1, 0, 1, 0], [0, 1, 0, 1]])

V = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

Expected Output Description:
Your output should display:

1. The attention weights matrix (after softmax)
2. The final output matrix

Short Answer Questions:

1. Why do we divide the attention score by √d in the scaled dot-product attention formula?
2. How does self-attention help the model understand relationships between words in a sentence?


In [None]:
# Importing necessary libraries
import tensorflow as tf
import numpy as np

# Define Q K V matrices
Q = tf.constant([[1, 0, 1, 0], [0, 1, 0, 1]], dtype=tf.float32)
K = tf.constant([[1, 0, 1, 0], [0, 1, 0, 1]], dtype=tf.float32)
V = tf.constant([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=tf.float32)

# Dot product of Q and K-transpose
matmul_qk = tf.matmul(Q, K, transpose_b=True)

# Scale by sqrt(d)
d_k = tf.cast(tf.shape(K)[-1], tf.float32)
scaled_result = matmul_qk / tf.math.sqrt(d_k)

# Softmax
attention_weights = tf.nn.softmax(scaled_result, axis=-1)

# Multiply weights with V
output = tf.matmul(attention_weights, V)

# Printing results
print(f"Attention Weights:\n {attention_weights.numpy()}")
print(f"Output:\n {output.numpy()}")


Attention Weights:
 [[0.73105854 0.26894143]
 [0.26894143 0.73105854]]
Output:
 [[2.0757656 3.0757656 4.0757656 5.0757656]
 [3.9242342 4.924234  5.924234  6.924234 ]]


Short Answer Questions:

1. Why do we divide the attention score by √d in the scaled dot-product attention formula?
   Scaled down by a factor to prevent exploding dot products and vanishing softmax gradients when the dimension gets large
2. How does self-attention help the model understand relationships between words in a sentence?
   It helps the model to assess the importance of words in a sentence with respect to each other word. It helps the model capture long range relationships even if the words are far way from each other.
