In [1]:
import numpy as np
from scipy.special import softmax

def scaled_dot_product_attention(Q, K, V):
    # Step 1: Compute dot product Q · K^T
    scores = np.dot(Q, K.T)
    
    # Step 2: Scale scores by sqrt(d), where d = dimension of key
    d_k = K.shape[-1]
    scaled_scores = scores / np.sqrt(d_k)
    
    # Step 3: Apply softmax to get attention weights
    attention_weights = softmax(scaled_scores, axis=-1)
    
    # Step 4: Multiply attention weights by V
    output = np.dot(attention_weights, V)

    print("Attention Weights:\n", attention_weights)
    print("\nOutput Matrix:\n", output)
    return output

# Test input
Q = np.array([[1, 0, 1, 0], [0, 1, 0, 1]])
K = np.array([[1, 0, 1, 0], [0, 1, 0, 1]])
V = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

# Run the attention function
scaled_dot_product_attention(Q, K, V)


Attention Weights:
 [[0.73105858 0.26894142]
 [0.26894142 0.73105858]]

Output Matrix:
 [[2.07576569 3.07576569 4.07576569 5.07576569]
 [3.92423431 4.92423431 5.92423431 6.92423431]]


array([[2.07576569, 3.07576569, 4.07576569, 5.07576569],
       [3.92423431, 4.92423431, 5.92423431, 6.92423431]])

# 1. Why do we divide the attention score by √d in the scaled dot-product attention formula?

We divide by √d to prevent the dot product values from growing too large when the dimension of the key vectors (d) is large. Without this scaling, the softmax would produce very small gradients, leading to vanishing gradient problems and slower learning.

This scaling helps maintain numerical stability and ensures effective gradient flow during training.

# 2. How does self-attention help the model understand realtionships between words in a sentence?

Self-attention allows the model to:

Compare each word with all other words in the sentence.

Assign different weights to different words based on their relevance.

This mechanism enables the model to:

Capture long-range dependencies (e.g., pronouns and antecedents far apart),

Understand context better, and

Perform parallel computation, unlike RNNs.

For example, in the sentence "The cat that chased the mouse was hungry," self-attention helps associate "was hungry" with "cat" rather than "mouse."