# Implementation: Self-Attention from Scratch

**Goal**: Implement $Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$.

In [None]:
import numpy as np

# 1. Mock Data (3 words, embedding dim 4)
x = np.array([
    [1, 0, 1, 0], # Word 1 (e.g., "I")
    [0, 1, 0, 1], # Word 2 (e.g., "love")
    [1, 1, 0, 0]  # Word 3 (e.g., "AI")
])

# In Self-Attention, Input = Query = Key = Value (conceptually)
# In reality, we multiply by weight matrices Wq, Wk, Wv.
# Here we simplify and assume W are identity matrices, so Q=K=V=x.
Q = x
K = x
V = x

# 2. Calculate Attention Scores (Q @ K_transpose)
scores = np.dot(Q, K.T)
print("Raw Scores (Dot Product):\n", scores)

# 3. Softmax (Normalize to probabilities)
def softmax(z):
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

# Scale by sqrt(d_k) for stability
d_k = x.shape[1]
attention_weights = softmax(scores / np.sqrt(d_k))
print("\nAttention Weights:\n", np.round(attention_weights, 2))

# 4. Weighted Sum (Weights @ Values)
output = np.dot(attention_weights, V)
print("\nOutput (Contextualized Embeddings):\n", output)

## Conclusion
See the "Attention Weights" matrix. The diagonal is highest (words attend to themselves), but they also attend to others.