# Attention Explained

This is a brief article that explains Attention (famously used in the GPT architecture used by chatGPT). This is based on the [Attention is all you need Paper](https://arxiv.org/abs/1706.03762) written in 2017, also from the [3Blue1Brown Attention video](https://www.youtube.com/watch?v=eMlx5fFNoYc&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=7), along with [Andrej Karpathy](https://en.wikipedia.org/wiki/Andrej_Karpathy) video [Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY).

## Attention as Communication

Attention is most famously in the GPT architecture, but Attention at its heart is a **Communication Mechanism** between nodes of a *Directed Graph*, and Attention is way that describes how nodes of a graph should pass information to each other.

If you're unfamiliar with what a [directed graph](https://en.wikipedia.org/wiki/Directed_graph) is, it's a data structure consisting of nodes connected by edges that have a direction (arrows). See below diagram for illustration.

![test](../diagrams/exported/directed_graph.png)

Think of Attention as way for nodes to communicate with each other in the graph, or how much weight each connection of the graph is given.

## Why communication matters?

Let's consider the classic example where you have a sentence that is composed of words, for example we may have the following sentence:

> The sky is blue

As humans we already know that *blue* here is referring to the *sky* and can make the connection between the 2 words. However, to make a machine understand this connection, we can represent this sentence as a sequence of nodes, each node represents a word (or a token), and we can connect these nodes using directed edges to represent the flow of meaning. Information can now be passed between nodes (or words) contextualizing and strengthening the overall meaning of each word.

So Attention in essence provides a systematic way for nodes (which are words or tokens in the LLM context) can pass information to each other.

In addition, attention has no notion of space (or where each of the nodes is located relative to the others). It is mainly way for each of the nodes to communicate information between each other. In the context of LLMs, a sentence will be broken down into tokens where tokens need to communicate with each other in order to pass information relating to context (an adjective needs to communicate with the noun it is describing for example). This communication is directional since only previous words can impact coming words. Here is an illustration of the directed graph that representing a sentence:

![sentence_directed_graph](../diagrams/exported/sentence_directed_graph.png)

## Attention Formulation

The formulation of attention from the paper can be described as:

$$
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
$$

Ok let's break this down in very simple terms.

![attention](../diagrams/exported/attention.png)

## Self vs. Cross Attention

Self attention means that words of the same sentence are internally passing information from each other. There is no external data source (like video or voice) where information passing is needed.

On the contrary, cross attention is where you have information passing from a datasource to a different data source. The classic example is translating a sentence from one language to another, here you'd want information from the sentence with language A to flow to the sentence with language B.


## Coding Attention

In the next section we code attention and show how it can be used.

In [1]:
import numpy as np

# --- Setup ---
np.random.seed(42)

num_nodes = 5
feature_dim = 4
d_k = d_v = 4

# Random features for each node
X = np.random.randn(num_nodes, feature_dim)

# Random projection matrices
W_Q = np.random.randn(feature_dim, d_k)
W_K = np.random.randn(feature_dim, d_k)
W_V = np.random.randn(feature_dim, d_v)

# Queries, Keys, Values
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

# --- Step 1: Create a random directed adjacency matrix ---
# 0/1 edges, no self-loops for clarity
A = (np.random.rand(num_nodes, num_nodes) > 0.5).astype(int)
np.fill_diagonal(A, 0)  # remove self-loops

print("Adjacency matrix (directed graph):\n", A)

# --- Step 2: Compute attention scores only for existing edges ---
scores = Q @ K.T / np.sqrt(d_k)   # similarity matrix (N x N)

# Mask out non-edges (set to -inf before softmax)
mask = (A == 0)
scores_masked = np.where(mask, -1e9, scores)

# --- Step 3: Apply softmax row-wise (over neighbors only) ---
def softmax(x):
    e_x = np.exp(x)
    return e_x / (e_x.sum(axis=-1, keepdims=True) + 1e-9) # Add 1e-9 to 

attention_weights = softmax(scores_masked)

print("\nAttention weights (edge weights):\n", attention_weights)

# --- Step 4: Message passing ---
out = attention_weights @ V   # aggregate messages from neighbors
print("\nOutput node features after attention:\n", out)

Adjacency matrix (directed graph):
 [[0 1 1 1 1]
 [0 0 0 0 0]
 [0 1 0 1 1]
 [0 0 1 0 0]
 [0 0 1 1 0]]

Attention weights (edge weights):
 [[0.         0.58686424 0.11015243 0.29349828 0.00948504]
 [0.         0.         0.         0.         0.        ]
 [0.         0.29790782 0.         0.14780649 0.5542857 ]
 [0.         0.         1.         0.         0.        ]
 [0.         0.         0.75252504 0.24747496 0.        ]]

Output node features after attention:
 [[ 0.62674184  0.35085896 -0.7575579  -0.94982768]
 [ 0.          0.          0.          0.        ]
 [ 0.17279806 -1.11993955 -0.30863806 -1.02376616]
 [-0.29388799 -1.00053631  0.24193489  0.17904783]
 [ 0.22715068 -0.67936369  0.56923429  0.0995786 ]]
