# Attention Layers

## 1. Tensor-Dot Mechanism

Let’s see how attention can be implemented in code. I will use random variables here for the different quantities but I will indicate which variables should be trained with `w_` and which should be inputs with `i_`.

We’ll begin with implementing the tensor-dot attention mechanism first. As an example, we’ll use a sequence length of 11 and a keys feature length of 4 and a values feature dimension of 2. Remember the keys and query must share feature dimension size.

In [5]:
import numpy as np


def softmax(x, axis=None):
    return np.exp(x) / np.sum(np.exp(x), axis=axis)


def tensor_dot(q, k):
    """Produces the attention vector"""
    b = softmax((k @ q) / np.sqrt(q.shape[0]))
    return b


i_query = np.random.normal(size=(4,))    # One vector of length 4
i_keys = np.random.normal(size=(11, 4))  # 11 vectors of length 4

b = tensor_dot(i_query, i_keys)  # Attention vector must be shape of entering values (11) and normalized
print("b = ", b)
print("norm b =", np.sum(b))

b =  [0.1369961  0.02122721 0.01506782 0.11386682 0.03810638 0.29349904
 0.14518754 0.15744044 0.00555456 0.00395907 0.06909501]
norm b = 1.0


Now we obtain the attention

In [8]:
def attention_layer(q, k, v):
    b = tensor_dot(q, k)
    return b @ v


i_values = np.random.normal(size=(11, 2))  # We will obtain 2 outputs
attention_layer(i_query, i_keys, i_values)

array([-0.07274881,  0.5986953 ])

We get two values, one for each feature dimension.

## 2. Self-attention

The change in self-attention is that we make queries, keys, and values equal. We need to make a small change in that the queries are batched in this setting, so we should get a rank 2 output.

In [11]:
def batched_tensor_dot(q, k):
    # a will be batch x seq x feature dim
    # which is N x N x 4
    # batched dot product in einstein notation
    a = np.einsum("ij,kj->ik", q, k) / np.sqrt(q.shape[0])
    # now we softmax over sequence
    b = softmax(a, axis=1)
    return b


def self_attention(x):
    b = batched_tensor_dot(x, x)
    return b @ x


i_batched_query = np.random.normal(size=(11, 4))
print(np.sum(batched_tensor_dot(i_batched_query, i_batched_query), axis=0)) # Returns N attention vectors
self_attention(i_batched_query)

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


array([[ 0.1154936 ,  0.197874  ,  0.27291667, -0.63629915],
       [-0.04773785,  0.62663405, -0.10043959, -0.20765456],
       [ 0.42516801,  0.33139845, -0.41250352, -0.29836134],
       [-0.65958821,  1.8990822 ,  0.1094502 , -1.0619072 ],
       [ 1.04191603,  1.8098573 ,  0.42828857,  1.26168888],
       [ 0.18981706,  0.40404974, -0.45543939, -0.05699234],
       [ 0.2043638 ,  0.30224483,  0.01286699, -0.14859858],
       [ 0.19673329,  0.38762582, -0.21084643, -0.19133374],
       [ 0.21713664,  0.2674368 , -1.13089823,  0.14970111],
       [-0.24275823,  0.59954767,  0.19125803, -0.75298776],
       [-0.2255954 ,  0.71718845,  0.11766536, -0.46957887]])

## 3. Adding trainable parameters

You can add trainable parameters to these steps by adding a weight matrix. Let’s do this for the self-attention. Although keys, values, and query are equal in self-attention, I can multiply them by different weights. Just to demonstrate, I’ll have the values change to feature dimension 2.

In [12]:
# weights should be input feature_dim -> desired output feature_dim
w_q = np.random.normal(size=(4, 4))
w_k = np.random.normal(size=(4, 4))
w_v = np.random.normal(size=(4, 2))


def trainable_self_attention(x, w_q, w_k, w_v):
    q = x @ w_q
    k = x @ w_k
    v = x @ w_v
    b = batched_tensor_dot(q, k)
    return b @ v


trainable_self_attention(i_batched_query, w_q, w_k, w_v)

array([[-0.14826184, -0.55589221],
       [ 0.0603077 , -0.36480304],
       [ 0.16447977, -1.70439765],
       [-6.98793611,  2.27343673],
       [-1.50784049, -2.22772401],
       [ 0.35414843, -0.72988563],
       [ 0.06196388, -0.4944749 ],
       [ 0.2831109 , -0.69316157],
       [ 0.46503781, -1.09529759],
       [-3.98045315,  0.20094664],
       [-0.4171768 , -0.1328769 ]])

Since we had our values change to feature dimension 2 with the weights, we get out an 11x2 output

## 4. Multi-head

The only change for multi-head attention is that we have one set of weights for each head and we agree on how to combine after running through the heads. I’ll just use a length $H$ vector of trainable weights. Other strategies are to concatenate them or use a reduction (e.g., mean, max).

In [13]:
w_q_h1 = np.random.normal(size=(4, 4))
w_k_h1 = np.random.normal(size=(4, 4))
w_v_h1 = np.random.normal(size=(4, 2))
w_q_h2 = np.random.normal(size=(4, 4))
w_k_h2 = np.random.normal(size=(4, 4))
w_v_h2 = np.random.normal(size=(4, 2))
w_h = np.random.normal(size=2)


def multihead_attention(x, w_q_h1, w_k_h1, w_v_h1, w_q_h2, w_k_h2, w_v_h2):
    h1_out = trainable_self_attention(x, w_q_h1, w_k_h1, w_v_h1)
    h2_out = trainable_self_attention(x, w_q_h2, w_k_h2, w_v_h2)
    # join along last axis so we can use dot.
    all_h = np.stack((h1_out, h2_out), -1)
    return all_h @ w_h


multihead_attention(i_batched_query, w_q_h1, w_k_h1, w_v_h1, w_q_h2, w_k_h2, w_v_h2)

array([[-3.22823848e+01,  1.64024108e+00],
       [-3.62693633e-01,  6.42179290e-01],
       [-5.97527871e+01,  1.31720675e+01],
       [-4.65071209e+01, -1.88286946e+02],
       [ 2.20755991e+00,  6.51506825e+00],
       [-4.95406321e-01,  3.70776240e-01],
       [-8.16256397e+00,  1.48087827e+00],
       [-1.48730486e+00, -8.97586041e-03],
       [-3.14780494e-01,  3.87091898e-01],
       [-9.46045874e-01, -2.39139704e-01],
       [-5.04126599e-01,  2.81149184e-01]])

[Continue discussion](https://dmol.pub/dl/attention.html#attention-in-graph-neural-networks)

## 5. Summary

- Attention layers are inspired by human ideas of attention, but is fundamentally a weighted mean reduction.
- The attention layer takes in three inputs: the query, the values, and the keys. These inputs are often identical, where the query is one key and the keys and the values are equal.
- They are good at modeling sequences, such as language.
- The attention vector should be normalized, which can be achieved using a softmax activation function, but the attention mechanism equation is a hyperparameter.
- Attention layers compute an attention vector with the attention mechanism, and then reduce it by computing the attention-weighted average.
- Using hard attention (hardmax function) returns the maximum output from the attention mechanism.
- The tensor-dot followed by a softmax is the most common attention mechanism.
- Self-attention is achieved when the query, values, and the keys are equal.
- Attention layers by themselves are not trainable.
- Multi-head attention block is a group of layers that splits to multiple parallel attentions.



$$