For simplicity, you will initially calculate the attention for the first word in a sequence of four. You will then generalize the code to calculate an attention output for all four words in matrix form. 

Hence, let’s start by first defining the word embeddings of the four different words to calculate the attention. In actual practice, these word embeddings would have been generated by an encoder; however, for this particular example, you will define them manually. 

In [3]:
import numpy as np
import random

The next step generates the weight matrices, which you will eventually multiply to the word embeddings to generate the queries, keys, and values. Here, you shall generate these weight matrices randomly; however, in actual practice, these would have been learned during training. 

In [2]:

# encoder representations of four different words
word_1 = np.array([1, 0, 0])
word_2 = np.array([0, 1, 0])
word_3 = np.array([1, 1, 0])
word_4 = np.array([0, 0, 1])

In [13]:
word_1

array([1, 0, 0])

In [8]:
len(word_1)

3

In [9]:
type(word_1)

numpy.ndarray

In [6]:

# generating the weight matrices
random.seed(42) # to allow us to reproduce the same attention values

W_Q = np.random.randint(3, size=(3, 3))
W_K = np.random.randint(3, size=(3, 3))
W_V = np.random.randint(3, size=(3, 3))

In [11]:
W_Q

array([[2, 1, 0],
       [0, 0, 0],
       [0, 2, 1]])

In [10]:
len(W_Q)

# dim of word embedding len is equivalent to W_Q len

3

In [12]:
type(W_Q)

numpy.ndarray

Notice how the number of rows of each of these matrices is equal to the dimensionality of the word embeddings (which in this case is three) to allow us to perform the matrix multiplication.

Subsequently, the query, key, and value vectors for each word are generated by multiplying each word embedding by each of the weight matrices. 

In [14]:
...
# generating the queries, keys and values
query_1 = word_1 @ W_Q # (1, 3) @ (3, 3) --> (1,3)
key_1 = word_1 @ W_K
value_1 = word_1 @ W_V

query_2 = word_2 @ W_Q
key_2 = word_2 @ W_K
value_2 = word_2 @ W_V

query_3 = word_3 @ W_Q
key_3 = word_3 @ W_K
value_3 = word_3 @ W_V

query_4 = word_4 @ W_Q
key_4 = word_4 @ W_K
value_4 = word_4 @ W_V

In [16]:
query_1

array([2, 1, 0])

###### word1 = matrix A
###### W_Q = matrix B
###### query_1 = resultant matrix

![test](./matrix_multiply_word1_WQ.png)

In [22]:
key_1

array([0, 0, 1])

In [23]:
value_1

array([1, 2, 0])

Considering only the first word for the time being, the next step scores its query vector against all the key vectors using a dot product operation. 

In [24]:
...
# scoring the first query vector against all key vectors
scores = np.array([np.dot(query_1, key_1), np.dot(query_1, key_2), np.dot(query_1, key_3), np.dot(query_1, key_4)])

In [25]:
scores

array([0, 2, 2, 2])

The score values are subsequently passed through a softmax operation to generate the weights. Before doing so, it is common practice to divide the score values by the square root of the dimensionality of the key vectors (in this case, three) to keep the gradients stable. 

In [28]:
# python -m pip install scipy
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.softmax.html
from scipy.special import softmax

In [29]:
...
# computing the weights by a softmax operation
weights = softmax(scores / key_1.shape[0] ** 0.5)

In [30]:
weights

array([0.09506409, 0.3016453 , 0.3016453 , 0.3016453 ])

Finally, the attention output is calculated by a weighted sum of all four value vectors.

In [31]:
...
# computing the attention by a weighted sum of the value vectors
attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4)

print(attention)

[1.         1.39670939 1.80987182]


For faster processing, the same calculations can be implemented in matrix form to generate an attention output for all four words in one go:

In [36]:
from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax

# encoder representations of four different words
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])

print(f"word_1: {word_1}")

# stacking the word embeddings into a single array
words = array([word_1, word_2, word_3, word_4])

print(f"words: {words}")

# generating the weight matrices
random.seed(42)
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))

print(f"W_Q: {W_Q}")
print(f"W_K: {W_K}")
print(f"W_V: {W_V}")

# generating the queries, keys and values
Q = words @ W_Q # (4, 3) @ (3, 3) -> (4, 3)
K = words @ W_K
V = words @ W_V

print(f"Q: {Q}")
print(f"K: {K}")
print(f"V: {V}")

# scoring the query vectors against all key vectors
scores = Q @ K.transpose() # (4, 3) @ (3, 4) -> (4, 4)

print(f"scores: {scores}")


# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)

print(f"weights: {weights}")

# computing the attention by a weighted sum of the value vectors
attention = weights @ V

print(f"attention: {attention}")

word_1: [1 0 0]
words: [[1 0 0]
 [0 1 0]
 [1 1 0]
 [0 0 1]]
W_Q: [[2 0 2]
 [2 0 0]
 [2 1 2]]
W_K: [[2 2 2]
 [0 2 1]
 [0 1 1]]
W_V: [[1 1 0]
 [0 1 1]
 [0 0 0]]
Q: [[2 0 2]
 [2 0 0]
 [4 0 2]
 [2 1 2]]
K: [[2 2 2]
 [0 2 1]
 [2 4 3]
 [0 1 1]]
V: [[1 1 0]
 [0 1 1]
 [1 2 1]
 [0 0 0]]
scores: [[ 8  2 10  2]
 [ 4  0  4  0]
 [12  2 14  2]
 [10  4 14  3]]
weights: [[2.36089863e-01 7.38987555e-03 7.49130386e-01 7.38987555e-03]
 [4.54826323e-01 4.51736775e-02 4.54826323e-01 4.51736775e-02]
 [2.39275049e-01 7.43870015e-04 7.59237211e-01 7.43870015e-04]
 [8.99501754e-02 2.81554063e-03 9.05653685e-01 1.58059922e-03]]
attention: [[0.98522025 1.74174051 0.75652026]
 [0.90965265 1.40965265 0.5       ]
 [0.99851226 1.75849334 0.75998108]
 [0.99560386 1.90407309 0.90846923]]


In [35]:
K

array([[2, 2, 2],
       [0, 2, 1],
       [2, 4, 3],
       [0, 1, 1]])

In [34]:
K.transpose()

array([[2, 0, 2, 0],
       [2, 2, 4, 1],
       [2, 1, 3, 1]])

In [None]:
# CREDITS: https://machinelearningmastery.com/the-attention-mechanism-from-scratch/