<a href="https://colab.research.google.com/github/Tstrebe2/ml-mastery-transformers/blob/main/chapter-8/exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
import numpy as np 
from scipy.special import softmax

### 8.3 The General Attention Mechanism with NumPy and SciPy
This section will explore how to implement the general attention mechanism using the NumPy and SciPy libraries in Python.  
For simplicity, you will initially calculate the attention for the first word in a sequence of four. You will then generalize the code to calculate an attention output for all four words in matrix form. Hence, let’s start by first defining the word embeddings of the four different
words to calculate the attention. In actual practice, these word embeddings would have been generated by an encoder; however, for this particular example, you will define them manually.

In [4]:
word_1 = np.array([1, 0, 0])
word_2 = np.array([0, 1, 0])
word_3 = np.array([1, 1, 0])
word_4 = np.array([0, 0, 1])

The next step generates the weight matrices, which we you eventually multiply to the word
embeddings to generate the queries, keys, and values. Here, you shall generate these weight
matrices randomly; however, in actual practice, these would have been learned during training.

In [5]:
np.random.seed(42) # to allow us to reproduce the same attention values
W_Q = np.random.randint(3, size=(3, 3))
W_K = np.random.randint(3, size=(3, 3))
W_V = np.random.randint(3, size=(3, 3))

Notice how the number of rows of each of these matrices is equal to the dimensionality of the word embeddings (which in this case is three) to allow us to perform the matrix multiplication. Subsequently, the query, key, and value vectors for each word are generated by multiplying each word embedding by each of the weight matrices.


In [9]:
query_1 = word_1 @ W_Q
key_1 = word_1 @ W_K
value_1 = word_1 @ W_V
query_2 = word_2 @ W_Q
key_2 = word_2 @ W_K
value_2 = word_2 @ W_V
query_3 = word_3 @ W_Q
key_3 = word_3 @ W_K
value_3 = word_3 @ W_V
query_4 = word_4 @ W_Q
key_4 = word_4 @ W_K
value_4 = word_4 @ W_V

Considering only the first word for the time being, the next step scores its query vector against all the key vectors using a dot product operation.


In [20]:
scores = np.array([np.dot(query_1, key_1), np.dot(query_1, key_2), np.dot(query_1, key_3),
                   np.dot(query_1, key_4)])
scores

array([ 8,  2, 10,  2])

The score values are subsequently passed through a softmax operation to generate the weights. Before doing so, it is common practice to divide the score values by the square root of the dimensionality of the key vectors (in this case, three) to keep the gradients stable.

In [21]:
weights = softmax(scores / key_1.shape[0] ** 0.5)
weights

array([0.23608986, 0.00738988, 0.74913039, 0.00738988])

Finally, the attention output is calculated by a weighted sum of all four value vectors.

In [24]:
attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4)
attention

array([0.98522025, 1.74174051, 0.75652026])

For faster processing, the same calculations can be implemented in matrix form to generate an attention output for all four words in one go:


In [29]:
from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax
# encoder representations of four different words
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])
# stacking the word embeddings into a single array
words = array([word_1, word_2, word_3, word_4])
# generating the weight matrices
random.seed(42)
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))
# generating the queries, keys and values
Q = words @ W_Q
K = words @ W_K
V = words @ W_V
# scoring the query vectors against all key vectors
scores = Q @ K.transpose()
# computing the weights by a softmax operation
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)
# computing the attention by a weighted sum of the value vectors
attention = weights @ V
attention

array([[0.98522025, 1.74174051, 0.75652026],
       [0.90965265, 1.40965265, 0.5       ],
       [0.99851226, 1.75849334, 0.75998108],
       [0.99560386, 1.90407309, 0.90846923]])