<a href="https://colab.research.google.com/github/AdamClarkStandke/TransDiffTemp/blob/main/multiHead_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Head Attention found in Orginal Transformer Paper

In [136]:
import numpy as np
from scipy.special import softmax

# Step 1: Represent the Input

As detailed in the book [Transformers for NLP by Denis Rothman](https://www.packtpub.com/en-us/product/transformers-for-natural-language-processing-second-edition-9781803247335?type=print&gad_source=1&gclid=Cj0KCQiA5-uuBhDzARIsAAa21T_Vro5RuNsURN8t_PUHd6aybZm0mi2BApbIxHRjw-qbjQxcrWbvRL8aApJMEALw_wcB):

> The input of the attention mechanism we are building is like the orginal paper [Attention is all you need](https://arxiv.org/pdf/1706.03762v1.pdf) in which $d_{model}=512$ [i.e. the positional embedding space]

In this example the prompt is the following sentence: **"I love you"**

In this case *x* contains 3 inputs (i.e., the three words) embedded in the orginal 512 positional embedding space. Giving a 3 by 512 dimensional matrix.




In [137]:
print("Step 1: Input=3 tokenized words, d_model=512")
x = np.random.randint(1, 10, size=(3, 512))
print(f"I: {x[0]}")
print(f"love: {x[1]}")
print(f"you: {x[2]}")
print(x.shape)

Step 1: Input=3 tokenized words, d_model=512
I: [8 5 6 7 7 7 2 2 5 5 7 5 5 1 1 9 3 2 9 7 1 1 8 7 1 7 7 1 1 6 1 3 2 6 5 9 3
 5 2 6 2 9 4 1 3 3 5 2 1 8 7 6 5 9 5 7 3 2 9 8 5 3 5 1 8 5 2 5 8 3 2 3 6 7
 7 1 6 7 2 1 9 1 7 4 5 6 9 6 6 6 8 2 2 8 5 3 8 1 9 5 3 6 4 6 7 6 4 7 5 4 7
 7 4 6 7 8 7 3 4 1 4 9 7 6 8 9 4 1 3 4 2 9 6 6 8 8 2 1 5 3 2 1 9 3 4 2 6 9
 8 7 7 4 6 8 8 5 1 2 5 8 9 3 6 2 8 5 9 4 6 8 5 8 6 2 3 1 4 2 1 2 2 7 1 4 1
 7 4 6 6 8 1 4 3 3 6 6 8 8 2 2 4 2 7 7 1 3 5 3 5 5 1 3 8 7 8 9 5 8 9 6 3 6
 5 1 9 3 3 2 2 8 4 2 5 2 9 8 1 1 5 5 2 5 9 8 4 2 3 9 5 4 3 4 2 5 4 5 9 3 1
 2 3 1 9 9 7 6 4 1 8 2 5 6 3 7 3 9 4 9 2 8 3 2 9 3 5 4 6 5 9 1 6 8 3 7 7 6
 5 1 5 7 2 2 7 1 2 8 7 3 4 5 6 4 4 5 3 1 8 1 1 7 8 6 2 8 8 7 9 4 5 4 5 5 1
 1 6 1 7 5 5 3 4 5 3 2 4 3 3 1 8 3 8 6 1 3 6 6 2 2 2 8 3 5 7 6 6 9 8 8 1 6
 8 2 1 1 6 7 7 6 7 7 4 2 8 8 1 4 1 3 8 4 6 3 4 9 3 9 9 7 8 4 7 2 3 8 4 1 6
 7 5 3 9 5 9 6 3 4 8 5 8 1 6 1 5 3 2 2 6 4 7 1 9 1 8 4 8 9 7 3 6 5 9 9 3 7
 8 4 1 4 4 5 9 7 7 3 7 3 6 6 5 9 6 5 2 6 8 5 1 8 4 2

# Step 2: Initializing the Weight Matricies

Each input has 3 weight matrices:

*   $W_Q$
*   $W_K$
*   $W_V$

These 3 weight matrices are applied to all the inputs in the model to project the positional embeddings into a query, key and value matrix (i.e. QKV). Since we have eight heads in this model each dimension is 64 (i.e. 512 divided by 8 is 64). *Importantly, the dimensionality of both the query and key columns of the matrix have to be the same; the column dimensionality of the value vector does not need to be the same.*

In [138]:
print("Step 2: Create Query, Key Weights of 64 and Value Weights of 64 dimensions")
w_q = np.random.randint(1, 2, size=(512, 64)) # weight matrix query
w_k = np.random.randint(3, 4, size=(512, 64)) # weight matrix key
w_v = np.random.randint(5, 6, size=(512, 64)) # weight matrix value
print("W_q")
print(w_q.shape)
print("W_k")
print(w_k.shape)
print("W_v")
print(w_v.shape)

Step 2: Create Query, Key Weights of 64 and Value Weights of 64 dimensions
W_q
(512, 64)
W_k
(512, 64)
W_v
(512, 64)


# Step 3: Matrix multiplication to obtain Q,K, and V

As detailed in the book [Transformers for NLP by Denis Rothman](https://www.packtpub.com/en-us/product/transformers-for-natural-language-processing-second-edition-9781803247335?type=print&gad_source=1&gclid=Cj0KCQiA5-uuBhDzARIsAAa21T_Vro5RuNsURN8t_PUHd6aybZm0mi2BApbIxHRjw-qbjQxcrWbvRL8aApJMEALw_wcB):

> We will now multiply the input vectors by the weight matrices to obtain a query, key, and value vector for each input.In this model,*we will assume that there is one $W_Q$, $W_K$, and $W_V$ matrix for all inputs. Other approaches are possible*



In [139]:
print("Step 3: Matrix multiplication to obtain Q,K,V")
print("Query: x * w_query")
Q=np.matmul(x, w_q)
print(Q.shape)
print("Key: x * w_key")
K=np.matmul(x, w_k)
print(K.shape)
print("Value: x * w_value")
V=np.matmul(x, w_v)
print(V.shape)

Step 3: Matrix multiplication to obtain Q,K,V
Query: x * w_query
(3, 64)
Key: x * w_key
(3, 64)
Value: x * w_value
(3, 64)


# Step 4: Scaled Attention Scores

As detailed in the book [Transformers for NLP by Denis Rothman](https://www.packtpub.com/en-us/product/transformers-for-natural-language-processing-second-edition-9781803247335?type=print&gad_source=1&gclid=Cj0KCQiA5-uuBhDzARIsAAa21T_Vro5RuNsURN8t_PUHd6aybZm0mi2BApbIxHRjw-qbjQxcrWbvRL8aApJMEALw_wcB):

> The attention head now implements the original Transformer equation: Attention(Q,K,V) = $softmax(\frac{QK^{T}}{\sqrt{d_k}})V$


> *Step 4* focuses on the first portion (i.e. without softmax): $(\frac{QK^{T}}{\sqrt{d_k}})$








In [140]:
print("Step 4: Scaled Attention Scores")
srt_dk = 8 # square root of 64
attention_scores = (Q @ K.transpose())/srt_dk # matrix multiplication
print(attention_scores.shape)
print(f"scaled attention score for word: I {attention_scores[0]}")
print(f"scaled attention score for word: love {attention_scores[1]}")
print(f"scaled attention score for word: you {attention_scores[2]}")

Step 4: Scaled Attention Scores
(3, 3)
scaled attention score for word: I [1.49520384e+08 1.49939712e+08 1.56768768e+08]
scaled attention score for word: love [1.49939712e+08 1.50360216e+08 1.57208424e+08]
scaled attention score for word: you [1.56768768e+08 1.57208424e+08 1.64368536e+08]


# Step 5: Scaled Softmax Attention Scores for each Vector

*Step 5* applying softmax function to attention scores

In [141]:
print("Step 5: Scaled softmax attention_scores for each vector")
sft_attention = softmax(attention_scores)
print(sft_attention.shape)
print(f"softmax attention score for word: I {softmax(attention_scores[0])}")
print(f"softmax attention score for word: love {softmax(attention_scores[1])}")
print(f"softmax attention score for word: you {softmax(attention_scores[2])}")

Step 5: Scaled softmax attention_scores for each vector
(3, 3)
softmax attention score for word: I [0. 0. 1.]
softmax attention score for word: love [0. 0. 1.]
softmax attention score for word: you [0. 0. 1.]


# Step 6: The Context Vector for One Word

*Step 6* applying V to the softmax attention scores and then summing the results produces the output/context vector (i.e. $d_v$) for one word (e.g., "I") for one attention head.

In [142]:
print("Step 6: Context vector:")
print(f"Value shape: {V.shape}")
print(f"Softmax attention shape: {sft_attention.shape}")
print(f"Context vector for word: I {(sft_attention[0][0]*V[0]+sft_attention[0][1]*V[1]+sft_attention[0][2]*V[2])}")
print(f"Context vector shape: {(sft_attention[0][0]*V[0]+sft_attention[0][1]*V[1]+sft_attention[0][2]*V[2]).shape}")

Step 6: Context vector:
Value shape: (3, 64)
Softmax attention shape: (3, 3)
Context vector for word: I [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Context vector shape: (64,)


# Step 7: Repeat steps 1 to 6 for all other words/inputs

# Step 8: Output all of the heads of the attention sublayer