In [1]:
import numpy as np

## Inputs 

$$
\begin{align*}
X & = \left[\begin{array}{cccc}
- - x_1 - - \\
- - x_2 - - \\
. . . . \\
. . . . \\
- - x_n - -\\
\end{array} \\
\right]_{n\times d}
& = \left[\begin{array}{ccc}
x_{11}\  x_{12}\  ...\   x_{1d} \\
x_{21}\  x_{22}\  ...\   x_{2d}  \\
. . .  \\
. . .  \\
x_{n1}\  x_{n2}\  ...\   x_{nd} \\
\end{array} \\ 
\right]_{n\times d}
\end{align*}
$$

Where - 
- $n$ - # tokens in the input sequence
- $d$ - embedding dimension of the tokens

In [10]:
# Let there be n= 3 input tokens to the model with embedding dimension of d = 4
# The input tokens are represented as a 3x4 matrix
n = 3
d = 4
input_tokens = np.random.rand(n, d)
print("Input tokens: \n", input_tokens)
print("Shape of input tokens: ", input_tokens.shape)

# each row of the input tokens is a vector of length 4 representing the embedding of the token

Input tokens: 
 [[0.47403009 0.32876477 0.20495151 0.85971434]
 [0.80437388 0.22153859 0.88344645 0.47825417]
 [0.18688316 0.1481623  0.90946148 0.42144322]]
Shape of input tokens:  (3, 4)


## Projection Matrices for Query, Key and Value 

$$\begin{align*}
W_q = W_k = \left[\begin{array}{ccc}
wq_{11}\  wq_{12}\  ...\   wq_{1d_k} \\
wq_{21}\  wq_{22}\  ...\   wq_{2d_k}  \\
. . .  \\
. . .  \\
wq_{d1}\  wq_{d2}\  ...\   wq_{dd_k} \\
\end{array} \\ 
\right]_{d \times d_k}
W_v = \left[\begin{array}{ccc}
wv_{11}\  wv_{12}\  ...\   wv_{1d_v} \\
wv_{21}\  wv_{22}\  ...\   wv_{2d_v}  \\
. . .  \\
. . .  \\
wv_{d1}\  wv_{d2}\  ...\   wv_{dd_v} \\
\end{array} \\ 
\right]_{d \times d_v}
\end{align*}
$$

Where - 
- $d$ - embedding dimension of the tokens
- $d_k$ - embedding dimension of the query and key vectors
- $d_v$ - embedding dimension of the value vectors

In [12]:
# Let's assume that the query and key vectors are of length 6
# The value vector is of length 5
dk = 6
dv = 5
# the weights for query and key vectors are represented as a 4x6 matrix
weights_query_key = np.random.rand(d, dk)
print("Weights for query and key projections vectors: \n", weights_query_key)
print("Shape of weights for query and key  projection vectors: ", weights_query_key.shape)

# the weights for the value vector is represented as a 4x5 matrix
weights_value = np.random.rand(d, dv)
print("Weights for value  projection vector: \n", weights_value)
print("Shape of weights for value projection vector: ", weights_value.shape)


Weights for query and key projections vectors: 
 [[0.85457913 0.72805367 0.52885905 0.81602111 0.11114425 0.16665275]
 [0.99075152 0.43915368 0.09446376 0.81835108 0.449025   0.76979672]
 [0.45029968 0.60978598 0.99083217 0.20000659 0.37349433 0.5733803 ]
 [0.68608398 0.72666931 0.09941451 0.34274698 0.34492009 0.53264535]]
Shape of weights for query and key  projection vectors:  (4, 6)
Weights for value  projection vector: 
 [[0.36952055 0.22762371 0.03909977 0.43702857 0.80597927]
 [0.94430039 0.39320212 0.00445702 0.11603836 0.68048885]
 [0.30011805 0.76770143 0.00765135 0.02766898 0.96769012]
 [0.11024414 0.62303738 0.50907588 0.35974711 0.28597405]]
Shape of weights for value projection vector:  (4, 5)


## Create the Query, Key and Value matrices

We project the input tokens to query, key and value matrices using the projections defined in the previous step

$$\begin{align*}
Q_{n \times d_k} & = X_{n \times d}W_{q\ d \times d_k} \\
K_{n \times d_k} & = X_{n \times d}W_{k\ d \times d_k} \\
V_{n \times d_v} & = X_{n \times d}W_{v\ d \times d_v} \\ 
\end{align*}
$$

In [13]:
# To compute the query, key and value vectors for each token, we take the dot product of the input tokens with the weights

# query vectors
query_vectors = np.dot(input_tokens, weights_query_key)
print("Query vectors: \n", query_vectors)
print("Shape of query vectors: ", query_vectors.shape)

# key vectors
key_vectors = np.dot(input_tokens, weights_query_key)
print("Key vectors: \n", key_vectors)
print("Shape of key vectors: ", key_vectors.shape)

# value vectors
value_vectors = np.dot(input_tokens, weights_value)
print("Value vectors: \n", value_vectors)
print("Shape of value vectors: ", value_vectors.shape)


Query vectors: 
 [[1.41294625 1.23920219 0.57029209 0.99151971 0.57339029 0.90751846]
 [1.63282901 1.56916273 1.36922034 1.17829769 0.68379961 1.06588145]
 [1.00517413 1.06195371 1.05585209 0.60009606 0.57234251 0.89114652]]
Shape of query vectors:  (3, 6)
Key vectors: 
 [[1.41294625 1.23920219 0.57029209 0.99151971 0.57339029 0.90751846]
 [1.63282901 1.56916273 1.36922034 1.17829769 0.68379961 1.06588145]
 [1.00517413 1.06195371 1.05585209 0.60009606 0.57234251 0.89114652]]
Shape of key vectors:  (3, 6)
Value vectors: 
 [[0.64190468 0.93014723 0.45922777 0.56026456 1.04996474]
 [0.8242946  1.24639735 0.28266546 0.57373595 1.7907339 ]
 [0.52837434 1.06156653 0.22947264 0.27564264 1.25204546]]
Shape of value vectors:  (3, 5)


## Calculate the Attention Weights

Step 1 - Calculate the similarity scores between the Query and Key vectors
$$
S_{n \times n} = QK^T
$$
Where $s_{ij}$ = dot product similarity between query $q_i$ and key $k_j$ 


Step 2 - Normalize the similarity scores by the embedding dimension $d_k$
$$
S'_{n \times n} = \frac{S}{\sqrt{d_k}}
$$

Step 3 - Scale the values so that the   
$$\begin{align*}
a_{ij} & = \frac{e^{s'_{ij}}}{\Sigma_je^{s'_{ij}}} \\
A_{n \times n} & = softmax(\frac{QK^T}{\sqrt{d_k}})
\end{align*} 
$$
So that - $\Sigma_ja_{ij} = 1$

In [8]:
# The dot product of the query and key vectors gives us the similarity between the query and key vectors
similarity_scores = np.dot(query_vectors, key_vectors.T)
print("Similarity scores: \n", similarity_scores)
print("Shape of similarity scores: ", similarity_scores.shape)

# The similarity scores are scaled by the square root of the dimension of the key vectors
scaled_similarity_scores = similarity_scores / np.sqrt(6)

# The scaled similarity scores are passed through a softmax function to get the attention weights
attention_weights = np.exp(scaled_similarity_scores) / np.sum(np.exp(scaled_similarity_scores), axis=1, keepdims=True)
print("Attention weights: \n", attention_weights)
print("Shape of attention weights: ", attention_weights.shape)


Similarity scores: 
 [[5.06798984 3.09132164 3.47594607]
 [3.09132164 2.35205625 2.25159346]
 [3.47594607 2.25159346 2.57544933]]
Shape of similarity scores:  (3, 3)
Attention weights: 
 [[0.50805787 0.22669918 0.26524295]
 [0.40828812 0.30192217 0.28978971]
 [0.43497103 0.26386552 0.30116346]]
Shape of attention weights:  (3, 3)


## Learning Contextualized Embeddings Using the attention weights

$$\begin{align*}
X'_{n \times d_v} & = A_{n \times n}*V_{n \times d_v}
& = softmax(\frac{QK^T}{\sqrt{d_k}})V
\end{align*}
$$

$$\begin{align*}
X'_{n \times d_v} = \left[\begin{array}{cccc}
- - x'_1 - - \\
- - x'_2 - -  \\
. . .  \\
. . . \\
- - x'_d - -  \\
\end{array} \\ 
\right]_{n \times d_v}  = \left[\begin{array}{ccc}
a_{11}\  a_{12}\  ...\   a_{1n} \\
a_{21}\  a_{22}\  ...\   a_{2n}  \\
. . .  \\
. . .  \\
a_{n1}\  a_{n2}\  ...\   a_{nn} \\
\end{array} \\ 
\right]_{n \times n} \times 
\left[\begin{array}{ccc}
v_{11}\  v_{12}\  ...\   v_{1d_v} \\
v_{21}\  v_{22}\  ...\   v_{2d_v}  \\
. . .  \\
. . .  \\
v_{n1}\  v_{n2}\  ...\   a_{vd_v} \\
\end{array} \\ 
\right]_{n \times d_v} 
\end{align*}
$$

Where - 
$$
\Sigma_j{a_{ij}} = 1
$$

If for instance the value $a_{12}$ (i.e query 1 is most similar to key 2) is high the value of $x'_1$ is more influenced by the vector $v_2$

In [9]:
# The attention weights are multiplied with the value vectors to get the context vectors
context_vectors = np.dot(attention_weights, value_vectors)
print("Context vectors: \n", context_vectors)
print("Shape of context vectors: ", context_vectors.shape)

Context vectors: 
 [[1.35183866 1.20284683 0.68933144 1.02996614 1.47134148]
 [1.30738598 1.16640691 0.69458305 1.00196277 1.37074831]
 [1.32670181 1.17789289 0.69657226 1.0192344  1.40538209]]
Shape of context vectors:  (3, 5)
