<a id="toc"></a>
## Table of Contents
- [Transformers](#transformers)
  - [Self Attention](#self-attention)
    - [Self Attention Step 1](#self-attention-step-01)
      - [Exercise 1](#exercise-01)
    - [Self Attention Step 2](#self-attention-step-02)
      - [Exercise 2](#exercise-02)
    - [Self Attention Step 3](#self-attention-step-03)
      - [Exercise 3](#exercise-03)
      - [Exercise 4](#exercise-04)
    - [Self Attention Step 4](#self-attention-step-04)
      - [Exercise 5](#exercise-05)
    - [Self Attention All Steps](#self-attention-all-steps)
      - [Exercise 6](#exercise-06)
      - [Exercise 7](#exercise-07)
  - [Multihead Self Attention](#multihead-self-attention)
    - [Exercise 8](#exercise-08)

<a id="transformers"></a>
## Transformers

Since it's inventory in 2017, the transformer architecture continues to be the dominat model for nearly all NLP tasks. The core idea behind the Transformer model is the self-attention mechanism which is what this notebook will concentrate on. 

NLP-based transformer models first process text data into vectors which are then fed into the model for further processing. This is illustrated in the first figure below

![Preprocessing](../images/transformer_positional_encoding_vectors.png)

Next the vectors are fed into the decoder layer, which is made of mainly a self-attention layer and feed forward layer.

![Decoder Block](../images/decoder_with_tensors_2.png)

<a id="self-attention"></a>
## Self-Attention

The _self-attention_ block within the transformer architecture, takes $N$ inputs: ${\boldsymbol x_{1}},\, {\boldsymbol x_{2}}, \dots, {\boldsymbol x_{N}}$, each of dimension $D \times 1$ and returns $N$ output vectors. The operation in stepwise illustration is as follows:

<a id="self-attention-step-01"></a>
### Self-Attention: Step 1 
For each input word ${\boldsymbol x}$, we create a _query_ vector, ${\boldsymbol q}$, a _key_ vector, ${\boldsymbol k}$ and a _value_ vector, ${\boldsymbol v}$. We'll also need the weight _matrices_ and bias _vectors_ to compute the _queries_, _keys_, and _values_. In particular, 
- the weight and bias for computing the _queries_, denoted by ${\boldsymbol W_{q}}$ and ${\boldsymbol b_{q}}$,
- the weight and bias for computing the _keys_, denoted by ${\boldsymbol W_{k}}$ and ${\boldsymbol b_{k}}$, and 
- the weight and bias for computing the _values_, denoted by ${\boldsymbol W_{v}}$ and ${\boldsymbol b_{v}}$

As an exercise, you will now implement the computation of the _queries_, _keys_, and _values_ in python using either [torch](https://pytorch.org/) library. In the next two cells, we generate the input vectors, ${\boldsymbol x_{1}}$, ${\boldsymbol x_{2}}$, and ${\boldsymbol x_{3}}$, each of dimension $D \times 1$, with $D=4$, and the weight matrices and bias vectors ${\boldsymbol W_{q}}$ and ${\boldsymbol b_{q}}$ for queries, ${\boldsymbol W_{k}}$ and ${\boldsymbol b_{k}}$ for the keys, and ${\boldsymbol W_{v}}$ and ${\boldsymbol b_{v}}$ for the values. Your execise is to use these the compute the _queries_, _keys_, and _values_. 

In [1]:
import torch

# Set seed so we get the same random numbers
torch.manual_seed(3)

# Number of inputs
N = 3

# Number of dimensions of each input
D = 4

# Create an empty list
all_x = []
# Create elements x_n and append to list
for n in range(N):
    # Creates a tensor of shape (D, 1) with values drawn from the standard normal distribution, 
    # with a mean of 0 and a standard deviation of 1 
    x = torch.randn(size=(D, 1))  # D x 1 tensor
    print(f"The vector x_{n+1} is: \n {x} \n")
    all_x.append(x)  # Append x to list

The vector x_1 is: 
 tensor([[ 0.8033],
        [ 0.1748],
        [ 0.0890],
        [-0.6137]]) 

The vector x_2 is: 
 tensor([[ 0.0462],
        [-1.3683],
        [ 0.3375],
        [ 1.0111]]) 

The vector x_3 is: 
 tensor([[-1.4352],
        [ 0.9774],
        [ 0.5220],
        [ 1.2379]]) 



After executing the cell above, the list object, `all_x` now has $N$ vectors, each of dimension $D \times 1$.

In [2]:
# Set seed so we get the same random numbers
torch.manual_seed(0)

# Choose random values for the parameters

# weight matrices
W_q = torch.randn(size=(D, D))
W_k = torch.randn(size=(D, D))
W_v = torch.randn(size=(D, D))

# bais terms
b_q = torch.randn(size=(D, 1))
b_k = torch.randn(size=(D, 1))
b_v = torch.randn(size=(D, 1))

<a id="exercise-01"></a>
#### Exercise 1
Given vectors: ${\boldsymbol x}_{1}$, ${\boldsymbol x}_{2}$, and ${\boldsymbol x}_{3}$ in the list object **`all_x`**, and the weight matrices: ${\boldsymbol W_{q}}$, ${\boldsymbol W_{k}}$ and ${\boldsymbol W_{v}}$ and bais vectors: ${\boldsymbol b_{q}}$, ${\boldsymbol b_{k}}$ and ${\boldsymbol b_{v}}$, compute the query vectors: ${\boldsymbol q_{1}}$, ${\boldsymbol q_{2}}$, ${\boldsymbol q_{3}}$, the key vectors: ${\boldsymbol k_{1}}$, ${\boldsymbol k_{2}}$, ${\boldsymbol k_{3}}$ and the value vectors: ${\boldsymbol v_{1}}$, ${\boldsymbol v_{2}}$, ${\boldsymbol v_{3}}$. 

Hint: 
- ${\boldsymbol q} = {\boldsymbol b} + {\boldsymbol W}\, {\boldsymbol x}$
- ${\boldsymbol k} = {\boldsymbol b} + {\boldsymbol W}\, {\boldsymbol x}$
- ${\boldsymbol v} = {\boldsymbol b} + {\boldsymbol W}\, {\boldsymbol x}$

You may want to consider using one of the following functions:
- [torch.matmul](https://pytorch.org/docs/stable/generated/torch.matmul.html)



In [4]:
# Make three lists to store queries, keys, and values
all_queries = []
all_keys = []
all_values = []

# For every input
for x in all_x:
    # Compute the keys, queries, and values using PyTorch operations
    query =  torch.matmul(W_q, x) + b_q
    key = torch.matmul(W_k, x) + b_k
    value = torch.matmul(W_v, x) + b_v

    # Append the results to the lists
    all_queries.append(query)
    all_keys.append(key)
    all_values.append(value)

![Self Attention Mechanism Step 1](../images/self_attention_step_01.png)

<a id="self-attention-step-02"></a>
### Self Attention: Step 2

The second step in calculating self-attention is to calculate a score. Consider the figure above, with input words _"Thinking Machine"_. Suppose we want to calculate the _self-attention_ for the first word, _"Thinking"_. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

The score is calculated by taking the **dot product** of the _query_ vector with all the _key_ vectors of the respective word we are scoring. So if we are processing the self-attention for the word in the first position, the first score would be the **dot product** of ${\boldsymbol q_{1}}$ and ${\boldsymbol k_{1}}$. The second score would be the dot product of ${\boldsymbol q_{1}}$ and ${\boldsymbol k_{2}}$. As shown in Figure 2 below: 

![Self Attention Mechanism Step 2](../images/self_attention_step_02.png)

<a id="exercise-02"> </a>
#### Exercise 2

Compute the self-attention scores of the input vectors ${\boldsymbol x_{1}}$, ${\boldsymbol x_{2}}$, ${\boldsymbol x_{3}}$, using their query vectors ${\boldsymbol q_{1}}$, ${\boldsymbol q_{2}}$, ${\boldsymbol q_{3}}$ and key vectors: ${\boldsymbol k_{1}}$, ${\boldsymbol k_{2}}$, ${\boldsymbol k_{3}}$.

In [12]:
import math

all_attention_scores = []
for query in all_queries:
    query_keys_attention_scores = []
    for key in all_keys:
        # Compute the dot product of the query and key
        # for numerical stability you want to dot product with the square root of D
        dot_product = torch.matmul(query.T, key).squeeze() 
        scaled_dot_product = dot_product / math.sqrt(D) 

        # Append the result to the list
        query_keys_attention_scores.append(scaled_dot_product)
    all_attention_scores.append(query_keys_attention_scores)
    
all_attention_scores

[[tensor(-2.8869), tensor(3.6531), tensor(-1.1920)],
 [tensor(0.0760), tensor(4.2115), tensor(4.1211)],
 [tensor(0.8152), tensor(-6.1015), tensor(1.6231)]]

<a id="self-attention-step-03"></a>
### Self Attention: Step 3

Next we will normalise the each score in the list object `all_attention_scores` so that they are all positive and add up to $1$. We can achieve this using the [softmax function](https://en.wikipedia.org/wiki/Softmax_function), $\sigma: \mathbb{R}^{m} \to (0, 1)^{m}$, defined as follows: 

Given the vectors ${\boldsymbol z} = (z_{1},\, z_{2},\,\dots,\, z_{m}) \in \mathbb{R}^{m}$, 

$$\sigma({\boldsymbol z})_{i} = \dfrac{e^{z_{i}}}{\sum_{j=1}^{m} e^{z_{j}}}$$

<a id="exercise-03"></a>
#### Exercise 3
Implment the softmax function.

In [17]:
isinstance(all_queries[0], torch.Tensor)

True

In [18]:
torch.Tensor([1,2,3])

tensor([1., 2., 3.])

In [19]:
def softmax(items_in):
    if not isinstance(items_in, torch.Tensor):
        items_in = torch.Tensor(items_in)
    # Shift the input for numerical stability (optional, but recommended)
    # items_in = items_in - torch.max(items_in)

    # Compute the exponential of the input
    exp_items = torch.exp(items_in)
    
    # Compute the softmax by dividing by the sum of exponentials
    items_out = exp_items / torch.sum(exp_items)

    return items_out


You will now use your `softmax` implementation to compute the self-attention weights by applying the softmax function to the self-attention scores you computed in <a href="#exercise-02">Exercise 2</a>. Note that the self-attention scores for each input vector: ${\boldsymbol x_{1}}$, ${\boldsymbol x_{2}}$, ${\boldsymbol x_{3}}$, are the contained in the list object `all_attention_scores`.

<a id="exercise-04"></a>
#### Exercise 4
Compute the self-attention weights of the input vectors: ${\boldsymbol x_{1}}$, ${\boldsymbol x_{2}}$, and ${\boldsymbol x_{3}}$ using their self-attention scores.

In [20]:
all_attention_weights = []
for idx, attention_scores in enumerate(all_attention_scores):
    attention_weights = softmax(attention_scores)
    print(f"The attention probabilities for input vector x_{idx+1} are: {attention_weights}")
    all_attention_weights.append(attention_weights)

The attention probabilities for input vector x_1 are: tensor([0.0014, 0.9908, 0.0078])
The attention probabilities for input vector x_2 are: tensor([0.0083, 0.5183, 0.4735])
The attention probabilities for input vector x_3 are: tensor([3.0824e-01, 3.0549e-04, 6.9145e-01])


<a id="self-attention-step-04"></a>
### Self Attention: Step 4

The fourth step is to multiply each value vector, i.e., the values contained in the list object **`all_values`**, by their corresponding attention weights, i.e, the values contained in the list object **`all_attention_weights`**. The intuition behind this multiplication is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words, that is, multiplying irrelevant words by tiny numbers like $0.001$, which in this case are the attention weights. 

After multiplying the attention weights by their corresponding value vectors we sum them up. 

<a id="exercise-05"></a>
#### Exercise 5
Use the attention weights you computed in <a href="#exercise-04">Exercise 4</a>, i.e., the contents of **`all_attention_weights`**, and the value vectors, **`all_values`** to compute the weighted sum of all values ${\boldsymbol v_{1}}, {\boldsymbol v_{2}}, {\boldsymbol v_{3}}$.

In [22]:
all_attention_weighted_values = []

# Loop over each set of attention weights
for attention_weights in all_attention_weights:
    attention_weighted_values = []
    
    # For each attention weight and corresponding value
    for attention_weight, value in zip(attention_weights, all_values):
        # Compute the weighted value using element-wise multiplication
        weighted_value = attention_weight * value
        
        attention_weighted_values.append(weighted_value)
    
    # Sum the weighted values across all values
    attention_weight_across_all_values = torch.sum(torch.stack(attention_weighted_values), dim=0)
    
    # Append the final result to the list
    all_attention_weighted_values.append(attention_weight_across_all_values)

all_attention_weighted_values

[tensor([[ 0.2117],
         [ 1.0697],
         [-3.3355],
         [-4.9260]]),
 tensor([[ 0.6486],
         [ 0.9883],
         [-2.4109],
         [-3.0185]]),
 tensor([[ 0.6463],
         [ 0.8405],
         [-1.6421],
         [-0.0805]])]

<a id="self-attention-all-steps"></a>
### Self Attention All Steps
The figure below 

![Complete Attention Steps](../images/self-attention-output.png)

demonstrate the complete self-attention mechanism. Now let's put everything together in <a href="#exercise-06"> Exercise 6</a> below

<a id="exercise-06"></a>
#### Exercise 6

Implement the complete self attention mechanism by completing the incomplete code snippet below:

In [26]:
# Create empty list for output
all_x_prime = []

# Assuming N is defined
for n in range(N):
    # Create list for dot products of query N with all keys
    all_km_qn = []

    # Compute the dot products
    for key in all_keys:
        # Compute the dot product of the query and the key and normalize by sqrt(D)
        dot_product = torch.matmul(key.T, all_queries[n]).squeeze() / torch.sqrt(torch.tensor(D, dtype=torch.float32))

        # Store dot product
        all_km_qn.append(dot_product)

    # Convert dot products to a tensor
    all_km_qn = torch.tensor(all_km_qn)

    # Compute softmax over dot products to get attention
    attention = softmax(all_km_qn)

    # Print result (should be positive and sum to one)
    print("Attentions for output ", n)
    print(attention)

    # Compute the weighted sum of all of the values according to the attention
    x_prime = torch.sum(torch.stack([attention[i] * all_values[i] for i in range(len(all_values))]), dim=0)

    # Append the result to the output list
    all_x_prime.append(x_prime)

# Print out true values to check you have it correct
print("\nx_prime_0_calculated:", all_x_prime[0].T)
print("x_prime_0_true:       tensor([[ 0.2117,  1.0697, -3.3355, -4.9260]])\n")
print("x_prime_1_calculated:", all_x_prime[1].T)
print("x_prime_1_true:       tensor([[ 0.6486,  0.9883, -2.4109, -3.0185]])\n")
print("x_prime_2_calculated:", all_x_prime[2].T)
print("x_prime_2_true:       tensor([[ 0.6463,  0.8405, -1.6421, -0.0805]])\n")


Attentions for output  0
tensor([0.0014, 0.9908, 0.0078])
Attentions for output  1
tensor([0.0083, 0.5183, 0.4735])
Attentions for output  2
tensor([3.0824e-01, 3.0549e-04, 6.9145e-01])

x_prime_0_calculated: tensor([[ 0.2117,  1.0697, -3.3355, -4.9260]])
x_prime_0_true:       tensor([[ 0.2117,  1.0697, -3.3355, -4.9260]])

x_prime_1_calculated: tensor([[ 0.6486,  0.9883, -2.4109, -3.0185]])
x_prime_1_true:       tensor([[ 0.6486,  0.9883, -2.4109, -3.0185]])

x_prime_2_calculated: tensor([[ 0.6463,  0.8405, -1.6421, -0.0805]])
x_prime_2_true:       tensor([[ 0.6463,  0.8405, -1.6421, -0.0805]])



As you may have observed, all the computations we have done, could be done using matrices, as illustrated in Figures  

![Self Attention Matrix Calculation Step 1](../images/self-attention-matrix-calculation.png)

This complete the first step, the remain steps are illustrated in the figure below: 

![Self Attention Matrix Calculation Step 2](../images/self-attention-matrix-calculation-2.png)


1. Compute the attention scores by multiplying the set of queries packed in matrix $Q$ with the keys in the matrix $K$. If the matrix $Q$ is of size $m \times d_k$, and the matrix $K$ is of size $n \times d_k$, then the resulting matrix will be of size $m \times n$:

$$
QK^\top = \begin{bmatrix}
e_{11} & e_{12} & \dots & e_{1n} \\
e_{21} & e_{22} & \dots & e_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
e_{m1} & e_{m2} & \dots & e_{mn}
\end{bmatrix}
$$

2. Scale each of the alignment scores by $\frac{1}{\sqrt{d_k}}$:

$$
\frac{QK^\top}{\sqrt{d_k}} = \begin{bmatrix}
\frac{e_{11}}{\sqrt{d_k}} & \frac{e_{12}}{\sqrt{d_k}} & \dots & \frac{e_{1n}}{\sqrt{d_k}} \\
\frac{e_{21}}{\sqrt{d_k}} & \frac{e_{22}}{\sqrt{d_k}} & \dots & \frac{e_{2n}}{\sqrt{d_k}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{e_{m1}}{\sqrt{d_k}} & \frac{e_{m2}}{\sqrt{d_k}} & \dots & \frac{e_{mn}}{\sqrt{d_k}}
\end{bmatrix}
$$

3. And follow the scaling process by applying a softmax operation in order to obtain a set of attention weights:

$$
\text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) = \begin{bmatrix}
\text{softmax}\left( \frac{e_{11}}{\sqrt{d_k}} \right) & \text{softmax}\left( \frac{e_{12}}{\sqrt{d_k}} \right) & \dots & \text{softmax}\left( \frac{e_{1n}}{\sqrt{d_k}} \right) \\
\text{softmax}\left( \frac{e_{21}}{\sqrt{d_k}} \right) & \text{softmax}\left( \frac{e_{22}}{\sqrt{d_k}} \right) & \dots & \text{softmax}\left( \frac{e_{2n}}{\sqrt{d_k}} \right) \\
\vdots & \vdots & \ddots & \vdots \\
\text{softmax}\left( \frac{e_{m1}}{\sqrt{d_k}} \right) & \text{softmax}\left( \frac{e_{m2}}{\sqrt{d_k}} \right) & \dots & \text{softmax}\left( \frac{e_{mn}}{\sqrt{d_k}} \right)
\end{bmatrix}
$$

4. Finally, apply the resulting attention weights to the values in matrix $V$, of size $n \times d_v$:

$$
\text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) \cdot V = 
\begin{bmatrix}
\text{softmax}\left( \frac{e_{11}}{\sqrt{d_k}} \right) & \text{softmax}\left( \frac{e_{12}}{\sqrt{d_k}} \right) & \dots & \text{softmax}\left( \frac{e_{1n}}{\sqrt{d_k}} \right) \\
\text{softmax}\left( \frac{e_{21}}{\sqrt{d_k}} \right) & \text{softmax}\left( \frac{e_{22}}{\sqrt{d_k}} \right) & \dots & \text{softmax}\left( \frac{e_{2n}}{\sqrt{d_k}} \right) \\
\vdots & \vdots & \ddots & \vdots \\
\text{softmax}\left( \frac{e_{m1}}{\sqrt{d_k}} \right) & \text{softmax}\left( \frac{e_{m2}}{\sqrt{d_k}} \right) & \dots & \text{softmax}\left( \frac{e_{mn}}{\sqrt{d_k}} \right)
\end{bmatrix}
\cdot
\begin{bmatrix}
v_{11} & v_{12} & \dots & v_{1d_v} \\
v_{21} & v_{22} & \dots & v_{2d_v} \\
\vdots & \vdots & \ddots & \vdots \\
v_{n1} & v_{n2} & \dots & v_{nd_v}
\end{bmatrix}
$$


In <a href="#exercise-07">Exercise 7</a> below, you will use matrices to implement the self-attention mechanism, given the input matrix ${\boldsymbol X}$. We provide the helper function `softmax_cols` that will be necessary as well as the input matrix ${\boldsymbol X}$, whose columns are the vectors in **`all_x`**.

In [27]:
# Copy data into a matrix
X = torch.zeros((D, N))  # Create a tensor of shape (D, N) filled with zeros

# Copy the data from all_x into the matrix
X[:, 0] = all_x[0].squeeze()
X[:, 1] = all_x[1].squeeze()
X[:, 2] = all_x[2].squeeze()

X

tensor([[ 0.8033,  0.0462, -1.4352],
        [ 0.1748, -1.3683,  0.9774],
        [ 0.0890,  0.3375,  0.5220],
        [-0.6137,  1.0111,  1.2379]])

In [28]:
# Define softmax operation that works independently on each column
def softmax_cols(data_in):
    # Exponentiate all of the values
    exp_values = torch.exp(data_in)
    
    # Sum over columns (dim=0 for column-wise summation)
    denom = torch.sum(exp_values, dim=0)
    
    # Replicate denominator to match the input size
    denom = denom.unsqueeze(0).expand_as(data_in)
    
    # Compute softmax
    softmax = exp_values / denom
    
    # Return the result
    return softmax


<a id="exercise-07"></a>
#### Exercise 7
Compute the self-attention mechanism in matrix form

In [37]:
# Define the scaled dot product self-attention function
def scaled_dot_product_self_attention(X, W_v, W_q, W_k, b_v, b_q, b_k):
    
    # 1. Compute queries, keys, and values using matrix multiplication and addition
    queries = torch.matmul(W_q, X) + b_q
    keys = torch.matmul(W_k, X) + b_k
    values = torch.matmul(W_v, X) + b_v

    # 2. Compute dot products of keys and queries (keys.T * queries)
    attention_scores = torch.matmul(keys.T, queries)

    # 3. Scale the dot products by the square root of the dimensionality of the keys
    d_k = torch.tensor(keys.shape[0], dtype=torch.float32)  # dimensionality of the keys
    scaled_attention_scores = attention_scores / torch.sqrt(d_k)

    # 4. Apply softmax to calculate attention weights (column-wise softmax)
    attention_weights = softmax_cols(scaled_attention_scores)

    # 5. Weight the values by attention weights
    X_prime = torch.matmul(values, attention_weights)

    return X_prime


In [38]:
X_prime = scaled_dot_product_self_attention(X, W_v, W_q, W_k, b_v, b_q, b_k)
X_prime

tensor([[ 0.2117,  0.6486,  0.6463],
        [ 1.0697,  0.9883,  0.8405],
        [-3.3355, -2.4109, -1.6421],
        [-4.9260, -3.0185, -0.0805]])

<a id="multihead-self-attention"></a>
## Multihead Self-Attention 

The multihead self-attention mechanism maps $N$ inputs $\mathbf{x}_{n} \in \mathbb{R}^{D}$ and returns $N$ outputs $\mathbf{x}'_{n}\in \mathbb{R}^{D}$. In other words, it the repetition of the self-attention a fix number of times in parallel. Self-attention occurs in parallel across multiple "heads". Each head has its own queries, keys, and values. The Figure below gives an illustration of a 2-head self attention, in the cyan and orange boxes, respectively. The outputs are vertically concatenated an another linear transformation layer, ${\boldsymbol \Omega_{c}}$ is used to recombine them.

![Two Head Self Attention](../images/two-head-self-attention.png)

In [61]:
# Set seed so we get the same random numbers
torch.manual_seed(3)

# Number of inputs
N = 6

# Number of dimensions of each input
D = 8

# Create a tensor with random normal values (mean=0, std=1)
mat_X = torch.randn(D, N)

# Print X
print(mat_X)


tensor([[-0.077,  0.360, -0.782,  0.072,  0.665, -0.287],
        [ 1.621, -1.597, -0.052, -0.306,  0.249, -0.223],
        [ 0.913,  0.204,  0.574,  0.416,  0.262,  0.931],
        [-0.514, -1.652,  1.046,  0.522, -0.167,  0.053],
        [ 0.564,  2.257,  1.869, -1.195,  0.998,  0.459],
        [ 2.436, -0.147, -0.476, -0.293, -0.348,  0.349],
        [ 0.037, -0.068,  0.429, -0.868, -0.271,  0.142],
        [ 0.130,  0.681, -0.958,  0.064,  0.659,  0.819]])


As shown in the Figure we will use $2$ heads. We need the weights matrices and biases vectors for the keys, queries, and values. We'll make the queries keys and values of size dimensions $\frac{D}{H} \times N$, as shown in the Figure

In [69]:
H = 2
# QKV dimension
H_D = int(D / H)

# Set seed so we get the same random numbers
torch.manual_seed(0)

# Choose random values for the parameters for the first head
W_q1 = torch.randn(size=(H_D, D))
W_k1 = torch.randn(size=(H_D, D))
W_v1 = torch.randn(size=(H_D, D))
b_q1 = torch.randn(size=(H_D, 1))
b_k1 = torch.randn(size=(H_D, 1))
b_v1 = torch.randn(size=(H_D, 1))

# Choose random values for the parameters for the second head
W_q2 = torch.randn(size=(H_D, D))
W_k2 = torch.randn(size=(H_D, D))
W_v2 = torch.randn(size=(H_D, D))
b_q2 = torch.randn(size=(H_D, 1))
b_k2 = torch.randn(size=(H_D, 1))
b_v2 = torch.randn(size=(H_D, 1))

# Choose random values for the parameters
W_c = torch.randn(size=(D, D)) # Linear transformation used to combine the vertically concatenated attention heads.

As before, a helper fuction `multi_head_softmax_cols` is provided.

In [54]:
# Define softmax operation that works independently on each column
def multi_head_softmax_cols(data_in):
    # Exponentiate all of the values
    exp_values = torch.exp(data_in)
    
    # Sum over columns (dim=0 for column-wise summation)
    denom = torch.sum(exp_values, dim=0, keepdim=True)
    
    # Compute softmax (PyTorch broadcasts the denominator to all rows automatically)
    softmax = exp_values / denom
    
    # Return the result
    return softmax


<a id="exercise-08"></a>
#### Exercise 8

Implement the multihead self-attention mechanism by completing the code snippets below

In [66]:
# Define the multi-head scaled self-attention mechanism
def multihead_scaled_self_attention(
        X, W_v1, W_q1, W_k1, b_v1, b_q1, b_k1, W_v2, W_q2, W_k2, b_v2, b_q2, b_k2, W_c
    ):
    
    # 1. Compute queries, key, and value for Head 1
    Q1 = torch.matmul(W_q1, X) + b_q1
    K1 = torch.matmul(W_k1, X) + b_k1
    V1 = torch.matmul(W_v1, X) + b_v1

    # 2. Compute queries, key, and value for Head 2
    Q2 = torch.matmul(W_q2, X) + b_q2
    K2 = torch.matmul(W_k2, X) + b_k2
    V2 = torch.matmul(W_v2, X) + b_v2

    # 3. Compute dot products
    dot_products1 = torch.matmul(K1.T, Q1)
    dot_products2 = torch.matmul(K2.T, Q2)

    d_k = torch.tensor(K1.shape[0], dtype=torch.float32)  # dimensionality of the keys (same for both heads)

    # 4. Scale dot products
    scaled_dot_products1 = dot_products1 / torch.sqrt(d_k)
    scaled_dot_products2 = dot_products2 / torch.sqrt(d_k)

    # 5. Apply softmax to calculate attention scores
    attentions1 = multi_head_softmax_cols(scaled_dot_products1)
    attentions2 = multi_head_softmax_cols(scaled_dot_products2)

    # 6. Weight values by attention weights
    head1_output = torch.matmul(V1, attentions1)
    head2_output = torch.matmul(V2, attentions2)

    # 7. Concatenate the outputs of the two heads
    concatenated_output = torch.cat((head1_output, head2_output), dim=0)

    # 8. Apply the final linear transformation
    X_prime = torch.matmul(W_c, concatenated_output)

    return X_prime


In [72]:
X_prime = multihead_scaled_self_attention(
    X=mat_X,
    W_v1=W_v1,
    W_q1=W_q1, 
    W_k1=W_k1,
    b_v1=b_v1,
    b_q1=b_q1,
    b_k1=b_k1,
    W_v2=W_v2,
    W_q2=W_q2,
    W_k2=W_k2,
    b_v2=b_v2,
    b_q2=b_q2,
    b_k2=b_k2,
    W_c=W_c
)

# Set precision for printing
# torch.set_printoptions(precision=3)

# Print out the results
print("Your answer:")
print(X_prime)

print("\nTrue values:")
true_values = torch.tensor([[  7.501,  15.386,  12.121,  23.458,   5.546,  -7.499],
                            [  4.221,   4.875,  -2.205,   4.050,  -4.525,   5.155],
                            [  1.891,   3.035,   3.399,   2.733,   2.958,  -0.824],
                            [  2.621,   2.177,  -4.974,  -0.925,  -1.928,   3.726],
                            [ -0.130,  -0.250,   3.700,   0.948,   9.384,   0.697],
                            [  2.524,   1.555,  -0.789,   2.667,  -0.459,   4.428],
                            [  0.056,  -1.688,  -1.537,  -1.700,   0.391,   4.648],
                            [ -1.352,  -4.136,  -8.878,  -1.003, -12.857,  -4.945]])

print(true_values)

Your answer:
tensor([[  7.501,  15.386,  12.121,  23.458,   5.546,  -7.499],
        [  4.221,   4.875,  -2.205,   4.050,  -4.525,   5.155],
        [  1.891,   3.035,   3.399,   2.733,   2.958,  -0.824],
        [  2.621,   2.177,  -4.974,  -0.925,  -1.928,   3.726],
        [ -0.130,  -0.250,   3.700,   0.948,   9.384,   0.697],
        [  2.524,   1.555,  -0.789,   2.667,  -0.459,   4.428],
        [  0.056,  -1.688,  -1.537,  -1.700,   0.391,   4.648],
        [ -1.352,  -4.136,  -8.878,  -1.003, -12.857,  -4.945]])

True values:
tensor([[  7.501,  15.386,  12.121,  23.458,   5.546,  -7.499],
        [  4.221,   4.875,  -2.205,   4.050,  -4.525,   5.155],
        [  1.891,   3.035,   3.399,   2.733,   2.958,  -0.824],
        [  2.621,   2.177,  -4.974,  -0.925,  -1.928,   3.726],
        [ -0.130,  -0.250,   3.700,   0.948,   9.384,   0.697],
        [  2.524,   1.555,  -0.789,   2.667,  -0.459,   4.428],
        [  0.056,  -1.688,  -1.537,  -1.700,   0.391,   4.648],
        [ -1