<a href="https://colab.research.google.com/github/E1250/udl-ref/blob/main/ch12/12_2_Multihead_Self_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Notebook 12.2: Multihead Self-Attention**

This notebook builds a multihead self-attention mechanism as in figure 12.6

Work through the cells below, running each cell in turn. In various places you will see the words "TO DO". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.

Contact me at udlbookmail@gmail.com if you find any mistakes or have any suggestions.



In [None]:
import numpy as np
import matplotlib.pyplot as plt

The multihead self-attention mechanism maps $N$ inputs $\mathbf{x}_{n}\in\mathbb{R}^{D}$ and returns $N$ outputs $\mathbf{x}'_{n}\in \mathbb{R}^{D}$.  



In [None]:
# Set seed so we get the same random numbers
np.random.seed(3)
# Number of inputs
N = 6
# Number of dimensions of each input
D = 8
# Create an empty list
X = np.random.normal(size=(D,N))
# Print X
print(X)

We'll use two heads.  We'll need the weights and biases for the keys, queries, and values (equations 12.2 and 12.4).  We'll use two heads, and (as in the figure), we'll make the queries keys and values of size D/H

Equation 12.2
$$
V_m = \beta_v + \Omega_v x_m
$$

Equation 12.4
$$
q_n = \beta_q + \Omega_q x_n
$$
$$
k_m = \beta_k + \Omega_k x_m
$$

In [None]:
# Number of heads
H = 2
# QDV dimension
H_D = int(D/H)

# Set seed so we get the same random numbers
np.random.seed(0)

# Choose random values for the parameters for the first head
omega_q1 = np.random.normal(size=(H_D,D))
omega_k1 = np.random.normal(size=(H_D,D))
omega_v1 = np.random.normal(size=(H_D,D))
beta_q1 = np.random.normal(size=(H_D,1))
beta_k1 = np.random.normal(size=(H_D,1))
beta_v1 = np.random.normal(size=(H_D,1))

# Choose random values for the parameters for the second head
omega_q2 = np.random.normal(size=(H_D,D))
omega_k2 = np.random.normal(size=(H_D,D))
omega_v2 = np.random.normal(size=(H_D,D))
beta_q2 = np.random.normal(size=(H_D,1))
beta_k2 = np.random.normal(size=(H_D,1))
beta_v2 = np.random.normal(size=(H_D,1))

# Choose random values for the parameters
omega_c = np.random.normal(size=(D,D))

Now let's compute the multiscale self-attention

In [None]:
# Define softmax operation that works independently on each column
def softmax_cols(data_in):
  # Exponentiate all of the values
  exp_values = np.exp(data_in) ;
  # Sum over columns
  denom = np.sum(exp_values, axis = 0);
  # Compute softmax (numpy broadcasts denominator to all rows automatically)
  softmax = exp_values / denom
  # return the answer
  return softmax

In [43]:
X.shape, omega_q2.shape, beta_q2.shape

((8, 6), (4, 8), (4, 1))

In [78]:
def multihead_scaled_self_attention(X, omega_v1, omega_q1, omega_k1, beta_v1, beta_q1, beta_k1, omega_v2, omega_q2, omega_k2, beta_v2, beta_q2, beta_k2, omega_c):

    # TODO Write the multihead scaled self-attention mechanism.
    # Replace this line

    # Step 1: Compute Query, Key, and Value matrices for both heads
    Q1 = np.dot(X.T, omega_q1.T) + beta_q1.T
    K1 = np.dot(X.T, omega_k1.T) + beta_k1.T
    V1 = np.dot(X.T, omega_v1.T) + beta_v1.T

    Q2 = np.dot(X.T, omega_q2.T) + beta_q2.T
    K2 = np.dot(X.T, omega_k2.T) + beta_k2.T
    V2 = np.dot(X.T, omega_v2.T) + beta_v2.T

   # Step 2: Compute scaled dot-product attention for head 1
    d_k1 = Q1.shape[-1]  # Dimensionality of key for head 1
    scores1 = np.dot(Q1.T, K1) / np.sqrt(d_k1)  # (n, n) # Scaling
    attention_weights1 = softmax_cols(scores1.T).T  # Use the custom softmax
    output1 = np.dot(attention_weights1, V1.T)  # (n, d_v)

    # Step 3: Compute scaled dot-product attention for head 2
    d_k2 = Q2.shape[-1]  # Dimensionality of key for head 2
    scores2 = np.dot(Q2.T, K2) / np.sqrt(d_k2)  # (n, n)
    attention_weights2 = softmax_cols(scores2.T).T  # Use the custom softmax
    output2 = np.dot(attention_weights2, V2.T)  # (n, d_v)

    # Step 4: Concatenate the outputs from both heads
    X_prime = np.concatenate([output1.T, output2.T], axis=-1)  # (n, d_v1 + d_v2)

    # Step 5: Apply final linear transformation with omega_c
    X_prime = np.dot(X_prime, omega_c)  # (n, d_model)

    return X_prime

In [79]:
# Run the self attention mechanism
X_prime = multihead_scaled_self_attention(X, omega_v1, omega_q1, omega_k1, beta_v1, beta_q1, beta_k1, omega_v2, omega_q2, omega_k2, beta_v2, beta_q2, beta_k2, omega_c)

# Print out the results
np.set_printoptions(precision=3)
print("Your answer:")
print(X_prime)

print("True values:")
print("[[-21.207  -5.373 -20.933  -9.179 -11.319 -17.812]")
print(" [ -1.995   7.906 -10.516   3.452   9.863  -7.24 ]")
print(" [  5.479   1.115   9.244   0.453   5.656   7.089]")
print(" [ -7.413  -7.416   0.363  -5.573  -6.736  -0.848]")
print(" [-11.261  -9.937  -4.848  -8.915 -13.378  -5.761]")
print(" [  3.548  10.036  -2.244   1.604  12.113  -2.557]")
print(" [  4.888  -5.814   2.407   3.228  -4.232   3.71 ]")
print(" [  1.248  18.894  -6.409   3.224  19.717  -5.629]]")

# If your answers don't match, then make sure that you are doing the scaling, and make sure the scaling value is correct

Your answer:
[[ 13.611   4.216  -0.329   2.534  -8.478   3.381 -20.092  12.688]
 [  8.487  13.345  13.247   2.14  -10.118  16.831 -14.967 -10.607]
 [ -7.404   0.906  12.187  -1.196  -6.558   6.614  -2.015 -12.675]
 [-11.181   0.435   9.535  -0.202  -2.294   3.585   7.255 -14.7  ]
 [ -9.532   0.58    6.365   2.498  -7.305  -0.606   0.403  -8.937]
 [-14.167 -10.011   3.812  -1.838  -1.388  -6.871   8.576  -1.814]]
True values:
[[-21.207  -5.373 -20.933  -9.179 -11.319 -17.812]
 [ -1.995   7.906 -10.516   3.452   9.863  -7.24 ]
 [  5.479   1.115   9.244   0.453   5.656   7.089]
 [ -7.413  -7.416   0.363  -5.573  -6.736  -0.848]
 [-11.261  -9.937  -4.848  -8.915 -13.378  -5.761]
 [  3.548  10.036  -2.244   1.604  12.113  -2.557]
 [  4.888  -5.814   2.407   3.228  -4.232   3.71 ]
 [  1.248  18.894  -6.409   3.224  19.717  -5.629]]


In [None]:
\