<a href="https://colab.research.google.com/github/udlbook/udlbook/blob/main/Notebooks/Chap12/12_2_Multihead_Self_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Week 14: Colab Experiment on Multhead Self-Attention**

This notebook builds a multihead self-attention mechanism as in figure 12.6

Work through the cells below, running each cell in turn. In various places you will see the words "TO DO". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.




In [19]:
import numpy as np
import matplotlib.pyplot as plt

The multihead self-attention mechanism maps $N$ inputs $\mathbf{x}_{n}\in\mathbb{R}^{D}$ and returns $N$ outputs $\mathbf{x}'_{n}\in \mathbb{R}^{D}$.  



In [20]:
# Set seed so we get the same random numbers
np.random.seed(3)
# Number of inputs
N = 6
# Number of dimensions of each input
D = 8
# Create an empty list
X = np.random.normal(size=(D,N))
# Print X
print(X)

[[ 1.789  0.437  0.096 -1.863 -0.277 -0.355]
 [-0.083 -0.627 -0.044 -0.477 -1.314  0.885]
 [ 0.881  1.71   0.05  -0.405 -0.545 -1.546]
 [ 0.982 -1.101 -1.185 -0.206  1.486  0.237]
 [-1.024 -0.713  0.625 -0.161 -0.769 -0.23 ]
 [ 0.745  1.976 -1.244 -0.626 -0.804 -2.419]
 [-0.924 -1.024  1.124 -0.132 -1.623  0.647]
 [-0.356 -1.743 -0.597 -0.589 -0.874  0.03 ]]


We'll use two heads.  We'll need the weights and biases for the keys, queries, and values (equations 12.2 and 12.4).  We'll use two heads, and (as in the figure), we'll make the queries keys and values of size D/H

In [21]:
# Number of heads
H = 2
# QDV dimension
H_D = int(D/H)

# Set seed so we get the same random numbers
np.random.seed(0)

# Choose random values for the parameters for the first head
omega_q1 = np.random.normal(size=(H_D,D))
omega_k1 = np.random.normal(size=(H_D,D))
omega_v1 = np.random.normal(size=(H_D,D))
beta_q1 = np.random.normal(size=(H_D,1))
beta_k1 = np.random.normal(size=(H_D,1))
beta_v1 = np.random.normal(size=(H_D,1))

# Choose random values for the parameters for the second head
omega_q2 = np.random.normal(size=(H_D,D))
omega_k2 = np.random.normal(size=(H_D,D))
omega_v2 = np.random.normal(size=(H_D,D))
beta_q2 = np.random.normal(size=(H_D,1))
beta_k2 = np.random.normal(size=(H_D,1))
beta_v2 = np.random.normal(size=(H_D,1))

# Choose random values for the parameters
omega_c = np.random.normal(size=(D,D))

Now let's compute the multiscale self-attention

In [22]:
# Define softmax operation that works independently on each column
def softmax_cols(data_in):
  # Exponentiate all of the values
  exp_values = np.exp(data_in) ;
  # Sum over columns
  denom = np.sum(exp_values, axis = 0);
  # Replicate denominator to N rows
  denom = np.matmul(np.ones((data_in.shape[0],1)), denom[np.newaxis,:])
  # Compute softmax
  softmax = exp_values / denom
  # return the answer
  return softmax

In [23]:
 # Now let's compute self attention in matrix form
def multihead_scaled_self_attention(X,omega_v1, omega_q1, omega_k1, beta_v1, beta_q1, beta_k1, omega_v2, omega_q2, omega_k2, beta_v2, beta_q2, beta_k2, omega_c):

  # TODO Write the multihead scaled self-attention mechanism.
  # 1. Compute queries, keys, and values
  # 2. Apply softmax to calculate attentions and weight values by attentions
  # 3. Concatenate the self-attentions and apply linear transformation to combine them
  # Replace this line
  X_prime = np.zeros_like(X) ;

  # Compute queries, keys, and values for both heads
  Q1 = np.dot(omega_q1, X) + beta_q1
  K1 = np.dot(omega_k1, X) + beta_k1
  V1 = np.dot(omega_v1, X) + beta_v1

  Q2 = np.dot(omega_q2, X) + beta_q2
  K2 = np.dot(omega_k2, X) + beta_k2
  V2 = np.dot(omega_v2, X) + beta_v2

  # Compute attention weights
  attention_weights1 = softmax_cols(np.dot(K1.T, Q1) / np.sqrt(H_D))
  attention_weights2 = softmax_cols(np.dot(K2.T, Q2) / np.sqrt(H_D))

  # Compute weighted values
  weighted_values1 = np.dot(V1, attention_weights1)
  weighted_values2 = np.dot(V2, attention_weights2)

  # Concatenate the weighted values from both heads
  concatenated_values = np.concatenate((weighted_values1, weighted_values2), axis=0)

  # Apply the final linear transformation
  X_prime = np.dot(omega_c, concatenated_values)

  return X_prime
  

In [24]:
# Run the self attention mechanism
X_prime = multihead_scaled_self_attention(X,omega_v1, omega_q1, omega_k1, beta_v1, beta_q1, beta_k1, omega_v2, omega_q2, omega_k2, beta_v2, beta_q2, beta_k2, omega_c)

# Print out the results
np.set_printoptions(precision=3)
print("Your answer:")
print(X_prime)

print("True values:")
print("[[-21.207  -5.373 -20.933  -9.179 -11.319 -17.812]")
print(" [ -1.995   7.906 -10.516   3.452   9.863  -7.24 ]")
print(" [  5.479   1.115   9.244   0.453   5.656   7.089]")
print(" [ -7.413  -7.416   0.363  -5.573  -6.736  -0.848]")
print(" [-11.261  -9.937  -4.848  -8.915 -13.378  -5.761]")
print(" [  3.548  10.036  -2.244   1.604  12.113  -2.557]")
print(" [  4.888  -5.814   2.407   3.228  -4.232   3.71 ]")
print(" [  1.248  18.894  -6.409   3.224  19.717  -5.629]]")

# If your answers don't match, then make sure that you are doing the scaling, and make sure the scaling value is correct

Your answer:
[[-21.207  -5.373 -20.933  -9.179 -11.319 -17.812]
 [ -1.995   7.906 -10.516   3.452   9.863  -7.24 ]
 [  5.479   1.115   9.244   0.453   5.656   7.089]
 [ -7.413  -7.416   0.363  -5.573  -6.736  -0.848]
 [-11.261  -9.937  -4.848  -8.915 -13.378  -5.761]
 [  3.548  10.036  -2.244   1.604  12.113  -2.557]
 [  4.888  -5.814   2.407   3.228  -4.232   3.71 ]
 [  1.248  18.894  -6.409   3.224  19.717  -5.629]]
True values:
[[-21.207  -5.373 -20.933  -9.179 -11.319 -17.812]
 [ -1.995   7.906 -10.516   3.452   9.863  -7.24 ]
 [  5.479   1.115   9.244   0.453   5.656   7.089]
 [ -7.413  -7.416   0.363  -5.573  -6.736  -0.848]
 [-11.261  -9.937  -4.848  -8.915 -13.378  -5.761]
 [  3.548  10.036  -2.244   1.604  12.113  -2.557]
 [  4.888  -5.814   2.407   3.228  -4.232   3.71 ]
 [  1.248  18.894  -6.409   3.224  19.717  -5.629]]


### Conclusion and Discussion

The purpose of this task is to implement a multi-head self-attention mechanism from scratch, demonstrating how multiple attention heads can capture different aspects of relationships between input vectors.

1. Initialize inputs and parameters by generating random input vectors and initialize weights and biases for the query, key, and value transformations.
2. Implement a column-wise softmax function `softmax_cols` to normalize attention scores across each column independently.
3. Implement multi-head scaled self-attention in the matrix form based on the equation below. 
$$
\begin{align*}
V_h&=\beta_{vh}1^T+\Omega_{vh}\mathbf{X} \\
Q_h&=\beta_{qh}1^T+\Omega_{qh}\mathbf{X} \\
K_h&=\beta_{kh}1^T+\Omega_{kh}\mathbf{X} \\
MhSa[\mathbf{X}]&=\Omega_C[Sa_1[\mathbf{X}]^T,Sa_2[\mathbf{X}]^T,\dots,Sa_H[\mathbf{X}]^T]^T
\end{align*}
$$
4. Compare the computed results with the expected true values to verify the correctness of the implementation.