# **Transformer and Transformer-Based Models (Part 1)**

In this python notebook, we will implement the **multiple head attention** sub layer in a transformer encoder.

In [1]:
import math
import numpy as np
import torch
import torch.nn as nn
from scipy.special import softmax

print(torch.__version__)

1.13.1


***

## **1. Implement the Multiple Head Attention Sub-Layer**

### 1.1 ~ Initialize Input Data

Step 1, we generate some random input data in the shape of $\text{n\_inputs}\times \text{d\_model}$. 

We use `np.random.rand()`.

In [38]:
np.random.seed(0)

d_model = 512
n_inputs = 3

x = np.random.rand(n_inputs, d_model)

In [39]:
print('x:', x)
print('x.shape:', x.shape)

x: [[0.5488135  0.71518937 0.60276338 ... 0.44613551 0.10462789 0.34847599]
 [0.74009753 0.68051448 0.62238443 ... 0.6204999  0.63962224 0.9485403 ]
 [0.77827617 0.84834527 0.49041991 ... 0.07382628 0.49096639 0.7175595 ]]
x.shape: (3, 512)


### 1.2 ~ Create Weight Matrices for *query*, *key*, and *value*

Step 2, we create the weight matrices into the correct dimensions. 

Let's start with `W_query` and `Q`. 

We first initialize an empty tensor `W` in the dimension of `(d_model, d_k)`, using the `torch.empty()` function.

Then we initialize it with `nn.init.xavier_uniform_()`.

After `W_query` is initialized, we can get the query matrix `Q` with a multiplication between `x` and `W_query`. 

We use `np.matmul()`.

In [40]:
torch.manual_seed(0)

n_heads = 8
d_k = d_model // n_heads

# Create an empty tensor W with the correct dimension.
W = torch.empty(d_model, d_k)

# Randomly initialize the values in the tensor.
nn.init.xavier_uniform_(W)
# Copy out the numpy array
W_query = W.data.numpy()

Q = np.matmul(x, W_query)

In [41]:
print('W_query[0,:5]:', W_query[0,:5])
print('W_query.shape:', W_query.shape)
print('Q[0, :5]:', Q[0,:5])
print('Q.shape:', Q.shape)

W_query[0,:5]: [-0.00076412  0.05475055 -0.0840017  -0.07511146 -0.03930965]
W_query.shape: (512, 64)
Q[0, :5]: [-0.22772416  0.48167867  1.48693414 -1.00410582  0.19323682]
Q.shape: (3, 64)


Next, repeat for `W_key` & `K`, and `W_value` & `V`.

In [42]:
torch.manual_seed(1)

W = torch.empty(d_model, d_k)

nn.init.xavier_uniform_(W)
W_key = W.data.numpy()

K = np.matmul(x, W_key)

In [43]:
torch.manual_seed(2)

W = torch.empty(d_model, d_k)

nn.init.xavier_uniform_(W)
W_value = W.data.numpy()

V = np.matmul(x, W_value)

In [44]:
print('K[0,:5]', K[0,:5])
print('K.shape', K.shape)
print('V[0,:5]', V[0,:5])
print('V.shape', V.shape)

K[0,:5] [ 0.22836541 -0.65482718 -0.07202062  0.49886369  0.5704503 ]
K.shape (3, 64)
V[0,:5] [-0.44997758  0.92097353 -0.76932045  0.03289758 -0.49462581]
V.shape (3, 64)


### 1.3 ~ Compute Attention Scores and Weighted Output

Step 3, we compute the attension scores using the matrices `Q` and `K`, following the equation:

\begin{equation}
Attention(Q, K, V) = softmax(\frac{Q\cdot K^T}{\sqrt{d_k}})V
\end{equation}

in which $\sqrt{d_k}$ is for normalization purpose.

We should first compute `attn_scores`, which is the unnormalized score. Then we can apply the `softmax()` function imported from `scipy` to calculate the normalized scores. Note that we need to specify the `axis` argument correctly when we call `softmax()`.

In [45]:
attn_scores = np.dot(Q, K.T) / math.sqrt(d_k)

attn_scores_norm = softmax(attn_scores, axis=1)

In [46]:
print('attn_scores.shape:', attn_scores.shape)
print('Unnormalized attn_scores:', attn_scores)
print('Normalized atten_scores:', attn_scores_norm)

attn_scores.shape: (3, 3)
Unnormalized attn_scores: [[-0.75497316 -0.97036241 -0.85112739]
 [ 0.2377701  -0.70730389 -0.37639248]
 [ 0.21608568 -0.73905382 -0.89881122]]
Normalized atten_scores: [[0.36838498 0.29700213 0.33461289]
 [0.51820328 0.20140013 0.2803966 ]
 [0.58387084 0.22464925 0.19147991]]


Step 4, finally, we compute the output as the weighted sum of value (`V`), using the above computed `attn_scores_norm` as the weight.

`attn_scores_norm[0,:]` is the weight for the first output `weighted_output[0,:]`, \
so the computation is:\
`weighted_output[0,:] = attn_scores_norm[0,0] * V[0,:] + attn_scores_norm[0,1] * V[1,:] + attn_scores_norm[0,2] * V[2,:]`

But we can achieve this with one line code using `np.matmul()`.

In [48]:
weighted_output = np.matmul(attn_scores_norm, V)

print('weighted_output[0,:5]:', weighted_output[0,:5])
print('weighted_output.shape:', weighted_output.shape)

weighted_output[0,:5]: [-0.37040035  0.49331394 -0.78595571  0.09711597 -0.33551546]
weighted_output.shape: (3, 64)


***

**We have finished Task 1, and now we know how to implement the self-attention module, which is the core technique of Transformer.**