In [11]:
import torch.nn as nn
import torch

d_model = 512

W_Q = nn.Linear(d_model, d_model)
W_K = nn.Linear(d_model, d_model)
W_V = nn.Linear(d_model, d_model)

Linear is a linear combination of the input tensor with size (512, 512) that takes input X (batch_size, seq_len, hidden_dim) and conduct a transformation $$y = x.W^T + b$$

where W, b are the trainable Weight/ bias matrix

In [None]:
# Example:  batch of 2 sentences, each 5 tokens, embedding size 512
batch_size = 10
seq_len = 5
d_model = 512

X = torch.randn(batch_size, seq_len, d_model)  # (10, 5, 512)

In [16]:
Q = W_Q(X)
K = W_K(X)
V = W_V(X)

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}}V)$$

Q: Query matrix

K: Key matrix

V: value matrix

$d_k$: dimensions of K

Attention also contains 2 types:

- Dot-product attention: Simply conduct dot product betweem Query and Key to evaluate the similarity, this is considered faster and lighter to process

- Additive attention: Computes the similarity by a feed-forward network with a single hidden layer, which is slower and more costly in computation

In high hidden dimension the additive attention outperforms the dot-product one, likely due to the growth of value over iterative computation, pushing the softmax close to small gradient, that's why a scaling coefficient $\frac{1}{\sqrt{d_k}}$ appears to balance the dot-product approach

The Attention is applied basically in 3 parts of a Transformer architecture:

- encoder-decoder framework, the query - current embedding state of the decoder and key-value from the hidden-state of encoder performs attention to see how each token in the input attends to each token in the output, this is a basic use of attention in seq2seq opimization

- Encoder contains self-attention layer, where V and Q come from same place, the output of previous layers in the layer

- Decoder also contains self-attention layer to retain the auto-regressive nature of the model; however, this is more strict as previous hidden-states are marked to prevent the model learn feature in the future

#### Position-wise FFN

Another layer of the Transformer Encoder is the Position-wise FeedForward Network, which applies transformation to each invididual token in the sentence (by position) 

$$FFN(X) = max(0, xW_1 + b_1)W_2 + b_2$$

where $W_1, W_2, b_1, b_2$ are trainable parameters

this is interpreted as a ReLU activation of 1 linear transformation of input, then a second Linear transformation of the ReLU, creating a 2 layer MLP

the dimension of input and output is typically hidden_dim = 512 and inner layer has dimensions 2048 (the transformation layer inside ReLU)

$W_1$ (512 x 2048)

$W_2$ (2048x512)

this is simply a high-dimensional representation of the token for processing before returning the original dimension

Note that the linear transformation uses the same parameters across positions in a layer (as a kernel size 1) but not equal between layers

#### Positional Encoding

As Transformer architecture doesn't have recurrent and convolution so the model needs something to utilize the sequence, we need to inject information about the relative position of the token in the sequence. 

The positional encoding have the same dimension $d_{model}$ as the embeddings, the original Transformer model applies the sine/ cosine functions

$$PE_{pos, 2i} = sin(\frac{pos}{1000^{\frac{2i}{d_{model}}}})$$

$$PE_{pos, 2i+1} = cos(\frac{pos}{1000^{\frac{2i}{d_{model}}}})$$

The larger order of dimensions, the smaller frequency of the sinusoidal representation, that'll ensure that words will be less likely to have different projection value

where pos is the position and i is the hidden_dim of the model

This ensures that positions that are close together will have near encodings (due to sinusoidal representation) and the model can also extrapolate to longer sequences (beyone what it's trained on) based on known tokens

This is used in Transformer by adding the positional encoding to the token embedding

x = token_embedding + positional_embedding

In [14]:
num_heads = 8
head_dim = d_model // num_heads

In [None]:
# (batch, seq_len, 512) â†’ (batch, num_heads, seq_len, head_dim)
Q = Q.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
K = K.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
V = V.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)

In [22]:
import torch.nn as nn

ln = nn.LayerNorm(512)
x = torch.randn(2, 5, 512)
y = ln(x)