# Attention please!

Before transformers, recurrent neural networks (RNNs) were considered the cutting edge in Natural Language Processing (NLP). An RNN is a type of neural network where outputs from previous steps are fed as inputs to the current step. This characteristic enables an RNN to retain information from previous steps, making them well-suited for sequential data like text. In the context of NLP, an RNN takes an input, such as a word or character, processes it through its network, and generates a vector known as the hidden state. If you are unfamiliar with RNNs, don't worry, you don't need to know the detailed workings of RNNs to follow this discussion. 

One area where RNNs played an important role was in the development of machine translation systems, where the model translates text from one language to another. However, the word sequence in one language might be different from another one due to the grammatical structures in the source and target language. To address this issue we can use an encoder-decoder architecture. The encoder's role is to convert input sequence information into a numerical representation, typically referred to as the final hidden state. The encoder updates its hidden state at each step, trying to capture the entire meaning of the input sentence in the final hidden state. The decoder then takes this final hidden state to start generating the translated sentence, one word at a time. \
However, a significant challenge of this architecture lies in the fact that the final hidden state of the encoder creates an information bottleneck. it has to represent the meaning of the whole input sequence because this is all the decoder has access to when generating the output. This is especially challenging for long sequences, where information at the start of the sequence might be lost in the process of compressing everything to a single, fixed representation.

To address this challenge, an "attention mechanism" is introduced, permitting the decoder to selectively access different hidden states of the encoder. But, why selective? Using all the states at the same time would create a huge input for the decoder, the attention mechanism lets the decoder assign a different amount of weight, or "attention" to each of the encoder states at every decoding timestep. \
Researchers, as detailed in the paper "Attention is all you need," have demonstrated that RNN architectures are not required for NLP applications such as machine translation and proposed a transformer architecture with a “self-attention mechanism”.  

The main idea behind the self-attention mechanism is that instead of using fixed embeddings for each token, we can use the whole sequence to compute a weighted average of each embedding. Given a sequence of token embeddings $ x_{1}, ..., x_{n} $ self-attention produces a sequence of new embeddings $ x_{1}^{'}, ..., x_{n}^{'} $ where each $ x_{i}^{'} $ is a linear combination of all the $ x_{j}^{'},  j=1...n $: 

$(1) \; x_{i} = \sum \limits _{j=1} ^{n} w_{ji}x_{j}$


There are several ways to implement a self-attention layer. The original implementation  introduced in the paper “” is called "Scaled Dot-Product Attention" 

$ (2) \; Attention(Q,K,V) =  softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V$

In [None]:
! pip install torch==2.0.1

In [None]:
import torch
from torch import nn

Putting these steps together, we will have the following function:

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. 

$(3) \; MultiHead(Q,K,V) =  Concat(head_{1}, ..., head_{h})W^{O} $ 

where, 

$ head_{i} = Attention(QW_{i}^{Q},KW_{i}^{K},VW{i}^{V})$ and, \
$ W_{i}^{Q} \in \mathbb{R}^{d_{model}\times d_{k}}, W_{i}^{K} \in \mathbb{R}^{d_{model} \times d_{k}}, W_{i}^{V} \in \mathbb{R}^{d_{model} \times d_{v}} and W^{O} \in \mathbb{R}^{d_{hv} \times d_{model}} $ are weight matrices.

These three weight matrices are used to project the embedded input tokens, x(i), into query, key, and value vectors.

These matrices transform input data into queries, keys, and values, which are crucial components of the attention mechanism. As the model is exposed to more data during training, it adjusts these trainable weights

# What are Query, Key and Value ? 

In attention mechanisms, we use terms like "key," "query," and "value" which come from information retrieval and databases. They help us store, search, and get information efficiently.

Think of a "query" like a search term you put into a database. It's what the model is currently focusing on or trying to understand, like a word in a sentence. The query helps the model figure out how much attention to give to other parts of the input.

A "key" is like an index in a database used for searching. Each item in the input sequence, such as each word in a sentence, has a key. These keys are matched with the query to find relevant information.

The "value" in this context is similar to the value in a key-value pair in a database. It represents the actual content or representation of the input items. Once the model determines which keys (and thus which parts of the input) are most relevant to the query (the current focus item), it retrieves the corresponding values. 

With this introduction, let's code our very first attention mechanism. Imagine we have an embedding model that generates embeddings in a 5 dimentional embedding space. Assume that our embedding model has generated the following embedding vectors for our input sentence "Write your first Attention mechanism".  

Please note that embedding values in this example are totally random and dosen't express any information. 

In [61]:
import torch
# Write your first Attention mechanism
inputs = torch.tensor(
  [[0.172, 0.295, 0.618, 0.459, 0.818, 0.071], # Write 
   [0.265, 0.563, 0.718, 0.323, 0.126, 0.235], # your                                               
   [0.206, 0.333, 0.044, 0.862, 0.152, 0.594], # first    
   [0.300, 0.505, 0.727, 0.495, 0.898, 0.954], # Attention     
   [0.095, 0.809, 0.596, 0.110, 0.447, 0.418]] # mechanism   
)

In [88]:
print(inputs.shape)

torch.Size([5, 6])


First we should generate attention weight scores, which simply is the dot product of each embedding vector with other embedding vecotrs. Dot product is used as a similarity function. 

That is $ Attention Scores \in \mathbb{R}^{d_{t}\times d_{t}} $ \
Where $ d_{t} $ is the number of input tokens (i.e., words), here $ d_{t} = 5 $   

In [89]:
attn_scores = torch.matmul(inputs, inputs.T) 
print("attention scores are: ", attn_scores)

attention scores are:  tensor([[1.3834, 0.9234, 0.7230, 1.6794, 1.0691],
        [0.9234, 1.0781, 0.7108, 1.3830, 1.0987],
        [0.7230, 0.7108, 1.2742, 1.3918, 0.7262],
        [1.6794, 1.3830, 1.3918, 2.8351, 1.7250],
        [1.0691, 1.0987, 0.7262, 1.7250, 1.4054]])


Then, we normalize the attention scores using softmax function. The main goal behind the normalization is to obtain attention weights that sum up to 1. 

$ (4) \; \sigma(\mathbf{z})_{i} = \frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}}, for \; i = 1 ,...,K \; and \; \mathbf{z} \in \mathbb{R}^{K} $

In [100]:
attn_weights = torch.softmax(attn_scores, dim=1)
print("attention weights are: ", attn_weights)
print("Sums of all rows are: ",attn_weights.sum(dim=1))

attention weights are:  tensor([[0.2368, 0.1495, 0.1224, 0.3184, 0.1730],
        [0.1739, 0.2030, 0.1406, 0.2753, 0.2072],
        [0.1497, 0.1479, 0.2598, 0.2923, 0.1502],
        [0.1489, 0.1107, 0.1117, 0.4729, 0.1558],
        [0.1649, 0.1698, 0.1170, 0.3176, 0.2307]])
Sums of all rows are:  tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


In [66]:
context_vectors = torch.matmul(attn_weights, inputs)  
print(context_vectors) # 5x6

tensor([[0.1780, 0.4262, 0.4843, 0.3858, 0.4458, 0.3576],
        [0.1593, 0.4009, 0.4263, 0.3352, 0.3457, 0.3228],
        [0.1564, 0.3528, 0.3389, 0.3937, 0.3099, 0.3515],
        [0.3600, 0.8420, 0.9404, 0.7560, 0.9062, 0.8448],
        [0.1843, 0.4831, 0.5131, 0.3783, 0.4335, 0.3953]])


As you can see, context vectors are the same size as our inputs. In other way, we simply modified the embeddings to reflect the attention to other tokens as well.

# Self Attention

The paper "Attention Is All You Need" introduces <em> Scaled Dot-Product Attention </em>. For instance, when scaling up the embedding dimension, which is typically greater than thousand for GPT-like LLMs, large dot products can result in very small gradients during backpropagation due to the softmax function applied to them. As dot products increase, the softmax function behaves more like a step function, resulting in gradients nearing zero. These small gradients can drastically slow down learning. We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. 

<center><figure><img src="imgs/scaled-dot-product.png" alt="drawing" width="300"/><figcaption>Fig. 1: Scaled Dot-Product Attention.</figcaption></figure></center>    

Let's see figure 1 in code. First, we write a function to implement scaled dot-product. 

In [83]:
from math import sqrt
def scaled_dot_product_attention(Q, K, V):
    dim_k = K.size(-1)
    attn_scores = torch.matmul(Q, K.T)
    attn_weights = torch.softmax(attn_scores / sqrt(dim_k),dim=-1)
    return torch.matmul(attn_weights,V)

In [84]:
class SelfAttention(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.W_q = nn.Linear(embed_dim, head_dim, bias=False)
        self.W_k = nn.Linear(embed_dim, head_dim, bias=False)
        self.W_v = nn.Linear(embed_dim, head_dim, bias=False)
        
    def forward(self, x):
        keys = self.W_k(x)
        queries = self.W_q(x)
        values = self.W_v(x)        
  
        attention_outputs = scaled_dot_product_attention(queries, keys, values)
        return attention_outputs    

# Multi-head attention
As we saw, the self-attention mechanism employs three independent linear transformations on each embedding to produce the query, key, and value vectors. Each projection has  its own set of trainable parameters, so that the model, especially the self-attention layer, can attend to various semantic features within the sequence and  learn to produce "good" context vectors.		 	 	 		
			

Additionally, it would be advantageous to incorporate multiple sets of linear projections, referred to as  “attention head”. But why several attention heads? The reason is the softmax function of a single head focuses on one similarity aspect. By employing several heads, the model simultaneously focuses on multiple aspects of similarity.

Now that we have implemented self-attention mechanism, let's move forward with multi-head attention mechanism. 

<center><figure><img src="imgs/multi-head-attention.png" alt="drawing" width="350"/><figcaption>Fig. 2: Multi-head Attention.</figcaption></figure></center>    

In [85]:
embed_dim = inputs.shape[1]
num_heads = 2
head_dim = embed_dim // num_heads
print(f"head dimension is: {head_dim} and number of heads are: {num_heads}")

head dimension is: 3 and number of heads are: 2


In [86]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, head_dim):
        super().__init__()
        self.heads = nn.ModuleList([SelfAttention(embed_dim, head_dim) for _ in range(num_heads)])
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, inputs):
        x = torch.cat([head(inputs) for head in self.heads], dim=-1)
        x = self.output_linear(x)
        return x   

Please note that the final linear layer is used to produce a tensor of the same size as our input tensor (i.e., 5x6 in our example). Let's check it out!

In [87]:
mult_head_attn = MultiHeadAttention(embed_dim,num_heads,head_dim)
attn_output = mult_head_attn(inputs)
print(attn_output.shape)

torch.Size([5, 6])


As you can see it generates an output tensor with the expected size.