# Why we need Attention mechanism

## The problem with modeling long sequences

Well, before the attention mechanism, architectures like RNN (encoder + decoder) were used in a sequence to sequence tasks like machine translation, however they had some noticeable problems.
The encoder part of the architecture takes the each part of the input, then| processes it into a hidden state (memory cell), the decoder then takes the hidden state to produce the output (the hidden state here plays the role of the embeddings).

The issue here is that the decoder cannot access earlier hidden state during the decoding phase so it solely relies on the **current** hidden state.

And this was the reason the **Attetion mechanism** was introduced.

## Capturing data dependencies

One shortcoming for the RNN is that it must remember entire encoded text in a single hidden state before passing it to the decoder
*Bahdanau attention* was introduces that made the decoder selectively access different parts of the input sequence at each step  

The transformer architecture came in afterwards with a self-attention mechanism inspired by the *Bahdanau attention mechanism*
 - Self-attention is a mechanism that allows each position in the input sequence to attend to all positions in the same sequence when computing the representation of a sequence.

## Attending to different parts of the input with self-attention

in self-attention, the term “self” means that each element of a sequence (e.g., a word) pays attention to other elements within the same sequence to understand context and relationships. It learns how different parts of the input relate to each other, unlike traditional attention, which models relationships between two different sequences (e.g., input and output in seq2seq models).

### Simple self-attention mechanism without trainable weights

In self-attention, our goal is to calcualte a context vector z<sup>(i)</sup> for each element x<sup>(i)</sup> in the input sequence. A context vector can be interpreted as an enriched embedding vector. The importance or contribution of each input element for computing z<sup>2</sup> for example is determined by the attention weights α<sub>21</sub> to α<sub>2T</sub>. When computing z<sup>2</sup>, the attention weights are calculated with respect to input element x<sup>2</sup> and all other inputs.

<img src="self-attention.png" width="600">

This enhanced context vector, z<sup>2</sup>, is an embedding that contains information about x<sup>(2)</sup> and all other input elements x<sup>(1)</sup>to x<sup>(T)</sup>.

In self-attention, context vectors play a crucial role. Their purpose is to create enriched representations of each element in an input sequence (like a sentence) by incorporating
information from all other elements in the sequence

In [1]:
import torch

inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89], # Your     (x^1)
        [0.22, 0.58, 0.33], # journey  (x^2)
        [0.77, 0.25, 0.10], # starts   (x^3)
        [0.05, 0.80, 0.55], # with     (x^4)
        [0.57, 0.85, 0.64], # one      (x^5)
        [0.55, 0.87, 0.66], # step     (x^6)
    ]
)

<img src="attention-weight.png" width="650">

We calculate the intermediate attention scores between the query token and each input token. We determine these scores by computing the dot product of the query, x<sup>2</sup> , with every other input token

In [10]:
query = inputs[1]

attn_score_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_score_2[i] = torch.dot(query, x_i)

attn_score_2

tensor([0.4753, 0.4937, 0.3474, 0.6565, 0.8296, 0.8434])

After computing the attention scores ω<sub>21</sub> to ω<sub>2T</sub> with respect to the input query x<sup>(2)</sup> , the next step is to obtain the attention weights α<sub>21</sub> to α<sub>2T</sub> by normalizing the attention scores
The goal is to obtain attention weights that sum up to 1

In [11]:
attn_weight_2_tmp = attn_score_2 / attn_score_2.sum()

print(f"Attention weights: {attn_weight_2_tmp}")
print(f"Sum: {attn_weight_2_tmp.sum()}")

Attention weights: tensor([0.1304, 0.1354, 0.0953, 0.1801, 0.2275, 0.2313])
Sum: 1.0000001192092896


In practice, it's more common and advisable to use the softmax function for normalization. This approach is better at managing extreme values and offers more favorable gradient properties during training, the softmax function ensures that the attention weights are always positive.
This makes the output interpretable as probabilities or relative importance

In [16]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weight_2_naive = softmax_naive(attn_score_2)

print(f"Attention weights: {attn_weight_2_naive}")
print(f"Sum: {attn_weight_2_naive.sum()}")

Attention weights: tensor([0.1435, 0.1462, 0.1263, 0.1720, 0.2046, 0.2074])
Sum: 1.0


This softmax we implemented may encounter numerical instability problems, such as overflow or underflow when dealing with large or small inputs, in practice is advisable to use pytoch implementation

In [20]:
attn_weight_2 = torch.softmax(attn_score_2, dim=0)

print(f"Attention weights: {attn_weight_2}")
print(f"Sum: {attn_weight_2.sum()}")

Attention weights: tensor([0.1435, 0.1462, 0.1263, 0.1720, 0.2046, 0.2074])
Sum: 1.0
