## What’s the big deal about attention?

Consider machine translation as an example. Before attention, most translation was done via an encoder-decoder network. The encoder encodes the input sentence (“I love you”) via a recurrent model and the decoder decodes it into another language (“我爱你”).

![image](https://eugeneyan.com/assets/encoder-decoder.jpeg)

- Via this approach, the encoder had to cram the entire input into a fixed-size vector which is then passed to the decoder
    - this single vector had to convey everything about the input sentence! Naturally, this led to an informational bottleneck. 
- With attention, we no longer have to encode input sentences into a single vector. 
    - Instead, we let the decoder attend to different words in the input sentence at each step of output generation. 
    - This increases the informational capacity, from a single fixed-size vector to the entire sentence (of vectors).

- Furthermore, previous recurrent models had long paths between input and output words. 
    - If you had a 50-word sentence, the decoder had to recall information from 50 steps ago for the first word (and that data had to be squeezed into a single context vector)
    - As a result, recurrent models had difficulty dealing with long-range dependencies
    - Attention addressed this by letting each step of the decoder see the entire input sentence and decide what words to attend to. 

- Prior models using the above architecture had to process words sequentially. 
    - To encode a sentence, we start with the first word (w1) and process it to get the first hidden state (h1). Then, we input the second word (w2) with the previous hidden state (h1) to derive the next hidden state (h2). And so on. 
    - This was slow and inefficient. 
    - Attention allowed for parallel processing of words, which sped up training and inference.

## Back to Embeddings

- Embeddings represent words/tokens as high dimensional vectors in a way that similar words are located closed to each other.
- Words having high similarity are placed closer to each other whereas words with low similarity will be pushed far away from each other in this space.

### How are the words placed together or far apart?

- Similar words are placed together by the `context` --  meaning the words in the sentence decide how much closer two words will be placed to each other.
- Caluclated by Word Similarity Score (Cosine Similarity, Scaled dot product (used in self-attention))

![image](https://lakshyamalhotra.github.io/images/word_similarity.png)

Take another example, consider the following two statements:

1. A bat lives in the cave
2. A tennis racket and a baseball bat

Without any context, it would be hard for the model to place the word “bat” in the embeddings space – it would probably be placed in the middle.

![image](https://lakshyamalhotra.github.io/images/attention_weights.png)

|         | Bat | Cave | Racket |
|---------|-----|------|--------|
| **Bat**    | 1   | 0.71 | 0.71   |
| **Cave**   | 0.71 | 1   | 0      |
| **Racket** | 0.71 | 0   | 1      |

From the above similarity table, we can see the individual contributions (as weights) from other words for a given word. In the language of linear algebra, we can write the individual words as a linear combination (weighted sum) of the other:

- A bat lives in the cave
    - Bat = 1 * Bat + 0.71 * Cave
    - Cave = 0.71 * Bat + 1 * Cave

- A tennis racket and a baseball bat
    - Bat = 1 * Bat + 0.71 * Racket
    - Racket = 0.71 * Bat + 1 * Racket

In machine learning applications, it’s convenient to normalize the coefficients so that they sum to 1.
    - Normalization increases the interpretability of the coefficients as they can be assumed as the relative weights or the probabilities. This is typically done by applying softmax operation.

After softmax:
    - Bat = 0.57 * Bat + 0.43 * Cave
    - Cave = 0.43 * Bat + 0.57 * Cave

    - Bat = 0.57 * Bat + 0.43 * Racket
    - Racket = 0.43 * Bat + 0.57 * Racket

## So what are we doing here?

- With the help of some coordinate geometry, it is not hard to see that after adding contributions from the other words, we are basically moving each word in the embedding space.

In other words, after getting the “context” from other words in the first sentence, the first set of equations basically shifted the “Bat” and the “Cave” towards each other. The equations,

    - Bat = 0.57 * Bat + 0.43 * Cave
    - Cave = 0.43 * Bat + 0.57 * Cave

are the context vectors for the “Bat” and the “Cave” tokens in the embedding space.

> On a very high level this is how the attention mechanism works -- we took a word token (query) and look in its own sequence (keys) to find the information that should be used from other words to create a context vector.

### To summarize, we took the following steps to calculate the context vector:



1. Calculate attention scores: The attention mechanism calculates the similarity scores between each pair of the input sequence. Higher the similarity score, the more relevant is the key to the current query. (cosine similarity used here, but in practice -- scaled dot-product is used for efficiency and stability)

```python
import numpy as np
from numpy.linalg import norm

# define coordinates of words
racket = [0, 5]  # x-coordinate: 0, y-coordinate: 5
bat = [3, 3]     # x-coordinate: 3, y-coordinate: 3
cave = [4, 0]    # x-coordinate: 4, y-coordinate: 0

# convert the list of coordinates to a NumPy array
embed_vectors = np.array([bat, cave, racket])

# create a function for creating cosine similarity
def cos_sim_matrix(vectors):
    # calculate the norms of each vector
    norms = norm(vectors, axis=1)

    # calculate the dot product between each pair of vectors
    dot_products = np.dot(vectors, vectors.T)

    # calculate the outer product of the norms
    norm_products = np.outer(norms, norms)

    # calculate the cosine similarity matrix
    cosine_similarity = dot_products / norm_products

    return np.round(cosine_similarity, 2)

# find similarity
similarity_matrix = cos_sim_matrix(embed_vectors)
print(similarity_matrix)

# prints
# array([[1. , 0.71, 0.71],
#	     [0.71, 1. , 0. ],
#	     [0.71, 0. , 1. ]])
```

2. Normalization (Softmax): Softmax function is applied on the attention scores to yield the probabilities. The softmax ensures the weights sum up to 1, which is helpful for training stability and interpretability.

```python
# similarity scores for the first sentence
bat_cave = similarity_matrix[:1, :2]

# apply softmax to it
bat_cave = np.exp(bat_cave)
softmax = bat_cave / np.sum(bat_cave, axis=1)
softmax = np.round(softmax, 2)
print(softmax)

# prints
# array([[0.57, 0.43]])
```

3. Weighted Summation: Lastly, attention weights are multiplied by the corresponding values and these weighted contributions are then summed up to create a context vector.

## What are query, key, and value vectors?

- A “query” is analogous to a search query in a database. It represents the current item (e.g., a word or token in a sentence) the model focuses on or tries to understand. The query is used to probe the other parts of the input sequence to determine how much attention to pay to them.
- The “key” is like a database key used for indexing and searching. In the attention mechanism, each item in the input sequence (e.g., each word in a sentence) has an associated key. These keys are used to match with the query.
- The “value” in this context is similar to the value in a key-value pair in a database. It represents the actual content or representation of the input items. Once the model determines which keys (and thus which parts of the input) are most relevant to the query (the current focus item), it retrieves the corresponding values.

- These vectors are actually obtained by transforming the input embedding vectors with three matrices – K, Q and V.

*	Q (Query) — “What am I looking for?”
*	K (Key) — “What do I have to offer?”
*	V (Value) — “What information should I share if chosen?”

## Intuition behind K, Q and V matrices

- From basic linear-algebra, we know that matrices are nothing but the linear transformations or rules that operate on vectors and change their properties like rotate them by a certain angle, reflect them about some axis, etc
- These `trainable matrices` for query, keys and values do something similar – stretch, shear, or elongate the manifolds such that the `similarity of the alike words increases whereas for dissimilar words it decreases.`

- In a nutshell, transforming vectors with matrices can increase/decrease the similarity score and hence the attention weights between two vectors.
    - This is what K, Q and V do to the input embedding vectors. They are trainable meaning during the course of training, their weights will be optimized to change the manifold. This will increase/decrease the similarity between tokens on the basis of the loss function optimization during training.

## Transformers and Attention Mechanism

![img](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*kf871smtQKXAf3dSYOlVPA.png)

![img](https://storrs.io/content/images/2021/08/Screen-Shot-2021-08-07-at-7.51.37-AM.png)

![img](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*7sy0KlRgyh3n0M0SH-6xqQ.gif)

### 1. Submit a "Query" for each token in the input sequence
### 2. Take those queries and match them against the "Keys" -- The Query is essentially probing the Keys to see how relevant they are to the current token.
### 3. The similarity between the Query and the Keys is computed using a dot product -- the attention matrix is then formed by taking the softmax of the dot products.
### 4. The attention matrix is a representation of how much each query matches a given key - the attention values indicate how much each token in the sequence is "paying attention" to other tokens in the sequence.
### 5. The attention values are then used to weight the values associated with each key, producing a weighted sum that represents the output for each token.

### Self Attention - Q, K, V's all from the same sequence
### Cross Attention - Q from one sequence and K, V from another sequence

![img](https://storrs.io/content/images/size/w1600/2021/08/image3--8-.png)

![img](https://storrs.io/content/images/size/w1600/2021/08/image8--1-.png)

<img src="https://storrs.io/content/images/size/w1600/2021/08/image7--2-.png" alt="Description of Image" width="800px">


<img src="https://storrs.io/content/images/size/w1600/2021/08/image4--2-.png" alt="Description of Image" width="800px">


<img src="https://storrs.io/content/images/size/w1600/2021/08/image6--3-.png" alt="Description of Image" width="800px">


<img src="https://storrs.io/content/images/size/w1600/2021/08/image5--3-.png" alt="Alt Text" width="800px">


![image](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*YE0dWWP7uzWIa5zmNN2xdA.png)

```python
import torch
import torch.nn as nn

# Helper class for Self-Attention calculation
class SelfAttention(nn.Module):
	def __init__(self, input_dim, output_dim):
		super().__init__()
		self.output_dim = output_dim

		# matrices for Query, Key, and Value
		self.W_query = nn.Linear(
			in_features=input_dim, out_features=output_dim, bias=False
		)
		self.W_key = nn.Linear(
			in_features=input_dim, out_features=output_dim, bias=False
		)
		self.W_value = nn.Linear(
			in_features=input_dim, out_features=output_dim, bias=False
		)

	def forward(self, x): # x shape: (N, input_dim), N: number of tokens
		queries = self.W_query(x)
		keys = self.W_key(x)
		values = self.W_value(x)

		# attention scores
		attn_scores = queries @ keys.T # N x N

		# attention weights
		attn_weights = torch.softmax(attn_scores / self.d_out**0.5, dim=-1)

		# compute context vector
		context_vec = attn_weights @ values # N x output_dim

		return context_vec

# Multi-head self-attention
class MultiHeadAttention(nn.Module):
	def __init__(self, input_dim, output_dim, num_heads):
		super().__init__()
		self.heads = [
			SelfAttention(input_dim, output_dim)
			for _ in range(num_heads)
		]

	def forward(self, x):
		return torch.cat([head(x) for head in self.heads], dim=-1)
```

## Positional Embeddings?

![image](./pos_emb.png)