# Attention

*In this exercise we are going to work with attention. If you haven't heard, it has become **massively** important, mostly since transformers use it (and is part of what makes them the new black for everything).*

After these exercises (and the lecture), you should be able to (in rough order of importance):
- Understand what attention is
- Understand how attention is computed given inputs 
- Understand why it is often called 'self-attention'
- Understand how neural networks are used to get attention values
- Understand why attention is good to use
- Understand how attention differs from just embeddings

- Understand early (non learnable) vs newer learnable attention

## Quick introduction

# Fix indexing here

*In its most basic form, attention is a form of similarity measur applied to sequences of usually words. As you may know, the dot product is a rough measure of similarity between vectors. Let $\mathbf{y}$ be an input sequence of word embeddings, for example a sentence in english, and let $\mathbf{x}$ be the same sentence but in another language. The similarity between two words in either sequences will then be:*

$$e_{ij} = x^{(i)^T} y^{(j)}$$

*That is, how much word $\mathbf{x}^{(i)}$ relates to word $\mathbf{y}^{(j)}$ The softmax is then usually used to scale these values to be between 0 and 1, which gives the attention weight $W_{ij}$ between word $i$ and $j$*:

$$W_{ij} = \text{softmax}\left(\mathbf{x}^{(i)^T} \mathbf{y}^{(j)}\right)$$

*After this, the attention weights $W_{ij}$ can then be used on each word of the input sentence, to get how much they relate to those in the desired output sentence:*

$$\mathbf{o}{(i)} = \sum^T_{j = 0} W_{ij} \mathbf{x}^{j}$$

*The result $o^{(i)}$, will then be **a vector** representing how much each word in the input sentence relates to **a word** in the output sentence. If we wanted to get an attention value **from each word** to **each word** we would need a matrix, which would be the whole $o$*

## Self-attention

*Introduced primarily in [attention is all you need](https://arxiv.org/abs/1706.03762), self attention improves 'regular' attention by adding learnable parameters to the attention.*

*Note that above, the attention values only depended on the similarity between words, meaning words embedded with similar values also got high attention scores. This doesn't make sense for many tasks, particularly in translation, since some words in sentences can relate a lot to others despite being different, consider the sentence:*

***Mary saw a dog in the window, she wanted it***

*Logically, we can think the words (Mary, saw, wanted, it), (dog, it), (in, window) fit together as they describe the relationship that Mary wanted **it** "it" meaning "the dog" (not the window). This would not be caught by the above method, since it is uncommon that "it" and "dog" are very similar words...*

*The solution is to use **self-attention**. Here we maintain three new embedding values for each word: $\mathbf{Q}$ (queries), $\mathbf{K}$ (keys), and $\mathbf{V}$ (values). Do not worrry about the names.* 

*We train these three embeddings much the same way as we would word embeddings in, say Fasttext. However, they are not specifically embeddings. **They work on embeddings**. To get $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ for one input sentence (of words!!) $\mathbf{s}$, we first get a word embedding of that input sentence, and then multiply that matrix with matrices for each $\mathbf{Q}, \mathbf{K}, \mathbf{V}$:*

$$\mathbf{Q} = \text{embedding}(\mathbf{s})W_Q,\quad \mathbf{K} = \text{embedding}(\mathbf{s})W_K, \quad \mathbf{V} = \text{embedding}(\mathbf{s})W_V$$

*We then calculate the full attention matrix:*

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \right) \mathbf{V}$$

*The reason we scale by $\sqrt{d_k}$ (which is either based on the dimension of $W_K$, chosen arbitrarily, or based on the number of 'heads' for multi-head attention), is simply to avoid exploding gradients (don't worry too much about it)...*

# 1 - Calculating attention

