#### **Self-Attention**

The whole idea of transformers and its related architecutures are based on the "self-attention" operation.<br><br><br>
Self-attention is a sequence-to-sequence operation: a sequence of vector goes in, and a sequence of vector comes out. Let's input vectors $\mathbf{x_1, x_2, x_3,...,x_t}$ and output vectors $\mathbf{y_1, y_2, y_3,..., y_t}$ and all vectors have $\mathbb{k}$ dimensions.<br>
To produce output vector $\mathbf{y_i}$, the self attention operation simply takes a weighted average over all the input vectors<br><br>
$$\mathbf{y_i} = \sum_j \mathbf{w}_{ij}\mathbf{x}_j$$
where $j$ indexes over the whole sequence and the weights sum to one over all $j$. The weight $\mathbf{w_{ij}}$ is not a parameter, as in a normal neural net, but it is derived from a function over $\mathbf{x_i}$ and $\mathbf{x_j}$. The simplest option for this function is the dot product:
$$\mathbf{w}_{ij}' = \mathbf{x_i^Tx_j}$$
Note that $\mathbf{x_i}$ is the input vector at the same position as the current output vector $\mathbf{y_i}$. For the next output vector, we get an entirely new series of dot products, and a different weighted sum.<br><br>
The dot product gives us a value anywhere between negative and positive infinity, so we apply a softmax to map the values to $[0, 1]$ and to ensure tht they sum to $1$ over the whole sequence:
$$\mathbf{w}_{ij} = \frac{\exp  \mathbf{w}_{ij}'}{\sum_j \exp \mathbf{w}_{ij}'}$$

Assigning embeddings vectors $v_t$ to each word $t$ in our vocabulory is known as *embedding layer*, the values of the embedding vectors will be learned. It turns the word sequence *the, cat, walks, on, the, street* into vector space
$$\mathbf{v}_{the}, \mathbf{v}_{cat}, \mathbf{v}_{walks}, \mathbf{v}_{on}, \mathbf{v}_{the}, \mathbf{v}_{street}$$
When this sequence is fed into the self-attention layer, the output is another sequence of vectors<br>
$$\mathbf{y}_{the}, \mathbf{y}_{cat}, \mathbf{y}_{walks}, \mathbf{y}_{on}, \mathbf{y}_{the}, \mathbf{y}_{street}$$
where $\mathbf{y}_{cat}$ is a weighted sum over all the embedding vectors in the first sequence, weighted by their(normalize) dot-product with $\mathbf{v}_{cat}$.<br><br><br>

Since we are *learning* what the values in $\mathbf{v}_t$ should be, how "related" two words are is entirely determined by the task. In most case, the definite articles *the* is not very relevant to the interpretation of the other words in the sentence; therefor we will likely end up with an embedding $\mathbf{v}_{the}$ that has a low or negative dot product with all other words. On the other hand, to interpret what *walks* means in this sentence, it's very helpful to work out *who* is doing the walking. This is likely expressed by a noun, so for nounds like *cat* and verbs like *walks*, we will likely learn embeddings $\mathbf{v}_{cat}$ and $\mathbf{v}_{walks}$ that have a high, positive dot product together.<br><br><br>
This is the basic intuition behind self-attention. The dot product expresses how related two vectors in the input seqeunce are, with "related" defined by the learning task, and the output vectors are weighted sums over the whhoel input sequence, with the weights determined by these dot products.<br><br><br>
Before we move on, it's worthwhile to note the following properties, which are unsual for a sequence-to-sequence operation:
*   There are no parameters(yet). What the basic self-attention actually does is entirely determined by whatever mechanism creates the input sequence. Upstream mechanisms, like an embedding layer, drive the self-attention by learning representaions with particular dot products (although we'll add a few parameters later).
*    Self attention sees its input as a set, not a sequence. If we permute the input sequence, the output sequence will be exactly the same, except permuted also (i.e. self-attention is permutation equivariant). We will mitigate this somewhat when we build the full transformer, but the self-attention by itself actually ignores the sequential nature of the input



The whole self-attention equation can be written as:
$$\begin{align*}
\mathbf{y}_i = \displaystyle \sum_j \left\{ \frac{\exp \mathbf{x}_i^T \mathbf{x}_k}{\sum_k \exp \mathbf{x}_i^T \mathbf{x}_k} \right\} \mathbf{x}_j
\end{align*}$$<br><br>
$k, j$ are both same, indexes over the same sequence. <br><br>This implies, when the sentences is given to the self-attention, in the form of sequence of input vectors and corresponding output vectors are generated. The corresponding output vector is sum of probability of each input vector multiplied(hadamard/dot product) with actual input vector. The probabilities are generated by dot product of corresponding input vector(transposed) with all other input vectors tokens. The way input vectors are arranged is learned by the self-attention.

**References:**
1.   https://peterbloem.nl/blog/transformers
2.   