# Attention

[link](https://ut.philkr.net/deeplearning/transformers/attention/)

![image.png](attachment:image.png)

## Attention
For a set of key-value pairs $\{(k_i,v_i)\}_{i=1}^N \in \mathbb{R}^{d_k\times d_v}$ and another set of queries $\{q_j\}_{j=1}^M \in \mathbb{R}^{d_k}$, atention returns the "expected" value $o_j \in \mathbb{R}^{d_v}$ for each querry $q_j, \ j=1,2,\dots,M$.

## Definition

Inputs
* a set of queries $Q = [q_1|q_2|\dots|q_M]^T \in \mathbb{R}^{M \times d_k}$
* a set of keys $K = [k_1|k_2|\dots|k_N]^T \in \mathbb{R}^{N \times d_k}$
* a set of values $V = [v_1|v_2|\dots|v_N]^T \in \mathbb{R}^{N \times d_v}$

Output: $O= [o_1|o_2|\dots|o_M]^T \in\mathbb{R}^{M \times d_v}$
$$ \text{Attention}(Q,K,V) = O  = \alpha V, \quad \text{ where } \quad \alpha = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \in \mathbb{R}^{M\times N}$$
$$ o_{i}  =  \sum_{j=1}^N\alpha_{i,j}v_j\, \quad \text{ where } \quad  \alpha_{i,j}=\frac{e^{\frac{q_i^{T}k_j}{\sqrt{d_k}}}}{\sum_{l=1}^Ne^{\frac{q_i^{T}k_l}{\sqrt{d_k}}}}$$


where $\text{softmax}$ is applied to per row, so each row of $\alpha$ sums to one.  If we denote by $v(q_i)$ the random variable "value of the querry $q_i$", for $i=1,2,\dots,M$, then the induced probability of $v(q_i)$ is 
$$p(v(q_i) = v_j)=\alpha_{i,j}, \quad j=1,2,\dots,N.$$

Notice that $\sum_{j=1}^N\alpha_{i,j}=1, \ i =1,2,\dots,M$.

The value $p(v(q_i) = v_j)=\alpha_{i,j}$ is ussually interpreted as how much attention the querry $q_i$ pays to value $v_j$. So the "attention" of $q_i$ is partitioned along the values $v_j$.

## Attention with weights
Inputs
* a set of queries $Q = [q_1|q_2|\dots|q_M]^T \in \mathbb{R}^{M \times c_q}$
* a set of keys $K = [k_1|k_2|\dots|k_N]^T \in \mathbb{R}^{N \times c_k}$
* a set of values $V = [v_1|v_2|\dots|v_N]^T \in \mathbb{R}^{N \times c_v}$

Weights
* a set of querry weights $W_q \in \mathbb{R}^{c_q\times d_k}$ and bias $b_q\in \mathbb{R}^{d_k}$
* a set of key weights $W_k \in \mathbb{R}^{c_k \times d_k}$ and bias $b_k\in \mathbb{R}^{d_k}$
* a set of value weights $W_v \in \mathbb{R}^{c_v \times d_v}$ and bias $b_q\in \mathbb{R}^{d_v}$

Output: 
* Output $O= [o_1|o_2|\dots|o_N]^T \in\mathbb{R}^{M\times d_v}$
\begin{align*}
O =\text{Attention}_{\mathcal{W}}(Q,K,V)=\text{Attention}(QW_q+B_q,KW_k+B_k,VW_v+B_v)
\end{align*}
where
\begin{align*}'
B_q &= [b_q|b_q|\dots|b_q]^T \in \mathbb{R}^{M \times d_k},\\ 
B_k &= [b_k|b_k|\dots|b_k]^T \in \mathbb{R}^{N \times d_k},\\
B_v &= [b_v|b_v|\dots|b_v]^T \in \mathbb{R}^{N \times d_v},\\
\mathcal{W} &= \{W_q,b_q,W_k,b_k,W_v,b_v\}.
\end{align*}

Usually we have $c_q = c_k = c_v = d_k=d_v$.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)