## 注意力机制
***
***
Time: 2020-09-16<br>
Author: dsy<br>
Notes: 《神经网络与深度学习》
***

自注意力

假设输入序列为$X=[x_1,\cdots,x_N] \in \mathbb{R}^{d_1 x N}$，输出序列为$H=[h_1,\cdots,h_N] \in \mathbb{R}^{d_2 x N}$,首先我们可以通过线性变换得到三组向量序列：
$$
\begin{aligned}
Q & = W_Q X \in \mathbb{R}^{d_3 x N},\\
K & = W_K X \in \mathbb{R}^{d_3 x N},\\
V & = W_V X \in \mathbb{R}^{d_3 x N},
\end{aligned}
$$
其中$Q,K,V$分别为查询向量序列，键向量序列和值向量序列，$W_Q \in \mathbb{R}^{d_3 x d1},W_K \in \mathbb{R}^{d_3 x d1},W_V \in \mathbb{R}^{d_2 x d_1}$分别为可学习的参数矩阵。

利用公式可以得出输出向量$h_i$:
$$
\begin{aligned}
h_i & = att \Big( (K,V),q_i\Big) \\
    & = \sum_{j=1}^{N} \alpha_{ij}v_j\\
    & = \sum_{j=1}^{N} softmax \Big(s(k_j,q_i) \Big)v_j
\end{aligned}
$$


常用的注意力打分函数：
$$\begin{array}{cll}
\text{加性模型} & s(x_i,q) & = v^T \tanh(W x_i + U q) \\
\text{点积模型} & s(x_i,q) & = x_i^Tq \\
\text{缩放点积模型} & s(x_i,q) & = \frac{x_i^T q}{\sqrt{d}} \\
\text{双线性模型} & s(x_i,q) & = x_i^T W q 
\end{array}$$

In [1]:
import torch
import torch.nn as nn

In [2]:
L,N,E,S = 100,200,300,400

In [3]:
multiheadAttention = torch.nn.MultiheadAttention(
    embed_dim = E
    , num_heads=10
    , dropout=0.0
    , bias=True
    , add_bias_kv=False
    , add_zero_attn=False
    , kdim=None
    , vdim=None
)

In [4]:
query = torch.rand((L,N,E))

key = torch.rand((S,N,E))

value = torch.rand((S,N,E))

In [5]:
attn_output,attn_output_weights = multiheadAttention(
    query
    , key
    , value
    , key_padding_mask=None
    , need_weights=True
    , attn_mask=None)

In [6]:
attn_output.shape # L,N,E

torch.Size([100, 200, 300])

In [7]:
attn_output_weights.shape # N,L,S

torch.Size([200, 100, 400])