## 注意力机制
***
***
Time: 2020-09-16<br>
Author: dsy<br>
Notes: 《神经网络与深度学习》
***

自注意力

假设输入序列为$X=[x_1,\cdots,x_N] \in \mathbb{R}^{d_1 x N}$，输出序列为$H=[h_1,\cdots,h_N] \in \mathbb{R}^{d_2 x N}$,首先我们可以通过线性变换得到三组向量序列：
$$
\begin{aligned}
Q & = W_Q X \in \mathbb{R}^{d_3 x N},\\
K & = W_K X \in \mathbb{R}^{d_3 x N},\\
V & = W_V X \in \mathbb{R}^{d_2 x N},
\end{aligned}
$$
其中$Q,K,V$分别为查询向量序列，键向量序列和值向量序列，$W_Q \in \mathbb{R}^{d_3 x d_1},W_K \in \mathbb{R}^{d_3 x d_1},W_V \in \mathbb{R}^{d_2 x d_1}$分别为可学习的参数矩阵。

利用公式可以得出输出向量$h_i$:
$$
\begin{aligned}
h_i & = att \Big( (K,V),q_i\Big) \\
    & = \sum_{j=1}^{N} \alpha_{ij}v_j\\
    & = \sum_{j=1}^{N} softmax \Big(s(k_j,q_i) \Big)v_j
\end{aligned}
$$


常用的注意力打分函数：
$$\begin{array}{cll}
\text{加性模型} & s(x_i,q) & = v^T \tanh(W x_i + U q) \\
\text{点积模型} & s(x_i,q) & = x_i^Tq \\
\text{缩放点积模型} & s(x_i,q) & = \frac{x_i^T q}{\sqrt{d}} \\
\text{双线性模型} & s(x_i,q) & = x_i^T W q 
\end{array}$$

In [1]:
import torch
import torch.nn as nn
seed = 0
torch.manual_seed(seed)
torch.random.manual_seed(seed)

<torch._C.Generator at 0x1dad61b22d0>

In [2]:
class AttentionFromDsy(nn.Module):
    def __init__(self):
        super(AttentionFromDsy,self).__init__()
        
    def s(self,k,q,d):
        return (k.T.reshape((1,-1))).mm(q.reshape((-1,1))) / torch.sqrt(torch.Tensor([d]))
        
    def forward(self,X):
        d1 ,N = X.shape
        d2,d3 = 10,10
        
        h = torch.zeros((d2,N))
        
        WQ = torch.randn((d3,d1),requires_grad=True) # d3 ,d1
        WK = torch.randn((d3,d1),requires_grad=True) # d3,d1
        WV = torch.randn((d2,d1),requires_grad=True) # d2 ,d1
        
        Q = WQ.mm(X) # d3,N
        K = WK.mm(X) # d3,N
        V = WV.mm(X) # d2,N
        
        for i in range(N):
            sha = [self.s(K[:,j],Q[:,i],d3) for j in range(N)]
            sha_softmax =  torch.nn.functional.softmax(torch.Tensor(sha))
            shaaa = torch.zeros((d2,))
            for j in range(N):
                sha_softmax_j = sha_softmax[j]
                shaaa += sha_softmax_j * V[:,j]
            h[:,i] = shaaa
            
        return h

In [3]:
data = torch.randn((4,4))
data

tensor([[-1.1258, -1.1524, -0.2506, -0.4339],
        [ 0.8487,  0.6920, -0.3160, -2.1152],
        [ 0.3223, -1.2633,  0.3500,  0.3081],
        [ 0.1198,  1.2377,  1.1168, -0.2473]])

In [4]:
afd = AttentionFromDsy()
afd(data)



tensor([[ 0.5689,  0.6598,  1.5184,  0.9254],
        [-0.2293, -0.4523, -2.2452, -1.1773],
        [-0.7210, -0.8249, -0.7496, -1.3198],
        [-0.2952, -0.1535, -0.0277,  0.3204],
        [ 0.9266,  0.9983,  1.2646,  1.3426],
        [-0.2005, -0.2239, -0.0952, -0.3015],
        [ 1.1224,  1.0853, -0.0742,  1.0859],
        [ 0.8179,  0.6819,  0.3794, -0.0328],
        [ 0.1456,  0.1609,  1.5256,  0.2119],
        [ 0.1954, -0.0705, -0.0035, -1.1535]], grad_fn=<CopySlices>)

In [5]:
L,N,E,S = 3,4,10,6

In [6]:
multiheadAttention = torch.nn.MultiheadAttention(
    embed_dim = E
    , num_heads=10
    , dropout=0.0
    , bias=True
    , add_bias_kv=False
    , add_zero_attn=False
    , kdim=None
    , vdim=None
)

In [7]:
query = torch.rand((L,N,E))
query.shape

torch.Size([3, 4, 10])

In [8]:
key = torch.rand((S,N,E))
key.shape

torch.Size([6, 4, 10])

In [9]:
value = torch.rand((S,N,E))
value.shape

torch.Size([6, 4, 10])

In [10]:
attn_output,attn_output_weights = multiheadAttention(
    query
    , key
    , value
    , key_padding_mask=None
    , need_weights=True
    , attn_mask=None)

In [11]:
attn_output.shape # L,N,E

torch.Size([3, 4, 10])

In [12]:
attn_output_weights.shape # N,L,S

torch.Size([4, 3, 6])