## 注意力评分函数

计算注意力汇聚的输出为值的加权和

![attention-output.svg](https://zh.d2l.ai/_images/attention-output.svg)

$f(\mathbf{q}, (\mathbf{k}_1, \mathbf{v}_1), \ldots, (\mathbf{k}_m, \mathbf{v}_m)) = \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i \in \mathbb{R}^v,$

$\alpha(\mathbf{q}, \mathbf{k}_i) = \mathrm{softmax}(a(\mathbf{q}, \mathbf{k}_i)) = \frac{\exp(a(\mathbf{q}, \mathbf{k}_i))}{\sum_{j=1}^m \exp(a(\mathbf{q}, \mathbf{k}_j))} \in \mathbb{R}.$

- `a`: 注意力评分函数，加性注意力 & 缩放点积注意力

In [14]:
import sys
sys.path.append('..')
import math
import torch
from torch import nn
import d2l

掩蔽softmax操作

In [15]:
def masked_softmax(X, valid_lens):
    '''通过在最后一个轴上掩蔽元素来执行softmax操作'''
    # X: 3D张量，valid_lens: 1D或2D张量
    if valid_lens is None:
        return nn.functional.softmax(X, dim=-1)
    else:
        shape = X.shape
        if valid_lens.dim() == 1:
            valid_lens = torch.repeat_interleave(valid_lens, shape[1])
        else:
            valid_lens = valid_lens.reshape(-1)
        # 最后一轴上被掩蔽的元素使用一个非常大的负值替换，从而其softmax输出为0
        X = d2l.sequence_mask(X.reshape(-1, shape[-1]), valid_lens, value=-1e6)
        return nn.functional.softmax(X.reshape(shape), dim=-1)

In [16]:
masked_softmax(torch.rand(2, 2, 4), torch.tensor([2, 3]))

tensor([[[0.3794, 0.6206, 0.0000, 0.0000],
         [0.4201, 0.5799, 0.0000, 0.0000]],

        [[0.2399, 0.4417, 0.3185, 0.0000],
         [0.2645, 0.4256, 0.3100, 0.0000]]])

In [17]:
masked_softmax(torch.rand(2, 2, 4), torch.tensor([[1, 3], [2, 4]]))

tensor([[[1.0000, 0.0000, 0.0000, 0.0000],
         [0.4294, 0.3181, 0.2526, 0.0000]],

        [[0.5266, 0.4734, 0.0000, 0.0000],
         [0.1558, 0.2273, 0.2203, 0.3966]]])

加性注意力