## 注意力机制Attention:

### 简介:

这一部分是Transformer模型的核心部分,以下部分逐步给出实现过程中可能用到的一些矩阵运算的原理， 以下代码均不需要大家实现,希望大家阅读代码以及下列文档中的信息:

https://arxiv.org/abs/1706.03762

https://jalammar.github.io/illustrated-transformer/

理解Attention的运行机制以及实现过程的数学技巧，完成最后的主文件中的HeadAttention(),MultiHeadAttention()部分。

我们虚构一组输入数据的Embedding用于这部分讲解：

In [9]:
import torch
from torch import nn
import torch.nn.functional as F
B, T, C = 1, 8, 16   ## B: batch size 一次训练的数据量, T: context length 前文token数, C: embedding length 隐变量长度
inputData = torch.rand(size=(B,T,C))

for i in range(T):
    print(f"Embedding of {i}th position:\n {inputData[0,i]}")


Embedding of 0th position:
 tensor([0.1871, 0.5729, 0.9432, 0.1598, 0.2466, 0.7337, 0.4543, 0.7171, 0.2751,
        0.4127, 0.7406, 0.0676, 0.4502, 0.8963, 0.4078, 0.9129])
Embedding of 1th position:
 tensor([0.3547, 0.2431, 0.6216, 0.0833, 0.0251, 0.8722, 0.6959, 0.0732, 0.8515,
        0.4331, 0.6258, 0.0605, 0.8798, 0.2157, 0.9914, 0.7378])
Embedding of 2th position:
 tensor([0.6076, 0.6578, 0.2336, 0.9668, 0.2281, 0.7006, 0.7854, 0.2738, 0.3387,
        0.2599, 0.2020, 0.7798, 0.6362, 0.6626, 0.1425, 0.9745])
Embedding of 3th position:
 tensor([0.9335, 0.5228, 0.9507, 0.7577, 0.4946, 0.5187, 0.2221, 0.2487, 0.9560,
        0.6632, 0.3825, 0.2873, 0.9177, 0.8314, 0.6764, 0.3301])
Embedding of 4th position:
 tensor([0.6072, 0.3791, 0.9668, 0.2989, 0.7493, 0.6435, 0.0162, 0.1578, 0.1426,
        0.7176, 0.6305, 0.7508, 0.3079, 0.1527, 0.4698, 0.1419])
Embedding of 5th position:
 tensor([0.0650, 0.9096, 0.1460, 0.5958, 0.1760, 0.8341, 0.6807, 0.9811, 0.3167,
        0.8352, 0.2220, 0.3

Attention从直观上可以理解为对前文各个位置信息的融合以获得当前语境所需的信息。 一个最简单的融合方式为对前文Embedding加权求和作为当前位置的信息。

我们计算第i个位置的融合后的embedding:

假设前i个位置的embedding的权重相同，均为1/i，即更新后第i个位置embedding为前文所有位置embedding的平均值：

In [10]:
def Attention_version1(contextEmbeddings):
    for i in range(T):
        context_embeddings = contextEmbeddings[0,:i+1,:] ## shape [i+1, C]
        new_embedding_for_i = torch.mean(context_embeddings,dim=0)
        contextEmbeddings[0,i] = new_embedding_for_i
    return contextEmbeddings

print("Embedding of Data after aggregate context embedding:\n", Attention_version1(inputData))

Embedding of Data after aggregate context embedding:
 tensor([[[0.1871, 0.5729, 0.9432, 0.1598, 0.2466, 0.7337, 0.4543, 0.7171,
          0.2751, 0.4127, 0.7406, 0.0676, 0.4502, 0.8963, 0.4078, 0.9129],
         [0.2709, 0.4080, 0.7824, 0.1216, 0.1358, 0.8030, 0.5751, 0.3951,
          0.5633, 0.4229, 0.6832, 0.0640, 0.6650, 0.5560, 0.6996, 0.8253],
         [0.3552, 0.5463, 0.6531, 0.4160, 0.2035, 0.7458, 0.6050, 0.4620,
          0.3924, 0.3652, 0.5419, 0.3038, 0.5838, 0.7050, 0.4166, 0.9042],
         [0.4367, 0.5125, 0.8323, 0.3638, 0.2701, 0.7003, 0.4641, 0.4557,
          0.5467, 0.4660, 0.5871, 0.1807, 0.6542, 0.7472, 0.5501, 0.7431],
         [0.3714, 0.4838, 0.8356, 0.2720, 0.3211, 0.7253, 0.4230, 0.4375,
          0.3840, 0.4769, 0.6367, 0.2733, 0.5322, 0.6114, 0.5088, 0.7055],
         [0.2810, 0.5722, 0.6988, 0.3215, 0.2255, 0.7570, 0.5337, 0.5748,
          0.4130, 0.4965, 0.5686, 0.2096, 0.6162, 0.7097, 0.5639, 0.8179],
         [0.3094, 0.5393, 0.7756, 0.3278, 0.2291, 0.

我们将上述的mean操作换为等价的矩阵运算，以i=3 为例：

new_embedding_for_3 = torch.mean(contextEmbeddings[0,:3+1],dim=0)

等价于(@ 是矩阵乘法):

new_embedding_for_3 = contextEmbeddings[0] @ torch.tensor([1/4,1/4,1/4,1/4,0,0,0,0])

In [11]:
def Attention_version2(contextEmbeddings):
    for i in range(T):
        weight = torch.cat((torch.ones(i+1) / (i+1),torch.zeros(T-i-1,dtype=torch.float)),dim=0)
        contextEmbeddings[0,i] = weight @ contextEmbeddings[0]
    return contextEmbeddings

print("Attention_version1 equivalent to Attention_version2: ",torch.all(Attention_version1(inputData) == Attention_version2(inputData)).item())

Attention_version1 equivalent to Attention_version2:  True


接下来我们用矩阵运算进一步简化上述运算，移除其中的for循环:

其中 weight = torch.tril(torch.ones(T,T)) 得到:

[[1., 0., 0., 0., 0., 0., 0., 0.],

 [1., 1., 0., 0., 0., 0., 0., 0.],
 
 [1., 1., 1., 0., 0., 0., 0., 0.],
 
 [1., 1., 1., 1., 0., 0., 0., 0.],
 
 [1., 1., 1., 1., 1., 0., 0., 0.],
 
 [1., 1., 1., 1., 1., 1., 0., 0.],
 
 [1., 1., 1., 1., 1., 1., 1., 0.],
 
 [1., 1., 1., 1., 1., 1., 1., 1.]]
 
表示前文的求和权重相同都为一。

weight = weight.masked_fill(weight==0,float("-inf"))

weight = F.softmax(weight)

这两行用于归一化weight,即每一次加权求和的权重和为1，具体详见Softmax公式,我们可得到：

[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],

[0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],

[0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],

[0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],

[0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],

[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],

[0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],

[0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]]


In [7]:
def Attention_version3(contextEmbeddings):
    B, T, C = contextEmbeddings.shape
    weight = torch.tril(torch.ones(T,T))
    print("weight of context embeddings:\n",weight)
    weight = weight.masked_fill(weight==0,float("-inf"))
    weight = F.softmax(weight,dim=1)
    print("weight of context embeddings after regularization:\n",weight)
    contextEmbeddings[0] = weight @ contextEmbeddings[0]
    return contextEmbeddings

print("Attention_version1 equivalent to Attention_version3: ",torch.all(Attention_version1(inputData) == Attention_version3(inputData)).item())

weight of context embeddings:
 tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])
weight of context embeddings after regularization:
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.125

最后，我们确定计算weight的方法，上述三个版本的weight都是假定所有前文信息的重要程度相同,在大语言模型中，我们希望有一个灵活的方式计算前文信息对应当前语境的重要程度，为此Transformer引入了Query，Key，Value:

其中Query可以理解为当前语境对于前文信息的需求，Key可以理解为前文包含信息的索引，Value为前文所包含的信息。

Query 和 Key 用来计算信息融合的weight.

如何计算Query和Key，并用他们计算weight对Value加权求和是这次实验的重点内容，这里不能给出大家具体代码，希望大家参见Attention is All you need原论文以及助教提供的文档最后的参考链接学习这部分。

利于Query和Key得出的是信息相关性，我们需要遮盖住下文的信息(生成第i个token时，只可以使用0到i-1处的信息)，并且要对相关性归一化使之可以作为weight。这里利于Attension_version3()中的结论给出如何对计算出来的相关性加掩码和归一化:


In [8]:
def weight_mask_and_normalization(weight):
    tril = torch.tril(torch.ones_like(weight))
    weight = weight.masked_fill(tril == 0, float("-inf"))
    weight = F.softmax(weight,dim=-1)
    return weight

weight = torch.rand(T,T)
print("weight before mask and normalize:\n",weight)
print("weight after mask and normalize:\n",weight_mask_and_normalization(weight))

weight before mask and normalize:
 tensor([[5.5676e-01, 9.4231e-01, 5.1067e-01, 7.5773e-01, 1.6254e-01, 9.4655e-01,
         5.3268e-01, 6.7807e-01],
        [2.9409e-01, 9.6507e-01, 1.1124e-01, 2.7023e-02, 1.0419e-01, 9.1842e-01,
         8.9423e-01, 1.9520e-01],
        [5.5079e-01, 3.5385e-01, 4.4868e-01, 1.9307e-01, 8.0032e-01, 2.5918e-02,
         2.8909e-01, 7.7354e-01],
        [4.7963e-01, 2.1382e-01, 2.4471e-01, 2.3849e-02, 9.9436e-01, 5.4707e-01,
         6.0641e-01, 6.9419e-01],
        [5.2073e-01, 2.5474e-03, 5.3586e-01, 1.9880e-01, 9.4446e-01, 5.0853e-02,
         5.7236e-01, 3.0462e-01],
        [3.9553e-01, 8.0423e-01, 6.3437e-02, 8.3306e-02, 7.4547e-01, 6.9222e-01,
         1.8560e-02, 5.5100e-01],
        [2.4007e-01, 3.6904e-01, 8.6105e-01, 2.4039e-02, 2.3494e-01, 3.8838e-01,
         9.8474e-01, 7.5347e-01],
        [7.6678e-01, 4.7956e-01, 7.1612e-01, 7.5432e-01, 9.7332e-01, 3.9041e-04,
         4.2841e-02, 4.8691e-01]])
weight after mask and normalize:
 tensor([[1