# 注意力机制

## 缩放点积注意力（scaled dot product）

缩放点积注意力是一种注意力机制，其中点积按 $\sqrt{d_k}$ 比例缩小.

如果我们有 a query $Q$, a key $K$ and a value $V$ ，计算attention的公式为:

```{tip}

$$
\text{Attention}(Q,K,V)=\text{softmax}\biggl(\frac{QK^T}{\sqrt{d_k}}\biggr)V
$$

```


我们假设 $q$ 和 $k$ 是 $d_k$-dimensional 向量，其分量是均值为0，方差为1的独立随机变量，那么它们的点积为$q\cdot k=\sum_{i=1}^{dk}u_iv_i$ 均值为 0 同时方差为
 $d_k$.
 
因为我们希望这些值的方差为1, 所以将计算结果除以 $\sqrt{d_k}$


 ```{figure} ../images/attention/scaled-dot-product-attention.png
:width: 400px
:align: center
:name: my-fig-ref

scaled dot product.
```

## 代码

> 下面是一个简单的pytorch对应实现：

In [12]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
    ''' Scaled Dot-Product Attention '''

    def __init__(self, temperature, attn_dropout=0.1):
        super().__init__()
        self.temperature = temperature
        self.dropout = nn.Dropout(attn_dropout)

    def forward(self, q, k, v, mask=None):
        # 计算注意力权重
        attn = torch.matmul(q / self.temperature, k.transpose(1, 2))
        
        if mask is not None:
            # 对未被掩码的位置设置一个极大的负数，以将注意力权重置为0
            attn = attn.masked_fill(mask == 0, -1e9)

        # 对注意力权重进行归一化处理（通过softmax函数）
        attn = self.dropout(F.softmax(attn, dim=-1))
        # 计算加权后的输出
        output = torch.matmul(attn, v)

        return output, attn


In [13]:
# 测试一下这个模块

# 创建输入张量
q = torch.randn(2, 3, 4)  # (batch_size, query_len, query_dim)
k = torch.randn(2, 3, 4)  # (batch_size, key_len, key_dim)
v = torch.randn(2, 3, 4)  # (batch_size, value_len, value_dim)
mask = torch.ones(2, 3,3)  # (batch_size, query_len, key_len)

# 创建ScaledDotProductAttention模块
attention = ScaledDotProductAttention(temperature=0.5)

# 前向传播
output, attn = attention(q, k, v, mask)

# 输出每个过程中的形状
print("Output Shape:", output.shape)
print("Attention Shape:", attn.shape)

Output Shape: torch.Size([2, 3, 4])
Attention Shape: torch.Size([2, 3, 3])


## 自注意力（self attention）

>Self Attention在2017年由Google机器翻译团队发表的《Attention is All You Need》中被提出，它完全抛弃了RNN和CNN等网络结构，而仅采用新提出的Self Attention机制来处理机器翻译任务，并且取得了很好的效果。

在Encoder-Decoder框架下，广义的attention机制中的输入Source和输出Target内容是不一样的，以英-中机器翻译为例，Source是英文句子，Target是对应的翻译出的中文句子，Attention机制发生在Target的元素和Source中的所有元素之间。此时Query来自Target，Key和Value来自Source。

而Self Attention顾名思义，指**不是Target和Source之间做Attend，而是Source内部元素之间或者Target内部元素之间发生的Attention机制，此时Query、Key和Value都来自Target或Source。**



 ```{figure} ../images/attention/muti-head-attention.png
:width: 300px
:align: center
:name: aaa

Self attention.
```

[参考链接](https://paperswithcode.com/method/scaled)

这里有一个GIF展示了这个流程

 ```{figure} ../images/attention/SF.gif
:width: 600px
:align: center
:name: aaa

Gif.
```

## 代码

以下是一个官方paper的实现

In [17]:
import numpy as np
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    ''' Multi-Head Attention module '''

    def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1):
        super().__init__()

        self.n_head = n_head
        self.d_k = d_k
        self.d_v = d_v

        self.w_qs = nn.Linear(d_model, n_head * d_k, bias=False)
        self.w_ks = nn.Linear(d_model, n_head * d_k, bias=False)
        self.w_vs = nn.Linear(d_model, n_head * d_v, bias=False)
        self.fc = nn.Linear(n_head * d_v, d_model, bias=False)

        self.attention = ScaledDotProductAttention(temperature=d_k ** 0.5)

        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(d_model, eps=1e-6)


    def forward(self, q, k, v, mask=None):

        d_k, d_v, n_head = self.d_k, self.d_v, self.n_head
        sz_b, len_q, len_k, len_v = q.size(0), q.size(1), k.size(1), v.size(1)

        residual = q

        # Pass through the pre-attention projection: b x lq x (n*dv)
        # Separate different heads: b x lq x n x dv
        q = self.w_qs(q).view(sz_b, len_q, n_head, d_k)
        k = self.w_ks(k).view(sz_b, len_k, n_head, d_k)
        v = self.w_vs(v).view(sz_b, len_v, n_head, d_v)

        # Transpose for attention dot product: b x n x lq x dv
        q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)

        if mask is not None:
            mask = mask.unsqueeze(1)   # For head axis broadcasting.

        q, attn = self.attention(q, k, v, mask=mask)

        # Transpose to move the head dimension back: b x lq x n x dv
        # Combine the last two dimensions to concatenate all the heads together: b x lq x (n*dv)
        q = q.transpose(1, 2).contiguous().view(sz_b, len_q, -1)
        q = self.dropout(self.fc(q))
        q += residual

        q = self.layer_norm(q)

        return q, attn
