## Transformer 学习，实现

包含 transformer 的框架实现，还有 pytorch 库中一些函数的用法 tips

皈依不在天堂，皈依就在彼岸！

有李沐大佬的[transformer 论文精读]()和用 d2l 库的从零实现

以及大佬的博客＋视频讲解

当然还有 ai 的帮助啦~~~

[transformer](https://wmathor.com/index.php/archives/1438/)

[大佬视频讲解](https://www.bilibili.com/video/BV1mk4y1q7eK/?spm_id_from=333.999.0.0&vd_source=744197c073f4828379c29fa20f3ea477)
![](./img/figure1.png)


### Step1 Positional Encoding

因为没有用循环神经网络，需要序列
在 transformer 里不训练，在 Bert 里会训练
[文章理解](https://wmathor.com/index.php/archives/1453/)

- 编码唯一
- 值有界
- 不同长度的句子之间，任何两个字之间的差值应该一致
- ![](./img/position2.png)
- ![](./img/position.png)


In [3]:
import torch
import torch.nn as nn
import math
import numpy as np

# 参数
d_model = 512
d_ff = 2048
d_k = d_v = 64
n_layers = 6  # number of encoder and decoder layers
n_heads = 8


class PositionalEncoding(nn.Module):
    def __init__(self, d_module, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_module)
        # unsqueeze => 在指定位置插入一个新的维度
        position = torch.arange(0, max_len, dtype=torch.float32).unsqueeze(1)
        # (seq_len, batch_size, d_model)
        # 比例因子
        div_term = torch.exp(
            torch.arange(0, d_module, 2).float() * (-math.log(10000.0) / d_module)
        )
        # 0::2 切片表达式，索引从零开始，步长为2（隔一个索引取一个）
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0).transpose(0, 1)
        # 将位置编码矩阵注册为一个缓冲区, 避免在每次前向传播时重新计算位置编码
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[0 : x.size(0), :]
        return self.dropout(x)

Tip1 unsqueeze


Tip2 transpose


Tip3 register_buffer


In [None]:
import torch
import math

x = torch.arange(0, 10)
print(x)
print(x[1::2])
print(torch.arange(0, 10, 2, dtype=torch.float32))
print(torch.arange(0, 10, dtype=torch.float32).unsqueeze(1))
print(torch.arange(0, 10, dtype=torch.float32).unsqueeze(0))
# x1 = torch.arange(0, 10, 2)
# x2 = math.log(10000.0) / 10
# print(x1)
# print(x2)
# print(torch.exp(x1)/-x2)

### Step2 Pad_Mask and Subsequence Mask

按照 mini-batch 中最大的句长对剩余的句子进行补齐，一般用 0 进行填充(padding)
mask 操作，让无效的区域不参与运算，一般是给无效区域加一个很大的负数偏置

![](./img/padding_mask.png)


In [None]:
def get_attn_pad_mask(seq_q, seq_k):
    batch_size, len_q = seq_q.size()
    batch_size, len_k = seq_k.size()
    # data.eq(0) 是比较操作，找出序列中所有等于零的元素,返回一个True（即填充（PAD）token），False 表示其他非填充元素
    pad_attn_mask = seq_k.data.eq(0).unsqueeze(1)
    # 根据指定的形状参数沿着指定的维度扩展输入张量
    return pad_attn_mask.expand(batch_size, len_q, len_k)

Tip4 eq()


In [None]:
def get_attn_subsequence_mask(seq):
    # 在decoder中用到，屏蔽未来时刻的信息
    attn_shape = [seq.size(0), seq.size(1), seq.size(1)]
    subsequence_mask = np.triu(np.ones(attn_shape), k=1)
    subsequence_mask = torch.from_numpy(subsequence_mask).byte()
    # torch.from_numpy().byte() 将numpy数组转换为Tensor
    return subsequence_mask

Tip5 np.triu()


In [None]:
import torch

# 原始一维张量
a = torch.tensor([1, 2, 3])
print(a)
print(a.size())
# 在最后一个维度之前添加新维度
b = a.unsqueeze(0)
print(b.size())
print(b)
c = a.unsqueeze(1)
print(c)
print(c.size())

### Step3 ScaledDotProductAttention

- ![](./img/attention.png)
- ![](./img/self_attention.png)


In [None]:
class ScaledDotProductAttention(nn.Module):
    """缩放点积注意力 单词间的权重计算"""

    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()
    """
        Q: [batch_size, n_heads, len_q, d_k]
        K: [batch_size, n_heads, len_k, d_k]
        V: [batch_size, n_heads, len_v(=len_k), d_v]
        attn_mask: [batch_size, n_heads, seq_len, seq_len]
    """

    def forward(self, Q, K: torch.Tensor, V, attn_mask):
        # 将Q和K的最后一个维度进行点积，在最后一个维度上进行的。
        scores: torch.Tensor = torch.matmul(
            Q, K.transpose(-1, -2)) / np.sqrt(d_k)
        # mask --- qt~qn => 很大的负数
        scores.masked_fill_(attn_mask, -1e9)
        # softmax()高得分接近1，低得分接近0，所有概率之和为1
        attn = nn.Softmax(dim=1)(scores)
        # 再乘值向量得到上下文的权重
        context = torch.matmul(attn, V)

        return context, attn

### Step4 MultiHeadAttention

增加可学习的参数 W_Q, W_K, W_V


In [None]:

class MultiHeadAttention(nn.Module):
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        self.W_Q = nn.Linear(d_model, d_k * n_heads, bias=False)
        self.W_K = nn.Linear(d_model, d_k * n_heads, bias=False)
        self.W_V = nn.Linear(d_model, d_v * n_heads, bias=False)
        # 将多头注意力的输出进行聚合和转换，将输入维度（batch_size,n_heads*d_v)转换为(~, d_model)
        self.fc = nn.Linear(n_heads * d_v, d_model, bias=False)

    def forward(self, input_Q, input_K, input_V, attn_maks):
        """
        input_Q: [batch_size, len_q, d_model]
        input_K: [batch_size, len_k, d_model]
        input_V: [batch_size, len_v(=len_k), d_model]
        attn_mask: [batch_size, seq_len, seq_len]
        """
        # 残差
        residual, batch_size = input_Q, input_Q.size(0)

        Q = self.W_Q(input_Q).view(batch_size, -1,
                                   n_heads, d_k).transpose(1, 2)
        K = self.W_K(input_K).view(batch_size, -1,
                                   n_heads, d_k).transpose(1, 2)
        V = self.W_V(input_V).view(batch_size, -1,
                                   n_heads, d_k).transpose(1, 2)

Tip6 torch.matmul()


Tip7 masked_fill()


### Step5 FeedForward Layer

前馈神经网络


In [None]:
class PoswiseFeedForwardNet(nn.Module):
    def __init__(self):
        super(PositionalEncoding, self).__init__()