# 前言

学习飞桨官方精品课程，紧跟人工智能与AIGC趋势

[零基础实践深度学习（第2版）](https://aistudio.baidu.com/aistudio/education/group/info/25302)

[零门槛搞懂基于大模型的AIGC应用及技术要点](https://aistudio.baidu.com/aistudio/course/introduce/26723)

目录：

01 从Attention到Transformer

02 通过Transformer实现机器翻译任务

<img src="./image/transformer.png" alt="图片来源飞桨零基础实践深度学习" width="900" height="600" align="bottom" />

> 图片来源: 飞桨零基础实践深度学习（第二版）课程 7.2.3 Transformer模型


2017年，由Vaswani等人在论文“[Attention Is All You Need](https://arxiv.org/abs/1706.03762)”中提出一种神经网络模型Transformer，其基于Seq2Seq网络结构进行建模，最开始被应用于机器翻译、语言建模和文本生成等自然语言处理任务上，其模型结构图如上图所示。

与传统NLP特征提取类模型（CNN、RNN）相比，Transformer神经网络模型的主要区别在于：
1）没有使用循环神经网络，纯基于注意力机制的结构搭建；
2）引入了残差连接和层归一化；
3）采用编码器-解码器结构来处理序列数据，并引入了位置编码，将位置信息嵌入到输入数据的向量表示中，让模型能够学习到序列中不同位置之间的相对位置关系。

这些处理带来的优点有并行计算能力强、更好地捕捉上下文信息、更加灵活易扩展。

# Attention

In [79]:
import numpy as np
import paddle
import paddle.nn as nn

class ScaledDotProductAttention(nn.Layer):
    def __init__(self, dropout_p=0.):
        super().__init__()
        self.softmax = nn.Softmax()
        self.dropout = nn.Dropout(1-dropout_p)

    def forward(self, query, key, value, attn_mask=None):
        """scaled dot product attention"""
        # 计算scaling factor
        embed_size = query.shape[-1]
        scaling_factor = paddle.sqrt(paddle.to_tensor(embed_size,dtype='float32'))
        # 注意力权重计算
        # 计算query和key之间的点积，并除以scaling factor进行归一化
        attn = paddle.matmul(query,key / scaling_factor)
        # 注意力掩码机制
        def masked_fill(x, mask, value):
            y = paddle.full(x.shape, value, x.dtype)
            return paddle.where(mask, y, x)

        if attn_mask is not None:
            attn = masked_fill(attn, attn_mask,-1e9)
        # softmax，保证注意力权重范围在0-1之间
        attn = self.softmax(attn)
        # dropout
        attn = self.dropout(attn)
         # 对value进行加权
        output = paddle.matmul(attn,value)
        return (output,attn)

In [80]:
def get_attn_pad_mask(seq_q, seq_k, pad_idx):
    """注意力掩码：识别序列中的<pad>占位符
    Args:
        seq_q (Tensor): query序列，shape = [batch size, query len]
        seq_k (Tensor): key序列，shape = [batch size, key len]
        pad_idx (Tensor): key序列<pad>占位符对应的数字索引
    """
    batch_size, len_q = seq_q.shape
    batch_size, len_k = seq_k.shape
    # 如果序列中元素对应<pad>占位符，则该位置在mask中对应元素为True
    # pad_attn_mask: [batch size, key len]
    pad_attn_mask = paddle.equal(seq_k, pad_idx)
    # 增加额外的维度
    # pad_attn_mask: [batch size, 1, key len]
    pad_attn_mask = pad_attn_mask.unsqueeze(1)
    # 将掩码广播到[batch size, query len, key len]
    pad_attn_mask = paddle.broadcast_to(pad_attn_mask, (batch_size, len_q, len_k))

    return pad_attn_mask

In [81]:
# length = 6 , be like "Hello World !" --> [Hello,World,!,<pad>,<pad>,<pad>]
q = k = paddle.to_tensor(np.array([[1, 1, 1, 0, 0, 0]]), dtype='float32')
pad_idx = 0
mask = get_attn_pad_mask(q, k, pad_idx)
print(mask)
print(q.shape, mask.shape)

Tensor(shape=[1, 6, 6], dtype=bool, place=Place(gpu:0), stop_gradient=True,
       [[[False, False, False, True , True , True ],
         [False, False, False, True , True , True ],
         [False, False, False, True , True , True ],
         [False, False, False, True , True , True ],
         [False, False, False, True , True , True ],
         [False, False, False, True , True , True ]]])
[1, 6] [1, 6, 6]


# Self-Attention

# Multi-Head Attention

In [82]:
class MultiHeadAttention(nn.Layer):
    def __init__(self, d_model, d_k, n_heads, dropout_p=0.):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_k
        self.W_Q = nn.Linear(d_model, d_k * n_heads)
        self.W_K = nn.Linear(d_model, d_k * n_heads)
        self.W_V = nn.Linear(d_model, d_k * n_heads)
        self.W_O = nn.Linear(n_heads * d_k, d_model)
        self.attention = ScaledDotProductAttention(dropout_p=dropout_p)

    def forward(self, query, key, value, attn_mask):
        """
        query: [batch_size, len_q, d_model]
        key: [batch_size, len_k, d_model]
        value: [batch_size, len_k, d_model]
        attn_mask: [batch_size, seq_len, seq_len]
        """

        batch_size = query.shape[0]

        # 将query，key和value分别乘以对应的权重，并分割为不同的“头”
        # q_s: [batch_size, len_q, n_heads, d_k]
        # k_s: [batch_size, len_k, n_heads, d_k]
        # v_s: [batch_size, len_k, n_heads, d_k]
        q_s = self.W_Q(query).reshape([batch_size, -1, self.n_heads, self.d_k])
        k_s = self.W_K(key).reshape([batch_size, -1, self.n_heads, self.d_k])
        v_s = self.W_V(value).reshape([batch_size, -1, self.n_heads, self.d_k])

        # 调整query，key和value的维度
        # q_s: [batch_size, n_heads, len_q, d_k]
        # k_s: [batch_size, n_heads, len_k, d_k]
        # v_s: [batch_size, n_heads, len_k, d_k]
        q_s = q_s.transpose((0, 2, 1, 3))
        k_s = k_s.transpose((0, 2, 1, 3))
        v_s = v_s.transpose((0, 2, 1, 3))

        # attn_mask的dimension需与q_s, k_s, v_s对应
        # attn_mask: [batch_size, n_heads, seq_len, seq_len]
        attn_mask = attn_mask.unsqueeze(1)
        attn_mask = paddle.tile(attn_mask, (1, self.n_heads, 1, 1))

        # 计算每个头的注意力分数
        # context: [batch_size, n_heads, len_q, d_k]
        # attn: [batch_size, n_heads, len_q, len_k]
        context, attn = self.attention(q_s, k_s, v_s, attn_mask)

        # concatenate
        # context: [batch_size, len_q, n_heads * d_k]
        context = context.transpose((0, 2, 1, 3)).reshape((batch_size, -1, self.n_heads * self.d_k))

        # 乘以W_O
        # output: [batch_size, len_q, n_heads * d_k]
        output = self.W_O(context)

        return output, attn

In [83]:
paddle.to_tensor(np.array([False])).broadcast_to((1, 2, 2))

Tensor(shape=[1, 2, 2], dtype=bool, place=Place(gpu:0), stop_gradient=True,
       [[[False, False],
         [False, False]]])

In [84]:
dmodel, dk, nheads = 10, 2, 5
q = k = v =paddle.ones((1, 2, 10), dtype='float32')
attn_mask = paddle.to_tensor(np.array([False])).broadcast_to((1, 2, 2))
multi_head_attn = MultiHeadAttention(dmodel, dk, nheads)
output, attn = multi_head_attn(q, k, v, attn_mask)
print(output.shape, attn.shape)

[1, 2, 10] [1, 5, 2, 2]


# Transformer

## Positional Encoding

## Encoder

## Decoder

## Transformer

# 通过Transformer实现机器翻译任务

[PaddleNLP精选Example - 机器翻译](https://github.com/PaddlePaddle/PaddleNLP/tree/9e3bc459366faa04f70660e7881934dee1fa41b5/examples/machine_translation)

# 参考资料：

[1] [Machine Translation using Transformer](https://github.com/PaddlePaddle/PaddleNLP/blob/21714c3797149f5283c9229313cb93fcb2c2d51a/examples/machine_translation/transformer/README.md)
[2] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010.
[3] Ba J L, Kiros J R, Hinton G E. Layer normalization[J]. arXiv preprint arXiv:1607.06450, 2016.
[4] [PaddlePaddle文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/index_cn.html)
[5] [零基础实践深度学习（第2版）](https://aistudio.baidu.com/aistudio/education/group/info/25302)

作者：Armor
邮箱：htkstudy163.com
AI Studio主页：https://aistudio.baidu.com/aistudio/personalcenter/thirdview/392748