# 多技能对话

多技能对话系统旨在建立一个开放域的多轮对话系统，能自然地融合多个对话技能，比如知识对话、推荐对话等，使得机器可以流畅自然地与人进行语言交互，从而有效地提升用户体验。

参考[2021语言与智能技术竞赛：多技能对话](https://aistudio.baidu.com/aistudio/competition/detail/67)基线

In [1]:
# 安装paddlenlp最新版本
!pip install --upgrade paddlenlp -i https://pypi.org/simple

%cd multi-skill_dialogue/

Collecting paddlenlp
  Downloading paddlenlp-2.3.4-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m28.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0mm
[?25hCollecting datasets>=2.0.0
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.3/362.3 kB[0m [31m28.0 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting paddle2onnx
  Downloading paddle2onnx-0.9.8-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m18.2 kB/s[0m eta [36m0:00:00[0m00:01[0m00:06[0m
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 kB[0m [31m19.3 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_1

In [None]:
%cd multi-skill_dialogue/

/home/aistudio/multi-skill_dialogue


In [None]:
%pwd

'/home/aistudio/multi-skill_dialogue'

In [None]:
%cd ./tools/

/home/aistudio/multi-skill_dialogue/tools


## 多技能对话基线

多技能对话比赛提供了多个子数据集，包含知识对话、推荐对话、画像对话和其他多种类型的对话数据集。基线采用UnifiedTransformer模型，模型的的输入除了数据token及`[CLS]`、`[SEP]`等special token之外，还有用于区别不同对话技能的special token。

![模型输入](https://ai-studio-static-online.cdn.bcebos.com/24d697df544c4299a679e04e2d3b1442fdf17a14981e454e8a2de5c7acea8051)

### Step1：数据预处理

由于多技能对话比赛的[数据集](https://aistudio.baidu.com/aistudio/competition/detail/67)**数量多且数据规模大**，并且数据集之间**格式不同**，所以需要使用脚本对数据集进行预处理，同时将数据转化成id化的数据。

**注意：** 需要确保脚本中的输入文件路径、输出文件路径和参数配置正确。由于数据规模较大，脚本运行时间较长(尤其是训练集)。也可自行分批次处理。

In [None]:
# 注意：脚本默认只取每个数据集的部分语料进行处理作为基线模型的训练数据，参赛选手需根据需求自行修改数据处理策略
# python ./tools/convert_data_to_numerical.py ./tools/spm.model

SyntaxError: invalid syntax (2685084954.py, line 2)

#### OOV

   在encoder-decoder结构中，需要通过固定的词典对平行语料进行表示，**词典大小**一般控制在30k-60k；因此希望**减少词表的大小**，从而提高时间和空间效率。同时还希望**文本长度尽可能的短**，因为文本长度的增加会降低效率并增加神经模型传递信息所需的距离（LSTM），文本越长信息丢失的可能性越大。这就导致了很多**未登录词（OOV）和罕见词（Rare Words）**。
   
   参考[LIC2021-多技能对话赛题冠军方案分享](https://zhuanlan.zhihu.com/p/514079863)的思路，有很多在文本里频繁出现的字符在模型的Vocab文件里没有出现，因此对所有在文本里出现的字符进行统计，将出现次数大于100次的字符保存下来，替换成Vocab文件里一次都没有出现过的字符。
   
   除了一些频繁出现的字符以外，还存在大量的只出现过一次的生僻字符，只用Vocab文件里的字符替换是替换不完的，这些字符的存在会导致模型在infer的时候后续相关的字符也一起乱掉。因此，我们对模型的词表embedding数量进行修改，额外定义了128个特殊字符，对一个样本里的所有字符依次进行遍历，当存在生僻字符时，依次使用特殊字符进行替换。
   

### Step2：构建模型

[UnifiedTransformer](https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue)以Transformer的编码器为网络基本组件，采用灵活的注意力机制，十分适合文本生成任务，并在模型输入中加入了标识不同对话技能的special token，使得模型能同时支持闲聊对话、推荐对话和知识对话。

**PaddleNLP提供了UnifiedTransformer中文预训练模型，可以通过预训练模型名称完成一键加载。PaddleNLP为了方便用户处理数据，内置了与模型配套的Tokenizer，可以完成文本token化，token转ID，ID转token等操作。**

PaddleNLP目前为UnifiedTransformer提供了两个中文预训练模型：
- `unified_transformer-12L-cn` 该预训练模型是在大规模中文会话数据集上训练得到的
- `unified_transformer-12L-cn-luge` 该预训练模型是`unified_transformer-12L-cn`在千言对话数据集上进行微调得到的

In [3]:
from paddlenlp.transformers import UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer

# 预训练模型名称
model_name_or_path = 'unified_transformer-12L-cn-luge'

# 加载预训练模型
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
# 加载配套的tokenizer
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

[2022-07-19 21:34:09,088] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/unified_transformer-12L-cn-luge/unified_transformer-12L-cn-luge.pdparams
[2022-07-19 21:34:13,755] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/unified_transformer-12L-cn-luge/unified_transformer-12L-cn-vocab.txt
[2022-07-19 21:34:13,759] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/unified_transformer-12L-cn-luge/unified_transformer-12L-cn-spm.model


### 基于预训练模型的模板替换方法
在推荐对话中，我们参考了2020语言与智能技术竞赛推荐对话任务[强行跳大团队的方案](https://zhuanlan.zhihu.com/p/343061563)。使用增加语言模型层的RoBERTa预训练模型，将双向注意力机制改为类似GPT的单向自左向右注意力机制，使得每个Token只能看到它左边的Token。

模型将目标预测以及对话生成集合在一个模型中，并且添加了一个特殊的字符[GOAL]，来表示前一小段文本是下一句话的目标信息。用户和机器人的对话总是包括两部分，Goal 以及对话内容。其中 Goal 总是以[GOAL]作为结尾，表示对话有新目标。大部分情况下，Goal总是为空。User或者Bot 的对话以[SEP]作为结尾。预测时，首先预测回复对应的新目标，如果包含在给定期望目标中，则设置该目标为本轮对话目标，否则本轮对话目标为空。随后生成带有指代字符串的本轮对话回复。

对于对话的输入和生成回复采用了模板填充的方式。对对话中涉及外部信息的部分进行模板替换，模板包括画像信息、知识主体信息、知识客体信息。在预处理时通过规则匹配的方式将对话内容进行模板替换，在后处理时将回复中存在的模板字符串重新替换为外部信息。


### Step3：加载数据

基线通过继承`paddle.io.IterableDataset`自定义可迭代数据集`DialogueDataset`，包括读取文件、shuffle及组batch等操作，细节详见`data.py`。

In [4]:
from paddle.io import DataLoader
from data import DialogueDataset

# 训练batch_size
batch_size = 8192
# 组batch进行排序和shuffle的pool_size
sort_pool_size = 65536

# 训练集路径，注意与数据预处理输出路径保持一致
train_data_path = './datasets/train.txt' 
# 初始化Dataset
train_dataset = DialogueDataset(
        train_data_path,
        batch_size,
        tokenizer.pad_token_id,
        tokenizer.cls_token_id,
        sort_pool_size,
        mode='train')
# 初始化Dataloader
train_dataloader = DataLoader(train_dataset, return_list=True, batch_size=None)

# 开发集路径，注意与数据预处理输出路径保持一致
valid_data_path = './datasets/valid.txt' 
valid_dataset = DialogueDataset(
    valid_data_path,
    batch_size,
    tokenizer.pad_token_id,
    tokenizer.cls_token_id,
    sort_pool_size,
    mode='valid')
valid_dataloader = DataLoader(valid_dataset, return_list=True, batch_size=None)

### Step4：训练优化

在该基线中，我们选择交叉熵损失函数，使用`paddle.optimizer.AdamW`作为优化器。

在训练过程中，模型保存在当前目录checkpoints文件夹下。在训练的同时在验证集上进行评估，输出`loss`和`PPL`等指标。

In [5]:
import os

# 定义训练模型保存函数
def save_ckpt(model, tokenizer, save_dir, name):
    output_dir = os.path.join(save_dir, "model_{}".format(name))
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

In [6]:
import math
import paddle
import paddle.nn.functional as F

# 定义模型评估函数，在模型训练过程中会在开发集上对模型进行评估
@paddle.no_grad()
def evaluation(model, data_loader):
    print('\nEval begin...')
    model.eval()
    total_tokens = 0
    total_loss = 0.0
    start_time = time.time()
    step = 0
    for inputs in data_loader:
        step += 1
        token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs

        logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos)
        loss = F.cross_entropy(logits, tgt_label, reduction='sum')

        total_loss += loss.numpy()[0]
        total_tokens += tgt_label.shape[0]

    avg_loss = total_loss / total_tokens
    ppl = math.exp(avg_loss)
    avg_speed = (time.time() - start_time) / step
    print('loss: %.4f - ppl: %.4f - %.3fs/step\n' % (avg_loss, ppl, avg_speed))
    model.train()

In [7]:
import paddle.nn as nn
from paddle.optimizer.lr import NoamDecay
from paddle.optimizer import AdamW

# 学习率
lr = 1e-5
# 学习率逐渐升高到基础学习率（即上面配置的lr）所需要的迭代数
warmup_steps = 4000
# AdamW优化器中使用的weight_decay的系数
weight_decay = 0.01
# 度裁剪允许的最大梯度值
max_grad_norm = 0.1

# 初始化Noam衰减学习率的策略
lr_scheduler = NoamDecay(1 / (warmup_steps * (lr**2)), warmup_steps)
# 对偏置和LayerNorm层不进行weight_decay策略
decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]
# 初始化AdamW优化器
optimizer = AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in decay_params,
    grad_clip=nn.ClipGradByGlobalNorm(max_grad_norm))

In [None]:
import time

# 训练轮次
epochs = 10
# 日志打印间隔
logging_steps = 5
# 模型保存及评估间隔
save_steps = 10
# 模型的保存路径
save_dir = './checkpoints/'

step = 0
total_time = 0.0
for epoch in range(epochs):
    print('\nEpoch %d/%d' % (epoch + 1, epochs))
    batch_start_time = time.time()
    for inputs in train_dataloader:
        step += 1
        token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs

        logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos)
        # 使用交叉熵损失函数计算loss
        loss = F.cross_entropy(logits, tgt_label)
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()

        total_time += (time.time() - batch_start_time)
        if step % logging_steps == 0:
            ppl = paddle.exp(loss)
            print('step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step'
                % (step, loss, ppl, optimizer.get_lr(), total_time / logging_steps))
            total_time = 0.0
        if step % save_steps == 0:
            # 在开发集上对模型进行评估
            evaluation(model, valid_dataloader)
            # 保存模型
            save_ckpt(model, tokenizer, save_dir, step)
        batch_start_time = time.time()
print("\n=====training complete=====")


Epoch 1/10
step 5 - loss: 4.0267 - ppl: 56.0757 - lr: 0.0000000 - 0.985s/step
step 10 - loss: 3.4819 - ppl: 32.5222 - lr: 0.0000000 - 0.416s/step

Eval begin...
loss: 2.7557 - ppl: 15.7322 - 0.161s/step



[2022-07-19 15:15:57,155] [    INFO] - tokenizer config file saved in ./checkpoints/model_10/tokenizer_config.json
[2022-07-19 15:15:57,158] [    INFO] - Special tokens file saved in ./checkpoints/model_10/special_tokens_map.json


step 15 - loss: 3.5252 - ppl: 33.9609 - lr: 0.0000000 - 0.427s/step
step 20 - loss: 3.5798 - ppl: 35.8652 - lr: 0.0000001 - 0.431s/step

Eval begin...
loss: 2.7557 - ppl: 15.7324 - 0.163s/step



[2022-07-19 15:18:50,562] [    INFO] - tokenizer config file saved in ./checkpoints/model_20/tokenizer_config.json
[2022-07-19 15:18:50,564] [    INFO] - Special tokens file saved in ./checkpoints/model_20/special_tokens_map.json


step 25 - loss: 4.1072 - ppl: 60.7754 - lr: 0.0000001 - 0.425s/step
step 30 - loss: 4.1608 - ppl: 64.1204 - lr: 0.0000001 - 0.421s/step

Eval begin...
loss: 2.7557 - ppl: 15.7327 - 0.163s/step



[2022-07-19 15:21:42,961] [    INFO] - tokenizer config file saved in ./checkpoints/model_30/tokenizer_config.json
[2022-07-19 15:21:42,964] [    INFO] - Special tokens file saved in ./checkpoints/model_30/special_tokens_map.json


step 35 - loss: 3.6854 - ppl: 39.8602 - lr: 0.0000001 - 0.432s/step
step 40 - loss: 3.5509 - ppl: 34.8451 - lr: 0.0000001 - 0.428s/step

Eval begin...
loss: 2.7558 - ppl: 15.7330 - 0.163s/step



[2022-07-19 15:24:36,156] [    INFO] - tokenizer config file saved in ./checkpoints/model_40/tokenizer_config.json
[2022-07-19 15:24:36,159] [    INFO] - Special tokens file saved in ./checkpoints/model_40/special_tokens_map.json


step 45 - loss: 3.9115 - ppl: 49.9756 - lr: 0.0000001 - 0.426s/step
step 50 - loss: 3.6398 - ppl: 38.0851 - lr: 0.0000001 - 0.430s/step

Eval begin...
loss: 2.7558 - ppl: 15.7336 - 0.164s/step



[2022-07-19 15:27:29,904] [    INFO] - tokenizer config file saved in ./checkpoints/model_50/tokenizer_config.json
[2022-07-19 15:27:29,907] [    INFO] - Special tokens file saved in ./checkpoints/model_50/special_tokens_map.json


step 55 - loss: 3.6934 - ppl: 40.1810 - lr: 0.0000001 - 0.429s/step
step 60 - loss: 3.5569 - ppl: 35.0534 - lr: 0.0000001 - 0.425s/step

Eval begin...
loss: 2.7558 - ppl: 15.7343 - 0.163s/step



[2022-07-19 15:30:22,426] [    INFO] - tokenizer config file saved in ./checkpoints/model_60/tokenizer_config.json
[2022-07-19 15:30:22,429] [    INFO] - Special tokens file saved in ./checkpoints/model_60/special_tokens_map.json


step 65 - loss: 3.5345 - ppl: 34.2789 - lr: 0.0000002 - 0.431s/step
step 70 - loss: 3.1544 - ppl: 23.4399 - lr: 0.0000002 - 0.428s/step

Eval begin...
loss: 2.7559 - ppl: 15.7351 - 0.164s/step



[2022-07-19 15:33:15,807] [    INFO] - tokenizer config file saved in ./checkpoints/model_70/tokenizer_config.json
[2022-07-19 15:33:15,810] [    INFO] - Special tokens file saved in ./checkpoints/model_70/special_tokens_map.json


step 75 - loss: 4.4154 - ppl: 82.7188 - lr: 0.0000002 - 0.435s/step
step 80 - loss: 3.5995 - ppl: 36.5785 - lr: 0.0000002 - 0.427s/step

Eval begin...
loss: 2.7560 - ppl: 15.7361 - 0.163s/step



[2022-07-19 15:36:08,772] [    INFO] - tokenizer config file saved in ./checkpoints/model_80/tokenizer_config.json
[2022-07-19 15:36:08,775] [    INFO] - Special tokens file saved in ./checkpoints/model_80/special_tokens_map.json


step 85 - loss: 3.6144 - ppl: 37.1274 - lr: 0.0000002 - 0.433s/step
step 90 - loss: 3.5916 - ppl: 36.2934 - lr: 0.0000002 - 0.429s/step

Eval begin...
loss: 2.7560 - ppl: 15.7372 - 0.162s/step



[2022-07-19 15:39:00,692] [    INFO] - tokenizer config file saved in ./checkpoints/model_90/tokenizer_config.json
[2022-07-19 15:39:00,695] [    INFO] - Special tokens file saved in ./checkpoints/model_90/special_tokens_map.json


step 95 - loss: 3.4181 - ppl: 30.5118 - lr: 0.0000002 - 0.427s/step
step 100 - loss: 3.6781 - ppl: 39.5710 - lr: 0.0000002 - 0.434s/step

Eval begin...
loss: 2.7561 - ppl: 15.7382 - 0.163s/step



[2022-07-19 15:41:52,625] [    INFO] - tokenizer config file saved in ./checkpoints/model_100/tokenizer_config.json
[2022-07-19 15:41:52,628] [    INFO] - Special tokens file saved in ./checkpoints/model_100/special_tokens_map.json


step 105 - loss: 3.5338 - ppl: 34.2541 - lr: 0.0000003 - 0.432s/step
step 110 - loss: 3.4108 - ppl: 30.2883 - lr: 0.0000003 - 0.431s/step

Eval begin...
loss: 2.7562 - ppl: 15.7392 - 0.163s/step



[2022-07-19 15:44:45,643] [    INFO] - tokenizer config file saved in ./checkpoints/model_110/tokenizer_config.json
[2022-07-19 15:44:45,645] [    INFO] - Special tokens file saved in ./checkpoints/model_110/special_tokens_map.json


step 115 - loss: 3.5370 - ppl: 34.3629 - lr: 0.0000003 - 0.431s/step
step 120 - loss: 4.2940 - ppl: 73.2576 - lr: 0.0000003 - 0.426s/step

Eval begin...
loss: 2.7563 - ppl: 15.7407 - 0.164s/step



[2022-07-19 15:47:36,064] [    INFO] - tokenizer config file saved in ./checkpoints/model_120/tokenizer_config.json
[2022-07-19 15:47:36,067] [    INFO] - Special tokens file saved in ./checkpoints/model_120/special_tokens_map.json


step 125 - loss: 3.6018 - ppl: 36.6628 - lr: 0.0000003 - 0.431s/step
step 130 - loss: 3.3020 - ppl: 27.1667 - lr: 0.0000003 - 0.432s/step

Eval begin...
loss: 2.7564 - ppl: 15.7425 - 0.163s/step



[2022-07-19 15:50:25,414] [    INFO] - tokenizer config file saved in ./checkpoints/model_130/tokenizer_config.json
[2022-07-19 15:50:25,417] [    INFO] - Special tokens file saved in ./checkpoints/model_130/special_tokens_map.json


step 135 - loss: 3.5356 - ppl: 34.3163 - lr: 0.0000003 - 0.427s/step
step 140 - loss: 4.2148 - ppl: 67.6775 - lr: 0.0000004 - 0.430s/step

Eval begin...
loss: 2.7565 - ppl: 15.7442 - 0.163s/step



[2022-07-19 15:53:14,677] [    INFO] - tokenizer config file saved in ./checkpoints/model_140/tokenizer_config.json
[2022-07-19 15:53:14,680] [    INFO] - Special tokens file saved in ./checkpoints/model_140/special_tokens_map.json


step 145 - loss: 4.0054 - ppl: 54.8915 - lr: 0.0000004 - 0.428s/step
step 150 - loss: 3.5815 - ppl: 35.9268 - lr: 0.0000004 - 0.432s/step

Eval begin...
loss: 2.7566 - ppl: 15.7456 - 0.163s/step



[2022-07-19 15:56:04,423] [    INFO] - tokenizer config file saved in ./checkpoints/model_150/tokenizer_config.json
[2022-07-19 15:56:04,425] [    INFO] - Special tokens file saved in ./checkpoints/model_150/special_tokens_map.json


step 155 - loss: 3.5113 - ppl: 33.4910 - lr: 0.0000004 - 0.428s/step
step 160 - loss: 4.2658 - ppl: 71.2197 - lr: 0.0000004 - 0.430s/step

Eval begin...
loss: 2.7566 - ppl: 15.7470 - 0.163s/step



[2022-07-19 15:58:53,744] [    INFO] - tokenizer config file saved in ./checkpoints/model_160/tokenizer_config.json
[2022-07-19 15:58:53,746] [    INFO] - Special tokens file saved in ./checkpoints/model_160/special_tokens_map.json


step 165 - loss: 3.7945 - ppl: 44.4543 - lr: 0.0000004 - 0.426s/step
step 170 - loss: 3.5206 - ppl: 33.8044 - lr: 0.0000004 - 0.426s/step

Eval begin...
loss: 2.7567 - ppl: 15.7478 - 0.163s/step



[2022-07-19 16:01:44,060] [    INFO] - tokenizer config file saved in ./checkpoints/model_170/tokenizer_config.json
[2022-07-19 16:01:44,062] [    INFO] - Special tokens file saved in ./checkpoints/model_170/special_tokens_map.json


step 175 - loss: 3.6167 - ppl: 37.2158 - lr: 0.0000004 - 0.428s/step
step 180 - loss: 3.6895 - ppl: 40.0238 - lr: 0.0000005 - 0.431s/step

Eval begin...
loss: 2.7568 - ppl: 15.7487 - 0.164s/step



[2022-07-19 16:04:34,296] [    INFO] - tokenizer config file saved in ./checkpoints/model_180/tokenizer_config.json
[2022-07-19 16:04:34,299] [    INFO] - Special tokens file saved in ./checkpoints/model_180/special_tokens_map.json


step 185 - loss: 3.6069 - ppl: 36.8512 - lr: 0.0000005 - 0.429s/step
step 190 - loss: 3.5987 - ppl: 36.5523 - lr: 0.0000005 - 0.430s/step

Eval begin...
loss: 2.7569 - ppl: 15.7503 - 0.163s/step



[2022-07-19 16:07:24,025] [    INFO] - tokenizer config file saved in ./checkpoints/model_190/tokenizer_config.json
[2022-07-19 16:07:24,028] [    INFO] - Special tokens file saved in ./checkpoints/model_190/special_tokens_map.json


step 195 - loss: 3.3243 - ppl: 27.7801 - lr: 0.0000005 - 0.433s/step
step 200 - loss: 3.6484 - ppl: 38.4151 - lr: 0.0000005 - 0.428s/step

Eval begin...
loss: 2.7569 - ppl: 15.7512 - 0.163s/step



[2022-07-19 16:10:13,819] [    INFO] - tokenizer config file saved in ./checkpoints/model_200/tokenizer_config.json
[2022-07-19 16:10:13,822] [    INFO] - Special tokens file saved in ./checkpoints/model_200/special_tokens_map.json


step 205 - loss: 4.3781 - ppl: 79.6884 - lr: 0.0000005 - 0.426s/step
step 210 - loss: 3.3038 - ppl: 27.2170 - lr: 0.0000005 - 0.428s/step

Eval begin...
loss: 2.7570 - ppl: 15.7517 - 0.163s/step



[2022-07-19 16:13:02,948] [    INFO] - tokenizer config file saved in ./checkpoints/model_210/tokenizer_config.json
[2022-07-19 16:13:02,951] [    INFO] - Special tokens file saved in ./checkpoints/model_210/special_tokens_map.json


step 215 - loss: 3.6600 - ppl: 38.8621 - lr: 0.0000005 - 0.427s/step
step 220 - loss: 3.5599 - ppl: 35.1582 - lr: 0.0000006 - 0.425s/step

Eval begin...
loss: 2.7570 - ppl: 15.7523 - 0.163s/step



[2022-07-19 16:15:52,553] [    INFO] - tokenizer config file saved in ./checkpoints/model_220/tokenizer_config.json
[2022-07-19 16:15:52,556] [    INFO] - Special tokens file saved in ./checkpoints/model_220/special_tokens_map.json


step 225 - loss: 3.8935 - ppl: 49.0835 - lr: 0.0000006 - 0.429s/step
step 230 - loss: 3.6475 - ppl: 38.3774 - lr: 0.0000006 - 0.427s/step

Eval begin...
loss: 2.7570 - ppl: 15.7531 - 0.164s/step



[2022-07-19 16:18:42,722] [    INFO] - tokenizer config file saved in ./checkpoints/model_230/tokenizer_config.json
[2022-07-19 16:18:42,724] [    INFO] - Special tokens file saved in ./checkpoints/model_230/special_tokens_map.json


step 235 - loss: 3.6399 - ppl: 38.0888 - lr: 0.0000006 - 0.427s/step
step 240 - loss: 3.5448 - ppl: 34.6333 - lr: 0.0000006 - 0.430s/step

Eval begin...
loss: 2.7570 - ppl: 15.7530 - 0.163s/step



[2022-07-19 16:21:32,511] [    INFO] - tokenizer config file saved in ./checkpoints/model_240/tokenizer_config.json
[2022-07-19 16:21:32,513] [    INFO] - Special tokens file saved in ./checkpoints/model_240/special_tokens_map.json


step 245 - loss: 3.9547 - ppl: 52.1819 - lr: 0.0000006 - 0.428s/step
step 250 - loss: 3.4478 - ppl: 31.4303 - lr: 0.0000006 - 0.428s/step

Eval begin...
loss: 2.7571 - ppl: 15.7541 - 0.164s/step



[2022-07-19 16:24:23,004] [    INFO] - tokenizer config file saved in ./checkpoints/model_250/tokenizer_config.json
[2022-07-19 16:24:23,007] [    INFO] - Special tokens file saved in ./checkpoints/model_250/special_tokens_map.json


step 255 - loss: 4.3372 - ppl: 76.4931 - lr: 0.0000006 - 0.427s/step
step 260 - loss: 4.1356 - ppl: 62.5266 - lr: 0.0000007 - 0.428s/step

Eval begin...
loss: 2.7571 - ppl: 15.7544 - 0.164s/step



[2022-07-19 16:27:13,153] [    INFO] - tokenizer config file saved in ./checkpoints/model_260/tokenizer_config.json
[2022-07-19 16:27:13,155] [    INFO] - Special tokens file saved in ./checkpoints/model_260/special_tokens_map.json


step 265 - loss: 3.6275 - ppl: 37.6183 - lr: 0.0000007 - 0.428s/step
step 270 - loss: 3.4993 - ppl: 33.0939 - lr: 0.0000007 - 0.430s/step

Eval begin...
loss: 2.7572 - ppl: 15.7552 - 0.163s/step



[2022-07-19 16:30:02,968] [    INFO] - tokenizer config file saved in ./checkpoints/model_270/tokenizer_config.json
[2022-07-19 16:30:02,970] [    INFO] - Special tokens file saved in ./checkpoints/model_270/special_tokens_map.json


step 275 - loss: 3.6878 - ppl: 39.9563 - lr: 0.0000007 - 0.426s/step
step 280 - loss: 3.5232 - ppl: 33.8920 - lr: 0.0000007 - 0.430s/step

Eval begin...
loss: 2.7572 - ppl: 15.7552 - 0.163s/step



[2022-07-19 16:32:52,385] [    INFO] - tokenizer config file saved in ./checkpoints/model_280/tokenizer_config.json
[2022-07-19 16:32:52,388] [    INFO] - Special tokens file saved in ./checkpoints/model_280/special_tokens_map.json


step 285 - loss: 3.6954 - ppl: 40.2615 - lr: 0.0000007 - 0.433s/step
step 290 - loss: 3.5729 - ppl: 35.6196 - lr: 0.0000007 - 0.432s/step

Eval begin...
loss: 2.7572 - ppl: 15.7555 - 0.164s/step



[2022-07-19 16:35:42,576] [    INFO] - tokenizer config file saved in ./checkpoints/model_290/tokenizer_config.json
[2022-07-19 16:35:42,579] [    INFO] - Special tokens file saved in ./checkpoints/model_290/special_tokens_map.json


step 295 - loss: 3.8845 - ppl: 48.6448 - lr: 0.0000007 - 0.433s/step
step 300 - loss: 3.4953 - ppl: 32.9589 - lr: 0.0000008 - 0.430s/step

Eval begin...
loss: 2.7573 - ppl: 15.7571 - 0.163s/step



[2022-07-19 16:38:32,056] [    INFO] - tokenizer config file saved in ./checkpoints/model_300/tokenizer_config.json
[2022-07-19 16:38:32,058] [    INFO] - Special tokens file saved in ./checkpoints/model_300/special_tokens_map.json


step 305 - loss: 3.6830 - ppl: 39.7666 - lr: 0.0000008 - 0.428s/step
step 310 - loss: 3.4747 - ppl: 32.2892 - lr: 0.0000008 - 0.437s/step

Eval begin...
loss: 2.7574 - ppl: 15.7585 - 0.163s/step



[2022-07-19 16:41:21,942] [    INFO] - tokenizer config file saved in ./checkpoints/model_310/tokenizer_config.json
[2022-07-19 16:41:21,945] [    INFO] - Special tokens file saved in ./checkpoints/model_310/special_tokens_map.json


step 315 - loss: 3.7047 - ppl: 40.6366 - lr: 0.0000008 - 0.442s/step
step 320 - loss: 3.3017 - ppl: 27.1601 - lr: 0.0000008 - 0.433s/step

Eval begin...
loss: 2.7574 - ppl: 15.7595 - 0.163s/step



[2022-07-19 16:44:11,912] [    INFO] - tokenizer config file saved in ./checkpoints/model_320/tokenizer_config.json
[2022-07-19 16:44:11,915] [    INFO] - Special tokens file saved in ./checkpoints/model_320/special_tokens_map.json


step 325 - loss: 3.4832 - ppl: 32.5641 - lr: 0.0000008 - 0.430s/step
step 330 - loss: 3.4689 - ppl: 32.1020 - lr: 0.0000008 - 1.415s/step

Eval begin...


### Step5：预测解码

用训练保存的模型参数来初始化模型，加载测试集后即可进行预测。

**PaddleNLP针对生成式任务提供了`generate`函数，支持Greedy Search、Beam Search和Sampling解码策略，用户只需指定解码策略以及相应的参数即可完成预测解码，得到生成的sequence的token ids以及概率得分。**

In [8]:
%pwd

'/home/aistudio/multi-skill_dialogue'

In [9]:
# 这里可以是paddlenlp提供的预训练模型名称，或者自己训练获得的微调模型路径
model_name_or_path = './checkpoints/model_320' 
# 加载模型
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)

In [12]:
import time

In [13]:
# 预测batch_size
batch_size = 4

# 测试集路径，注意与数据预处理输出路径保持一致
test_data_path = './datasets/test.txt' 
test_dataset = DialogueDataset(
    test_data_path,
    batch_size,
    tokenizer.pad_token_id,
    tokenizer.cls_token_id,
    mode='test')
test_dataloader = DataLoader(test_dataset, return_list=True, batch_size=None)

In [11]:
from data import select_response

# 预测解码生成序列的最大长度
max_dec_len = 64
# 预测解码生成序列的最小长度
min_dec_len = 1
# 解码策略
decode_strategy = 'sampling'
# topk-sampling解码参数top_k
top_k = 5
# 每条输入序列返回的输出序列个数，生成式API内部会将输入序列进行复制
num_return_sequences = 20
# 文本结果序列保存路径
output_path = './predict.txt'
# 日志打印间隔
logging_steps = 1

print('\nInfer begin...')
model.eval()
total_time = 0.0
start_time = time.time()
responses = []
for step, inputs in enumerate(test_dataloader, 1):
    input_ids, token_type_ids, position_ids, attention_mask = inputs
    ids, scores = model.generate(
        input_ids=input_ids,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        attention_mask=attention_mask,
        max_length=max_dec_len,
        min_length=min_dec_len,
        decode_strategy=decode_strategy,
        top_k=top_k,
        num_return_sequences=num_return_sequences)

    total_time += (time.time() - start_time)
    if step % logging_steps == 0:
        print('step %d - %.3fs/step' % (step, total_time / logging_steps))
        total_time = 0.0
    # 模型输出序列排序，从num_return_sequences个序列中选出最好的一个作为结果
    results = select_response(ids, scores, tokenizer, max_dec_len, num_return_sequences)
    responses.extend(results)

    start_time = time.time()

# 保存文本结果序列
with open(output_path, 'w', encoding='utf-8') as fout:
    for response in responses:
        fout.write(response + '\n')
print('\nSave inference result into: %s' % output_path)


Infer begin...
step 1 - 2.609s/step
step 2 - 2.460s/step
step 3 - 2.469s/step
step 4 - 2.359s/step
step 5 - 2.682s/step
step 6 - 2.744s/step
step 7 - 2.373s/step
step 8 - 2.311s/step


### 快速搭建基线Step6：提交结果

预测结果会被保存在`output_path`中，将预测结果准备成比赛官网要求的格式提交。