Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PP-MiniLM #1403

Merged
merged 19 commits into from
Dec 16, 2021
Merged

Add PP-MiniLM #1403

merged 19 commits into from
Dec 16, 2021

Conversation

LiuChiachi
Copy link
Contributor

@LiuChiachi LiuChiachi commented Dec 7, 2021

PR types

New features

PR changes

Models & Docs

Description

  • Add PP-MiniLM code
  • Add doc for PP-MiniLM
Model #Params #FLOPs Speedup AFQMC TNEWS IFLYTEK CMNLI OCNLI WSC CSL CLUE平均值
Bertbase 102.3M 10.87B 1.00x 74.14 56.81 61.10 81.19 74.85 79.93 81.47 72.78
TinyBERT6 59.7M 5.44B 1.66x 72.59 55.70 57.64 79.57 73.97 77.63 80.00 71.01
UER-py RoBERTa L6- H768 59.7M 5.44B 1.66x 69.74 66.36 59.95 77.00 71.39 71.05 82.83 71.19
RBT6, Chinese 59.7M 5.44B 1.66x 73.93 56.63 59.79 79.28 73.12 77.30 80.80 71.55
ERNIE-Tiny 90.7M 4.83B 1.89x 70.67 55.60 59.91 75.74 71.36 67.11 76.70 68.16
PP-MiniLM 6L-768H 59.7M 5.44B 1.66x 74.14 57.43 61.75 81.01 76.17 86.18 77.47 73.45
PP-MiniLM裁剪后 49.1M 4.08B 2.00x 73.91 57.44 61.64 81.10 75.59 85.86 77.97 73.36
PP-MiniLM量化后 49.2M 4.08B 4.15x 74.00 57.37 61.33 81.09 75.56 85.85 76.53 73.10

TODO:
1.更新QA对UER-py的测试结果进README;在cuda10.2 paddle2.2.1下测试CSL;

@LiuChiachi LiuChiachi force-pushed the add-ppminilm branch 2 times, most recently from 6eb85a4 to ebd2be2 Compare December 7, 2021 13:25
@ZeyuChen ZeyuChen added this to In progress in PaddleNLP v2.2 via automation Dec 7, 2021

PP-MiniLM融合了蒸馏、裁剪、量化、高性能推理技术,拥有精度高、推理速度快、参数规模小的特点:

- 精度高:6层-768hidden size的模型,精度高于华为、腾讯同样大小的模型;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要用公司名,要用模型名。

# PP-MiniLM中文特色小模型


PP-MiniLM中文特色小模型,模型结构同ERNIE,目前本案例主要包含六层transformer layer的模型的通用蒸馏,以及借助PaddleSlim对模型进行裁剪和量化,进一步提升推理速度。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

模型介绍中需要是否突出下基于MiniLMv2策略的改进?@tianxin1860

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如线上沟通,按照 1.推理速度快 2.模型效果好 3.参数规模小 的逻辑来呈现。在模型效果好小项里体现我们对 MiniLMv2 的改进。

| -------------------- | ------------- | ------ | ------- | ----- | ----- | ------- | ----- | ----- | ----- | ----- | ---------- |
| bert-base-chinese | 102.27M | | TODO | | | | | | | | |
| TinyBERT(6l-768d) | 59.7M | | 1.00x | 72.22 | 55.82 | 58.10 | 79.53 | 74.00 | 75.99 | 80.57 | 70.89 |
| 腾讯 RoBERTa 6l-768d | 59.7M | | 1.00x | 69.74 | 66.36 | 59.95 | 77.00 | 71.39 | 71.05 | 82.83 | 71.19 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UER-py RoBERTa xxxx 去掉公司名


### 数据介绍

百度内部业务数据。数据被分割成64个文件,放在目录dataset下。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉内部业务数据这段话。是否改用CLUESmall数据来作为数据示例。

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者只是简单说下数据的组织方式即可


PP-MiniLM模型的蒸馏方法介绍:

用large-size教师模型的第20层对6层学生模型第6层的q与q、k与k、v与v之间的样本间关系进行蒸馏。即对q、k、v统一head_num之后进行重新排列,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

large-size教师模型是什么?是否需要以某个模型为例?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段话需要重新概括

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不适特别清晰


### 性能测试

我们在NVIDIA 16G T4单卡上,使用inference/infer.py脚本,对量化后的模型进行预测。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA Tesla T4 (T4只有16G,可以无需特别强调)

```shell
cd inference

python infer.py --task_name ${task} --model_path ../quantization/${task}_quant_models/${algo}${bs}/int8 --int8 --use_trt --collect_shape # 生成shape range info文件
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect shape环节是否需要增强下说明


### 环境要求:

这一步需要依赖paddle2.2.1,如果想要看到更明显的加速比,需要在T系列卡上测试(本案例使用的是T4)。若在V系列卡上测试,由于其不支持int8 tensor core,加速效果将达不到本文档表格中的加速效果。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要注重英文术语大小写拼写的正确性
Int8 Tensor Core

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果需要得到更明显的加速效果,推荐在NVIDA Tensor Core GPU(如T4、A10、A100)上进行测试。

float32预测脚本:

```shell
python infer.py --task_name ${task} --model_path $MODEL_PATH --use_trt --collect_shape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collect shape环节需要单独说明一下比较好,否则这里会让用户很困惑

config.tensorrt_engine_enabled()))
if args.collect_shape:
config.collect_shape_range_info(
os.path.dirname(args.model_path) + "/" + args.task_name +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目录的拼写应该是用os.path.join API来拼接,强制用 /会导致windows不兼容

Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave some comments


### 数据介绍

本实验基于CLUE中分类数据集,linux系统下该数据集会在启动脚本后自动下载到`~/.paddlenlp/datasets/Clue/`。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本实验基于 CLUE 数据集,运行 Fine-tune 脚本会自动下载该数据集到 *** 目录.


本实验基于CLUE中分类数据集,linux系统下该数据集会在启动脚本后自动下载到`~/.paddlenlp/datasets/Clue/`。

使用以下超参范围对第一步通用蒸馏得到的通用模型`GENERAL_MODEL_DIR`进行精调

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

基于如下超参范围对第一步蒸馏产出的小模型 GENERAL_MODEL_DIR 进行 Grid Search 超参寻优

cd ofa
```

经过我们的实验,模型的宽度压缩为原来的3/4的情况下,模型精度无损(-0.15)。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意表述: 6L768H 条件下,模型宽度压缩为原来的 3/4, 精度几乎无损。


经过我们的实验,模型的宽度压缩为原来的3/4的情况下,模型精度无损(-0.15)。

### 压缩和蒸馏的启动脚本

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否明确一下 压缩、裁剪、蒸馏的关系以及使用场合?感觉标题这里用压缩和蒸馏可能引起误解。

"cmnli": Accuracy,
"cluewsc2020": Accuracy,
"csl": Accuracy,
"xnli": Accuracy,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除 CLUE 之外的数据集更合适一些?

Comment on lines 206 to 209
#print(origin_model_new.state_dict().keys())
#print("=====================")
#for name, params in origin_model_new.named_parameters():
# print(name, params.name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Comment on lines 1 to 7
export CUDA_VISIBLE_DEVICES=$6
export TASK_NAME=$1
export BATCH_SIZE=$3
export SEQ_LEN=$5
export PRE_EPOCHS=$4
export LR=$2
export STUDENT_DIR=$7

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否按 $1、$2 顺序依次解析变量?


do

python quant_post.py --task_name ${task} --input_dir ${MODEL_DIR}/${task}/0.75/sub_static

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 0.75 直接作为目录名吗?

'target']['span1_text'], example['target']['span2_text'], example[
'target']['span1_index'], example['target']['span2_index']
text_list = list(text)
# print(text)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多余注释

s_head_dim, t_head_dim = s.shape[3], t.shape[3]

if alpha + beta == 1.0:
loss1 = 0.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loss1、loss2、loss3 是否替换为相应有意义的命名?

| ----- | ----- | ------- | ----- | ----- | ----- | ----- | ---------- |
| 74.28 | 57.33 | 61.72 | 81.06 | 76.20 | 86.51 | 78.77 | 73.70 |

### 你可以这样导出Fine-tuning之后的模型直接用于部署
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个作为标题不太合适,标题要尽可能简洁

# PP-MiniLM中文特色小模型


PP-MiniLM中文特色小模型,模型结构同ERNIE,目前本案例主要包含六层transformer layer的模型的通用蒸馏,以及借助PaddleSlim对模型进行裁剪和量化,进一步提升推理速度。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们自称为特色是否合适还请再看下。

这里的本案例可能也不太合适,现在是本模型或者本方案了,是自己的模型了


### 原理介绍

PP-MiniLM模型的蒸馏方法介绍:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否也提下体现下原版MiniLM,也能显出我们命名的由来


执行完成后,模型保存的路径位于`ofa_models/CLUEWSC2020/0.75/best_model/`

### 导出裁剪后的模型:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

标题里加: 好像也不太合适


### 环境要求

本实验如果基于NVIDIA V100 32G 8卡进行,训练周期约为2-3天。若资源有限,可以直接下载这一步得到的模型跳过此步骤。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个感觉不能算是环境要求


PP-MiniLM中文特色小模型,模型结构同ERNIE,目前本案例主要包含六层transformer layer的模型的通用蒸馏,以及借助PaddleSlim对模型进行裁剪和量化,进一步提升推理速度。

PP-MiniLM融合了蒸馏、裁剪、量化、高性能推理技术,拥有精度高、推理速度快、参数规模小的特点:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这句和上面那句可以合起来

| 腾讯 RoBERTa 6l-768d | 59.7M | | 1.00x | 69.74 | 66.36 | 59.95 | 77.00 | 71.39 | 71.05 | 82.83 | 71.19 |
| PP-MiniLM 6l-768d | 59.7M | | 1.00x | 74.28 | 57.33 | 61.72 | 81.06 | 76.2 | 86.51 | 78.77 | 73.70 |
| PP-MiniLM裁剪后 | 49.1M (+裁剪) | | 1.15x | 73.82 | 57.33 | 61.60 | 81.38 | 76.20 | 85.52 | 79.00 | 73.55 |
| PP-MiniLM量化后 | 49.2M(+量化) | | 2.18x | 73.61 | 57.18 | 61.49 | 81.26 | 76.31 | 84.54 | 77.67 | 73.15 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

量化后模型比上一步还更大了吗

- `num_relation_heads` relation heads的个数,一般对于large size的教师模型是64,对于base size的教师模型是48。
- `teacher_model_type`指示了教师模型类型,当前仅支持'ernie'、'roberta'。
- `teacher_layer_index`蒸馏时使用的教师模型的层数
- `student_layer_index` 蒸馏时使用的学生模型的层数
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是想表示选用第几层吧,层数感觉可能会带来些误解

# The first commit's message is:

update inference

# This is the 2nd commit message:

update
fix infer perf

remove useless comments
@@ -0,0 +1,305 @@
# PP-MiniLM中文小模型

PP-MiniLM 中文特小模型案例旨在提供训推一体的高精度、高性能小模型解决方案。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PP-MiniLM 中文特小模型案例旨在提供训推一体的高精度、高性能小模型及解决方案。


当前解决方案依托业界领先的 Task Agnostic 模型蒸馏技术、裁剪技术、量化技术,使得小模型兼具推理速度快、模型效果好、参数规模小的 3 大特点。

- 推理速度快:我们集成了 PaddleSlim 的裁剪、量化技术进一步对小模型进行压缩,保证模型推理速度达到原先的2.18倍;
Copy link

@tianxin1860 tianxin1860 Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

推理速度快: 依托 PaddleSlim 的裁剪、量化技术进一步小模型进行压缩, 使得 PP-MiniLM 量化模型 GPU 推理速度相比 Bert-base 加速比高达 3.56;


- 推理速度快:我们集成了 PaddleSlim 的裁剪、量化技术进一步对小模型进行压缩,保证模型推理速度达到原先的2.18倍;

- 精度高: 我们以 MiniLMv2 提出的 Multi-Head Self-Attention Relation Distillation 技术为基础,通过引入样本间关系知识蒸馏做了进一步算法优化。我们的6层、hidden size为768的模型,在CLUE上的平均准确率分别高于TinyBERT、UER-py RoBERTa同样大小的模型2.66%、1.51%。
Copy link

@tianxin1860 tianxin1860 Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

精度高: 我们以 MiniLMv2 提出的 Multi-Head Self-Attention Relation Distillation 技术为基础,通过引入样本间关系知识蒸馏做了进一步算法优化, 6 层 PP-MiniLM 模型在 CLUE 数据集上比 12 层 Bert-base-chinese 高 0.23%,比同等规模的 TinyBERT、UER-py RoBERTa 分别高 2.66%、1.51%;


- 精度高: 我们以 MiniLMv2 提出的 Multi-Head Self-Attention Relation Distillation 技术为基础,通过引入样本间关系知识蒸馏做了进一步算法优化。我们的6层、hidden size为768的模型,在CLUE上的平均准确率分别高于TinyBERT、UER-py RoBERTa同样大小的模型2.66%、1.51%。

- 参数规模小:依托 PaddleSlim 裁剪技术,在精度几乎无损(-0.15)条件下将模型宽度压缩 1/4。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数规模小:依托 PaddleSlim 裁剪技术,在精度几乎无损(-0.15%)条件下将模型隐层宽度压缩 1/4,模型参数量减少 28%;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改,十分感谢~

Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave some comments

update code and readme

update readme

Add serial number to readme

update readme

Added a catalog

fix a catalog bug

fix a catalog bug

| Model | #Params | #FLOPs | Speedup | AFQMC | TNEWS | IFLYTEK | CMNLI | OCNLI | WSC | CSL | CLUE平均值 |
| ----------------------- | ------- | ------ | ------- | ----- | ----- | ------- | ----- | ----- | ----- | ----- | ---------- |
| Bert<sub>base</sub> | 102.3M | 10.87B | 1.00x | 74.17 | 57.17 | 61.14 | 81.14 | 75.08 | 80.26 | 81.47 | 72.92 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BERT,作为模型名的BERT统一大写

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢,已经修改


### 环境说明

本实验基于NVIDIA Tesla V100 32G 8卡进行,训练周期约为2-3天。若资源有限,可以直接[下载PP-MiniLM(6L768H)](https://bj.bcebos.com/paddlenlp/models/transformers/ppminilm/6l-768h)用于下游任务的微调。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要手动下载吗?是否可以告诉大家用from_pretrained接口自动下载?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要,已经加上了用from_pretrained导入的示例,并在modeling.pytokenizer.py加上ppminilm相关配置

update reamde

update readme

update reamde

update readme
ZeyuChen
ZeyuChen previously approved these changes Dec 15, 2021
@@ -0,0 +1,12 @@
for task in afqmc tnews iflytek cmnli ocnli cluewsc2020 csl
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是否需要加copyright声明呢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢提示,已经把shell上都加了copyright

PaddleNLP v2.2 automation moved this from Reviewer approved to Review in progress Dec 15, 2021
jiweibo
jiweibo previously approved these changes Dec 15, 2021
Copy link

@jiweibo jiweibo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for inference api


#### 环境要求

这一步依赖安装有预测库的 PaddlePaddle 2.2.1。可以在[PaddlePaddle 官网](https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html)根据机器环境选择合适的 Python 预测库进行安装。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢提示,已经更新


## 导入 PP-MiniLM

PP-MiniLM是使用任务无关蒸馏方法,以 `roberta-wwm-ext-large` 做教师模型蒸馏产出的包含 6 层 Transformer Encoder Layer、Hidden Size 为 768 的中文预训练小模型,在[中文任务测评基准 CLUE](https://github.com/CLUEbenchmark/CLUE) 上七个分类任务上的模型精度超过 BERT<sub>base</sub>、TinyBERT<sub>6</sub>、UER-py RoBERTa L6-H768、RBT6。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roberta-wwm-ext-large 作为教师模型,6层ERNIE作为学生模型是吧?感觉体现下ERNIE比较清晰

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢建议,已经体现 6 层 ERNIE了。

tianxin1860
tianxin1860 previously approved these changes Dec 16, 2021
Copy link

@tianxin1860 tianxin1860 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

PaddleNLP v2.2 automation moved this from Review in progress to Reviewer approved Dec 16, 2021
PaddleNLP v2.2 automation moved this from Reviewer approved to Review in progress Dec 16, 2021
PaddleNLP v2.2 automation moved this from Review in progress to Reviewer approved Dec 16, 2021
Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LiuChiachi
Copy link
Contributor Author

Thank you all:)🙏

@LiuChiachi LiuChiachi merged commit 868e7a2 into PaddlePaddle:develop Dec 16, 2021
PaddleNLP v2.2 automation moved this from Reviewer approved to Done Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants