[TIPC]Support @to_static train for base-transformer #3277

0x45f · 2022-09-15T08:39:19Z

PR types

Others

PR changes

Others

Description

[TIPC]Support @to_static train for base-transformer

此 PR 基于新Benchmark规范实现了 @to_static 动转静训练监控机制，在现有的功能上，为兼容性升级。

1. 使用方法

在动态图训练的基础上，开启动转静训练的方法如下：

配置参数名：to_static（不能拼写错误，大小写敏感）

bash test_tipc/benchmark_train.sh test_tipc/configs/transformer/base/train_infer_python.txt benchmark_train to_static

2. 验证日志

此PR在4卡条件下对fp16和fp32的base transformer模型进行了动转静验证
可以根据日志中的 Successfully to apply @to_static with specs XX 来判断动转静是否生效，部分日志如下：

Successfully to apply @to_static with specs: [InputSpec(shape=(-1, -1), dtype=paddle.int64, name=src_word), InputSpec(shape=(-1, -1), dtype=paddle.int64, name=trg_word)]
�[32m[2022-09-15 08:10:28,648] [    INFO]�[0m - step_idx: 0, epoch: 0, batch: 0, avg loss: 10.752361, normalized loss: 9.384720, ppl: 46740.265625 �[0m
�[32m[2022-09-15 08:10:51,772] [    INFO]�[0m - step_idx: 100, epoch: 0, batch: 100, avg loss: 8.507105, normalized loss: 7.139464, ppl: 4949.812012, avg_speed: 4.36 step/sec, batch_cost: 0.22947 sec, reader_cost: 0.00102 sec, tokens: 357597, ips: 15583.50416 words/sec�[0m
�[32m[2022-09-15 08:11:19,721] [    INFO]�[0m - step_idx: 200, epoch: 0, batch: 200, avg loss: 7.661503, normalized loss: 6.293862, ppl: 2124.949463, avg_speed: 3.60 step/sec, batch_cost: 0.27778 sec, reader_cost: 0.00157 sec, tokens: 370828, ips: 13349.55987 words/sec�[0m
�[32m[2022-09-15 08:11:47,662] [    INFO]�[0m - step_idx: 300, epoch: 0, batch: 300, avg loss: 7.323730, normalized loss: 5.956089, ppl: 1515.848145, avg_speed: 3.60 step/sec, batch_cost: 0.27751 sec, reader_cost: 0.00128 sec, tokens: 362789, ips: 13073.14962 words/sec�[0m
�[32m[2022-09-15 08:12:15,261] [    INFO]�[0m - step_idx: 400, epoch: 0, batch: 400, avg loss: 6.923210, normalized loss: 5.555569, ppl: 1015.574890, avg_speed: 3.65 step/sec, batch_cost: 0.27376 sec, reader_cost: 0.00117 sec, tokens: 350618, ips: 12807.44412 words/sec�[0m
�[32m[2022-09-15 08:12:25,449] [    INFO]�[0m - train epoch: 0, epoch_cost: 122.22634 s�[0m
�[32m[2022-09-15 08:12:45,209] [    INFO]�[0m - step_idx: 500, epoch: 1, batch: 63, avg loss: 6.535992, normalized loss: 5.168351, ppl: 689.517212, avg_speed: 3.36 step/sec, batch_cost: 0.29750 sec, reader_cost: 0.00732 sec, tokens: 335067, ips: 11262.88438 words/sec�[0m
�[32m[2022-09-15 08:13:14,839] [    INFO]�[0m - step_idx: 600, epoch: 1, batch: 163, avg loss: 6.250062, normalized loss: 4.882421, ppl: 518.045227, avg_speed: 3.39 step/sec, batch_cost: 0.29466 sec, reader_cost: 0.00114 sec, tokens: 369991, ips: 12556.56592 words/sec�[0m
�[32m[2022-09-15 08:13:44,164] [    INFO]�[0m - step_idx: 700, epoch: 1, batch: 263, avg loss: 6.091871, normalized loss: 4.724230, ppl: 442.248230, avg_speed: 3.43 step/sec, batch_cost: 0.29144 sec, reader_cost: 0.00164 sec, tokens: 365220, ips: 12531.40888 words/sec�[0m
�[32m[2022-09-15 08:14:13,494] [    INFO]�[0m - step_idx: 800, epoch: 1, batch: 363, avg loss: 6.070483, normalized loss: 4.702842, ppl: 432.889618, avg_speed: 3.43 step/sec, batch_cost: 0.29169 sec, reader_cost: 0.00110 sec, tokens: 358782, ips: 12300.06401 words/sec�[0m
�[32m[2022-09-15 08:14:34,369] [    INFO]�[0m - train epoch: 1, epoch_cost: 128.91993 s�[0m

3 方案介绍

在benchmark_train.sh增加了REST_ARGS参数，来捕获动转静to_static参数。在具体的train_infer_python.txt中，修改第20行参数新增动转静参数trainer to_static_train:-o to_static=True
此处会复用trainer:norm_train的配置，在其后追加--to_static=True 来实现开启动转静训练，以保证动转静训练和动态图训练的基本配置参数是对齐的。
可以参考PaddleClas仓库的类似PR：PaddlePaddle/PaddleClas#1756

ZeyuChen · 2022-09-15T08:59:43Z

examples/machine_translation/transformer/train.py

@@ -86,6 +86,11 @@ def parse_args():
                        type=str,
                        choices=['true', 'false', 'True', 'False'],
                        help="Whether to use amp to train Transformer. ")
+    parser.add_argument("--to_static",
+                        default=None,


这里使用store true的方案代码是否更简洁？也不用下面那么复杂的判断。
参考上面 --benchmark

已修改为--to_static，感谢

ZeyuChen · 2022-09-15T09:00:35Z

examples/machine_translation/transformer/train.py

@@ -383,6 +388,12 @@ def do_train(args):
    args.unk_token = ARGS.unk_token
    args.bos_token = ARGS.bos_token
    args.eos_token = ARGS.eos_token
+    if ARGS.to_static:


to_static改成bool类型，直接就可以
args.to_static = ARGS.to_static

* fix multi-layer-inherit * update bert model unittest * update requirements.txt * update ernie modeling test * update roberta unittest * update roformer modeling testing * complete ernie label loss * complete ernie/roberta/roformer unittest * update label/loss * update refactor code * remove unrelated requirements * add license * Update setup.py and README Examples (#3208) * Move token_num fetch out of train cycle (#3089) * Add finance course (#3207) * add finance course group code Co-authored-by: tianxin <tianxin04@baidu.com> * Update README_cn.md (#3212) add v2.4 features description. * Update README.md (#3209) Improve and fix the text content of case 1. Co-authored-by: tianxin <tianxin04@baidu.com> * [Recompute] Update recompute for hybrid parallel interface. (#3211) Co-authored-by: Zhong Hui <zhonghui.net@gmail.com> * Update README_cn.md * [ModelingOutput]update roformer unittest (#3159) * add roformer unittest * add roformer unittest * update test_modeling * use relative import * reduce model config to accelerate testing * remove input_embedding from pretrained model * revert slow tag * update local branch * update get_vocab method * update get_vocab method * update test_chinese method * change absolute import * update unittest * update chinese test case * add roformer more output testing Co-authored-by: Guo Sheng <guosheng@baidu.com> Co-authored-by: liu zhengxi <380185688@qq.com> * Update README_cn.md * Fix windows dtype bug of neural search (#3182) * Fix windows dtype bug of neural search * Fix windows dtype bug of neural search Co-authored-by: 吴高升 <w5688414@gmail.com> * Update README_cn.md * Update README_cn.md * Update README_cn.md * [ModelingOutput]add more output for skep model (#3146) * update return_dict/label in skep model * complete skep add-more-output * refactor simple code Co-authored-by: Zhong Hui <zhonghui.net@gmail.com> Co-authored-by: Guo Sheng <guosheng@baidu.com> Co-authored-by: liu zhengxi <380185688@qq.com> * remove model_config_file and resource_files_names * Update README_cn.md (#3219) * Remove boost library. (#3215) * Remove boost library. * add conditional include for gtest * Add test, demo exclude * Update bos url for UIE (#3222) * Update bos url * Update README.md * Update README.md * 源码安装htbuilder,避免windows安装失败 (#3221) Co-authored-by: 吴高升 <w5688414@gmail.com> * not default to gpu (#3218) * Update codegen params and doc (#3228) * update decoding * update doc * update three models * [Unittest]add roformerv2 unittest (#2994) * add roformerv2 unittest * update roformer-v2 testing * update config to accelerate testing * remove comment Co-authored-by: Guo Sheng <guosheng@baidu.com> * Optimize text classification deploy (#3217) * optimize_deploy * optimize_deploy * update_readme * fix data distill for UIE (#3231) * fix data distill * update * add evaluate_teacher * [Pre-Training] ERNIE-CW pre-training tasks docs. (#3111) * add ernie-large config * update * update clue finetune. * unused delete. * update * support no nsp for enrie. * fix evaluation * fix amp o2 save_dtype bugs. * extand ernie. * fix ernie pretrain with ## vocab. * extend vocab * support custom tokenizer. * add some comments. * fix bugs. * add comments. * fix bug. * fix run_pretrain_static logging. * fix all gather. * fix a100 * fix * fix bugs * fix save * tmp commit for pre-process. * Update README.md * Update README.md * add amp o1 support * ernie cw readme. * fix * throw error when dataset is invalid. * update document. * refine readme. * fix * refactor * refator2 * Add pre-training introduction. * update image width. * refine doc * fit table width. * fix c++ style * fix table * refine docs * refine model_zoo/ernie-1.0/README.md * readfine readme. * fix link * fix bug * fix documents. * add weight. * fix config * Update README.md & Add more data into csv& change UI (#3237) * fix bug of label dimension smaller than 1 (#3238) * update output dirname of compression api (#3252) * [ModelingOutput] add tinybert/Electra/XLNet/ALBERT/ERNIE-M more output & loss (#3148) * complete tinybert more output & loss * complete tinybert/erniem output * complete xlnet unittest * complete the electra unittest * complete albert more modeling output * complete albert more modeling output * complete ernie-doc model more output * revert ernie-doc modeling * update more output * update model testing * convert paddle.is_tensor -> isinstance * update tinybert & electra models * Add unit tests for T5 (#3115) * analysis_module_bug_fix (#3246) * [CodeStyle] Add copyright for python file. (#3259) * Add copyright for python files. * [IssueTemplate] Add issue template (#3251) * update issue-template * remove old issue template * add id field to template * update github issue template * [BugFix]update vocab_size in init_config (#3260) * update vocab_size in init_config * make update_init_config more common Co-authored-by: Zhong Hui <zhonghui.net@gmail.com> * update t5 tests (#3266) * Update debug mode for relation prompt (#3263) * update debug mode for relation prompt * update * update * Update README.md and Rename dir to FAQ directory (#3272) * [DOC] Add ernie-1.0-base-zh-cw benchmark results. (#3248) * [DOC] Update highlights of README.md (#3278) * Update README.md * Update README.md * Add unit tests for UnifiedTransformer (#3177) * [Trainer] Support recompute for trainer. (#3261) * support recompute for trainer. * Upgrade FAQ finance to Milvus 2.1 (#3267) * Upgrade FAQ finance to Milvus 2.1 * Update text format for faq * Update feature_extract.sh * Fix ft substr bug (#3279) * optimize cmakelist * Add substr pos check * remove glog/logging.h (#3280) * Update ft version to 0.2.0 (#3285) * update docs wechat code (#3284) * update link typo (#3236) * add_dataset_link (#3286) * Add use_faster flag for uie of taskflow. (#3194) * Add use_faster flag for taskflow * Add empty line * Add doc of uie * remove faster_tokenizer tmp * merge * fix import error (#2853) * [TIPC]Support @to_static train for base-transformer (#3277) * [TIPC]Support @to_static train for base-transformer * Fix to_static args * Add ft compile doc and scripts (#3292) * Fix the mac compile * Add cpp, python lib building scripts * Remove cache in cpp lib * Add compile docs * fix ft build script (#3293) * Add Milvus2.1 Support and Update pipielines qa ui (#3283) * Add Milvus Support and Update pipielines qa ui * Remove unused comments * fix bug of relation example is empty (#3295) * Compression API Supports ERNIE-M and more Pretrained models (#3234) * update compression doc * update compression doc * support more models and update compression api * update inputspec info, avoid error * optimize train.py (#3300) * update ernie task tipc * update * optimize_sparse_strategy (#3311) * Add FAQ and missing json output files (#3298) * Add Docker compile Support for Pipelines (#3315) * Add Docker compile Support * change cuda to uppercase * Update README_en.md (#3320) * Update README_en.md * Update README_en.md * Update README_en.md * Update README_en.md * Update README_en.md * Update README_en.md * Update README_en.md * Update __init__.py * Replace OMP with std::thread (#3309) * fix bug and codestyle * save change * change code style * fix conflict * change h file * Update tokenizer.cc Co-authored-by: zhoushunjie <zhoushunjie@baidu.com> Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com> * update tipc log (#3333) * Remove unused function of Pipelines (#3330) * update CodeGen doc (#3299) * update doc * update doc * update docs Co-authored-by: 骑马小猫 <1435130236@qq.com> * fix tipc log (#3337) * [MoE] Fix recompute & communication api (#3338) * update moe recompute. * [few-shot] fix typo and failed links (#3339) Co-authored-by: Zhong Hui <zhonghui.net@gmail.com> * [New Model]add t5-encoder-model (#3168) * add t5-encoder-model * update t5model * update t5encoder & test modeling * update t5 * update type hinting * update cache type annotation * Update retrieval based classification README.md (#3322) * Update retrieval based classification README.md * Revert predict.py * Update cpu predict script * restore gpu config * Fix TIPC log path (#3347) * Upgrade Neural Search README.md (#3350) * support layoutxlm re dygraph to static (#3325) * support layoutxlm re dygraph to static * fix error * upgrade-modeling-output (#3305) * upgrade-modeling-output * fix codestyle * Compression API supports ELECTRA (#3324) * supports electra * fix typo * [FasterGeneration] MBart supports dy2sta (#3356) * unimo unittests (#3349) * [Benchamrk] Fix fuse_transformer option of TIPC (#3358) * Fix the README description of Pipelines & Neural Search (#3353) * Fix the README description * Update Pipelines README.md * Update Docker README.md * Add more details for ranking model * supports distribute (#3361) * Fix the semantic search example mistakes (#3363) Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com> * [BugFix] Fix amp usage for evaluation. (#3303) * fix eval of amp usage. * fix * [MoE] Fix distributed wait api (#3365) * Fix gpt example attention mask (#3240) * add hf ds and upgrade example * fix attention mask * update * update attention mask * fix static attention mask * Fix erniegen no model_config_file (#3321) * fix * rm save_pretrained * fix tipc log for benchmark and upate bigru_crf config (#3373) * fix tipc log * fix tipc log and upate bigru_crf config * add t5 encoder model (#3376) * MBART supports freeze multi-lingual model when dy2sta (#3367) * fix dataloader memory overflow * add warning * Update README_en.md (#3375) edit typo Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com> * Improve CodeGen (#3371) * Add codegen unittests (#3348) * add codegen unittests * fix codegen * update * [BugFix] fix supporting `OrderedDict` bug in paddle.jit module (#3364) * convert keys to `__dict__` * use fields to get keys Co-authored-by: Guo Sheng <guosheng@baidu.com> * 【Hackathon + GradientCache】 (#1799) * gradient_cache * gradient_cache * gradient_cache * gradient_cache * data * train_for_gradient_cache * add * add * add * 修改 * 修改 * update * update * update * update * Update README_gradient_cache.md * Update README_gradient_cache.md * Update README_gradient_cache.md * feat: modified the code * fix: delete useless code * feat: added requirements.txt * feat: modify readme * feat: modify some code * feat: code style * feat: add function * feat: add licence * feat: add comments * Update README_gradient_cache.md * feat: modify readme * feat: modify readme * fix: copyright * fix: yapf * feat: modify readme * feat: modify readme * feat: delete useless code * feat: add new explain Co-authored-by: 吴高升 <wugaosheng@mails.ccnu.edu.cn> Co-authored-by: 吴高升 <w5688414@gmail.com> * [TIPC] Add scripts for npu and xpu, test=develop (#3377) * add scripts for xpu and npu * add npu/xpu args * add script for xpu * add npu/xpu args to predict.py * fix codestyle ci bug * add copyright * fix copyright_checker * Add ERNIE-LayoutX (#3183) * Add ernie-layoutx * simplify code * simplify code * support batch input * add word_boxes support * Update docs * update * Update README.md * Udpate README.md * Update README.md * Update README.md * [Dygraph] Support sharding stage2/3+dp in GPT-3 model (#2471) * add sharding+dp * update * code style check Co-authored-by: gongenlei <gongel@qq.com> * complete t5 more output (#3370) * fix gpt N4C32 dp script bug (#3392) * codestyle * Update README.md of neural search (#3391) * Update artist model activateion (#3106) * update * rename * fix gpt ut (#3407) * add qg example * delete useless scripts * delete .sh files in t5 dir * normalize t5 naming * rewrite run_gen.py to train.py and predict.py in unimo-text * Update README_cn.md (#3413) * fix bigru crf offset index error (#3418) * modified according to zeyang's comments * modified according to zeyang's comments * fix bert unittest bug (#3422) * fix bert unittest bug * change token_labels -> sequence_labels * [BugFix]Fix ernie tokenizer unittest (#3423) * fix bert unittest bug * change token_labels -> sequence_labels * update ernie tokenizer max_input_size * update qg example readme * fix pillow deperate warning (#3404) Co-authored-by: gongenlei <gongel@qq.com> * Update taskflow.py (#3424) fix typo * fix bug of debug mode (#3417) * rewrite unimo-text/predict.py to retrain only the prediction function * support paddle serving http deploy for text classification (#3378) * add_http_deploy * [prompt] add doc (#3362) * modified according to zeyang's comments, 20221010 * [few-shot] fix script for multi_class and fix input type for windows (#3426) * Update README_cn.md * adjust the position of the experiment' result * support mlu training (#3431) * support mlu training * [mlu] add mlu config in rnn and ernie-1.0 README. * remove the tcn for the paddlenlp (#3435) * add qg-taskflow * fix code style * Add multi type files index update example for pipelines (#3439) * [MLU] support SQuAD_Bert with mlu device (#3434) * Update FAQ Finance Paddle Serving dependencies (#3430) * Add batch prediction for pipelines (#3432) * Add batch prediction for pipelines * Fix some hardcode problem& Update comments * Support past_key_values argument for Electra (#3411) * unit test pass; fix yapf * change docstring Co-authored-by: 骑马小猫 <1435130236@qq.com> Co-authored-by: Guo Sheng <guosheng@baidu.com> * modified according to zeyang's comments * refine gpt (#3447) * fix some typos in qg-example readme * Fix #3446 (#3457) * update Pillow version * compare version * [NEW Features] feature_extraction and processor support from_pretrained (#3453) * update * add import * Update README.md and optimize DocPrompt postprocess (#3441) * Update README.md * optimize sort * update * Update * Update * Update * Update * Update * Update * update * update * Add english docs and rename ernie_layout * Add english docs and rename ernie_layout * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * Update taskflow.md * update * add symbolic link for ernie_layout * Update README.md Co-authored-by: wj-Mcat <1435130236@qq.com> Co-authored-by: yujun <50394665+JunnYu@users.noreply.github.com> Co-authored-by: 吴高升 <w5688414@gmail.com> Co-authored-by: limingshu <61349199+JamesLim-sy@users.noreply.github.com> Co-authored-by: chenxiaozeng <chenshuo07@baidu.com> Co-authored-by: tianxin <tianxin04@baidu.com> Co-authored-by: Guo Sheng <guosheng@baidu.com> Co-authored-by: bruce0210 <100854336+bruce0210@users.noreply.github.com> Co-authored-by: wuhuachaocoding <77733235+wuhuachaocoding@users.noreply.github.com> Co-authored-by: Zhong Hui <zhonghui.net@gmail.com> Co-authored-by: wawltor <fangzeyang0904@hotmail.com> Co-authored-by: liu zhengxi <380185688@qq.com> Co-authored-by: kztao <taokuizu@qq.com> Co-authored-by: Jack Zhou <zhoushunjie@baidu.com> Co-authored-by: paopjian <672034519@qq.com> Co-authored-by: gongenlei <gongel@qq.com> Co-authored-by: lugimzzz <63761690+lugimzzz@users.noreply.github.com> Co-authored-by: Jiaqi Liu <709153940@qq.com> Co-authored-by: WangZhen <23097963+0x45f@users.noreply.github.com> Co-authored-by: Thomas Young <35565423+HexToString@users.noreply.github.com> Co-authored-by: Zeyu Chen <chenzeyu01@baidu.com> Co-authored-by: zhengya01 <43601548+zhengya01@users.noreply.github.com> Co-authored-by: Roc <30228238+sljlp@users.noreply.github.com> Co-authored-by: Noel <wanghuijuan03@baidu.com> Co-authored-by: zhoujun <572459439@qq.com> Co-authored-by: Liujie0926 <44688141+Liujie0926@users.noreply.github.com> Co-authored-by: westfish <westfish@126.com> Co-authored-by: Septilliony <52767905+Septilliony@users.noreply.github.com> Co-authored-by: Elvis Stuart <75023175+Elvisambition@users.noreply.github.com> Co-authored-by: 吴高升 <wugaosheng@mails.ccnu.edu.cn> Co-authored-by: duanyanhui <45005871+YanhuiDua@users.noreply.github.com> Co-authored-by: Haohongxiang <86215757+haohongxiang@users.noreply.github.com> Co-authored-by: Yam <40912707+Yam0214@users.noreply.github.com> Co-authored-by: sneaxiy <32832641+sneaxiy@users.noreply.github.com> Co-authored-by: alkaid <41095516+alkaideemo@users.noreply.github.com> Co-authored-by: Chenxiao Niu <ncx_bupt@163.com> Co-authored-by: qipengh <qipengh@qq.com> Co-authored-by: Sijun He <sijun.he@hotmail.com>

0x45f added 2 commits September 15, 2022 08:36

[TIPC]Support @to_static train for base-transformer

e5771d9

Merge branch 'develop' into tipc-transformer

b24e5c4

ZeyuChen reviewed Sep 15, 2022

View reviewed changes

Fix to_static args

d465aaa

ZeyuChen approved these changes Sep 17, 2022

View reviewed changes

ZeyuChen merged commit 9a25764 into PaddlePaddle:develop Sep 17, 2022

0x45f deleted the tipc-transformer branch September 19, 2022 09:27

wawltor mentioned this pull request Jan 12, 2023

PaddleNLP 2.5.0 Release Note Candidate #4439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TIPC]Support @to_static train for base-transformer #3277

[TIPC]Support @to_static train for base-transformer #3277

0x45f commented Sep 15, 2022 •

edited

ZeyuChen Sep 15, 2022

0x45f Sep 15, 2022

ZeyuChen Sep 15, 2022

[TIPC]Support @to_static train for base-transformer #3277

[TIPC]Support @to_static train for base-transformer #3277

Conversation

0x45f commented Sep 15, 2022 • edited

PR types

PR changes

Description

1. 使用方法

2. 验证日志

3 方案介绍

ZeyuChen Sep 15, 2022

Choose a reason for hiding this comment

0x45f Sep 15, 2022

Choose a reason for hiding this comment

ZeyuChen Sep 15, 2022

Choose a reason for hiding this comment

0x45f commented Sep 15, 2022 •

edited