Skip to content

UIE 微调中断,无明显错误日志 Exit code -6 #3869

@shuiiiiiimu

Description

@shuiiiiiimu

Window WSL2
paddlenlp 2.4.3
paddlepaddle-gpu 2.4.0rc0

抽取式任务。 准备了 70 条训练数据,标注了 6 个标签。操作步骤以及参数都是参考 model_zoo/uie#4-训练定制

  • GPU 多卡

从样本数据中 head 6 条用于训练,正常。

样本 70 条数据用于训练,直接中断,无明显日志。日志如下:

   .......  其他 INFO ......
LAUNCH INFO 2022-11-23 14:02:02,328 Pod failed
INFO 2022-11-23 14:02:02,328 controller.py:109] Pod failed
LAUNCH ERROR 2022-11-23 14:02:02,328 Container failed !!!
   .......  其他 INFO ......
ERROR 2022-11-23 14:02:02,328 controller.py:110] Container failed !!!
   .......  其他 INFO ......
LAUNCH INFO 2022-11-23 14:02:02,328 ------------------------- ERROR LOG DETAIL -------------------------
INFO 2022-11-23 14:02:02,328 controller.py:111] ------------------------- ERROR LOG DETAIL -------------------------
01:49,570] [    INFO] - remove_unused_columns         :True
[2022-11-23 14:01:49,571] [    INFO] - report_to                     :['visualdl']
[2022-11-23 14:01:49,571] [    INFO] - resume_from_checkpoint        :None
[2022-11-23 14:01:49,571] [    INFO] - round_type                    :round
[2022-11-23 14:01:49,571] [    INFO] - run_name                      :./checkpoint/model_best
[2022-11-23 14:01:49,571] [    INFO] - save_on_each_node             :False
[2022-11-23 14:01:49,571] [    INFO] - save_steps                    :100
[2022-11-23 14:01:49,571] [    INFO] - save_strategy                 :IntervalStrategy.STEPS
[2022-11-23 14:01:49,571] [    INFO] - save_total_limit              :1
[2022-11-23 14:01:49,571] [    INFO] - scale_loss                    :32768
[2022-11-23 14:01:49,571] [    INFO] - seed                          :42
[2022-11-23 14:01:49,571] [    INFO] - sharding                      :[]
[2022-11-23 14:01:49,571] [    INFO] - sharding_degree               :-1
[2022-11-23 14:01:49,571] [    INFO] - should_log                    :True
[2022-11-23 14:01:49,571] [    INFO] - should_save                   :True
[2022-11-23 14:01:49,571] [    INFO] - strategy                      :dynabert+ptq
[2022-11-23 14:01:49,571] [    INFO] - train_batch_size              :16
[2022-11-23 14:01:49,572] [    INFO] - use_pact                      :True
[2022-11-23 14:01:49,572] [    INFO] - warmup_ratio                  :0.1
[2022-11-23 14:01:49,572] [    INFO] - warmup_steps                  :0
[2022-11-23 14:01:49,572] [    INFO] - weight_decay                  :0.0
[2022-11-23 14:01:49,572] [    INFO] - weight_quantize_type          :channel_wise_abs_max
[2022-11-23 14:01:49,572] [    INFO] - width_mult_list               :None
[2022-11-23 14:01:49,572] [    INFO] - world_size                    :2
[2022-11-23 14:01:49,572] [    INFO] -
[2022-11-23 14:01:49,598] [    INFO] - ***** Running training *****
[2022-11-23 14:01:49,598] [    INFO] -   Num examples = 330
[2022-11-23 14:01:49,598] [    INFO] -   Num Epochs = 100
[2022-11-23 14:01:49,598] [    INFO] -   Instantaneous batch size per device = 16
[2022-11-23 14:01:49,598] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 32
[2022-11-23 14:01:49,598] [    INFO] -   Gradient Accumulation steps = 1
[2022-11-23 14:01:49,598] [    INFO] -   Total optimization steps = 1100.0
[2022-11-23 14:01:49,598] [    INFO] -   Total num train samples = 33000.0
[2022-11-23 14:01:49,663] [    INFO] -   Number of trainable parameters = 117946370
LAUNCH INFO 2022-11-23 14:02:02,329 Exit code -6
INFO 2022-11-23 14:02:02,329 controller.py:141] Exit code -6

workerlog.0 workerlog.1 没有 ERROR 日志。

这种情况有什么办法可以定位到问题?或者 debug 思路可以分享一下?

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions