-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
Window WSL2
paddlenlp 2.4.3
paddlepaddle-gpu 2.4.0rc0
抽取式任务。 准备了 70 条训练数据,标注了 6 个标签。操作步骤以及参数都是参考 model_zoo/uie#4-训练定制
- GPU 多卡
从样本数据中 head 6 条用于训练,正常。
样本 70 条数据用于训练,直接中断,无明显日志。日志如下:
....... 其他 INFO ......
LAUNCH INFO 2022-11-23 14:02:02,328 Pod failed
INFO 2022-11-23 14:02:02,328 controller.py:109] Pod failed
LAUNCH ERROR 2022-11-23 14:02:02,328 Container failed !!!
....... 其他 INFO ......
ERROR 2022-11-23 14:02:02,328 controller.py:110] Container failed !!!
....... 其他 INFO ......
LAUNCH INFO 2022-11-23 14:02:02,328 ------------------------- ERROR LOG DETAIL -------------------------
INFO 2022-11-23 14:02:02,328 controller.py:111] ------------------------- ERROR LOG DETAIL -------------------------
01:49,570] [ INFO] - remove_unused_columns :True
[2022-11-23 14:01:49,571] [ INFO] - report_to :['visualdl']
[2022-11-23 14:01:49,571] [ INFO] - resume_from_checkpoint :None
[2022-11-23 14:01:49,571] [ INFO] - round_type :round
[2022-11-23 14:01:49,571] [ INFO] - run_name :./checkpoint/model_best
[2022-11-23 14:01:49,571] [ INFO] - save_on_each_node :False
[2022-11-23 14:01:49,571] [ INFO] - save_steps :100
[2022-11-23 14:01:49,571] [ INFO] - save_strategy :IntervalStrategy.STEPS
[2022-11-23 14:01:49,571] [ INFO] - save_total_limit :1
[2022-11-23 14:01:49,571] [ INFO] - scale_loss :32768
[2022-11-23 14:01:49,571] [ INFO] - seed :42
[2022-11-23 14:01:49,571] [ INFO] - sharding :[]
[2022-11-23 14:01:49,571] [ INFO] - sharding_degree :-1
[2022-11-23 14:01:49,571] [ INFO] - should_log :True
[2022-11-23 14:01:49,571] [ INFO] - should_save :True
[2022-11-23 14:01:49,571] [ INFO] - strategy :dynabert+ptq
[2022-11-23 14:01:49,571] [ INFO] - train_batch_size :16
[2022-11-23 14:01:49,572] [ INFO] - use_pact :True
[2022-11-23 14:01:49,572] [ INFO] - warmup_ratio :0.1
[2022-11-23 14:01:49,572] [ INFO] - warmup_steps :0
[2022-11-23 14:01:49,572] [ INFO] - weight_decay :0.0
[2022-11-23 14:01:49,572] [ INFO] - weight_quantize_type :channel_wise_abs_max
[2022-11-23 14:01:49,572] [ INFO] - width_mult_list :None
[2022-11-23 14:01:49,572] [ INFO] - world_size :2
[2022-11-23 14:01:49,572] [ INFO] -
[2022-11-23 14:01:49,598] [ INFO] - ***** Running training *****
[2022-11-23 14:01:49,598] [ INFO] - Num examples = 330
[2022-11-23 14:01:49,598] [ INFO] - Num Epochs = 100
[2022-11-23 14:01:49,598] [ INFO] - Instantaneous batch size per device = 16
[2022-11-23 14:01:49,598] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 32
[2022-11-23 14:01:49,598] [ INFO] - Gradient Accumulation steps = 1
[2022-11-23 14:01:49,598] [ INFO] - Total optimization steps = 1100.0
[2022-11-23 14:01:49,598] [ INFO] - Total num train samples = 33000.0
[2022-11-23 14:01:49,663] [ INFO] - Number of trainable parameters = 117946370
LAUNCH INFO 2022-11-23 14:02:02,329 Exit code -6
INFO 2022-11-23 14:02:02,329 controller.py:141] Exit code -6
workerlog.0 workerlog.1 没有 ERROR 日志。
这种情况有什么办法可以定位到问题?或者 debug 思路可以分享一下?
stitchshawstitchshaw
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested