RT-DETR模型训练：模型训练中途因为内存占用过高被Killed #8283

YYLCyylc · 2023-05-24T06:11:04Z

问题确认 Search before asking

我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

训练过程中训练进程总是被Killed，查看历史发现是占用内存过多

[05/24 05:58:30] ppdet.engine INFO: Epoch: [8] [1000/2373] learning_rate: 0.000025 loss_class: 1.117397 loss_bbox: 0.252872 loss_giou: 0.447971 loss_class_aux: 3.277040 loss_bbox_aux: 1.026030 loss_giou_aux: 1.620483 loss_class_dn: 0.432132 loss_bbox_dn: 0.546347 loss_giou_dn: 0.661677 loss_class_aux_dn: 0.881084 loss_bbox_aux_dn: 1.264564 loss_giou_aux_dn: 1.543707 loss: 13.045103 eta: 22:27:16 batch_cost: 0.4796 data_cost: 0.2118 ips: 8.3410 images/s
[05/24 06:00:29] ppdet.engine INFO: Epoch: [8] [1200/2373] learning_rate: 0.000025 loss_class: 1.112336 loss_bbox: 0.260119 loss_giou: 0.458377 loss_class_aux: 3.301874 loss_bbox_aux: 1.002547 loss_giou_aux: 1.663409 loss_class_dn: 0.433255 loss_bbox_dn: 0.515790 loss_giou_dn: 0.658224 loss_class_aux_dn: 0.877041 loss_bbox_aux_dn: 1.190781 loss_giou_aux_dn: 1.539540 loss: 13.123555 eta: 22:26:01 batch_cost: 0.5571 data_cost: 0.3091 ips: 7.1795 images/s
Killed

dmesg | tail -10
[7061009.960277] [1406057] 0 1406057 10131 2570 122880 0 999 sh
[7061009.960279] [1406395] 0 1406395 14294 2592 147456 0 999 top
[7061009.960280] [1407839] 0 1407839 442590971 4671736 41353216 0 999 python
[7061009.960282] [1408100] 0 1408100 10515 2455 110592 0 999 orion_client_ex
[7061009.960284] [1410164] 0 1410164 10131 2580 126976 0 999 sh
[7061009.960285] [1410507] 0 1410507 10131 1663 110592 0 999 sh
[7061009.960287] [1410508] 0 1410508 307984 9444 290816 0 999 orion-nv-smi-na
[7061009.960289] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-90d30e8297098c6fe06fb3c7b475b132d8f6ef895e545fb6474df6a6ad5640f0.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod77278367_89c1_4f2d_b54f_4c09b0b644e9.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod77278367_89c1_4f2d_b54f_4c09b0b644e9.slice/docker-90d30e8297098c6fe06fb3c7b475b132d8f6ef895e545fb6474df6a6ad5640f0.scope,task=python,pid=1407839,uid=0
[7061009.960363] Memory cgroup out of memory: Killed process 1407839 (python) total-vm:1770363884kB, anon-rss:16492188kB, file-rss:1995400kB, shmem-rss:199356kB, UID:0 pgtables:40384kB oom_score_adj:999

YYLCyylc · 2023-05-24T08:20:10Z

补充，训练时虚拟内存的占用就很大

MINGtoMING · 2023-05-24T16:26:49Z

根据你的·内存容量选择合适的batch_size, 并打开amp模式，在训练指令的后面新增 --amp

YYLCyylc · 2023-05-25T07:05:30Z

增加了内存之后可以了

YYLCyylc added the question Further information is requested label May 24, 2023

YYLCyylc closed this as completed May 25, 2023

paddle-bot bot added the status/close label May 25, 2023

lyuwenyu mentioned this issue Aug 17, 2023

Collection of questions/discussions/usage lyuwenyu/RT-DETR#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RT-DETR模型训练：模型训练中途因为内存占用过高被Killed #8283

RT-DETR模型训练：模型训练中途因为内存占用过高被Killed #8283

YYLCyylc commented May 24, 2023 •

edited

Loading

YYLCyylc commented May 24, 2023

MINGtoMING commented May 24, 2023

YYLCyylc commented May 25, 2023

RT-DETR模型训练：模型训练中途因为内存占用过高被Killed #8283

RT-DETR模型训练：模型训练中途因为内存占用过高被Killed #8283

Comments

YYLCyylc commented May 24, 2023 • edited Loading

问题确认 Search before asking

请提出你的问题 Please ask your question

YYLCyylc commented May 24, 2023

MINGtoMING commented May 24, 2023

YYLCyylc commented May 25, 2023

YYLCyylc commented May 24, 2023 •

edited

Loading