Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RT-DETR模型训练:模型训练中途因为内存占用过高被Killed #8283

Closed
1 task done
YYLCyylc opened this issue May 24, 2023 · 3 comments
Closed
1 task done
Labels
question Further information is requested status/close

Comments

@YYLCyylc
Copy link

YYLCyylc commented May 24, 2023

问题确认 Search before asking

  • 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

训练过程中训练进程总是被Killed,查看历史发现是占用内存过多

[05/24 05:58:30] ppdet.engine INFO: Epoch: [8] [1000/2373] learning_rate: 0.000025 loss_class: 1.117397 loss_bbox: 0.252872 loss_giou: 0.447971 loss_class_aux: 3.277040 loss_bbox_aux: 1.026030 loss_giou_aux: 1.620483 loss_class_dn: 0.432132 loss_bbox_dn: 0.546347 loss_giou_dn: 0.661677 loss_class_aux_dn: 0.881084 loss_bbox_aux_dn: 1.264564 loss_giou_aux_dn: 1.543707 loss: 13.045103 eta: 22:27:16 batch_cost: 0.4796 data_cost: 0.2118 ips: 8.3410 images/s
[05/24 06:00:29] ppdet.engine INFO: Epoch: [8] [1200/2373] learning_rate: 0.000025 loss_class: 1.112336 loss_bbox: 0.260119 loss_giou: 0.458377 loss_class_aux: 3.301874 loss_bbox_aux: 1.002547 loss_giou_aux: 1.663409 loss_class_dn: 0.433255 loss_bbox_dn: 0.515790 loss_giou_dn: 0.658224 loss_class_aux_dn: 0.877041 loss_bbox_aux_dn: 1.190781 loss_giou_aux_dn: 1.539540 loss: 13.123555 eta: 22:26:01 batch_cost: 0.5571 data_cost: 0.3091 ips: 7.1795 images/s
Killed

dmesg | tail -10
[7061009.960277] [1406057] 0 1406057 10131 2570 122880 0 999 sh
[7061009.960279] [1406395] 0 1406395 14294 2592 147456 0 999 top
[7061009.960280] [1407839] 0 1407839 442590971 4671736 41353216 0 999 python
[7061009.960282] [1408100] 0 1408100 10515 2455 110592 0 999 orion_client_ex
[7061009.960284] [1410164] 0 1410164 10131 2580 126976 0 999 sh
[7061009.960285] [1410507] 0 1410507 10131 1663 110592 0 999 sh
[7061009.960287] [1410508] 0 1410508 307984 9444 290816 0 999 orion-nv-smi-na
[7061009.960289] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-90d30e8297098c6fe06fb3c7b475b132d8f6ef895e545fb6474df6a6ad5640f0.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod77278367_89c1_4f2d_b54f_4c09b0b644e9.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod77278367_89c1_4f2d_b54f_4c09b0b644e9.slice/docker-90d30e8297098c6fe06fb3c7b475b132d8f6ef895e545fb6474df6a6ad5640f0.scope,task=python,pid=1407839,uid=0
[7061009.960363] Memory cgroup out of memory: Killed process 1407839 (python) total-vm:1770363884kB, anon-rss:16492188kB, file-rss:1995400kB, shmem-rss:199356kB, UID:0 pgtables:40384kB oom_score_adj:999

@YYLCyylc YYLCyylc added the question Further information is requested label May 24, 2023
@YYLCyylc
Copy link
Author

补充,训练时虚拟内存的占用就很大
63ecf693230ce61b4f16931a46e76da

@MINGtoMING
Copy link

根据你的·内存容量选择合适的batch_size, 并打开amp模式, 在训练指令的后面新增 --amp

@YYLCyylc
Copy link
Author

增加了内存之后可以了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested status/close
Projects
None yet
Development

No branches or pull requests

2 participants