Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qwen 2.5 vl 72B A100单机rm训练爆内存,不是显存,是不是哪里内存没回收? #3559

Open
aleien95 opened this issue Mar 19, 2025 · 4 comments

Comments

@aleien95
Copy link

Describe the bug
用qwen2.5 vl 72B A1008卡单机训练reward模型,内存爆了,导致程序被kill,是不是哪里内存没回收?我的A100内存有2T 还是爆了。如下是我的内存监控:

Image

@aleien95
Copy link
Author

补充下我的启动命令:

Image

@Jintao-Huang
Copy link
Collaborator

是训着训着炸嘛

@aleien95
Copy link
Author

第一步就炸了

是训着训着炸嘛

@aleien95
Copy link
Author

问题解决了,单台A100 2T内存 zero3offload 也跑不起 qwen2.5 vl 72B的全量微调,两台A100就可以了,每台内存占用峰值大概600G左右。原因是deepspeed的切分和dp相关,显存换内存爆了,增加机器==增加dp数,减少了内存(实质是显存)的开销。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants