We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug 用qwen2.5 vl 72B A1008卡单机训练reward模型,内存爆了,导致程序被kill,是不是哪里内存没回收?我的A100内存有2T 还是爆了。如下是我的内存监控:
The text was updated successfully, but these errors were encountered:
补充下我的启动命令:
Sorry, something went wrong.
是训着训着炸嘛
第一步就炸了
问题解决了,单台A100 2T内存 zero3offload 也跑不起 qwen2.5 vl 72B的全量微调,两台A100就可以了,每台内存占用峰值大概600G左右。原因是deepspeed的切分和dp相关,显存换内存爆了,增加机器==增加dp数,减少了内存(实质是显存)的开销。
No branches or pull requests
Describe the bug
用qwen2.5 vl 72B A1008卡单机训练reward模型,内存爆了,导致程序被kill,是不是哪里内存没回收?我的A100内存有2T 还是爆了。如下是我的内存监控:
The text was updated successfully, but these errors were encountered: