We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
为什么训练pro的时候使用8卡,经常最后一张卡oom。
The text was updated successfully, but these errors were encountered:
hi~推测是设置了第8张卡也放了LLM?默认设置是前7张卡放LLM,第8张卡只放奖励模型用于训练中的validation。
Sorry, something went wrong.
你好我跑训练代码会报OOM,我是80G的A800,训练13B的模型,按道理应该不会爆啊 我把batch size设为1,block_size设为100,还是爆了,不知道问题出在哪?
hi~推测是设置了第8张卡也放了LLM?默认设置是前7张卡放LLM,第8张卡只放奖励模型用于训练中的validation。 你好我跑训练代码会报OOM,我是80G的A800,训练13B的模型,按道理应该不会爆啊 我把batch size设为1,block_size设为100,还是爆了,不知道问题出在哪?
或许可以考虑关闭do_validation并使用bf16和zero-3。需注意直接使用zero-3可能出现checkpoint只保存一部分的现象,见 #66 所述。 此外,data_manager.py中多处设置了self.max_length - 128,来规定prompt本身的长度(因为128是默认的response长度,我们没有将之设置在args里),如block_size修改为100,此处的128应该也需调整一下。
No branches or pull requests
为什么训练pro的时候使用8卡,经常最后一张卡oom。
The text was updated successfully, but these errors were encountered: