Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pro #69

Open
liumingzhu6060 opened this issue Aug 23, 2023 · 3 comments
Open

pro #69

liumingzhu6060 opened this issue Aug 23, 2023 · 3 comments
Labels

Comments

@liumingzhu6060
Copy link

liumingzhu6060 commented Aug 23, 2023

为什么训练pro的时候使用8卡,经常最后一张卡oom。

@huybery huybery added the pro label Aug 25, 2023
@F2-Song
Copy link

F2-Song commented Mar 16, 2024

hi~推测是设置了第8张卡也放了LLM?默认设置是前7张卡放LLM,第8张卡只放奖励模型用于训练中的validation。

@Zheng-Jay
Copy link

hi~推测是设置了第8张卡也放了LLM?默认设置是前7张卡放LLM,第8张卡只放奖励模型用于训练中的validation。

你好我跑训练代码会报OOM,我是80G的A800,训练13B的模型,按道理应该不会爆啊
我把batch size设为1,block_size设为100,还是爆了,不知道问题出在哪?

@F2-Song
Copy link

F2-Song commented Mar 20, 2024

hi~推测是设置了第8张卡也放了LLM?默认设置是前7张卡放LLM,第8张卡只放奖励模型用于训练中的validation。

你好我跑训练代码会报OOM,我是80G的A800,训练13B的模型,按道理应该不会爆啊 我把batch size设为1,block_size设为100,还是爆了,不知道问题出在哪?

或许可以考虑关闭do_validation并使用bf16和zero-3。需注意直接使用zero-3可能出现checkpoint只保存一部分的现象,见 #66 所述。
此外,data_manager.py中多处设置了self.max_length - 128,来规定prompt本身的长度(因为128是默认的response长度,我们没有将之设置在args里),如block_size修改为100,此处的128应该也需调整一下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants