-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forced EOS token in vllm generation? #238
Comments
Because the reward model uses the EOS token to predict the reward value. |
Ahhhh, got it, that makes sense. I think that's probably broken with local generation then! I just verified that that doesn't have EOS if max_tokens is reached. Also, would taking the RM output on the last non-masked token instead of EOS be a better way around this? Or alternatively, forcing the EOS token only when feeding the experience to the RM? |
I am not sure which approach to take at the moment, but our current implementation is heavily dependent on EOS tokens. |
You mean specifically for the RM? Or more broadly than that? |
Oh, for local generation, does OpenRLHF/openrlhf/models/actor.py Line 159 in bed10e1
If so, then doing this in |
@hijkzzz Could I ask a quick related question: In |
I see in
RemoteExperienceMaker._generate_vllm()
, line 375 that for generations that don't finish, i.e. don't output the EOS tokens within the max token limit, we manually set the last token to be the EOS token, even though that was not what the model generated.Isn't this the wrong thing to do? E.g. if the model generated an unfinished sentence like "This is an unfinished" when it ran into the token limit, shouldn't we train on that, rather than "This is an "? My understanding of the PPO algorithm is also that it doesn't do well with off-policy experiences, which we technically have if we manually change to the EOS token. So I just wanted to check if there's a specific reason to do this?
It also looks to me that the huggingface
model.generate()
method, and by extensionRemoteExperienceMaker._generate_local()
andNaiveExperienceMaker
do not do this.The text was updated successfully, but these errors were encountered: