Skip to content

Fix bugs related to loss mask, meta info, and response length#21

Merged
PeterGriffinJin merged 1 commit intoPeterGriffinJin:mainfrom
xiaobo-yang:yxb/fix-info-mask-bugs
Mar 19, 2025
Merged

Fix bugs related to loss mask, meta info, and response length#21
PeterGriffinJin merged 1 commit intoPeterGriffinJin:mainfrom
xiaobo-yang:yxb/fix-info-mask-bugs

Conversation

@xiaobo-yang
Copy link
Contributor

@xiaobo-yang xiaobo-yang commented Mar 14, 2025

  1. Construct the loss mask immediately after obtaining the observation to prevent encoding misalignment when converting back to tokens after text transformation.
  2. When calculating the critic/kl, information tokens should not be included, as they are not samples generated by the policy. Otherwise, it may lead to a severe negative KL explosion.
  3. Follow up on meta info to ensure that the test batch can apply do sample.
  4. Remove the recording of info information for response length.

After fixing these bugs, the RL training remains stable for a much longer duration:
image

1. Construct the loss mask immediately after obtaining the observation to prevent encoding misalignment when converting back to tokens after text transformation.
2. Follow up on meta info to ensure that the test batch can apply do sample.
3. Remove the recording of info information for response length.
@PeterGriffinJin PeterGriffinJin merged commit 50cedb2 into PeterGriffinJin:main Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants