Fix bugs related to loss mask, meta info, and response length by xiaobo-yang · Pull Request #21 · PeterGriffinJin/Search-R1

xiaobo-yang · 2025-03-14T06:30:14Z

Construct the loss mask immediately after obtaining the observation to prevent encoding misalignment when converting back to tokens after text transformation.
When calculating the critic/kl, information tokens should not be included, as they are not samples generated by the policy. Otherwise, it may lead to a severe negative KL explosion.
Follow up on meta info to ensure that the test batch can apply do sample.
Remove the recording of info information for response length.

After fixing these bugs, the RL training remains stable for a much longer duration:

1. Construct the loss mask immediately after obtaining the observation to prevent encoding misalignment when converting back to tokens after text transformation. 2. Follow up on meta info to ensure that the test batch can apply do sample. 3. Remove the recording of info information for response length.

PeterGriffinJin merged commit 50cedb2 into PeterGriffinJin:main Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs related to loss mask, meta info, and response length#21

Fix bugs related to loss mask, meta info, and response length#21
PeterGriffinJin merged 1 commit intoPeterGriffinJin:mainfrom
xiaobo-yang:yxb/fix-info-mask-bugs

xiaobo-yang commented Mar 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiaobo-yang commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiaobo-yang commented Mar 14, 2025 •

edited

Loading