Add MaxRL mean normalization over advantages#1126
Add MaxRL mean normalization over advantages#1126SumanthRH merged 13 commits intoNovaSky-AI:mainfrom
Conversation
| raise ValueError(f"no score in prompt index: {idx}") | ||
| for i in range(bsz): | ||
| if len(id2score[index[i]]) > 1: | ||
| scores[i] = (scores[i] - id2mean[index[i]]) / (id2mean[index[i]] + epsilon) |
There was a problem hiding this comment.
🔴 MAXRL advantage sign is flipped when group mean reward is negative
In compute_maxrl_advantage, the normalization divides by (id2mean[index[i]] + epsilon) at line 1220. When the group's mean reward is negative, this denominator is negative, which flips the sign of the advantage. For example, with group scores [-10, -5] (mean = -7.5): the worse response (-10) gets advantage (-10 - (-7.5)) / (-7.5 + 1e-6) ≈ +0.333 (positive!) and the better response (-5) gets advantage (-5 - (-7.5)) / (-7.5 + 1e-6) ≈ -0.333 (negative!). This causes the RL algorithm to reinforce bad responses and penalize good ones whenever the group mean is negative. The test only uses positive rewards so it doesn't catch this. The fix should use abs(id2mean) in the denominator.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This formulation is as per the original maxrl paper
There was a problem hiding this comment.
Should I make the denominator absolute then? Don't think people use negative rewards anyways nowadays
|
Hi @tamoghnokandar .Thank you for the PR! We recently finished the migration from Is this PR ready for review? Could you merge the latest changes from |
|
Hi @SumanthRH , I merged the latest changes from main. Ready for review now. |
There was a problem hiding this comment.
why was this file modified? can you revert these changes?
There was a problem hiding this comment.
I think it is already restored? I have made it same as the current main branch.
There was a problem hiding this comment.
Yeah no issues, I reverted the change in this commit: 328943a
| index=index, | ||
| ) | ||
|
|
||
| expected = torch.tensor([1.5 / 4.5, -1.5 / 4.5, -1.5 / 10.5, 1.5 / 10.5]).unsqueeze(-1) * response_mask |
There was a problem hiding this comment.
let's make sure expected includes 1e-6 in the denominator
SumanthRH
left a comment
There was a problem hiding this comment.
Looking good, left a couple minor comments.
Could you add an example script in examples/train/algorithms/maxrl for training on GSM8K with maxrl adv estimator?
Made-with: Cursor # Conflicts: # tests/backends/skyrl_train/utils/test_ppo_utils.py # tests/tx/models/test_models_common.py
|
Done! |
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Fixes #1030