Skip to content

Remove reward model codepath#374

Merged
SumanthRH merged 5 commits intoNovaSky-AI:mainfrom
SumanthRH:remove-rm
Oct 2, 2025
Merged

Remove reward model codepath#374
SumanthRH merged 5 commits intoNovaSky-AI:mainfrom
SumanthRH:remove-rm

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH commented Oct 1, 2025

What does this PR do?

Should close #371

We've had an unused codepath for using an outcome reward model in the training loop for a while. This was primarly for RLHF use-cases that we don't target and can be removed.

TODO:

  • Cleanup custom_rewards / orm_rewards keys in the trainer
  • Cleanup RewardModel logic
  • Cleanup normalize_reward config
  • E2E test with gsm8k (GRPO and PPO)

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH marked this pull request as ready for review October 1, 2025 20:58
@SumanthRH SumanthRH requested a review from erictang000 October 1, 2025 20:58
Copy link
Copy Markdown
Collaborator

@erictang000 erictang000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, if you haven't already can you try an e2e gsm8k run with ppo just to make sure, since the critic code path was more tightly coupled with reward?

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH merged commit 8805e75 into NovaSky-AI:main Oct 2, 2025
3 checks passed
erictang000 pushed a commit that referenced this pull request Oct 2, 2025
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
# What does this PR do?

Should close NovaSky-AI#371 

We've had an unused codepath for using an outcome reward model in the
training loop for a while. This was primarly for RLHF use-cases that we
don't target and can be removed.

TODO:
- [x] Cleanup `custom_rewards` / `orm_rewards` keys in the trainer
- [x] Cleanup `RewardModel` logic
- [x] Cleanup `normalize_reward` config
- [x] E2E test with gsm8k (GRPO and PPO)

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cleanup older reward model codepath

2 participants