feat: adding support for Bradley-Terry reward model training by jveronvialard · Pull Request #609 · NVIDIA-NeMo/RL

jveronvialard · 2025-07-03T23:26:55Z

What does this PR do ?

Adding support for Bradley-Terry reward model training in NVIDIA-NeMo/RL.

Usage

The command to launch a Bradley-Terry reward model training job is as follows:

uv run examples/run_rm.py --config <PATH TO YAML CONFIG> <OVERRIDES>

An example config can be found at examples/configs/rm.yaml. Please refer to docs/guides/rm.md for more information.

Example convergence plots:

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

odelalleau

Thanks! lgtm (I had already reviewed those changes privately) -- I just realized though that we probably need an update of README.md which contains the list of algorithms (and should link to the new rm.md). Would still be good to get this merged asap to unblock people who need this feature.

SahilJain314 · 2025-07-07T21:22:45Z

Thanks for the PR! Would you mind adding convergence plots to your PR description?

examples/run_sft.py

terrykong

Awesome work @jveronvialard !

can you also update index.md to include rm.md so it appears in our docs? @jgerh could you review the docs?
could you add unit tests? Cursor can help with most of the heavy lifting

docs/guides/rm.md

examples/run_sft.py

…t-rm-training

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

nemo_rl/algorithms/loss_functions.py

nemo_rl/models/policy/dtensor_policy_worker.py

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

odelalleau

lgtm, just one small suggestion

examples/configs/rm.yaml

…t-rm-training

odelalleau

lgtm but I'm probably biased since I contributed a few commits ;)

terrykong · 2025-07-29T16:18:11Z

@jveronvialard can you run the linter?

Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>

odelalleau · 2025-07-29T19:17:01Z

@jveronvialard can you run the linter?

Ah my bad my last commit introduced a minor linting issue, I just fixed it.

…NeMo#609) Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com> Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Signed-off-by: Julien Veron Vialard <50602890+jveronvialard@users.noreply.github.com> Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>

…NeMo#609) Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com> Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Signed-off-by: Julien Veron Vialard <50602890+jveronvialard@users.noreply.github.com> Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>

terrykong · 2025-10-02T18:49:00Z

@phtran8 can you QA?

github-actions · 2025-10-02T18:49:30Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7530c84 (PR #609 from jveronvialard/bt-rm-training)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/af73aa2cebf94a0bee5ea6dda2614ad989faffae/
CURRENT (PR #609 from jveronvialard/bt-rm-training): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-10-03T18:09:10Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7530c84 (PR #609 from jveronvialard/bt-rm-training)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/af73aa2cebf94a0bee5ea6dda2614ad989faffae/
CURRENT (PR #609 from jveronvialard/bt-rm-training): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

adding support for Bradley-Terry reward model training

a38c104

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

jveronvialard requested review from ashors1 and terrykong July 3, 2025 23:27

github-actions bot added the documentation Improvements or additions to documentation label Jul 3, 2025

jveronvialard requested review from YianZhang, abukharin-nv, gshennvm and odelalleau July 3, 2025 23:27

jveronvialard added enhancement New feature or request training Training related algorithm labels Jul 3, 2025

jveronvialard changed the title ~~adding support for Bradley-Terry reward model training~~ feat: adding support for Bradley-Terry reward model training Jul 3, 2025

odelalleau previously approved these changes Jul 4, 2025

View reviewed changes

parthchadha reviewed Jul 7, 2025

View reviewed changes

examples/run_sft.py Outdated Show resolved Hide resolved

terrykong requested changes Jul 11, 2025

View reviewed changes

docs/guides/rm.md Outdated Show resolved Hide resolved

examples/run_sft.py Outdated Show resolved Hide resolved

terrykong requested a review from jgerh July 11, 2025 00:46

jveronvialard added 2 commits July 15, 2025 06:33

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…

ede515b

…t-rm-training

update docs

5b9e976

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

jveronvialard dismissed odelalleau’s stale review via 5b9e976 July 15, 2025 17:04

jveronvialard added 2 commits July 15, 2025 10:52

add separate run_rm.py and unit tests

68e96ea

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

fix small typos and nit changes

21d67a0

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

jveronvialard requested review from SahilJain314, odelalleau, parthchadha and terrykong July 15, 2025 19:17

parthchadha suggested changes Jul 15, 2025

View reviewed changes

nemo_rl/algorithms/loss_functions.py Show resolved Hide resolved

nemo_rl/models/policy/dtensor_policy_worker.py Show resolved Hide resolved

rewards tensor shape

8a28af7

Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>

odelalleau reviewed Jul 16, 2025

View reviewed changes

examples/configs/rm.yaml Outdated Show resolved Hide resolved

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into jveronvialard/b…

e914087

…t-rm-training

jveronvialard requested review from SahilJain314, jgerh, odelalleau, parthchadha and terrykong July 29, 2025 12:54

odelalleau previously approved these changes Jul 29, 2025

View reviewed changes

terrykong previously approved these changes Jul 29, 2025

View reviewed changes

terrykong assigned odelalleau and jveronvialard Jul 29, 2025

odelalleau dismissed stale reviews from terrykong and themself via 43f7eae July 29, 2025 19:15

Minor lint fix

7530c84

Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>

odelalleau force-pushed the jveronvialard/bt-rm-training branch from 43f7eae to 7530c84 Compare July 29, 2025 19:15

SahilJain314 approved these changes Jul 29, 2025

View reviewed changes

terrykong added this pull request to the merge queue Jul 29, 2025

Merged via the queue into main with commit d467852 Jul 29, 2025
15 checks passed

terrykong deleted the jveronvialard/bt-rm-training branch July 29, 2025 23:09

terrykong added the QA:In Progress label Oct 2, 2025

phtran8 added QA:Verified and removed QA:In Progress labels Oct 3, 2025

Conversation

jveronvialard commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Usage

Before your PR is "Ready for review"

Uh oh!

odelalleau left a comment

Choose a reason for hiding this comment

Uh oh!

SahilJain314 commented Jul 7, 2025

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

odelalleau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

odelalleau left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

odelalleau commented Jul 29, 2025

Uh oh!

Uh oh!

terrykong commented Oct 2, 2025

Uh oh!

github-actions bot commented Oct 2, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Oct 3, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jveronvialard commented Jul 3, 2025 •

edited

Loading