feat: adding support for Bradley-Terry reward model training#609
feat: adding support for Bradley-Terry reward model training#609
Conversation
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
odelalleau
left a comment
There was a problem hiding this comment.
Thanks! lgtm (I had already reviewed those changes privately) -- I just realized though that we probably need an update of README.md which contains the list of algorithms (and should link to the new rm.md). Would still be good to get this merged asap to unblock people who need this feature.
|
Thanks for the PR! Would you mind adding convergence plots to your PR description? |
terrykong
left a comment
There was a problem hiding this comment.
Awesome work @jveronvialard !
- can you also update
index.mdto includerm.mdso it appears in our docs? @jgerh could you review the docs? - could you add unit tests? Cursor can help with most of the heavy lifting
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
odelalleau
left a comment
There was a problem hiding this comment.
lgtm, just one small suggestion
odelalleau
left a comment
There was a problem hiding this comment.
lgtm but I'm probably biased since I contributed a few commits ;)
|
@jveronvialard can you run the linter? |
Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
43f7eae to
7530c84
Compare
Ah my bad my last commit introduced a minor linting issue, I just fixed it. |
…NeMo#609) Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com> Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Signed-off-by: Julien Veron Vialard <50602890+jveronvialard@users.noreply.github.com> Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
…NeMo#609) Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com> Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Signed-off-by: Julien Veron Vialard <50602890+jveronvialard@users.noreply.github.com> Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
…NeMo#609) Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com> Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Signed-off-by: Julien Veron Vialard <50602890+jveronvialard@users.noreply.github.com> Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
…NeMo#609) Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com> Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Signed-off-by: Julien Veron Vialard <50602890+jveronvialard@users.noreply.github.com> Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Signed-off-by: Qidong Su <qidongs@nvidia.com>
|
@phtran8 can you QA? |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 7530c84 (PR #609 from ❌ Submodules that need attention:Megatron-LM: ❌ PR branch is BEHIND main branch Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 7530c84 (PR #609 from ❌ Submodules that need attention:Megatron-LM: ❌ PR branch is BEHIND main branch Please ensure all submodule commits are fast-forwards of the main branch before merging. |
What does this PR do ?
Adding support for Bradley-Terry reward model training in NVIDIA-NeMo/RL.
Usage
The command to launch a Bradley-Terry reward model training job is as follows:
An example config can be found at examples/configs/rm.yaml. Please refer to docs/guides/rm.md for more information.
Example convergence plots:



Before your PR is "Ready for review"
Pre checks: