feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows. by ffrujeri · Pull Request #679 · NVIDIA-NeMo/Gym

ffrujeri · 2026-02-12T00:42:01Z

GenRM Compare Resource Server & Cohort-Based Verify

What does this PR do?

This PR adds a production-ready Resource Server for comparing multiple candidate responses using GenRM models, and moves RLHF-specific reward logic (cohort buffering and comparison) into that server so that rollout collection and consumer libraries (e.g. nemo-rl) stay generic.

Issues

Related to PR add genrm rlhf #523 (reference).
Part of feat: Reward model support #516.

Summary

In RLHF, rewards are relative to other rollouts for the same task (e.g. same prompt), not independent. This PR addresses that by:

Cohort-based verify: The genrm_compare server’s /verify endpoint buffers rollouts by prompt (and optional principle). When num_rollouts_per_prompt rollouts have been received for a prompt, it runs pairwise comparison, aggregates scores, and returns the appropriate reward to each of the N callers. Callers naturally “wait” until their cohort is complete via the async verify flow.
No RLHF hacks in Gym or NeMo RL: Rollout collection stays a simple “post each row to agent /run”. The agent calls the resources server’s /verify with the response; genrm_compare owns all buffering and comparison. No comparison strategy or prompt buffering in rollout_collection.py.

Key features

Cohort-based verify: Configurable num_rollouts_per_prompt; verify buffers by prompt (and principle), runs comparison when cohort is full, distributes rewards.
Batch /compare API: Direct comparison of N response_objs (e.g. for scripts or tests).
Pairwise comparison: Circular and all-pairs strategies; tiebreaker and length-based bonuses; optional principle-based judging.
GenRM model alignment: Config aligned with genrm_model (server name genrm_model; custom roles response_1, response_2, principle).
Clean boundaries: Zero GenRM-specific code in rollout collection or config types; all RLHF logic in genrm_compare.

Architecture

Rollout collection
    └── For each row: POST to agent /run  (unchanged; no strategy or buffering)

Agent (e.g. simple_agent)
    └── /run: generate response → POST to resources server /verify with (params, response, optional principle)

GenRM Compare Resource Server
    ├── /verify (per-rollout)
    │   ├── num_rollouts_per_prompt <= 1 → return default_score
    │   └── num_rollouts_per_prompt > 1:
    │       ├── Buffer by prompt_key (input + principle)
    │       ├── When cohort size == num_rollouts_per_prompt:
    │       │   ├── Run pairwise comparison (GenRM model)
    │       │   ├── Aggregate scores (tiebreaker, length bonuses)
    │       │   └── Resolve all N pending verify callers with their rewards
    │       └── Return this rollout’s reward
    └── /compare (batch)
        └── Compare N response_objs; return rewards + metrics (for scripts/tests)

Config: genrm_compare config includes num_rollouts_per_prompt, genrm_model_server (name genrm_model), and comparison/aggregation options. No comparison_strategy in global config for rollout.
Data: For RLHF, provide num_rollouts_per_prompt rows per prompt (e.g. via num_repeats when loading data).

Testing

curl -s -X POST http://127.0.0.1:17795/compare \
  -H "Content-Type: application/json" \
  -d '{
    "conversation_history": [{"role": "user", "content": "What is SKILL?"}],
    "response_objs": [
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "SKILL is a verb meaning to kill."}]}]},
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "Skill refers to the ability to perform a task well."}]}]}
    ]
  }' | jq .

GenRM returns a response with reasoning and a final message containing JSON scores, e.g.:

{
  "rewards": [
    1.025,
    4.475
  ],
  "comparison_results": [
    {
      "response_i": 0,
      "response_j": 1,
      "judge_idx": 0,
      "score_1": 1.0,
      "score_2": 5.0,
      "ranking": 6.0
    },
    {
      "response_i": 1,
      "response_j": 0,
      "judge_idx": 0,
      "score_1": 4.0,
      "score_2": 1.0,
      "ranking": 1.0
    }
  ],
  "metrics": {
    "mean_individual_score": 2.75,
    "std_individual_score": 1.7853571071357126,
    "tiebreak_usage_rate": 0.0
  }
}

Unit tests cover genrm_compare (verify stub when N≤1, cohort logic, compare), utils (prompt key, parsing, aggregation), and comparison_strategies (batch client and helpers).

copy-pr-bot · 2026-02-12T00:42:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

ffrujeri changed the title ~~feat: Adds GenRM pairwise comparison resource servert to support RLHF training workflows.~~ feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows. Feb 13, 2026

ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 43da6c4 to 85f39dd Compare February 14, 2026 23:48

ffrujeri force-pushed the ffrujeri/genrm-model branch from c22967f to dd172a0 Compare February 18, 2026 02:33

bxyu-nvidia linked an issue Feb 18, 2026 that may be closed by this pull request

feat: Reward model support #516

Closed

ffrujeri force-pushed the ffrujeri/genrm-model branch from dd172a0 to 7458914 Compare February 18, 2026 16:58

ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 1d043b6 to 3000055 Compare February 18, 2026 16:59

ffrujeri marked this pull request as ready for review February 18, 2026 21:43

ffrujeri requested a review from bxyu-nvidia February 28, 2026 00:44

ffrujeri force-pushed the ffrujeri/genrm-model branch from 7458914 to 588a33b Compare March 3, 2026 00:08

ffrujeri requested a review from a team as a code owner March 3, 2026 00:08

ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from ec2a508 to 4e28ecf Compare March 3, 2026 00:19

bxyu-nvidia approved these changes Mar 3, 2026

View reviewed changes

ffrujeri force-pushed the ffrujeri/genrm-model branch from 0f3d69b to c0b5d7c Compare March 3, 2026 07:05

ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch 2 times, most recently from a3040bc to 1a33841 Compare March 3, 2026 18:35

ffrujeri force-pushed the ffrujeri/genrm-model branch from 4fc3148 to 6138812 Compare March 3, 2026 19:53

ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 6caa213 to 84fb3c4 Compare March 3, 2026 22:54

ffrujeri force-pushed the ffrujeri/genrm-model branch from 90f96d3 to ca71797 Compare March 3, 2026 22:55

ffrujeri added 4 commits March 3, 2026 23:21

Add genrm_compare resource server.

b48e9d5

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Update roles strings for the _run_single_comparison.

78eb9e9

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Modify message passing to be done through metadata field.

1eb345f

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Fix unit test.

e491b0f

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 84fb3c4 to e491b0f Compare March 3, 2026 23:21

ffrujeri merged commit ae89bf7 into ffrujeri/genrm-model Mar 3, 2026
5 checks passed

ffrujeri deleted the ffrujeri/genrm-compare-rs branch March 3, 2026 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679

feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679
ffrujeri merged 4 commits intoffrujeri/genrm-modelfrom
ffrujeri/genrm-compare-rs

ffrujeri commented Feb 12, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ffrujeri commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GenRM Compare Resource Server & Cohort-Based Verify

What does this PR do?

Issues

Summary

Key features

Architecture

Testing

Uh oh!

copy-pr-bot bot commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ffrujeri commented Feb 12, 2026 •

edited

Loading