Skip to content

feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679

Merged
ffrujeri merged 4 commits intoffrujeri/genrm-modelfrom
ffrujeri/genrm-compare-rs
Mar 3, 2026
Merged

feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows.#679
ffrujeri merged 4 commits intoffrujeri/genrm-modelfrom
ffrujeri/genrm-compare-rs

Conversation

@ffrujeri
Copy link
Contributor

@ffrujeri ffrujeri commented Feb 12, 2026

GenRM Compare Resource Server & Cohort-Based Verify

What does this PR do?

This PR adds a production-ready Resource Server for comparing multiple candidate responses using GenRM models, and moves RLHF-specific reward logic (cohort buffering and comparison) into that server so that rollout collection and consumer libraries (e.g. nemo-rl) stay generic.

Issues

Summary

In RLHF, rewards are relative to other rollouts for the same task (e.g. same prompt), not independent. This PR addresses that by:

  • Cohort-based verify: The genrm_compare server’s /verify endpoint buffers rollouts by prompt (and optional principle). When num_rollouts_per_prompt rollouts have been received for a prompt, it runs pairwise comparison, aggregates scores, and returns the appropriate reward to each of the N callers. Callers naturally “wait” until their cohort is complete via the async verify flow.
  • No RLHF hacks in Gym or NeMo RL: Rollout collection stays a simple “post each row to agent /run”. The agent calls the resources server’s /verify with the response; genrm_compare owns all buffering and comparison. No comparison strategy or prompt buffering in rollout_collection.py.

Key features

  • Cohort-based verify: Configurable num_rollouts_per_prompt; verify buffers by prompt (and principle), runs comparison when cohort is full, distributes rewards.
  • Batch /compare API: Direct comparison of N response_objs (e.g. for scripts or tests).
  • Pairwise comparison: Circular and all-pairs strategies; tiebreaker and length-based bonuses; optional principle-based judging.
  • GenRM model alignment: Config aligned with genrm_model (server name genrm_model; custom roles response_1, response_2, principle).
  • Clean boundaries: Zero GenRM-specific code in rollout collection or config types; all RLHF logic in genrm_compare.

Architecture

Rollout collection
    └── For each row: POST to agent /run  (unchanged; no strategy or buffering)

Agent (e.g. simple_agent)
    └── /run: generate response → POST to resources server /verify with (params, response, optional principle)

GenRM Compare Resource Server
    ├── /verify (per-rollout)
    │   ├── num_rollouts_per_prompt <= 1 → return default_score
    │   └── num_rollouts_per_prompt > 1:
    │       ├── Buffer by prompt_key (input + principle)
    │       ├── When cohort size == num_rollouts_per_prompt:
    │       │   ├── Run pairwise comparison (GenRM model)
    │       │   ├── Aggregate scores (tiebreaker, length bonuses)
    │       │   └── Resolve all N pending verify callers with their rewards
    │       └── Return this rollout’s reward
    └── /compare (batch)
        └── Compare N response_objs; return rewards + metrics (for scripts/tests)
  • Config: genrm_compare config includes num_rollouts_per_prompt, genrm_model_server (name genrm_model), and comparison/aggregation options. No comparison_strategy in global config for rollout.
  • Data: For RLHF, provide num_rollouts_per_prompt rows per prompt (e.g. via num_repeats when loading data).

Testing

curl -s -X POST http://127.0.0.1:17795/compare \
  -H "Content-Type: application/json" \
  -d '{
    "conversation_history": [{"role": "user", "content": "What is SKILL?"}],
    "response_objs": [
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "SKILL is a verb meaning to kill."}]}]},
      {"output": [{"type": "message", "content": [{"type": "output_text", "text": "Skill refers to the ability to perform a task well."}]}]}
    ]
  }' | jq .

GenRM returns a response with reasoning and a final message containing JSON scores, e.g.:

{
  "rewards": [
    1.025,
    4.475
  ],
  "comparison_results": [
    {
      "response_i": 0,
      "response_j": 1,
      "judge_idx": 0,
      "score_1": 1.0,
      "score_2": 5.0,
      "ranking": 6.0
    },
    {
      "response_i": 1,
      "response_j": 0,
      "judge_idx": 0,
      "score_1": 4.0,
      "score_2": 1.0,
      "ranking": 1.0
    }
  ],
  "metrics": {
    "mean_individual_score": 2.75,
    "std_individual_score": 1.7853571071357126,
    "tiebreak_usage_rate": 0.0
  }
}

Unit tests cover genrm_compare (verify stub when N≤1, cohort logic, compare), utils (prompt key, parsing, aggregation), and comparison_strategies (batch client and helpers).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ffrujeri ffrujeri changed the title feat: Adds GenRM pairwise comparison resource servert to support RLHF training workflows. feat: Adds GenRM pairwise comparison resource server to support RLHF training workflows. Feb 13, 2026
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 43da6c4 to 85f39dd Compare February 14, 2026 23:48
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from c22967f to dd172a0 Compare February 18, 2026 02:33
@bxyu-nvidia bxyu-nvidia linked an issue Feb 18, 2026 that may be closed by this pull request
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from dd172a0 to 7458914 Compare February 18, 2026 16:58
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 1d043b6 to 3000055 Compare February 18, 2026 16:59
@ffrujeri ffrujeri marked this pull request as ready for review February 18, 2026 21:43
@ffrujeri ffrujeri requested a review from bxyu-nvidia February 28, 2026 00:44
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from 7458914 to 588a33b Compare March 3, 2026 00:08
@ffrujeri ffrujeri requested a review from a team as a code owner March 3, 2026 00:08
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from ec2a508 to 4e28ecf Compare March 3, 2026 00:19
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from 0f3d69b to c0b5d7c Compare March 3, 2026 07:05
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch 2 times, most recently from a3040bc to 1a33841 Compare March 3, 2026 18:35
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from 4fc3148 to 6138812 Compare March 3, 2026 19:53
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 6caa213 to 84fb3c4 Compare March 3, 2026 22:54
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-model branch from 90f96d3 to ca71797 Compare March 3, 2026 22:55
ffrujeri added 4 commits March 3, 2026 23:21
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@ffrujeri ffrujeri force-pushed the ffrujeri/genrm-compare-rs branch from 84fb3c4 to e491b0f Compare March 3, 2026 23:21
@ffrujeri ffrujeri merged commit ae89bf7 into ffrujeri/genrm-model Mar 3, 2026
5 checks passed
@ffrujeri ffrujeri deleted the ffrujeri/genrm-compare-rs branch March 3, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Reward model support

2 participants