Skip to content

fix: GenRM lock in order to properly handle concurrent requests.#1041

Merged
ffrujeri merged 1 commit into
mainfrom
ffrujeri/fix-genrm-concurrency-lock
Apr 9, 2026
Merged

fix: GenRM lock in order to properly handle concurrent requests.#1041
ffrujeri merged 1 commit into
mainfrom
ffrujeri/fix-genrm-concurrency-lock

Conversation

@ffrujeri
Copy link
Copy Markdown
Contributor

@ffrujeri ffrujeri commented Apr 9, 2026

What does this PR do?

Release the GenRM compare cohort lock before running _run_compare, so long-running comparisons no longer block other concurrent verify requests.

Issues

This PR does not close NVIDIA-NeMo/Gym#354. It is related to that work (RLHF / shared reward-model style flows): it hardens the GenRM compare resource server under concurrency, which aligns with the broader RLHF infrastructure direction described there.

Refs https://github.com/NVIDIA-NeMo/Gym/issues/354

Usage

  • No change to configuration or call sites. verify() behavior for a given cohort is unchanged; only lock scope changes so other prompts’ cohorts can register and complete while a comparison is in flight.
# Callers still use the GenRM compare resource server as before, e.g. verify()
# with num_rollouts_per_prompt > 1 to buffer rollouts per prompt_key until the
# cohort is full, then aggregate pairwise GenRM scores. Concurrent verifies for
# different prompt_keys (or other cohorts) can now proceed without waiting on
# an unrelated cohort’s model call.

Additional Information

  • Before: The global _cohort_lock was held across await self._run_compare(...), so every GenRM comparison serialized all cohort bookkeeping and every other request’s progress.
  • After: Under the lock, the code only updates the buffer, snapshots a full cohort into cohort_buf, removes that key from _cohort_buffers, then releases the lock and runs _run_compare outside it, resolving each waiter’s future as before.

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ffrujeri ffrujeri merged commit 711100a into main Apr 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs + Environment pattern: RLHF

2 participants