fix: GenRM lock in order to properly handle concurrent requests. by ffrujeri · Pull Request #1041 · NVIDIA-NeMo/Gym

ffrujeri · 2026-04-09T18:46:25Z

What does this PR do?

Release the GenRM compare cohort lock before running _run_compare, so long-running comparisons no longer block other concurrent verify requests.

Issues

This PR does not close NVIDIA-NeMo/Gym#354. It is related to that work (RLHF / shared reward-model style flows): it hardens the GenRM compare resource server under concurrency, which aligns with the broader RLHF infrastructure direction described there.

Refs https://github.com/NVIDIA-NeMo/Gym/issues/354

Usage

No change to configuration or call sites. verify() behavior for a given cohort is unchanged; only lock scope changes so other prompts’ cohorts can register and complete while a comparison is in flight.

# Callers still use the GenRM compare resource server as before, e.g. verify()
# with num_rollouts_per_prompt > 1 to buffer rollouts per prompt_key until the
# cohort is full, then aggregate pairwise GenRM scores. Concurrent verifies for
# different prompt_keys (or other cohorts) can now proceed without waiting on
# an unrelated cohort’s model call.

Additional Information

Before: The global _cohort_lock was held across await self._run_compare(...), so every GenRM comparison serialized all cohort bookkeeping and every other request’s progress.
After: Under the lock, the code only updates the buffer, snapshots a full cohort into cohort_buf, removes that key from _cohort_buffers, then releases the lock and runs _run_compare outside it, resolving each waiter’s future as before.

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

copy-pr-bot · 2026-04-09T18:46:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Fix GenRM lock in order to properly handle concurrent requests.

5227309

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

bxyu-nvidia approved these changes Apr 9, 2026

View reviewed changes

ffrujeri merged commit 711100a into main Apr 9, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: GenRM lock in order to properly handle concurrent requests.#1041

fix: GenRM lock in order to properly handle concurrent requests.#1041
ffrujeri merged 1 commit into
mainfrom
ffrujeri/fix-genrm-concurrency-lock

ffrujeri commented Apr 9, 2026

Uh oh!

copy-pr-bot Bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ffrujeri commented Apr 9, 2026

What does this PR do?

Issues

Usage

Additional Information

Uh oh!

copy-pr-bot Bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants