feat: support staleness-window in ReplayBufferNew by yuki-97 · Pull Request #2458 · NVIDIA-NeMo/RL

yuki-97 · 2026-05-11T05:01:03Z

Part of RL-727. Stacks on #2448.

Implements ReplayBufferNew, a temporary replacement for ReplayBuffer until TQReplayBuffer is ready.

Motivation: ReplayBuffer.sample() requires target_weight_version == current_weight_version, which stalls training when the exact-match trajectories haven't arrived yet (buffer starvation). ReplayBufferNew fixes this by allowing slightly older trajectories to be used, with an importance-sampling correction.

Changes:

Add max_staleness config: trajectories with trainer_version - weight_version > max_staleness are evicted at the start of each sample() call.
sample() selects from the staleness window [trainer_version - max_staleness, trainer_version], removing the strict target_weight_version == current_weight_version gate.
Add sample_freshest_first flag (default True): when True, selects the highest-version trajectories first; when False, uses FIFO (insertion order).
target_weight_versions is intentionally unused in ReplayBufferNew — it gates generation on specific trainer steps, causing generation pauses. Will be removed when cleaning up after TQReplayBuffer lands.
Unit tests covering eviction, staleness-window sampling, freshest-first ordering, and FIFO ordering.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

copy-pr-bot · 2026-05-11T05:01:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

mehraakash · 2026-05-11T17:11:00Z

 # limitations under the License.

 import threading as _threading
+from collections import Counter


mehraakash · 2026-05-11T18:01:42Z

+                sampled_weights
+            )
+            sampled_items = [self.trajectories[i] for i in selected]
+            for idx in sorted(selected, reverse=True):


Could we refactored into another function:

def _remove_indices(self, indices: Iterable[int]) -> None: for idx in sorted(indices, reverse=True): self.trajectory_versions.pop(idx) self.target_weight_versions.pop(idx) self.trajectories.pop(idx)

can then use it in _evict and sample and provide different Iterables?

mehraakash · 2026-05-11T18:02:49Z

+        """
+        min_valid = current_weight_version - self.max_staleness
+        stale = [i for i, v in enumerate(self.trajectory_versions) if v < min_valid]
+        for idx in sorted(stale, reverse=True):


See comment below on adding it in a function.

mehraakash · 2026-05-11T18:05:00Z

+        stale = [i for i, v in enumerate(self.trajectory_versions) if v < min_valid]
+        for idx in sorted(stale, reverse=True):
+            self.trajectory_versions.pop(idx)
+            self.trajectories.pop(idx)


I know we want to eventually get rid of target_weight_versions but since we inherited from ReplayBufferImpl that list will be created. So we either keep that state aligned or remove it?

mehraakash · 2026-05-11T18:21:50Z

 @ray.remote  # pragma: no cover
 class ReplayBufferNew(ReplayBufferImpl):
-    pass
+    """Staleness-window replay buffer.


I think we need a follow-up task here before wiring this in:

ReplayBufferNew removes exact target matching in sample() here, but the collector still enforces
target-version reservation and generation-limit pauses through last_target_weight_already_generated.

For end-to-end staleness-window sampling, the collector needs a mode that generates based on
current generation_weight_version and buffer/backpressure capacity, and not future target_weight_version slots.

We'll control generation using SingleController by:

Buffer Capacity

Inflight Semaphore

Refit pause

Any manual Pause

Dataloader availability

add staleness-window in sample() and add evict()

8b19db5

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 mentioned this pull request May 11, 2026

refactor: refactor async utils #2448

Open

yuki-97 marked this pull request as ready for review May 11, 2026 05:06

yuki-97 requested review from a team as code owners May 11, 2026 05:06

yuki-97 requested review from mehraakash and terrykong May 11, 2026 05:06

add sample_freshest_first

19334ad

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yukih/staleness-sample branch from 85e179f to 19334ad Compare May 11, 2026 10:04

remove evict, only keep _evict

c68a0f5

Signed-off-by: Yuki Huang <yukih@nvidia.com>

mehraakash reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support staleness-window in ReplayBufferNew#2458

feat: support staleness-window in ReplayBufferNew#2458
yuki-97 wants to merge 3 commits into
yukih/refactor-async-utilsfrom
yukih/staleness-sample

yuki-97 commented May 11, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

mehraakash May 11, 2026

Uh oh!

mehraakash May 11, 2026

Uh oh!

mehraakash May 11, 2026

Uh oh!

mehraakash May 11, 2026

Uh oh!

mehraakash May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuki-97 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

mehraakash May 11, 2026

Choose a reason for hiding this comment

Uh oh!

mehraakash May 11, 2026

Choose a reason for hiding this comment

Uh oh!

mehraakash May 11, 2026

Choose a reason for hiding this comment

Uh oh!

mehraakash May 11, 2026

Choose a reason for hiding this comment

Uh oh!

mehraakash May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuki-97 commented May 11, 2026 •

edited

Loading