feat: data plane transfer queue integration by ZhiyuLi-Nvidia · Pull Request #2439 · NVIDIA-NeMo/RL

ZhiyuLi-Nvidia · 2026-05-07T17:22:38Z

What does this PR do ?

Summary

Introduces a TransferQueue-mediated data-plane for sync GRPO. Bulk per-token tensors live in TQ between rollout and train; the
driver only handles per-sample slices and KVBatchMeta.
Backends supported: simple and mooncake_cpu
Wire format:
- type-driven dispatch — tensors → maybe_pack_jagged (uint8 jagged for variable-length fields)
- non-tensors →ack_object_array (verl-style np.ndarray(dtype=object) pickled into uint8 jagged). Object fields recorded in meta.extra_info[META_OBJECT_FIELDS] for read-side decode. Backends only ever see tensors.
Core principle: 1-hop offload. Data movement stays local to the producer (rollout actor → TQ) and the consumer (worker fetch → train). The driver never holds bulk between rollout finish and train fan-out — only metadata and small per-sample fields cross Ray.

Details in https://github.com/NVIDIA-NeMo/RL/blob/zhiyul/data_plane_plan/nemo_rl/data_plane/README.md

Scope

nemo_rl/data_plane/: codec, adapters (NoOp / TQ over simple + mooncake_cpu), preshard helpers, observability hooks, driver I/O
(read_columns / write_columns).
nemo_rl/algorithms/grpo_sync.py: TQ-only sibling of grpo.grpo_train. Same algorithm/metrics; per-step lifecycle is prepare_step →
rollout 1-hop put → meta-driven logprob/train → kv_clear.
nemo_rl/experience/sync_rollout_actor.py: rollout + flatten + first put encapsulated in one Ray actor. Bulk never visits driver.
nemo_rl/models/policy/tq_policy.py: TQ-mediated Policy subclass; meta-driven workers via worker_mixin._fetch.

Test

https://wandb.ai/nvidia/nemorl-dataplane-zhiyul?nw=nwuserzhiyul

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

copy-pr-bot · 2026-05-07T17:22:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ZhiyuLi-Nvidia · 2026-05-09T01:17:13Z

/ok to test d20a6ed

ZhiyuLi-Nvidia · 2026-05-09T02:03:08Z

/ok to test e7f6a91

ZhiyuLi-Nvidia · 2026-05-09T02:14:33Z

/ok to test f8add06

ZhiyuLi-Nvidia · 2026-05-09T02:21:33Z

/ok to test c7cb642

ZhiyuLi-Nvidia · 2026-05-09T02:28:25Z

/ok to test fa121a5

ZhiyuLi-Nvidia · 2026-05-09T02:42:06Z

/ok to test 8de60a8

ZhiyuLi-Nvidia · 2026-05-09T02:53:23Z

/ok to test aeb273c

ZhiyuLi-Nvidia · 2026-05-09T03:04:56Z

/ok to test 1596562

ZhiyuLi-Nvidia · 2026-05-09T03:23:12Z

/ok to test abada7e

…<3.14) Docker build CI fails with ``uv sync --locked`` because the lockfile is stale and pyproject.toml's ``requires-python = ">=3.13.13"`` lacks the upper bound the resolver needs. The data-plane stack pins ``mooncake-transfer-engine-cuda13`` to a cp313-only wheel (upstream Mooncake doesn't ship cp314+ binaries yet). Without ``<3.14`` in ``requires-python``, uv forks the resolution into hypothetical Python 3.14 / 3.15 splits and fails because no wheel satisfies cp314/cp315 for that package. Pin ``requires-python = ">=3.13.13,<3.14"`` and regenerate the lock in-container using uv 0.11.6 (matches main's Dockerfile pin). ``uv sync --locked --no-install-project --dry-run`` verified. Drop and revert ``<3.14`` when upstream Mooncake (or kvcache-ai) ships cu13 + cp314+ wheels. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Lockfile was stale after the recent rebase shifted pyproject.toml (setuptools.packages.find restored alongside data-plane deps + the ``<3.14`` upper bound). uv 0.11.6 in-container regen, 443 packages resolved in 23.82s, ``uv sync --locked --no-install-project --dry-run`` verified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-05-19T18:14:10Z

/ok to test 4826d26

terrykong · 2026-05-19T18:54:16Z

@ZhiyuLi-Nvidia is there a reason to separate the tests like this (outside of tests/unit - where the L0*.sh scripts expect the unit tests to be run from)?

i'm not sure these are being run by the CI

Good point. Let me combine those.

Squashes 4 earlier commits into a single logical change: - Add ``data_plane`` block to the reference YAML so ``test_reference_configs_up_to_date`` passes alongside the exemplar. - Move every data-plane test out of ``tests/data_plane/{,functional}/`` into ``tests/unit/data_plane/`` so L0 actually discovers them. The L0 runner only sweeps ``tests/unit/`` (per ``L0_Unit_Tests_Other.sh``); the old location was invisible to CI. - Drop the sub-dir conftest entirely — no other ``tests/unit/`` subdir has one. Inline per-file fixtures kept the helpers explicit at the call sites. - Drop the per-test ``ray.init/shutdown`` fixture and rely on the parent autouse ``init_ray_cluster``. Matches production: NeMo-RL inits Ray once at startup; the data plane attaches on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

The previous defaults (512 GiB segment / 64 GiB buffer) caused ``test_seqpack_legacy_equals_tq[mooncake_cpu]`` to OOM on CI runners without prod-class memory (e.g. small CPU nodes). Seen on Slurm jobid 11920516: E real_client.cpp:734] Failed to allocate segment memory ... Mounting segment: 549755813888 bytes ... Shrink to 8 GiB segment / 1 GiB buffer everywhere these defaults appear: - examples/configs/grpo_math_1B.yaml (exemplar) - tests/unit/reference_configs/grpo_math_1B.yaml (mirror) - tests/unit/data_plane/test_seqpack_equivalence.py::_make_tq_cfg 8 GiB is enough for typical workloads and fits on a developer laptop or any CI runner with >=16 GiB RAM. Production users with terabyte- scale KV cache can override in their own YAML. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Commit 3539e83 renamed ``_apply_dynamic_sampling``'s ``dp_client: DataPlaneClient`` param to ``policy: TQPolicy`` (calling ``policy.discard_samples`` instead of ``dp_client.clear_samples``) but left the 4 unit-test call sites in ``test_sync_one_hop.py`` passing ``dp_client=client``. Slurm jobid 11920951: TypeError: _apply_dynamic_sampling() got an unexpected keyword argument 'dp_client' Add a tiny ``_fake_policy(client)`` helper that wraps the ``NoOpDataPlaneClient`` as a SimpleNamespace exposing only ``discard_samples`` (delegating to ``client.clear_samples``). Update the 4 ``_apply_dynamic_sampling`` callsites to use it. The 6 ``kv_first_write`` callsites still pass ``dp_client=client`` — different API, untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

``materialize()`` previously only applied the ``pad_to_seqlen`` target to tensors that arrived nested (vLLM jagged batches → padded via ``torch.nested.to_padded_tensor``). Wire payloads that were already rectangular at the batch's natural max (e.g. vLLM's right-padded single-microbatch output) rode the ``else`` branch and bypassed the cross-DP pad target entirely. Downstream effect: ``_stamp_pad_seqlen`` set ``GLOBAL_FORWARD_PAD_SEQLEN`` to ``max(seq_lengths)`` rounded up to ``sequence_length_round`` (e.g., 2240 at round=64), but the materialized tensor came back at the natural max (e.g., 2232 at vLLM's own alignment). The dynamic-shape microbatch iterator then called ``truncate_tensors(dim=1, truncated_len=2240)`` on a tensor of seq 2232 → ``torch.narrow`` raised ``start (0) + length (2240) exceeds dimension size (2232)``. Surfaced by job 11920250 (grpo-moonlight-16b-automodel-1n8g-ep8) on the all-grpo sweep. Fix: lift the ``pad_to_seqlen`` extension out of the nested branch so it applies to any 2D+ tensor whose seq dim is shorter than the target — nested-padded and rectangular-passthrough alike. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…s test Slurm jobid 11921251 surfaced this regression: KeyError: 'claim_meta_poll_interval_s' tests/unit/data_plane/test_tq_chaos_smoke.py in TQDataPlaneClient.__init__ (adapters/transfer_queue.py:413) The chaos test's ``_TQ_CFG`` was missing 3 required DataPlaneConfig keys (``claim_meta_poll_interval_s``, ``global_segment_size``, ``local_buffer_size``). It worked before only because the test was never being discovered by CI (was under tests/data_plane/functional/); moving it into tests/unit/data_plane/ — where L0 actually runs it — exposed the gap. Add the 3 missing keys with CI-sized values matching the ``_make_tq_cfg`` helper in test_seqpack_equivalence.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Slurm jobid 11921422 showed ``test_storage_actor_kill_raises_within_5s`` returns silently in ~10ms instead of raising within the 5s budget after ``ray.kill(SimpleStorageUnit_0)`` — the simple backend's ``get_samples`` doesn't appear to round-trip through the killed Ray actor (data probably lives in the controller / client process). The test's premise — "storage actor death surfaces as a raised exception" — doesn't hold on the simple backend we run in CI. Rather than skip with a TODO that nobody touches, remove the test outright; re-introduce when there's a chaos-test framework targeting backend=mooncake_cpu where storage is definitively out-of-process. The other two chaos tests in this file (P7-a controller kill, P7-c port-already-bound) are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

When ``calibrate_qkv_fp8_scales`` is invoked after a training step (``grpo_sync.py``:863), it reads calibration data from the data-plane via ``policy.read_from_dataplane(meta, select_fields=_calib_fields, ...)``. ``_calib_fields`` was built as ``meta.fields - DP_CALIB_EXCLUDED_FIELDS``, but the train partition's ``meta.fields`` carries the ``decompose_message_log`` wire payload (``turn_lengths``, ``turn_roles``, ``turn_contents``) alongside the model-input columns. That bulk metadata then rides into the legacy ``get_microbatch_iterator`` → ``get_and_validate_seqlen`` path, which asserts every 2D tensor's dim 1 matches the model's seq_dim (8192). ``turn_lengths`` has shape ``(B, max_turns≈3)`` → AssertionError, recipe crashes. Wire still carries these fields (the driver-side reconstruct path needs them); we just narrow what FP8 calibration asks for. Add ``MESSAGE_LOG_BULK_FIELDS`` to ``DP_CALIB_EXCLUDED_FIELDS`` so the filter at the calibration request site automatically drops them. Also adds ``tilelang`` to base deps as the workaround mamba-ssm requires on Hopper with Triton >= 3.4.0 (per upstream state-spaces/mamba#640). qwen3.5-9b megatron, qwen3.5-35ba3b megatron-ep16, and any other gated-chunk mamba recipe crash with ``RuntimeError: ... Please install tilelang`` without it. uv.lock regenerated in-container (uv 0.11.6, 443 packages, +tilelang). Surfaced by extras sweep on 7ffb1c5db: - 11920261 grpo-qwen3-8b-base-1n8g-fp8-kvcache-megatron (FP8 calib) - 11920253 grpo-qwen3.5-9b-1n8g-megatron (tilelang) - 11920255 grpo-qwen3.5-35ba3b-2n8g-megatron-ep16 (tilelang) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…ELDS Architecture-invariant test that prevents the silent regression we just hit on grpo-qwen3-8b-base-1n8g-fp8-kvcache-megatron (job 11920261): a new wire field landed in ``meta.fields`` but the FP8 calibration's blacklist (``DP_CALIB_EXCLUDED_FIELDS``) wasn't updated, so calibration silently requested the bulk-shape field and crashed in ``get_and_validate_seqlen`` (which assumes all 2D tensors are ``(B, seq_len)``). Pinning this membership: anyone who later adds another bulk-metadata field to the wire (e.g., extra decompose payload) must either match the per-token shape contract or extend ``DP_CALIB_EXCLUDED_FIELDS``, or this test fails in CI before the recipe crashes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…xtures Slurm jobid 11922031: KeyError: 'claim_meta_poll_interval_s' tests/unit/data_plane/test_tq_lifecycle.py:65 in TQDataPlaneClient.__init__ (adapters/transfer_queue.py:413) Same shape as the chaos-smoke fix in f593316bc: two fixtures in ``test_tq_lifecycle.py`` built dicts missing 3 required DataPlaneConfig keys (``claim_meta_poll_interval_s``, ``global_segment_size``, ``local_buffer_size``). Add them with the same CI-sized values used elsewhere in the suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

The legacy sync path returns FP8 scales from the calibration worker as a Python dict via Ray, then re-broadcasts to vLLM workers — keeps the driver on the critical path and doesn't compose with the async proposal (multiple in-flight calibrations need a shared transport, not driver-mediated dict passing). Add a thin TQ-backed transport for scales: - ``nemo_rl/data_plane/kv_scales.py``: - ``pack_kv_scales`` / ``unpack_kv_scales`` — dict ↔ fixed-shape tensors (``q_scales``, ``k_scales``, ``v_scales``). - ``put_kv_scales`` / ``get_kv_scales`` — register a single-sample ``"kv_scales"`` partition, write/read via the existing ``DataPlaneClient`` put_samples/get_samples primitives. - Adapter-agnostic (works on NoOp + TransferQueue alike). - ``grpo_sync.py`` (sync path, calibration site): After ``calibrate_qkv_fp8_scales`` returns the dict, immediately round-trip it through TQ via put/get. Legacy ``refit_policy_generation`` still consumes the dict — the round-trip just validates the transport works for scales. Next commit will move the read site out of the driver into the vLLM refit worker so the dict path drops out entirely. - ``tests/unit/data_plane/test_kv_scales.py``: pack/unpack identity, gap handling, empty case, full put→get round-trip via NoOp adapter parametrized on n_layers ∈ {1, 8, 24}, partition-id passthrough. Motivated by feedback on PR #2439 / data-plane async proposal: scales belong on the wire, not in driver-side Python dicts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…cut)" This reverts commit bee3a62f90bc15798b73578820da9f9c84715535. Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

The existing ``DP_CALIB_EXCLUDED_FIELDS`` negative-list shape silently broke when ``MESSAGE_LOG_BULK_FIELDS`` (wire-only object arrays) were added to the train wire — ``calibrate_qkv_fp8_scales`` routes through get_microbatch_iterator which only handles seq-dim tensors, so any new non-tensor wire field crashes calib until someone augments the exclude list. Mirror the ``LP_SEED_FIELDS`` pattern instead: name the fields calibration actually needs. Changes: * schema.py: replace ``_DP_CALIB_INPUT_FIELDS`` (private) + ``DP_CALIB_EXCLUDED_FIELDS`` (derived negative) with ``DP_CALIB_INPUT_FIELDS = (INPUT_IDS, INPUT_LENGTHS)``. Same shape as ``LP_SEED_FIELDS`` — a positive tuple of what the consumer fetches. Drops the cross-layer ``llm_message_utils`` import. * grpo_sync.py: ``_calib_fields = [f for f in meta.fields if f in DP_CALIB_INPUT_FIELDS]``. Trade-off: drops the implicit multimodal-extras pass-through. Today's GRPO recipes are text-only; multimodal calibration can re-introduce extras via a meta-side marker (e.g. ``meta.extra_info["multimodal_calib_fields"]``) in a follow-up. Also remove ``tests/unit/data_plane/test_tq_chaos_smoke.py`` — was an untracked working-tree scratch that got pulled in during the test consolidation; not load-bearing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…edupe Squashes 6 earlier commits into one logical change covering all realistic-shape test infrastructure under ``tests/unit/data_plane/``. * New helper module ``_rollout_shapes.py``: - ``make_rollout_batch(n, max_seqlen, multimodal=False, *_dtype, seed)`` mints data with the shape + dtypes ``SyncRolloutActor.rollout_to_tq`` actually writes (int64 ids, int32 masks, bf16 logprobs, optional multimodal extras as flat top-level fields). - ``make_realistic_tags(n, zero_std_fraction, seed)`` mirrors GRPO driver tag-stamping (std/total_reward/prompt_id/weight_version) with a controllable zero-std fraction for dynamic-sampling tests. - ``make_multi_turn_message_log(n, turns_per_sample, seed)`` builds jagged-turn-count message logs for decompose/reconstruct round-trips. - Shared cross-file helpers ``keys_from_uids``, ``register_train_partition``, ``mooncake_available`` (deduped from three test files that each defined their own copy). * Realistic-shape coverage added to 9 test files: codec_jagged (dtype parametrize), codec_mooncake (bf16 per-token), codec_wire_stripped (NonTensorStack of varied turn roles), correctness (kv_first_write round-trip with mixed dtypes), kvbatchmeta (driver tags), message_log decompose (jagged + multi-turn round-trip), observability (mixed- dtype put_bytes), preshard_extras (VLM ``pixel_values`` round-trip), sync_one_hop (full 7-stage TQ lifecycle). * ``test_full_sync_step_lifecycle_on_realistic_batch`` walks the production ``grpo_train_sync`` per-step flow end to end on a realistic batch — register → kv_first_write → tag → worker delta-writes → driver delta-write → full read → ``finish_step`` clear — asserting every field's dtype survives the pipeline. * Cross-file dedupe: ``_keys_from_uids`` / ``_setup`` / ``_setup_partition`` / ``_mooncake_available`` (defined identically in 3+ test files) collapsed into single canonical implementations in the helper module. * Module-level imports for all helpers; module-level ``pytest``; ``_PARTITION = "train"`` const in the lifecycle test (was repeated 7×). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Regenerate via ``uv lock`` against the current ``pyproject.toml``. The diff is the resolver-marker expansion — each platform/extra marker now splits by the ``extra-7-nemo-rl-vllm`` extra dimension. Legitimate uv-driven refresh, not drift. (Note: the tilelang comment compression in the working tree was reverted by ``uv lock``'s pyproject.toml canonicalization; only the lockfile change landed here.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-05-20T09:25:01Z

/ok to test 4879e78

CI pre-commit ran ``ruff check --select I --fix`` (per ``.pre-commit-config.yaml``'s second ruff hook) and flagged four files — local ``._rollout_shapes`` imports ordered before third-party imports, and a stray blank line after ``from __future__ import annotations``. ``ruff check`` (the first hook) doesn't enable isort rules by default, which is why the earlier local lint pass missed these. Reproduced + fixed locally via the exact command CI uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-05-20T09:35:39Z

/ok to test 275d227

V1 inherited TQWorkerMixin._is_writeback_leader which returns True for every rank when CP=1, letting all TP ranks race on Mooncake upserts and crashing mooncake_cpu with -601 ILLEGAL_CLIENT. V2 already gates on (cp_local_rank, tp_local_rank) == (0, 0); V1 now mirrors the same override so TP>1 DTensor recipes (deepscaler-1.5b-16K/24K, dapo-qwen2.5-7b, gemma3-27b-actckpt-long) stop multi-writing the same prev_logprobs keys. Verified against the original failing recipe (deepscaler-1.5b-16K, TP=2): Step 1/20 + Step 2/20 completed cleanly with no -601 errors after the override was added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

CI snapshot test test_reference_configs_up_to_date flagged two stale keys in tests/unit/reference_configs/grpo_math_1B.yaml: global_segment_size: real=549755813888 (512 GiB), reference=8589934592 (8 GiB) local_buffer_size: real=68719476736 (64 GiB), reference=1073741824 (1 GiB) Bring the snapshot in line with examples/configs/grpo_math_1B.yaml; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-05-20T10:43:27Z

/ok to test c078d75

…ests Drop 9 source-grep tests, 1 duplicate import-smoke, 1 xfail-strict TODO, and the FP8-calib regression test (tautological under the positive-list calib filter — ``DP_CALIB_INPUT_FIELDS ∩ MESSAGE_LOG_BULK_FIELDS = ∅`` by definition, so the leak the test guarded against is impossible by construction). Keep only: * ``test_run_grpo_dispatches_both_trainers`` — behavioral: imports and calls ``_select_trainer`` directly; verifies dispatch to grpo_train (data_plane absent) and grpo_train_sync (data_plane.enabled=True). * ``test_data_plane_client_abc_method_present`` — hasattr on the live class (not a source-grep); parametrized over the 8 DataPlaneClient ABC methods that every adapter must implement. 376 → 73 lines. 9 collected (was 18). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-05-20T18:43:59Z

/ok to test f7a2916

yuki-97

feels we need to add some e2e tests to guard (functional + nightly) and one comment for env registry, others LGTM.

yuki-97 · 2026-05-21T09:54:53Z

+from torchdata.stateful_dataloader import StatefulDataLoader
+
+# Re-imports from grpo so this file is a thin trainer-only fork.
+from nemo_rl.algorithms.grpo import (


there are many things import from legacy grpo, so other changes may break it.

can we add some e2e test (functional for CI + nightly for long run)? so that other changes will get aware when breaking it. (e.g. I'm sure #2518 will break it)

yuki-97 · 2026-05-21T14:27:45Z

    # ReplayBuffer needs vLLM environment to handle trajectory data from VllmGenerationWorker
    "nemo_rl.algorithms.async_utils.ReplayBuffer": PY_EXECUTABLES.VLLM,
+    # SyncRolloutActor drives vLLM rollouts and writes flattened tensors (tensordict) to TQ
+    "nemo_rl.experience.sync_rollout_actor.SyncRolloutActor": PY_EXECUTABLES.VLLM,


I feel vLLM is just be used in SyncRolloutActor -> self.policy_generation -> vLLM worker. so just curious does SyncRolloutActor itself really need vLLM env?

ZhiyuLi-Nvidia requested review from a team as code owners May 7, 2026 17:22

ZhiyuLi-Nvidia marked this pull request as draft May 7, 2026 17:22

ZhiyuLi-Nvidia force-pushed the zhiyul/data_plane_plan branch from bff0471 to d20a6ed Compare May 9, 2026 01:15

ZhiyuLi-Nvidia force-pushed the zhiyul/data_plane_plan branch from d20a6ed to e7f6a91 Compare May 9, 2026 02:02

ZhiyuLi-Nvidia marked this pull request as ready for review May 9, 2026 02:02

ZhiyuLi-Nvidia requested a review from a team as a code owner May 9, 2026 02:02

ZhiyuLi-Nvidia force-pushed the zhiyul/data_plane_plan branch from c7cb642 to fa121a5 Compare May 9, 2026 02:27

ZhiyuLi-Nvidia added the CI:L1 Run doctests, unit tests, and functional tests label May 9, 2026

ZhiyuLi-Nvidia force-pushed the zhiyul/data_plane_plan branch from fa121a5 to 8de60a8 Compare May 9, 2026 02:41

copy-pr-bot Bot had a problem deploying to nemo-ci May 9, 2026 02:42 Error

ZhiyuLi-Nvidia force-pushed the zhiyul/data_plane_plan branch from 8de60a8 to aeb273c Compare May 9, 2026 02:53

copy-pr-bot Bot had a problem deploying to nemo-ci May 9, 2026 02:54 Error

ZhiyuLi-Nvidia force-pushed the zhiyul/data_plane_plan branch from aeb273c to 1596562 Compare May 9, 2026 03:04

copy-pr-bot Bot had a problem deploying to nemo-ci May 9, 2026 03:06 Error

ZhiyuLi-Nvidia force-pushed the zhiyul/data_plane_plan branch from 1596562 to abada7e Compare May 9, 2026 03:22

ZhiyuLi-Nvidia and others added 2 commits May 19, 2026 11:05

terrykong reviewed May 19, 2026

View reviewed changes

Comment thread examples/configs/recipes/llm/grpo-glm47-flash-4n8g-automodel.yaml

ZhiyuLi-Nvidia and others added 14 commits May 20, 2026 02:24

Revert "feat(data-plane): route FP8 KV scales through TQ (sync first …

6194fe9

…cut)" This reverts commit bee3a62f90bc15798b73578820da9f9c84715535. Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia and others added 2 commits May 20, 2026 03:35

ZhiyuLi-Nvidia and others added 2 commits May 20, 2026 11:39

undo unnecessary change

f7a2916

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

yuki-97 reviewed May 21, 2026

View reviewed changes

Conversation

ZhiyuLi-Nvidia commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Test

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 9, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 19, 2026

Uh oh!

terrykong May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented May 20, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 20, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 20, 2026

Uh oh!

ZhiyuLi-Nvidia commented May 20, 2026

Uh oh!

yuki-97 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuki-97 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

yuki-97 May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZhiyuLi-Nvidia commented May 7, 2026 •

edited

Loading

terrykong May 19, 2026 •

edited

Loading

yuki-97 left a comment •

edited

Loading

yuki-97 May 21, 2026 •

edited

Loading