feat(algorithms): SingleController streaming train_pump (split-API consumer) by mehraakash · Pull Request #2700 · NVIDIA-NeMo/RL

mehraakash · 2026-06-05T01:24:41Z

Summary

Adds the SingleController streaming _train_pump that drives the worker-side split API (begin_train_step / train_microbatch_from_meta / finish_train_step / abort_train_step) shipped in #2683 (Megatron) and #2692 (DTensor v1/v2).

Per-prompt-group streaming: SC fires train_microbatch_from_meta per ready prompt group as rollouts commit, hiding trainer F/B under generation latency. Lazy begin_train_step on first ready group; single finish_train_step + single clear_samples at end-of-step. Math identical to the sync path (linear-in-1/N rescale at finish).

Stack (logical, not enforced by GitHub base)

This PR is logically stacked on #2683 (Megatron split-API). The diff shown here includes the #2683 commits because GitHub requires the base to be a branch on the upstream repo; review the last 4 commits for this PR's actual scope:

feat(algorithms): add single controller metadata sampler skeleton — StalenessSampler scaffold.
feat(algorithms): add single controller advantage pump — _advantage_pump.
refactor: some refactor in single-controller (#2654) — class rename to StalenessSampler.
feat(algorithms): SC per-prompt-group streaming train_pump — the actual new work.

Co-required with #2692 (DTensor split-API) for full backend coverage; works against either backend's split API.

What's new (commit 4)

StalenessSampler.select_one_group — single-group admission per loop iter; same predicate as select_indices, sort by (lag, indices[0]).
SingleControllerConfig.target_prompt_groups_per_step — explicit per-step admission count; validated against min_prompt_groups_per_batch.
_reap_in_flight_nonblocking — drain helper, ray.wait(timeout=0).
_train_pump rewrite: lazy begin, per-group prepare_logprobs_from_meta (gated on cfg fields) → _advantage_pump → train_microbatch_from_meta queued, one finish + one clear_samples at end-of-step. Buffer capacity released per group.
DryRunTrainer split-API stub + invariants (begin-while-open raises, microbatch/finish with no-or-mismatched step raises).
7 streaming dry-run tests: arrival order, finish-time trainer_version tick, strict on-policy filter, long-tail overlap, abort idempotence, empty-step no-op, single clear_samples per step.

Math (bias-free)

masked_mean(values, mask, N) = sum(values·mask)/(N+ε) is linear in 1/N. Per microbatch we pass global_valid_seqs = global_valid_toks = tensor(1.0); loss returns un-normalized sums; backward deposits raw d(sum)/dθ into .grad. At finish_train_step we all-reduce local mask sums, pick N from loss_fn.loss_type (TOKEN_LEVEL→toks, SEQUENCE_LEVEL→seqs), rescale p.grad by 1/N. Optimizer step applies a gradient identical (within ε) to the sync path.

Test plan

Dry-run streaming tests (tests/unit/algorithms/test_single_controller_dryrun.py::TestStreamingTrainPump) — 7 tests, CPU-only.
StalenessSampler unit tests (tests/unit/algorithms/test_staleness_sampler.py) — 7 tests for select_one_group.
End-to-end async GRPO recipe smoke. Pending feat(megatron): split-API train-step state machine on MegatronPolicyWorker #2683/feat(policy): split-API train-step state machine on DTensor v1/v2 #2692 merge + container test.

🤖 Generated with Claude Code

…orker Adds begin_train_step / train_microbatch / finish_train_step / abort_train_step on MegatronPolicyWorkerImpl, mirroring the DTensor v1/v2 implementations but adapted for mcore's contiguous grad bucket + pipeline-schedule reduce path. Mechanism: - begin_train_step: zero_grad_buffer + optimizer.zero_grad, store loss_fn / gbs / mbs / local_valid_seqs/toks accumulators on _train_step_state, and null model.config.grad_sync_func (saved for restore) so the PP scheduler's direct reduce dispatch cannot bypass no_sync. - train_microbatch(data): wrap one ``megatron_forward_backward`` invocation in ``with self.model.no_sync():`` so mcore DDP hooks accumulate ``param.main_grad`` locally without dispatching the cross-DP reduce. Pass ``global_valid_seqs/toks=tensor(1.0)`` so the loss returns un-normalized sums; backward deposits raw d(sum)/dθ. Accumulate local mask sums + per-mb metrics + the total pipeline-microbatch count (for finish-time MoE aux-loss scaling). - finish_train_step: all_reduce mask sums to get true N (toks for TOKEN_LEVEL loss, seqs for SEQUENCE_LEVEL), call self.model.scale_gradients(1/N), then the one true cross-DP reduce via start_grad_sync + finish_grad_sync, optimizer.step (clips internally), restore grad_sync_func, scheduler.step(increment=gbs). Rescale per-mb metrics by 1/N (linear-in-1/N math), aggregate, surface global counts. - abort_train_step: restore grad_sync_func, zero_grad_buffer + zero_grad, drop state. ``trainer_version`` unchanged. Sync ``train()`` is left untouched. Includes CPU unit tests at tests/unit/models/policy/test_megatron_split_state.py covering the lifecycle and call-order invariants (no_sync wrap, grad_sync_func save/restore, mask-sum accumulation, N selection by loss_type, abort idempotence, MoE scaling). Marked pytest.mark.mcore so they run only in mcore-enabled CI containers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Akash Mehra <akamehra@nvidia.com>

Signed-off-by: Akash Mehra <akamehra@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Drives the begin/train_microbatch/finish split API in NVIDIA-NeMo#2683 and group, per-group prepare_logprobs (when configured) -> advantage_pump -> train_microbatch_from_meta (queued), one finish_train_step + one clear_samples at end-of-step. Buffer capacity released per group, not per step. - StalenessSampler.select_one_group: picks one eligible prompt group; same predicate as select_indices, sort by (lag, indices[0]). - SingleControllerConfig.target_prompt_groups_per_step: explicit per- step admission count; validated against min_prompt_groups_per_batch. - _reap_in_flight_nonblocking: ray.wait(timeout=0) drain helper. - DryRunTrainer: split-API stub with begin/microbatch/finish/abort invariants for dry-run tests. - 7 streaming dry-run tests: arrival order, finish-time trainer_version tick, strict on-policy filter, long-tail overlap, abort idempotence, empty-step no-op, single clear_samples per step. Signed-off-by: Akash Mehra <akamehra@nvidia.com>

copy-pr-bot · 2026-06-05T01:24:44Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mehraakash · 2026-06-05T01:26:27Z

/ok to test 9025e5a

Required by 'Check if any files with zero errors not in whitelist' guard in cicd-main.yml. Both files have zero pyrefly errors; without the whitelist entry the lint job fails. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Akash Mehra <akamehra@nvidia.com>

mehraakash · 2026-06-05T01:43:51Z

/ok to test 7ec56e2

Pre-existing zero-error file from NVIDIA-NeMo#2078 (Eagle3) that was never added to the project-includes whitelist. Carrying the fix forward in this PR to unblock the lint job. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Akash Mehra <akamehra@nvidia.com>

mehraakash · 2026-06-05T02:07:24Z

/ok to test cc5af8d

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Akash Mehra <akamehra@nvidia.com>

mehraakash · 2026-06-05T02:22:57Z

/ok to test 56f8d80

…erExtension Phase D of the Megatron-MX integration plan. Wires the new modelexpress.megatron_translator path into the existing update_weights_via_mx flow on NemoRL's direct-vLLM extension. What's new * First-cycle Megatron detection: peek at any candidate's megatron_meta. None → existing DTensor / FSDP path. Set → latched into self._mx_megatron_mode and routed through _update_weights_via_mx_megatron forevermore on this receiver. * _update_weights_via_mx_megatron: builds a MegatronReceiverContext once (transformer_config + Megatron→HF name map from the source sidecar; receive_specs from the per-tensor TensorDescriptorV2 registry); per refit, calls run_refit_cycle which drives the full discover → plan → assemble → translate → yield(hf_name, tensor) pipeline. The yielded HF tensors flow through the existing _load_weights + FP8-KV-cache hooks. * Tree fan-out (publish_self_as_source) is preserved on the Megatron path identical to the DTensor path. Backwards compat * DTensor / FSDP receivers see _mx_megatron_mode = False on the first cycle and stay on the existing path. PR NVIDIA-NeMo#2700's prime-rl receivers and John's Dynamo-side worker extension are unaffected. * Sources that advertise publisher_kind=megatron but are missing the transformer_config sidecar (older trainer image) trigger a one-shot warning and fall back to non-Megatron mode. Out of scope (deferred) * The pull callback in _update_weights_via_mx_megatron uses self._mx_receiver._receiver._nixl.pull_to, which is the conceptual API; the actual NIXL plumbing for sliced pulls (registering a sub-view of a parent tensor + completing the transfer) needs a small wrapper in modelexpress.refit_receiver. v0 of the actual NIXL sliced-pull plumbing lands in a follow-up commit; this commit gets the Phase D control flow + receive context build right so the next commit only needs to fill in the pull function. * Wiring into Dynamo's MxRefitWorkerExtension (a separate Dynamo repo edit; same shape — import run_refit_cycle, call from the worker extension's update_weights_via_mx).

The file is introduced by NVIDIA-NeMo#2692 (DTensor PR), not by this branch. Whitelisting it here causes pyrefly to fail with 'No Python files matched pattern' since the file does not exist on this branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Akash Mehra <akamehra@nvidia.com>

mehraakash · 2026-06-05T03:06:31Z

/ok to test 1fba99e

mehraakash and others added 6 commits June 3, 2026 22:56

style: lint + type fixes for split-API megatron worker

88a9e46

Signed-off-by: Akash Mehra <akamehra@nvidia.com>

feat(algorithms): add single controller metadata sampler skeleton

775483d

Signed-off-by: Akash Mehra <akamehra@nvidia.com>

feat(algorithms): add single controller advantage pump

530a77f

Signed-off-by: Akash Mehra <akamehra@nvidia.com>

refactor: some refactor in single-controller (NVIDIA-NeMo#2654)

66af510

Signed-off-by: Yuki Huang <yukih@nvidia.com>

mehraakash requested review from a team as code owners June 5, 2026 01:24

mehraakash added the CI:L0 Run doctests and unit tests label Jun 5, 2026

copy-pr-bot Bot temporarily deployed to public June 5, 2026 01:26 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 01:27 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 01:30 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 01:44 Inactive

copy-pr-bot Bot temporarily deployed to test June 5, 2026 01:46 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 01:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 02:07 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 02:08 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 02:11 Inactive

copy-pr-bot Bot temporarily deployed to test June 5, 2026 02:11 Inactive

ci: whitelist policy_trainer_actor.py in pyrefly

56f8d80

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Akash Mehra <akamehra@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 5, 2026 02:23 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 02:26 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 03:06 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 03:07 Inactive

copy-pr-bot Bot temporarily deployed to test June 5, 2026 03:09 Inactive

copy-pr-bot Bot temporarily deployed to public June 5, 2026 03:10 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(algorithms): SingleController streaming train_pump (split-API consumer)#2700

feat(algorithms): SingleController streaming train_pump (split-API consumer)#2700
mehraakash wants to merge 10 commits into
NVIDIA-NeMo:mainfrom
mehraakash:asyncrl/split_train_sc

mehraakash commented Jun 5, 2026

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mehraakash commented Jun 5, 2026

Summary

Stack (logical, not enforced by GitHub base)

What's new (commit 4)

Math (bias-free)

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

mehraakash commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants