Move inference context bookkeeping to CPU with ContextGPUView by lmcafee-nvidia · Pull Request #4306 · NVIDIA/Megatron-LM

lmcafee-nvidia · 2026-04-14T19:05:25Z

Summary

Move all per-request and per-token bookkeeping tensors in DynamicInferenceContext from GPU to pinned CPU memory
Introduce ContextGPUView as the single GPU interface for forward-pass code: context.foo is always CPU (source of truth), context.gpu_view.foo is always GPU (snapshot populated per-step by transfer_bookkeeping_to_gpu())
Defer Mamba state zeroing and prefix cache restore to transfer_bookkeeping_to_gpu(), making add_request() and update_requests() 100% CPU for hybrid models
Add explicit _transfer_samples_to_cpu() D2H boundary

Test plan

test_dynamic_context.py — 70 passed
test_dynamic_engine.py — 145 passed, 89 skipped (pre-existing)
test_dynamic_events.py — 10 passed
test_dynamic_prefix_caching.py — 18 passed
test_dynamic_prefix_caching_coordinator.py — 35 passed, 1 skipped
test_mamba_metadata.py — 12 passed
test_prefix_caching_cuda_graphs.py — 7 passed
test_mamba_prefix_caching_e2e.py — 5 passed

🤖 Generated with Claude Code

copy-pr-bot · 2026-04-14T19:05:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wdykas · 2026-04-14T19:45:45Z

I like this PR. I can rebase some of my stuff #4240 on top of this and then we can see what speed up we get

This is the "baseline" commit referenced in all future timing discussions for the context-cpu work. It branches directly from main and adds NVTX ranges for the 5 inference loop stages that exist on main: - initialize_attention_state - forward_pass - sampling - active_request_mask - update_requests All ranges nest inside the existing "Prefill"/"Decode" range from dynamic_engine.py, enabling nsys analysis of per-stage timing. Subsequent commits on this branch will add context-cpu-specific optimizations AND extend the NVTX range set with 2 transfer stages (transfer_bookkeeping_to_gpu, transfer_samples_to_cpu) that don't exist on main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move all per-request and per-token bookkeeping tensors in DynamicInferenceContext from GPU to pinned CPU memory. Introduce ContextGPUView as the single GPU interface for forward-pass code: context.foo is always CPU (source of truth), context.gpu_view.foo is always GPU (snapshot populated per-step by transfer_bookkeeping_to_gpu). This eliminates CPU-GPU device mixing by establishing a clear architectural boundary -- GPU code reads from gpu_view, CPU bookkeeping reads from context directly. The gpu_view is populated once per step with non-blocking pinned-memory copies. Key changes: - New gpu_view.py with ContextGPUView (6 token-level + 3 request-level GPU staging tensors) - All request/token tensors in dynamic_context.py moved to CPU with pin_memory=True - transfer_bookkeeping_to_gpu() populates gpu_view each step - text_generation_controller.py reads gpu_view for GPU-phase ops (sampling, verification, log-probs) - Post-rewind code reads CPU context directly (not stale gpu_view) - mamba_slot_allocator.py fixed for CPU bookkeeping indexing 302 tests pass, 0 failures (90 pre-existing skips). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Defer Mamba state zeroing and prefix cache restore from add_request() and update_requests() to transfer_bookkeeping_to_gpu(), making both methods 100% CPU for hybrid models. Mamba compute_and_store_offsets remains immediate since commit_intermediate_states depends on its CPU-side state. Add _transfer_samples_to_cpu() to make the D2H boundary explicit. 302 tests pass (each suite run separately), 0 regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Launch Mamba state zeroing/restore at the start of initialize_attention_state() instead of transfer_bookkeeping_to_gpu(). This allows the GPU ops to overlap with the CPU work that follows (batch dimension computation, MHA metadata, token padding). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Split mamba_metadata.update() into compute_cpu_metadata() (CPU, called in initialize_attention_state) and load_from_cpu() (H2D copies, called in transfer_bookkeeping_to_gpu). This eliminates GPU kernel launches for batch indices, cu_seqlens, chunk boundaries, and conv1d metadata. The intermediate state extraction still uses GPU (_update_intermediate_metadata). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move _intermediate_offsets_gpu, _intermediate_counts_gpu, _intermediate_block_ids_gpu, _eos_cache_block_id_gpu from GPU to CPU. Block IDs and EOS block IDs are pure CPU bookkeeping (consumed by commit_intermediate_states via .tolist()). Offsets and counts keep a GPU buffer for _update_intermediate_metadata to consume; the H2D copy is handled by that method on first use. Eliminates ~5 GPU writes per add_request and 2 .tolist() D2H syncs per commit_intermediate_states. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…init split PR NVIDIA#4225 extracted argument parsing out of initialize_megatron(); call parse_and_validate_args() separately and invoke initialize_megatron() with no arguments.

Drop the 5 per-step GPU kernel launches in MHAMetadata.reset() (query_lengths, cu_query_seq_lengths, cu_kv_seq_lengths, kv_seq_lengths, block_table). The next update() / load_from_cpu() fully overwrites the slice of each buffer that the forward pass will read (via state_data[:n]), so clearing here is redundant paranoia from the old GPU-resident design. Removes 10 vectorized_elementwise_kernel launches per step (5 buffers x Graphed + NonGraphed metadata). See lawrence/reports/20260417-bookkeeping-gpu-ops.md, Section 1.

adjust_batch_dims_for_expert_parallelism() runs on every inference step to pick a CUDA graph batch dimension consistent across EP ranks. It performed a torch.distributed.all_reduce(MAX) on a 4-int GPU tensor sandwiched between an H2D copy (tensor construction) and a D2H copy (.cpu() to read the result on the host). Add a sync_all_reduce_max() method to AsyncZMQCommunicator that uses blocking ZMQ send/recv on the CPU. When the engine has created the EP ZMQ communicator, it is attached to the context via DynamicInferenceContext.set_ep_zmq_communicator(), which in turn is forwarded to match_graph_config() / adjust_batch_dims_for_expert_parallelism(). When present, the 4-int MAX is done on CPU with zero GPU kernels. The torch.distributed fallback path is kept for standalone / non-engine call sites. Removes one NCCL AllReduce kernel plus one H2D and one D2H per step (~102 us/step in the 2304-step nanov3 trace). Also removes the stream-ordering barrier that the NCCL kernel introduced on the compute stream. See lawrence/reports/20260417-bookkeeping-gpu-ops.md, Section 3.1.

The 9 .copy_(non_blocking=True) calls in DynamicInferenceContext .transfer_bookkeeping_to_gpu() each incur ~15us of cudaMemcpyAsync launch overhead for ~1us of actual transfer — the NVTX range is 270us of wallclock with ~6% GPU utilization in the nanov3 trace. Back all 9 bookkeeping fields (6 per-token + 3 per-request) with one contiguous uint8 buffer on each side (pinned CPU + device GPU), with each attribute as a dtype-correct view onto a slice of the buffer. Layout is shared between DynamicInferenceContext._cpu_bookkeeping_buf and ContextGPUView._buf; int64 fields are placed first so alignment is automatic. The per-step transfer is now a single cudaMemcpyAsync of 32*max_tokens + 12*max_requests bytes (~71 KB for the benchmark config). Per-token fields are aliased with the source-of-truth attributes because the CPU-side bookkeeping and the GPU forward pass both use the same [:n_tok] slice. Per-request fields have an extra staging area in the coalesced buffer, refreshed each step from the persistent CPU tensors (at [paused_request_count:total_request_count]) into [:n_active], since the forward pass reads them at [:n_active] on GPU but CPU bookkeeping keeps paused requests in [0:paused_request_count). Correctness verified: test_reset, test_update_request, test_add_request, test_initialize_dynamic_context all pass (8 tests × 4 ranks).

MHAMetadata no longer owns private GPU buffers; GraphedMHAMetadata and NonGraphedMHAMetadata bind to shared views of ContextGPUView._buf (only one is active per step, so sharing storage is safe). initialize_attention_state writes the 5 MHA fields directly into pinned slots in _cpu_bookkeeping_buf, and transfer_bookkeeping_to_gpu's single cudaMemcpyAsync now covers them along with the existing token/request fields -- eliminating the 5 per-step mha.load_from_cpu copies. Per-step state_data is rebuilt via set_state_data using the freshly transferred GPU views plus Python-int max_seqlen scalars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends _cpu_bookkeeping_buf / ContextGPUView._buf with a Mamba section (9 int32 fields, hybrid-only) that mirrors MambaMetadata's per-step varlen tensors. MambaMetadata.compute_cpu_metadata() now writes directly into the bound pinned CPU views instead of allocating ephemeral tensors, and load_from_cpu() drops all 9 .copy_() calls -- the coalesced H2D in transfer_bookkeeping_to_gpu() covers the transfer, leaving load_from_cpu() to just alias state attributes onto the freshly-transferred GPU views and run the intermediate-extraction GPU computation. The legacy MambaMetadata.update() path is preserved (it still owns the standalone *_buffer tensors for unit tests that construct MambaMetadata without a context); it's unused on the inference path, so the ~40KB of redundant GPU memory is negligible. Also wires mamba_chunk_size through _allocate_mamba_states so the MambaMetadata's internal chunk_size matches the unified-buffer sizing in ContextGPUView. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The ~550 µs decode-step gap between update_requests and initialize_attention_state is dominated by step_end_event.synchronize() (CPU waits for GPU to drain so elapsed_time() can be read) plus the pre/post context_state dict builds. All of that work feeds only the print block (dynamic_engine.py:1858) and the W&B metrics block (dynamic_engine.py:1818), both already gated on logging_step_interval > 0 and step_count % logging_step_interval == 0. Predict that same condition once at the top of async_forward as `will_log_this_step` and skip the logging-only work on non-logging steps: - step_start/end events and elapsed_time (step_time = 0.0) - pre_step_context_state print-only fields (keep active_token_count and step_count, used by post_process_requests' pre_fwd_* args) - kvcache_util_stats computation - post_step_context_state dict (and drop the two dead fields padded_active_token_count, using_cuda_graph_this_step that no consumer reads) - the pre/post merge (minimal dict keeps kv_stats=None so the metrics-block gate at 1818 stays well-typed) In post_process_requests, gate the TPOT update on step_time > 0 so non-logging steps don't pollute request.tpot with zeros -- the metric becomes a sparse sample aligned with the same cadence as logging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

_dynamic_step_context_bookkeeping already produces sampled_tokens_cpu via _transfer_samples_to_cpu(), but the outer dict built by async_generate_output_tokens_dynamic_batch was handing out a fresh clone of the GPU _sampled_tokens_cuda buffer. The engine's post_process_requests then called sample.tolist() on that GPU tensor, forcing a D2H sync -- pure overhead inside the update_requests -> initialize_attention_state critical-path gap. Propagate the already-allocated CPU tensor instead: add "sample": sampled_tokens_cpu to the _dynamic_step_context_bookkeeping return dict, drop the outer GPU clone, and keep skip_bookkeeping=True behaving by doing a one-shot .cpu() on that path. The CPU tensor is independent storage (fresh .cpu() allocation, not a view) and isn't mutated by the step -- update_requests only touches new_sample_copy, a separate clone. Net: sample.tolist() becomes pure CPU-to-list, no sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lmcafee-nvidia · 2026-04-27T13:32:55Z

/ok to test ac90293

lmcafee-nvidia · 2026-05-01T18:43:00Z

/ok to test ee4fa04

lmcafee-nvidia · 2026-05-02T04:00:22Z

/ok to test 8305c95

lmcafee-nvidia · 2026-05-02T04:01:31Z

/ok to test 8305c95

lmcafee-nvidia · 2026-05-02T04:05:40Z

/ok to test a6f6d04

lmcafee-nvidia · 2026-05-03T01:17:41Z

/ok to test 8c91a68

# Conflicts: # megatron/core/inference/contexts/dynamic_context.py

lmcafee-nvidia · 2026-05-03T01:25:51Z

/ok to test 1dda0ac

lmcafee-nvidia · 2026-05-03T02:34:06Z

/ok to test 2bb6d5e

svcnvidia-nemo-ci · 2026-05-03T03:51:38Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25269187340

lmcafee-nvidia self-assigned this Apr 14, 2026

lmcafee-nvidia requested review from sidsingh-nvidia and wdykas April 14, 2026 19:06

wdykas reviewed Apr 14, 2026

View reviewed changes

Comment thread megatron/core/inference/contexts/mamba_slot_allocator.py Outdated

wdykas reviewed Apr 14, 2026

View reviewed changes

Comment thread megatron/core/inference/text_generation_controllers/text_generation_controller.py

lmcafee-nvidia and others added 6 commits April 16, 2026 14:42

lmcafee-nvidia force-pushed the context-cpu branch from df72eb6 to 0c755bb Compare April 16, 2026 21:58

root and others added 8 commits April 17, 2026 07:49

Update run_dynamic_text_generation_server.py for PR NVIDIA#4225 args …

2e5b588

…init split PR NVIDIA#4225 extracted argument parsing out of initialize_megatron(); call parse_and_validate_args() separately and invoke initialize_megatron() with no arguments.

lmcafee-nvidia force-pushed the context-cpu branch from 3409dcf to 3b6c3dd Compare April 24, 2026 16:08

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 27, 2026

lmcafee-nvidia marked this pull request as ready for review April 27, 2026 13:33

lmcafee-nvidia requested review from a team as code owners April 27, 2026 13:33

svcnvidia-nemo-ci requested a review from a team April 27, 2026 13:33

copy-pr-bot Bot temporarily deployed to test May 1, 2026 18:44 Inactive

lmcafee-nvidia force-pushed the context-cpu branch from ee4fa04 to 477d780 Compare May 2, 2026 03:50

lmcafee-nvidia removed request for a team May 2, 2026 03:51

lmcafee-nvidia added 3 commits May 1, 2026 20:58

Merge remote-tracking branch 'main/main' into context-cpu

a8ae94e

Restore EP graph matcher compatibility

3429e7d

Avoid deadsnakes when Python packages are available

8305c95

svcnvidia-nemo-ci removed the Approved All necessary approvals have been made label May 2, 2026

copy-pr-bot Bot had a problem deploying to test May 2, 2026 04:01 Error

Remove docker change from context CPU PR

a6f6d04

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 2, 2026

copy-pr-bot Bot temporarily deployed to test May 2, 2026 04:06 Inactive

Stabilize MTP CUDA graph eager comparison

8c91a68

copy-pr-bot Bot had a problem deploying to test May 3, 2026 01:18 Error

Merge remote-tracking branch 'main/main' into context-cpu

1dda0ac

# Conflicts: # megatron/core/inference/contexts/dynamic_context.py

copy-pr-bot Bot temporarily deployed to test May 3, 2026 01:27 Inactive

Fix rewind cache test device expectations

2bb6d5e

copy-pr-bot Bot temporarily deployed to test May 3, 2026 02:35 Inactive

lmcafee-nvidia added this pull request to the merge queue May 3, 2026

Merged via the queue into NVIDIA:main with commit 342dd59 May 3, 2026
64 of 67 checks passed

lmcafee-nvidia deleted the context-cpu branch May 3, 2026 04:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move inference context bookkeeping to CPU with ContextGPUView#4306

Move inference context bookkeeping to CPU with ContextGPUView#4306
lmcafee-nvidia merged 33 commits intoNVIDIA:mainfrom
lmcafee-nvidia:context-cpu

lmcafee-nvidia commented Apr 14, 2026

Uh oh!

copy-pr-bot Bot commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

wdykas commented Apr 14, 2026

Uh oh!

lmcafee-nvidia commented Apr 27, 2026

Uh oh!

lmcafee-nvidia commented May 1, 2026

Uh oh!

lmcafee-nvidia commented May 2, 2026

Uh oh!

lmcafee-nvidia commented May 2, 2026

Uh oh!

lmcafee-nvidia commented May 2, 2026

Uh oh!

lmcafee-nvidia commented May 3, 2026

Uh oh!

lmcafee-nvidia commented May 3, 2026

Uh oh!

lmcafee-nvidia commented May 3, 2026

Uh oh!

svcnvidia-nemo-ci commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

lmcafee-nvidia commented Apr 14, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

wdykas commented Apr 14, 2026

Uh oh!

lmcafee-nvidia commented Apr 27, 2026

Uh oh!

lmcafee-nvidia commented May 1, 2026

Uh oh!

lmcafee-nvidia commented May 2, 2026

Uh oh!

lmcafee-nvidia commented May 2, 2026

Uh oh!

lmcafee-nvidia commented May 2, 2026

Uh oh!

lmcafee-nvidia commented May 3, 2026

Uh oh!

lmcafee-nvidia commented May 3, 2026

Uh oh!

lmcafee-nvidia commented May 3, 2026

Uh oh!

svcnvidia-nemo-ci commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants