Move inference context bookkeeping to CPU with ContextGPUView#4306
Merged
lmcafee-nvidia merged 33 commits intoNVIDIA:mainfrom May 3, 2026
Merged
Move inference context bookkeeping to CPU with ContextGPUView#4306lmcafee-nvidia merged 33 commits intoNVIDIA:mainfrom
lmcafee-nvidia merged 33 commits intoNVIDIA:mainfrom
Conversation
wdykas
reviewed
Apr 14, 2026
wdykas
reviewed
Apr 14, 2026
Contributor
|
I like this PR. I can rebase some of my stuff #4240 on top of this and then we can see what speed up we get |
This is the "baseline" commit referenced in all future timing discussions for the context-cpu work. It branches directly from main and adds NVTX ranges for the 5 inference loop stages that exist on main: - initialize_attention_state - forward_pass - sampling - active_request_mask - update_requests All ranges nest inside the existing "Prefill"/"Decode" range from dynamic_engine.py, enabling nsys analysis of per-stage timing. Subsequent commits on this branch will add context-cpu-specific optimizations AND extend the NVTX range set with 2 transfer stages (transfer_bookkeeping_to_gpu, transfer_samples_to_cpu) that don't exist on main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all per-request and per-token bookkeeping tensors in DynamicInferenceContext from GPU to pinned CPU memory. Introduce ContextGPUView as the single GPU interface for forward-pass code: context.foo is always CPU (source of truth), context.gpu_view.foo is always GPU (snapshot populated per-step by transfer_bookkeeping_to_gpu). This eliminates CPU-GPU device mixing by establishing a clear architectural boundary -- GPU code reads from gpu_view, CPU bookkeeping reads from context directly. The gpu_view is populated once per step with non-blocking pinned-memory copies. Key changes: - New gpu_view.py with ContextGPUView (6 token-level + 3 request-level GPU staging tensors) - All request/token tensors in dynamic_context.py moved to CPU with pin_memory=True - transfer_bookkeeping_to_gpu() populates gpu_view each step - text_generation_controller.py reads gpu_view for GPU-phase ops (sampling, verification, log-probs) - Post-rewind code reads CPU context directly (not stale gpu_view) - mamba_slot_allocator.py fixed for CPU bookkeeping indexing 302 tests pass, 0 failures (90 pre-existing skips). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defer Mamba state zeroing and prefix cache restore from add_request() and update_requests() to transfer_bookkeeping_to_gpu(), making both methods 100% CPU for hybrid models. Mamba compute_and_store_offsets remains immediate since commit_intermediate_states depends on its CPU-side state. Add _transfer_samples_to_cpu() to make the D2H boundary explicit. 302 tests pass (each suite run separately), 0 regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Launch Mamba state zeroing/restore at the start of initialize_attention_state() instead of transfer_bookkeeping_to_gpu(). This allows the GPU ops to overlap with the CPU work that follows (batch dimension computation, MHA metadata, token padding). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split mamba_metadata.update() into compute_cpu_metadata() (CPU, called in initialize_attention_state) and load_from_cpu() (H2D copies, called in transfer_bookkeeping_to_gpu). This eliminates GPU kernel launches for batch indices, cu_seqlens, chunk boundaries, and conv1d metadata. The intermediate state extraction still uses GPU (_update_intermediate_metadata). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move _intermediate_offsets_gpu, _intermediate_counts_gpu, _intermediate_block_ids_gpu, _eos_cache_block_id_gpu from GPU to CPU. Block IDs and EOS block IDs are pure CPU bookkeeping (consumed by commit_intermediate_states via .tolist()). Offsets and counts keep a GPU buffer for _update_intermediate_metadata to consume; the H2D copy is handled by that method on first use. Eliminates ~5 GPU writes per add_request and 2 .tolist() D2H syncs per commit_intermediate_states. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
df72eb6 to
0c755bb
Compare
…init split PR NVIDIA#4225 extracted argument parsing out of initialize_megatron(); call parse_and_validate_args() separately and invoke initialize_megatron() with no arguments.
Drop the 5 per-step GPU kernel launches in MHAMetadata.reset() (query_lengths, cu_query_seq_lengths, cu_kv_seq_lengths, kv_seq_lengths, block_table). The next update() / load_from_cpu() fully overwrites the slice of each buffer that the forward pass will read (via state_data[:n]), so clearing here is redundant paranoia from the old GPU-resident design. Removes 10 vectorized_elementwise_kernel launches per step (5 buffers x Graphed + NonGraphed metadata). See lawrence/reports/20260417-bookkeeping-gpu-ops.md, Section 1.
adjust_batch_dims_for_expert_parallelism() runs on every inference step to pick a CUDA graph batch dimension consistent across EP ranks. It performed a torch.distributed.all_reduce(MAX) on a 4-int GPU tensor sandwiched between an H2D copy (tensor construction) and a D2H copy (.cpu() to read the result on the host). Add a sync_all_reduce_max() method to AsyncZMQCommunicator that uses blocking ZMQ send/recv on the CPU. When the engine has created the EP ZMQ communicator, it is attached to the context via DynamicInferenceContext.set_ep_zmq_communicator(), which in turn is forwarded to match_graph_config() / adjust_batch_dims_for_expert_parallelism(). When present, the 4-int MAX is done on CPU with zero GPU kernels. The torch.distributed fallback path is kept for standalone / non-engine call sites. Removes one NCCL AllReduce kernel plus one H2D and one D2H per step (~102 us/step in the 2304-step nanov3 trace). Also removes the stream-ordering barrier that the NCCL kernel introduced on the compute stream. See lawrence/reports/20260417-bookkeeping-gpu-ops.md, Section 3.1.
The 9 .copy_(non_blocking=True) calls in DynamicInferenceContext .transfer_bookkeeping_to_gpu() each incur ~15us of cudaMemcpyAsync launch overhead for ~1us of actual transfer — the NVTX range is 270us of wallclock with ~6% GPU utilization in the nanov3 trace. Back all 9 bookkeeping fields (6 per-token + 3 per-request) with one contiguous uint8 buffer on each side (pinned CPU + device GPU), with each attribute as a dtype-correct view onto a slice of the buffer. Layout is shared between DynamicInferenceContext._cpu_bookkeeping_buf and ContextGPUView._buf; int64 fields are placed first so alignment is automatic. The per-step transfer is now a single cudaMemcpyAsync of 32*max_tokens + 12*max_requests bytes (~71 KB for the benchmark config). Per-token fields are aliased with the source-of-truth attributes because the CPU-side bookkeeping and the GPU forward pass both use the same [:n_tok] slice. Per-request fields have an extra staging area in the coalesced buffer, refreshed each step from the persistent CPU tensors (at [paused_request_count:total_request_count]) into [:n_active], since the forward pass reads them at [:n_active] on GPU but CPU bookkeeping keeps paused requests in [0:paused_request_count). Correctness verified: test_reset, test_update_request, test_add_request, test_initialize_dynamic_context all pass (8 tests × 4 ranks).
MHAMetadata no longer owns private GPU buffers; GraphedMHAMetadata and NonGraphedMHAMetadata bind to shared views of ContextGPUView._buf (only one is active per step, so sharing storage is safe). initialize_attention_state writes the 5 MHA fields directly into pinned slots in _cpu_bookkeeping_buf, and transfer_bookkeeping_to_gpu's single cudaMemcpyAsync now covers them along with the existing token/request fields -- eliminating the 5 per-step mha.load_from_cpu copies. Per-step state_data is rebuilt via set_state_data using the freshly transferred GPU views plus Python-int max_seqlen scalars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends _cpu_bookkeeping_buf / ContextGPUView._buf with a Mamba section (9 int32 fields, hybrid-only) that mirrors MambaMetadata's per-step varlen tensors. MambaMetadata.compute_cpu_metadata() now writes directly into the bound pinned CPU views instead of allocating ephemeral tensors, and load_from_cpu() drops all 9 .copy_() calls -- the coalesced H2D in transfer_bookkeeping_to_gpu() covers the transfer, leaving load_from_cpu() to just alias state attributes onto the freshly-transferred GPU views and run the intermediate-extraction GPU computation. The legacy MambaMetadata.update() path is preserved (it still owns the standalone *_buffer tensors for unit tests that construct MambaMetadata without a context); it's unused on the inference path, so the ~40KB of redundant GPU memory is negligible. Also wires mamba_chunk_size through _allocate_mamba_states so the MambaMetadata's internal chunk_size matches the unified-buffer sizing in ContextGPUView. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ~550 µs decode-step gap between update_requests and
initialize_attention_state is dominated by step_end_event.synchronize()
(CPU waits for GPU to drain so elapsed_time() can be read) plus the
pre/post context_state dict builds. All of that work feeds only the
print block (dynamic_engine.py:1858) and the W&B metrics block
(dynamic_engine.py:1818), both already gated on
logging_step_interval > 0 and step_count % logging_step_interval == 0.
Predict that same condition once at the top of async_forward as
`will_log_this_step` and skip the logging-only work on non-logging steps:
- step_start/end events and elapsed_time (step_time = 0.0)
- pre_step_context_state print-only fields (keep active_token_count
and step_count, used by post_process_requests' pre_fwd_* args)
- kvcache_util_stats computation
- post_step_context_state dict (and drop the two dead fields
padded_active_token_count, using_cuda_graph_this_step that no
consumer reads)
- the pre/post merge (minimal dict keeps kv_stats=None so the
metrics-block gate at 1818 stays well-typed)
In post_process_requests, gate the TPOT update on step_time > 0 so
non-logging steps don't pollute request.tpot with zeros -- the metric
becomes a sparse sample aligned with the same cadence as logging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_dynamic_step_context_bookkeeping already produces sampled_tokens_cpu via _transfer_samples_to_cpu(), but the outer dict built by async_generate_output_tokens_dynamic_batch was handing out a fresh clone of the GPU _sampled_tokens_cuda buffer. The engine's post_process_requests then called sample.tolist() on that GPU tensor, forcing a D2H sync -- pure overhead inside the update_requests -> initialize_attention_state critical-path gap. Propagate the already-allocated CPU tensor instead: add "sample": sampled_tokens_cpu to the _dynamic_step_context_bookkeeping return dict, drop the outer GPU clone, and keep skip_bookkeeping=True behaving by doing a one-shot .cpu() on that path. The CPU tensor is independent storage (fresh .cpu() allocation, not a view) and isn't mutated by the step -- update_requests only touches new_sample_copy, a separate clone. Net: sample.tolist() becomes pure CPU-to-list, no sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3409dcf to
3b6c3dd
Compare
Contributor
Author
|
/ok to test ac90293 |
Contributor
Author
|
/ok to test ee4fa04 |
ee4fa04 to
477d780
Compare
Contributor
Author
|
/ok to test 8305c95 |
Contributor
Author
|
/ok to test 8305c95 |
Contributor
Author
|
/ok to test a6f6d04 |
Contributor
Author
|
/ok to test 8c91a68 |
# Conflicts: # megatron/core/inference/contexts/dynamic_context.py
Contributor
Author
|
/ok to test 1dda0ac |
Contributor
Author
|
/ok to test 2bb6d5e |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25269187340 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DynamicInferenceContextfrom GPU to pinned CPU memoryContextGPUViewas the single GPU interface for forward-pass code:context.foois always CPU (source of truth),context.gpu_view.foois always GPU (snapshot populated per-step bytransfer_bookkeeping_to_gpu())transfer_bookkeeping_to_gpu(), makingadd_request()andupdate_requests()100% CPU for hybrid models_transfer_samples_to_cpu()D2H boundaryTest plan
test_dynamic_context.py— 70 passedtest_dynamic_engine.py— 145 passed, 89 skipped (pre-existing)test_dynamic_events.py— 10 passedtest_dynamic_prefix_caching.py— 18 passedtest_dynamic_prefix_caching_coordinator.py— 35 passed, 1 skippedtest_mamba_metadata.py— 12 passedtest_prefix_caching_cuda_graphs.py— 7 passedtest_mamba_prefix_caching_e2e.py— 5 passed🤖 Generated with Claude Code