Skip to content

Move inference context bookkeeping to CPU with ContextGPUView#4306

Merged
lmcafee-nvidia merged 33 commits intoNVIDIA:mainfrom
lmcafee-nvidia:context-cpu
May 3, 2026
Merged

Move inference context bookkeeping to CPU with ContextGPUView#4306
lmcafee-nvidia merged 33 commits intoNVIDIA:mainfrom
lmcafee-nvidia:context-cpu

Conversation

@lmcafee-nvidia
Copy link
Copy Markdown
Contributor

Summary

  • Move all per-request and per-token bookkeeping tensors in DynamicInferenceContext from GPU to pinned CPU memory
  • Introduce ContextGPUView as the single GPU interface for forward-pass code: context.foo is always CPU (source of truth), context.gpu_view.foo is always GPU (snapshot populated per-step by transfer_bookkeeping_to_gpu())
  • Defer Mamba state zeroing and prefix cache restore to transfer_bookkeeping_to_gpu(), making add_request() and update_requests() 100% CPU for hybrid models
  • Add explicit _transfer_samples_to_cpu() D2H boundary

Test plan

  • test_dynamic_context.py — 70 passed
  • test_dynamic_engine.py — 145 passed, 89 skipped (pre-existing)
  • test_dynamic_events.py — 10 passed
  • test_dynamic_prefix_caching.py — 18 passed
  • test_dynamic_prefix_caching_coordinator.py — 35 passed, 1 skipped
  • test_mamba_metadata.py — 12 passed
  • test_prefix_caching_cuda_graphs.py — 7 passed
  • test_mamba_prefix_caching_e2e.py — 5 passed

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread megatron/core/inference/contexts/mamba_slot_allocator.py Outdated
@wdykas
Copy link
Copy Markdown
Contributor

wdykas commented Apr 14, 2026

I like this PR. I can rebase some of my stuff #4240 on top of this and then we can see what speed up we get

lmcafee-nvidia and others added 6 commits April 16, 2026 14:42
This is the "baseline" commit referenced in all future timing discussions
for the context-cpu work.  It branches directly from main and adds NVTX
ranges for the 5 inference loop stages that exist on main:

  - initialize_attention_state
  - forward_pass
  - sampling
  - active_request_mask
  - update_requests

All ranges nest inside the existing "Prefill"/"Decode" range from
dynamic_engine.py, enabling nsys analysis of per-stage timing.

Subsequent commits on this branch will add context-cpu-specific
optimizations AND extend the NVTX range set with 2 transfer stages
(transfer_bookkeeping_to_gpu, transfer_samples_to_cpu) that don't
exist on main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all per-request and per-token bookkeeping tensors in
DynamicInferenceContext from GPU to pinned CPU memory. Introduce
ContextGPUView as the single GPU interface for forward-pass code:
context.foo is always CPU (source of truth), context.gpu_view.foo
is always GPU (snapshot populated per-step by transfer_bookkeeping_to_gpu).

This eliminates CPU-GPU device mixing by establishing a clear
architectural boundary -- GPU code reads from gpu_view, CPU bookkeeping
reads from context directly. The gpu_view is populated once per step
with non-blocking pinned-memory copies.

Key changes:
- New gpu_view.py with ContextGPUView (6 token-level + 3 request-level
  GPU staging tensors)
- All request/token tensors in dynamic_context.py moved to CPU with
  pin_memory=True
- transfer_bookkeeping_to_gpu() populates gpu_view each step
- text_generation_controller.py reads gpu_view for GPU-phase ops
  (sampling, verification, log-probs)
- Post-rewind code reads CPU context directly (not stale gpu_view)
- mamba_slot_allocator.py fixed for CPU bookkeeping indexing

302 tests pass, 0 failures (90 pre-existing skips).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defer Mamba state zeroing and prefix cache restore from add_request()
and update_requests() to transfer_bookkeeping_to_gpu(), making both
methods 100% CPU for hybrid models. Mamba compute_and_store_offsets
remains immediate since commit_intermediate_states depends on its
CPU-side state.

Add _transfer_samples_to_cpu() to make the D2H boundary explicit.

302 tests pass (each suite run separately), 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Launch Mamba state zeroing/restore at the start of
initialize_attention_state() instead of transfer_bookkeeping_to_gpu().
This allows the GPU ops to overlap with the CPU work that follows
(batch dimension computation, MHA metadata, token padding).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split mamba_metadata.update() into compute_cpu_metadata() (CPU, called
in initialize_attention_state) and load_from_cpu() (H2D copies, called
in transfer_bookkeeping_to_gpu). This eliminates GPU kernel launches
for batch indices, cu_seqlens, chunk boundaries, and conv1d metadata.
The intermediate state extraction still uses GPU (_update_intermediate_metadata).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move _intermediate_offsets_gpu, _intermediate_counts_gpu,
_intermediate_block_ids_gpu, _eos_cache_block_id_gpu from GPU to CPU.
Block IDs and EOS block IDs are pure CPU bookkeeping (consumed by
commit_intermediate_states via .tolist()). Offsets and counts keep a
GPU buffer for _update_intermediate_metadata to consume; the H2D copy
is handled by that method on first use.

Eliminates ~5 GPU writes per add_request and 2 .tolist() D2H syncs
per commit_intermediate_states.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
root and others added 8 commits April 17, 2026 07:49
…init split

PR NVIDIA#4225 extracted argument parsing out of initialize_megatron(); call
parse_and_validate_args() separately and invoke initialize_megatron() with
no arguments.
Drop the 5 per-step GPU kernel launches in MHAMetadata.reset()
(query_lengths, cu_query_seq_lengths, cu_kv_seq_lengths, kv_seq_lengths,
block_table). The next update() / load_from_cpu() fully overwrites the
slice of each buffer that the forward pass will read (via state_data[:n]),
so clearing here is redundant paranoia from the old GPU-resident design.

Removes 10 vectorized_elementwise_kernel launches per step
(5 buffers x Graphed + NonGraphed metadata). See
lawrence/reports/20260417-bookkeeping-gpu-ops.md, Section 1.
adjust_batch_dims_for_expert_parallelism() runs on every inference step
to pick a CUDA graph batch dimension consistent across EP ranks. It
performed a torch.distributed.all_reduce(MAX) on a 4-int GPU tensor
sandwiched between an H2D copy (tensor construction) and a D2H copy
(.cpu() to read the result on the host).

Add a sync_all_reduce_max() method to AsyncZMQCommunicator that uses
blocking ZMQ send/recv on the CPU. When the engine has created the EP
ZMQ communicator, it is attached to the context via
DynamicInferenceContext.set_ep_zmq_communicator(), which in turn is
forwarded to match_graph_config() / adjust_batch_dims_for_expert_parallelism().
When present, the 4-int MAX is done on CPU with zero GPU kernels.
The torch.distributed fallback path is kept for standalone / non-engine
call sites.

Removes one NCCL AllReduce kernel plus one H2D and one D2H per step
(~102 us/step in the 2304-step nanov3 trace). Also removes the
stream-ordering barrier that the NCCL kernel introduced on the compute
stream. See lawrence/reports/20260417-bookkeeping-gpu-ops.md, Section 3.1.
The 9 .copy_(non_blocking=True) calls in DynamicInferenceContext
.transfer_bookkeeping_to_gpu() each incur ~15us of cudaMemcpyAsync launch
overhead for ~1us of actual transfer — the NVTX range is 270us of wallclock
with ~6% GPU utilization in the nanov3 trace.

Back all 9 bookkeeping fields (6 per-token + 3 per-request) with one
contiguous uint8 buffer on each side (pinned CPU + device GPU), with each
attribute as a dtype-correct view onto a slice of the buffer. Layout is
shared between DynamicInferenceContext._cpu_bookkeeping_buf and
ContextGPUView._buf; int64 fields are placed first so alignment is
automatic. The per-step transfer is now a single cudaMemcpyAsync of
32*max_tokens + 12*max_requests bytes (~71 KB for the benchmark config).

Per-token fields are aliased with the source-of-truth attributes because
the CPU-side bookkeeping and the GPU forward pass both use the same
[:n_tok] slice. Per-request fields have an extra staging area in the
coalesced buffer, refreshed each step from the persistent CPU tensors
(at [paused_request_count:total_request_count]) into [:n_active], since
the forward pass reads them at [:n_active] on GPU but CPU bookkeeping
keeps paused requests in [0:paused_request_count).

Correctness verified: test_reset, test_update_request, test_add_request,
test_initialize_dynamic_context all pass (8 tests × 4 ranks).
MHAMetadata no longer owns private GPU buffers; GraphedMHAMetadata and
NonGraphedMHAMetadata bind to shared views of ContextGPUView._buf (only
one is active per step, so sharing storage is safe). initialize_attention_state
writes the 5 MHA fields directly into pinned slots in _cpu_bookkeeping_buf,
and transfer_bookkeeping_to_gpu's single cudaMemcpyAsync now covers them
along with the existing token/request fields -- eliminating the 5 per-step
mha.load_from_cpu copies. Per-step state_data is rebuilt via set_state_data
using the freshly transferred GPU views plus Python-int max_seqlen scalars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends _cpu_bookkeeping_buf / ContextGPUView._buf with a Mamba section
(9 int32 fields, hybrid-only) that mirrors MambaMetadata's per-step varlen
tensors. MambaMetadata.compute_cpu_metadata() now writes directly into the
bound pinned CPU views instead of allocating ephemeral tensors, and
load_from_cpu() drops all 9 .copy_() calls -- the coalesced H2D in
transfer_bookkeeping_to_gpu() covers the transfer, leaving load_from_cpu()
to just alias state attributes onto the freshly-transferred GPU views and
run the intermediate-extraction GPU computation.

The legacy MambaMetadata.update() path is preserved (it still owns the
standalone *_buffer tensors for unit tests that construct MambaMetadata
without a context); it's unused on the inference path, so the ~40KB of
redundant GPU memory is negligible.

Also wires mamba_chunk_size through _allocate_mamba_states so the
MambaMetadata's internal chunk_size matches the unified-buffer sizing in
ContextGPUView.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ~550 µs decode-step gap between update_requests and
initialize_attention_state is dominated by step_end_event.synchronize()
(CPU waits for GPU to drain so elapsed_time() can be read) plus the
pre/post context_state dict builds. All of that work feeds only the
print block (dynamic_engine.py:1858) and the W&B metrics block
(dynamic_engine.py:1818), both already gated on
  logging_step_interval > 0 and step_count % logging_step_interval == 0.

Predict that same condition once at the top of async_forward as
`will_log_this_step` and skip the logging-only work on non-logging steps:

  - step_start/end events and elapsed_time (step_time = 0.0)
  - pre_step_context_state print-only fields (keep active_token_count
    and step_count, used by post_process_requests' pre_fwd_* args)
  - kvcache_util_stats computation
  - post_step_context_state dict (and drop the two dead fields
    padded_active_token_count, using_cuda_graph_this_step that no
    consumer reads)
  - the pre/post merge (minimal dict keeps kv_stats=None so the
    metrics-block gate at 1818 stays well-typed)

In post_process_requests, gate the TPOT update on step_time > 0 so
non-logging steps don't pollute request.tpot with zeros -- the metric
becomes a sparse sample aligned with the same cadence as logging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_dynamic_step_context_bookkeeping already produces sampled_tokens_cpu
via _transfer_samples_to_cpu(), but the outer dict built by
async_generate_output_tokens_dynamic_batch was handing out a fresh clone
of the GPU _sampled_tokens_cuda buffer. The engine's post_process_requests
then called sample.tolist() on that GPU tensor, forcing a D2H sync --
pure overhead inside the update_requests -> initialize_attention_state
critical-path gap.

Propagate the already-allocated CPU tensor instead: add "sample":
sampled_tokens_cpu to the _dynamic_step_context_bookkeeping return dict,
drop the outer GPU clone, and keep skip_bookkeeping=True behaving by
doing a one-shot .cpu() on that path. The CPU tensor is independent
storage (fresh .cpu() allocation, not a view) and isn't mutated by the
step -- update_requests only touches new_sample_copy, a separate clone.
Net: sample.tolist() becomes pure CPU-to-list, no sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test ac90293

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 27, 2026
@lmcafee-nvidia lmcafee-nvidia marked this pull request as ready for review April 27, 2026 13:33
@lmcafee-nvidia lmcafee-nvidia requested review from a team as code owners April 27, 2026 13:33
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 27, 2026 13:33
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test ee4fa04

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the Approved All necessary approvals have been made label May 2, 2026
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 8305c95

@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 8305c95

@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test a6f6d04

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 2, 2026
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 8c91a68

# Conflicts:
#	megatron/core/inference/contexts/dynamic_context.py
@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 1dda0ac

@lmcafee-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 2bb6d5e

@lmcafee-nvidia lmcafee-nvidia added this pull request to the merge queue May 3, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25269187340

Merged via the queue into NVIDIA:main with commit 342dd59 May 3, 2026
64 of 67 checks passed
@lmcafee-nvidia lmcafee-nvidia deleted the context-cpu branch May 3, 2026 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: high

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants