Skip to content

Optimize Deepseek V4 prepare decode#728

Merged
valarLip merged 1 commit into
mainfrom
dsv4_optimize
May 9, 2026
Merged

Optimize Deepseek V4 prepare decode#728
valarLip merged 1 commit into
mainfrom
dsv4_optimize

Conversation

@ZhangLirong-amd
Copy link
Copy Markdown
Contributor

@ZhangLirong-amd ZhangLirong-amd commented May 9, 2026

Motivation

1. Stream-overlapped H2D transfers

Adopts the same prep_stream pattern used by AiterMLAMetadataBuilder: fires 5 H2D copies on model_runner.async_execute_stream

2. Replace GPU window_topk with CPU numpy (_build_window_topk_np)

The existing _build_window_topk_batched uses ~15 PyTorch elementwise GPU ops (arange, where, clamp, mod, etc.) on small decode tensors (e.g. [128, 128]). Each op launches a separate GPU kernel

3. Vectorize HCA paged-offset for-loop

Replaces the per-token Python for-loop in _attach_v4_paged_decode_meta (O(T) iterations with slice writes) with vectorized numpy using np.repeat / np.cumsum / fancy indexing — eliminates 256
loop iterations for bs=128 MTP-1.

before:
image

after:
image

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 3 exact_match 0.96 ± 0.0197
strict-match 3 exact_match 0.96 ± 0.0197

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings May 9, 2026 08:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to speed up Deepseek V4 decode metadata preparation by reducing small GPU kernel launch overhead and overlapping host→device copies with CPU-side plan building.

Changes:

  • Add a NumPy implementation of the sliding-window window_topk builder to avoid multiple small GPU ops for small decode batches.
  • Overlap decode-time H2D staging on a dedicated CUDA stream (prep_stream) with CPU work (_build_compress_plans).
  • Vectorize part of the CPU-side HCA paged-offset construction and adjust _attach_v4_per_fwd_meta to consume CPU positions_np.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +913 to +919
prep_stream.wait_stream(current_stream)
with torch.cuda.stream(prep_stream):
positions = var["positions"].copy_to_gpu(sum_scheduled_tokens)
cu_seqlens_q_gpu = var["cu_seqlens_q"].copy_to_gpu(scheduled_bs + 1)
context_lens_gpu = var["context_lens"].copy_to_gpu(scheduled_bs)
block_tables_gpu = var["block_tables"].copy_to_gpu(scheduled_bs)
state_slot_gpu = ss_buf.copy_to_gpu(scheduled_bs)
@valarLip valarLip merged commit fc64fb6 into main May 9, 2026
18 of 32 checks passed
@valarLip valarLip deleted the dsv4_optimize branch May 9, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants