Optimize Deepseek V4 prepare decode by ZhangLirong-amd · Pull Request #728 · ROCm/ATOM

ZhangLirong-amd · 2026-05-09T08:39:48Z

Motivation

1. Stream-overlapped H2D transfers

Adopts the same prep_stream pattern used by AiterMLAMetadataBuilder: fires 5 H2D copies on model_runner.async_execute_stream

2. Replace GPU window_topk with CPU numpy (`_build_window_topk_np`)

The existing _build_window_topk_batched uses ~15 PyTorch elementwise GPU ops (arange, where, clamp, mod, etc.) on small decode tensors (e.g. [128, 128]). Each op launches a separate GPU kernel

3. Vectorize HCA paged-offset for-loop

Replaces the per-token Python for-loop in _attach_v4_paged_decode_meta (O(T) iterations with slice writes) with vectorized numpy using np.repeat / np.cumsum / fancy indexing — eliminates 256
loop iterations for bs=128 MTP-1.

before:

after:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	3	exact_match	↑	0.96	±	0.0197
		strict-match	3	exact_match	↑	0.96	±	0.0197

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR aims to speed up Deepseek V4 decode metadata preparation by reducing small GPU kernel launch overhead and overlapping host→device copies with CPU-side plan building.

Changes:

Add a NumPy implementation of the sliding-window window_topk builder to avoid multiple small GPU ops for small decode batches.
Overlap decode-time H2D staging on a dedicated CUDA stream (prep_stream) with CPU work (_build_compress_plans).
Vectorize part of the CPU-side HCA paged-offset construction and adjust _attach_v4_per_fwd_meta to consume CPU positions_np.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        prep_stream.wait_stream(current_stream)
+        with torch.cuda.stream(prep_stream):
+            positions = var["positions"].copy_to_gpu(sum_scheduled_tokens)
+            cu_seqlens_q_gpu = var["cu_seqlens_q"].copy_to_gpu(scheduled_bs + 1)
+            context_lens_gpu = var["context_lens"].copy_to_gpu(scheduled_bs)
+            block_tables_gpu = var["block_tables"].copy_to_gpu(scheduled_bs)
+            state_slot_gpu = ss_buf.copy_to_gpu(scheduled_bs)


Optimize Deepseek V4 prepare decode

d01fb7a

Copilot AI review requested due to automatic review settings May 9, 2026 08:39

Copilot started reviewing on behalf of ZhangLirong-amd May 9, 2026 08:41 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

valarLip approved these changes May 9, 2026

View reviewed changes

valarLip merged commit fc64fb6 into main May 9, 2026
18 of 32 checks passed

valarLip deleted the dsv4_optimize branch May 9, 2026 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Deepseek V4 prepare decode#728

Optimize Deepseek V4 prepare decode#728
valarLip merged 1 commit into
mainfrom
dsv4_optimize

ZhangLirong-amd commented May 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ZhangLirong-amd commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

1. Stream-overlapped H2D transfers

2. Replace GPU window_topk with CPU numpy (_build_window_topk_np)

3. Vectorize HCA paged-offset for-loop

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZhangLirong-amd commented May 9, 2026 •

edited

Loading

2. Replace GPU window_topk with CPU numpy (`_build_window_topk_np`)