[Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step.#7367
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7367 +/- ##
==========================================
Coverage ? 73.89%
==========================================
Files ? 398
Lines ? 54943
Branches ? 8606
==========================================
Hits ? 40598
Misses ? 11632
Partials ? 2713
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
20cfaa4
This reverts commit 20cfaa4.
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-16 15:40 CST
📋 Review 摘要
PR 概述:将 DeepSeekV3.2 模型中 slot_mapping 的计算从每层两次冗余计算优化为在 gpu_model_runner 层的单次预处理,同时清理 mask_encoder_batch 不再使用的参数。
变更范围:gpu_model_runner、input_batch、deepseek_v3 模型、dsa_attention_backend、CUDA 算子
影响面 Tag:Optimization Models OP Executor
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | gpu_model_runner.py:1284 |
attn_backends[0] 缺少空列表防御检查 |
| 🟡 建议 | gpu_model_runner.py:1296 |
assert 缺少描述性错误信息 |
总体评价
优化方案设计合理,将 slot_mapping 和 position_ids 的计算从模型层(deepseek_v3.py)和算子层(dsa_attention_backend.py)提升到 gpu_model_runner 统一预处理,消除了每层重复计算,Profile 数据也验证了性能收益。重复的 compute_slot_mapping 函数被正确移除,mask_encoder_batch 参数的清理也干净彻底。两条小建议均为代码健壮性改进,不阻塞合入。
| Results are stored in self.forward_meta. | ||
| """ | ||
| # NOTE(zhushengguang): Only support MLAAttentionBackend and DSAAttentionBackend currently. | ||
| if not isinstance(self.attn_backends[0], (MLAAttentionBackend, DSAAttentionBackend)): |
There was a problem hiding this comment.
🟡 建议 此处直接访问 self.attn_backends[0] 未检查列表是否为空。
对比同文件 _process_reorder 方法中的写法 if self.attn_backends and getattr(...) 做了空列表保护。虽然 _initialize_attn_backend 在 __init__ 中先于此方法被调用,保证了 attn_backends 非空,但建议保持一致的防御式编程风格。
建议修改:
if not self.attn_backends or not isinstance(self.attn_backends[0], (MLAAttentionBackend, DSAAttentionBackend)):
return| ) | ||
| block_size = self.cache_config.block_size | ||
| block_idx = position_ids // block_size # [num_tokens] | ||
| assert self.forward_meta.batch_id_per_token.shape == block_idx.shape |
There was a problem hiding this comment.
🟡 建议 此处 assert 语句在生产环境中可能被优化掉(python -O 模式),建议添加描述性的错误信息方便排查问题。
assert self.forward_meta.batch_id_per_token.shape == block_idx.shape, (
f"Shape mismatch: batch_id_per_token {self.forward_meta.batch_id_per_token.shape} "
f"vs block_idx {block_idx.shape}"
)
Motivation
使用5层DeepSeekV3.2-Exp-BF16进行Profile
--------------关闭CudaGraph之后的性能提升--------------------
V0版本:图中decode每个Step耗时17.2ms,每层layer耗时3.2ms,其中slotmapping在DSA_Index和DSA_attn中占据不少时间:
注意:最终目标是将preprocess逻辑放到了gpu_model_runner中,后续计划作为每一个Step通用的逻辑,而不仅仅是某一个模型的Trick。
V1版本 :关闭CudaGraph +消除冗余计算

看到DSA_Indexer和DSA_attn已经没有了slot计算耗时。
结果:图中decode每个Step耗时降低到15.8ms,相比V0版本降低1.4ms(+8.13%)
--------------保持CudaGraph开启之后的性能提升--------------------


V2版本 开启CudaGraph +slotmapping冗余计算:
之后图中decode每个Step耗时约7.2ms
V3版本 开启CudaGraph +消除冗余计算:
图中decode每个Step耗时约6.2ms,相比V2版本降低1ms(+13.88%)
Modifications
本次优化将 DeepSeekV3.2 模型中 slot_mapping 的计算从每层两次减少为单次预处理,显著减少重复计算,并将通用的slotmapping和postionids移到了gpu_model_runner层。
核心变更修改文件列表 (5个文件):
次要修改:删除了
get_position_ids_and_mask_encoder_batch算子不必要的参数,调整单测文件Usage or Command
None
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.