[XPU] add verify draft tokens#6947
Conversation
|
Thanks for your contribution! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #6947 +/- ##
==========================================
Coverage ? 73.51%
==========================================
Files ? 383
Lines ? 53614
Branches ? 8412
==========================================
Hits ? 39417
Misses ? 11516
Partials ? 2681
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
该 PR 在 XPU(Kunlun3)侧新增 verify_draft_tokens 的插件/算子实现,并通过 pybind 暴露到 Python,同时补充了对应的单测,用于投机解码(Speculative Decoding)中对 draft tokens 的校验与 Phase2 采样输出。
Changes:
- 新增 XPU3 kernel:
verify_draft_tokens.xpu,实现 Phase1 verify + Phase2 输出逻辑 - 新增插件 wrapper + Paddle 扩展算子 + pybind 导出:对外提供
verify_draft_tokens调用入口 - 新增较完整的 Python 单测:覆盖不同 verify_strategy、verify_window、EOS/max_dec_len、skip 条件等
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| custom_ops/xpu_ops/test/test_verify_draft_tokens.py | 新增 verify_draft_tokens 的 Python 单测与参考实现 |
| custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/verify_draft_tokens.cpp | 新增 verify_draft_tokens 插件 wrapper(CPU/XPU3 wrapper + 参数校验) |
| custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/verify_draft_tokens.xpu | 新增 Kunlun3 XPU kernel 实现 |
| custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h | 新增 verify_draft_tokens 插件函数声明 |
| custom_ops/xpu_ops/src/ops/pybind/pybind.cc | 新增 pybind 导出 verify_draft_tokens |
| custom_ops/xpu_ops/src/ops/mtp/verify_draft_token.cc | 新增 Paddle 扩展算子 host 侧实现与参数校验 |
| custom_ops/xpu_ops/src/ops/mtp/speculate_verify.cc | 调整 speculate_verify 的随机数生成逻辑(持久 seed/offset) |
| custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/speculate_verify.cpp | 清理重复 typedef 声明 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review | 2026-04-14
📋 Review 摘要
PR 概述:为 XPU 硬件添加 verify_draft_tokens 算子,支持 TOPP/GREEDY/TARGET_MATCH 三种验证策略,用于 speculative decoding 的 draft token 验证。
变更范围:custom_ops/xpu_ops/ (新增 7 个文件,修改 1 个文件)
- 新增 verify_draft_token.cc (host 端实现)
- 新增 verify_draft_tokens.xpu (XPU kernel)
- 新增 verify_draft_tokens.cpp (wrapper)
- 新增 test_verify_draft_tokens.py (单元测试)
- 修改 speculate_verify.cc (添加 persistent seed/offset)
影响面 Tag:[XPU] [OP]
📝 PR 规范检查
PR 描述未遵循官方模板,缺少以下必填章节:
标题建议(当前已有有效 tag):
- 当前标题:
[XPU] add verify draft tokens✓
描述建议(请补充以下内容):
## Motivation
为 XPU 硬件添加 draft token 验证算子,支持 speculative decoding 场景。参考 PR #6685 实现 XPU 版本。
## Modifications
1. 新增 `verify_draft_token.cc`:算子 host 端实现,支持三种验证策略
- TOPP (0):验证 draft token 是否在 top-p 候选集中
- GREEDY (1):验证 draft token 是否等于 argmax token
- TARGET_MATCH (2):验证 draft token 是否等于目标模型的采样 token
2. 新增 `verify_draft_tokens.xpu`:XPU3 kernel 实现,包含 VerifyContext 结构体和验证逻辑
3. 新增 `verify_draft_tokens.cpp`:CPU/XPU3 wrapper 实现
4. 新增 `test_verify_draft_tokens.py`:单元测试(包含 Python 参考实现)
5. 修改 `speculate_verify.cc`:使用 persistent seed/offset 替代固定 random seed
## Usage or Command
```python
from fastdeploy.model_executor.ops.xpu import verify_draft_tokens
verify_draft_tokens(
step_output_ids, # 输出 token ids
step_output_len, # 输出长度
step_input_ids, # 输入 draft tokens
target_tokens, # target 模型输出 (GREEDY/TARGET_MATCH)
candidate_ids, # 候选 token ids (TOPP/GREEDY)
candidate_scores, # 候选分数 (TOPP)
candidate_lens, # 候选长度 (TOPP/GREEDY)
topp, # top-p 阈值
stop_flags, # 停止标志
seq_lens_encoder, # encoder 序列长度
seq_lens_this_time, # 当前序列长度
end_tokens, # EOS tokens
is_block_step, # 是否 block step
cu_seqlens_q_output,# 累积序列长度
reasoning_status, # 推理状态
max_dec_len, # 最大解码长度
step_idx, # 当前步索引
max_seq_len=4096,
verify_window=4,
verify_strategy=0, # 0=TOPP, 1=GREEDY, 2=TARGET_MATCH
reject_all=False,
accept_all=False
)Accuracy Tests
已添加单元测试 test_verify_draft_tokens.py,包含 Python 参考实现与 XPU 实现的对比测试。
Checklist
- Add at least a tag in the PR title.
- Format your code, run
pre-commitbefore commit. (CI PASSED) - Add unit tests. Added test_verify_draft_tokens.py with reference implementation.
- Provide accuracy results. (待补充)
- If the current PR is submitting to the
releasebranch... (N/A)
### 问题
| 级别 | 文件 | 概述 |
|------|------|------|
| 🟡 建议 | - | PR 描述不完整,缺少 Motivation/Modifications/Usage/Accuracy Tests 章节 |
### 总体评价
代码实现逻辑正确,包含完整的 Python 参考实现和单元测试。VerifyContext 结构体在 XPU kernel 和 wrapper 中实现一致,emit_token/emit_final_token 的 EOS/max_dec_len 检测逻辑正确。使用 persistent seed/offset 确保随机数质量与 GPU 版本对齐。主要问题是 PR 描述未遵循模板,请补充 Motivation/Modifications/Usage/Accuracy Tests 内容。
* [XPU] cherry-pick PR-6947 * [XPU] use unified_update_model_status. * refactor xpu_model_runner. * refactor sampler. * fix codestyle. * Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path. * fix codestyle. * replace output_padding_offset with is_speculative flag in gather_next_token. * rename hiddden_states. * unify cu_seqlens_q_output and batch_id_per_token_output init. --------- Co-authored-by: cmcamdy <1027740945@qq.com>
add verify draft tokens
#6685
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.