Skip to content

[XPU] add verify draft tokens#6947

Merged
Jiang-Jia-Jun merged 12 commits intoPaddlePaddle:developfrom
cmcamdy:xpu_mtp_kernel
Apr 15, 2026
Merged

[XPU] add verify draft tokens#6947
Jiang-Jia-Jun merged 12 commits intoPaddlePaddle:developfrom
cmcamdy:xpu_mtp_kernel

Conversation

@cmcamdy
Copy link
Copy Markdown
Collaborator

@cmcamdy cmcamdy commented Mar 20, 2026

add verify draft tokens
#6685

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 20, 2026

Thanks for your contribution!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@ea998dd). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6947   +/-   ##
==========================================
  Coverage           ?   73.51%           
==========================================
  Files              ?      383           
  Lines              ?    53614           
  Branches           ?     8412           
==========================================
  Hits               ?    39417           
  Misses             ?    11516           
  Partials           ?     2681           
Flag Coverage Δ
GPU 73.51% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread custom_ops/xpu_ops/src/ops/mtp/verify_draft_token.cc Outdated
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

hong19860320
hong19860320 previously approved these changes Apr 13, 2026
Copy link
Copy Markdown
Collaborator

@hong19860320 hong19860320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 在 XPU(Kunlun3)侧新增 verify_draft_tokens 的插件/算子实现,并通过 pybind 暴露到 Python,同时补充了对应的单测,用于投机解码(Speculative Decoding)中对 draft tokens 的校验与 Phase2 采样输出。

Changes:

  • 新增 XPU3 kernel:verify_draft_tokens.xpu,实现 Phase1 verify + Phase2 输出逻辑
  • 新增插件 wrapper + Paddle 扩展算子 + pybind 导出:对外提供 verify_draft_tokens 调用入口
  • 新增较完整的 Python 单测:覆盖不同 verify_strategy、verify_window、EOS/max_dec_len、skip 条件等

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
custom_ops/xpu_ops/test/test_verify_draft_tokens.py 新增 verify_draft_tokens 的 Python 单测与参考实现
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/verify_draft_tokens.cpp 新增 verify_draft_tokens 插件 wrapper(CPU/XPU3 wrapper + 参数校验)
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/verify_draft_tokens.xpu 新增 Kunlun3 XPU kernel 实现
custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h 新增 verify_draft_tokens 插件函数声明
custom_ops/xpu_ops/src/ops/pybind/pybind.cc 新增 pybind 导出 verify_draft_tokens
custom_ops/xpu_ops/src/ops/mtp/verify_draft_token.cc 新增 Paddle 扩展算子 host 侧实现与参数校验
custom_ops/xpu_ops/src/ops/mtp/speculate_verify.cc 调整 speculate_verify 的随机数生成逻辑(持久 seed/offset)
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/speculate_verify.cpp 清理重复 typedef 声明

Comment thread custom_ops/xpu_ops/src/ops/mtp/verify_draft_token.cc
Comment thread custom_ops/xpu_ops/src/ops/mtp/verify_draft_token.cc Outdated
Comment thread custom_ops/xpu_ops/src/ops/mtp/speculate_verify.cc Outdated
Comment thread custom_ops/xpu_ops/test/test_verify_draft_tokens.py
Comment thread custom_ops/xpu_ops/test/test_verify_draft_tokens.py
Comment thread custom_ops/xpu_ops/test/test_verify_draft_tokens.py
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-14

📋 Review 摘要

PR 概述:为 XPU 硬件添加 verify_draft_tokens 算子,支持 TOPP/GREEDY/TARGET_MATCH 三种验证策略,用于 speculative decoding 的 draft token 验证。

变更范围:custom_ops/xpu_ops/ (新增 7 个文件,修改 1 个文件)

  • 新增 verify_draft_token.cc (host 端实现)
  • 新增 verify_draft_tokens.xpu (XPU kernel)
  • 新增 verify_draft_tokens.cpp (wrapper)
  • 新增 test_verify_draft_tokens.py (单元测试)
  • 修改 speculate_verify.cc (添加 persistent seed/offset)

影响面 Tag[XPU] [OP]

📝 PR 规范检查

PR 描述未遵循官方模板,缺少以下必填章节:

标题建议(当前已有有效 tag):

  • 当前标题:[XPU] add verify draft tokens

描述建议(请补充以下内容):

## Motivation

为 XPU 硬件添加 draft token 验证算子,支持 speculative decoding 场景。参考 PR #6685 实现 XPU 版本。

## Modifications

1. 新增 `verify_draft_token.cc`:算子 host 端实现,支持三种验证策略
   - TOPP (0):验证 draft token 是否在 top-p 候选集中
   - GREEDY (1):验证 draft token 是否等于 argmax token
   - TARGET_MATCH (2):验证 draft token 是否等于目标模型的采样 token

2. 新增 `verify_draft_tokens.xpu`:XPU3 kernel 实现,包含 VerifyContext 结构体和验证逻辑

3. 新增 `verify_draft_tokens.cpp`:CPU/XPU3 wrapper 实现

4. 新增 `test_verify_draft_tokens.py`:单元测试(包含 Python 参考实现)

5. 修改 `speculate_verify.cc`:使用 persistent seed/offset 替代固定 random seed

## Usage or Command

```python
from fastdeploy.model_executor.ops.xpu import verify_draft_tokens

verify_draft_tokens(
    step_output_ids,      # 输出 token ids
    step_output_len,       # 输出长度
    step_input_ids,       # 输入 draft tokens
    target_tokens,        # target 模型输出 (GREEDY/TARGET_MATCH)
    candidate_ids,        # 候选 token ids (TOPP/GREEDY)
    candidate_scores,     # 候选分数 (TOPP)
    candidate_lens,      # 候选长度 (TOPP/GREEDY)
    topp,               # top-p 阈值
    stop_flags,         # 停止标志
    seq_lens_encoder,    # encoder 序列长度
    seq_lens_this_time, # 当前序列长度
    end_tokens,         # EOS tokens
    is_block_step,      # 是否 block step
    cu_seqlens_q_output,# 累积序列长度
    reasoning_status,    # 推理状态
    max_dec_len,       # 最大解码长度
    step_idx,          # 当前步索引
    max_seq_len=4096,
    verify_window=4,
    verify_strategy=0,  # 0=TOPP, 1=GREEDY, 2=TARGET_MATCH
    reject_all=False,
    accept_all=False
)

Accuracy Tests

已添加单元测试 test_verify_draft_tokens.py,包含 Python 参考实现与 XPU 实现的对比测试。

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit. (CI PASSED)
  • Add unit tests. Added test_verify_draft_tokens.py with reference implementation.
  • Provide accuracy results. (待补充)
  • If the current PR is submitting to the release branch... (N/A)

### 问题

| 级别 | 文件 | 概述 |
|------|------|------|
| 🟡 建议 | - | PR 描述不完整,缺少 Motivation/Modifications/Usage/Accuracy Tests 章节 |

### 总体评价

代码实现逻辑正确,包含完整的 Python 参考实现和单元测试。VerifyContext 结构体在 XPU kernel 和 wrapper 中实现一致,emit_token/emit_final_token 的 EOS/max_dec_len 检测逻辑正确。使用 persistent seed/offset 确保随机数质量与 GPU 版本对齐。主要问题是 PR 描述未遵循模板,请补充 Motivation/Modifications/Usage/Accuracy Tests 内容。

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 13b9fe7 into PaddlePaddle:develop Apr 15, 2026
34 of 38 checks passed
Jiang-Jia-Jun pushed a commit that referenced this pull request Apr 16, 2026
* [XPU] cherry-pick PR-6947

* [XPU] use unified_update_model_status.

* refactor xpu_model_runner.

* refactor sampler.

* fix codestyle.

* Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct
  WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path.

* fix codestyle.

* replace output_padding_offset with is_speculative flag in gather_next_token.

* rename hiddden_states.

* unify cu_seqlens_q_output and batch_id_per_token_output init.

---------

Co-authored-by: cmcamdy <1027740945@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants