[XPU] Unify Spec and non-spec branch.(#6947) by Jiajun-Ji · Pull Request #7180 · PaddlePaddle/FastDeploy

Jiajun-Ji · 2026-04-03T04:45:12Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

【组网统一】 Unify Spec and non-spec branch. [Speculative Decoding] Unify Spec and non-spec branch #6685
Unify speculative decoding: align with GPU, refactor MTP flow.

Modifications

Usage or Command

Accuracy Tests

输出长度未见明显异常

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-03T04:45:18Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 作为 #6685 的 Cherry-Pick，目标是在 XPU 后端统一 Spec / 非 Spec 分支的执行与后处理路径，并补齐 XPU 的 draft token 验证能力，从而与 GPU 侧的统一架构对齐。

Changes:

XPU ModelRunner 侧统一 speculative method 字段与 proposer 初始化/调用路径，并在后处理时接入 unified_update_model_status。
XPU SpeculativeSampler 拆分 “naive 采样” 与 “verify + 采样” 路由，新增 verify_draft_tokens 调用链。
新增 XPU 自定义算子 verify_draft_tokens（C++ wrapper + XPU3 kernel）及对应单测。

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
fastdeploy/worker/xpu_model_runner.py	统一 spec_method 命名与 proposer 初始化/运行逻辑；调整 share_inputs（新增 reasoning_status 等）并把后处理切到 unified_update_model_status
fastdeploy/model_executor/xpu_pre_and_post_process.py	speculative 后处理由 speculate_update/speculate_set_value_by_flags_and_idx 迁移到 unified_update_model_status，并新增 is_naive_mode/prefill_one_step_stop 参数
fastdeploy/model_executor/layers/sample/sampler.py	XPU speculative 采样路径重构：naive 采样 vs verify_draft_tokens 验证采样分流
custom_ops/xpu_ops/test/test_verify_draft_tokens.py	新增 verify_draft_tokens kernel 的参考实现对比测试
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/verify_draft_tokens.cpp	新增 verify_draft_tokens 的 XPU plugin wrapper（含 CPU wrapper 与 XPU3 launch）
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/verify_draft_tokens.xpu	新增 XPU3 verify_draft_tokens kernel 实现
custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h	导出 verify_draft_tokens plugin API 声明
custom_ops/xpu_ops/src/ops/pybind/pybind.cc	暴露 verify_draft_tokens 到 Python 侧
custom_ops/xpu_ops/src/ops/mtp/verify_draft_token.cc	新增 verify_draft_tokens 的 Paddle 扩展 OP 封装与参数校验

Copilot · 2026-04-03T04:52:32Z

+  WRAPPER_CHECK_PTR(ctx, float, real_bsz, curand_states);
+  WRAPPER_CHECK_PTR(ctx, float, real_bsz, topp);
+  WRAPPER_CHECK_PTR(ctx, bool, real_bsz, stop_flags);
+  WRAPPER_CHECK_PTR(ctx, int, real_bsz, seq_lens_encoder);
+  WRAPPER_CHECK_PTR(ctx, float, real_bsz, seq_lens_this_time);
+  WRAPPER_CHECK_PTR(ctx, int64_t, end_length, end_tokens);
+
+  WRAPPER_CHECK_PTR(ctx, bool, real_bsz, is_block_step);
+  WRAPPER_CHECK_PTR(ctx, bool, real_bsz, cu_seqlens_q_output);
+  WRAPPER_CHECK_PTR(ctx, bool, real_bsz, reasoning_status);
+  WRAPPER_CHECK_PTR(ctx, bool, real_bsz, max_dec_len);
+  WRAPPER_CHECK_PTR(ctx, bool, real_bsz, step_idx);


新增的 verify_draft_tokens wrapper 里 WRAPPER_CHECK_PTR 的类型参数与真实指针类型不一致：seq_lens_this_time 是 int* 却按 float 检查；cu_seqlens_q_output/reasoning_status/max_dec_len/step_idx 也都被按 bool 检查。该问题会导致 wrapper 参数校验错误，严重时可能引发编译/运行期问题。建议按实际类型修正（seq_lens_this_time/cand_lens/seq_lens_encoder/… 用 int；cu_seqlens_q_output/reasoning_status 用 int；max_dec_len/step_idx 用 int64_t）。

Copilot · 2026-04-03T04:52:33Z

+          output_token =
+              topp_sampling_kernel(candidate_ids_now + i * max_candidate_len,
+                                   candidate_scores_now + i * max_candidate_len,
+                                   curand_states + i,


cpu_wrapper 的 TOPP Phase2 调用 topp_sampling_kernel 时使用 curand_states + i（i 为 token 位置），但 host 侧 VerifyDraftTokens 只分配了 bsz 个随机数（按 batch），在 i>0 时会越界读取，CPU 回退路径会出现未定义行为/崩溃。建议统一 curand_states 的语义（按 batch 还是按 step 位置），并确保分配长度与访问方式一致（例如改为 curand_states + bid 或分配 max_step_tokens 长度并在 XPU3 kernel 侧同样按 i 偏移）。

Suggested change

curand_states + i,

curand_states + bid,

Copilot · 2026-04-03T04:52:33Z

+        pass
+        # if not paddle.is_compiled_with_cuda():
+        #     self.skipTest("Requires CUDA")


该单测在 setUp 里没有根据 XPU 编译情况进行 skip（相关逻辑被注释掉），但文件顶部会在未编译 XPU 时把 CUDA_PLACE 退化为 CPUPlace，随后仍会调用 verify_draft_tokens 自定义算子，容易在无 XPU 环境下直接失败/崩溃。建议在 setUp 中恢复并改为检查 paddle.is_compiled_with_xpu()（或至少在非 XPU 环境 skip）。

Suggested change

pass

# if not paddle.is_compiled_with_cuda():

# self.skipTest("Requires CUDA")

if not paddle.is_compiled_with_xpu():

self.skipTest("Requires XPU")

codecov-commenter · 2026-04-03T06:05:47Z

Codecov Report

❌ Patch coverage is 16.21622% with 31 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@e53f518). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/layers/sample/sampler.py	10.34%	26 Missing ⚠️
fastdeploy/worker/input_batch.py	28.57%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7180   +/-   ##
==========================================
  Coverage           ?   73.83%           
==========================================
  Files              ?      394           
  Lines              ?    54779           
  Branches           ?     8581           
==========================================
  Hits               ?    40446           
  Misses             ?    11609           
  Partials           ?     2724

Flag	Coverage Δ
GPU	`73.83% <16.21%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

custom_ops/xpu_ops/src/ops/gather_next_token.cc:134

GatherNextTokenInferShape 里对非 speculative 分支把 bsz 固定成 0（且未使用 op attr 里的 max_bsz），会导致静态 shape 推断错误（out 维度变成 [0, dim]）。建议把 max_bsz 作为 infer_shape 参数透传并用于返回形状（或直接返回 [-1, dim] 作为动态维度）。

  int64_t bsz = 0;
  int64_t dim_embed = x_shape[1];
  if (is_speculative) {
    return {{-1, dim_embed}};
  } else {
    return {{bsz, dim_embed}};
  }

…utput/batch_id_per_token_output, correct WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path.

…_token.

cmcamdy

LGTM

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

PaddlePaddle-bot

📋 Review 摘要

PR 概述：统一 XPU 平台的 speculative 和 non-speculative 分支，与 GPU 实现对齐。

变更范围：custom_ops/xpu_ops/、fastdeploy/model_executor/、fastdeploy/worker/

影响面 Tag：[XPU] [Speculative Decoding]

📝 PR 规范检查

项目	状态
标题 Tag	✅ 包含 `[XPU]`
Motivation	✅ 已填写
Modifications	✅ 已填写
Checklist	❌ 未填写

建议补充 Checklist 中的测试和精度测试相关内容。

问题

级别	文件	概述
🟡 建议	`fastdeploy/worker/xpu_model_runner.py:1467`	XPU 平台不支持 NGRAM/SUFFIX 方法，缺少兼容性检查
❓ 疑问	`fastdeploy/worker/xpu_model_runner.py:1084`	`reasoning_status` 参数新增但未被更新

总体评价

整体代码重构清晰，统一了 XPU 的 speculative 和 non-speculative 分支，与 GPU 实现对齐。但存在平台兼容性问题：NgramProposer 和 SuffixProposer 使用 CUDA 专用代码，在 XPU 上配置 NGRAM 或 SUFFIX 方法会导致运行时失败。建议添加平台和方法的兼容性检查。

Copilot AI review requested due to automatic review settings April 3, 2026 04:45

Jiajun-Ji had a problem deploying to Metax_ci April 3, 2026 04:45 — with GitHub Actions Failure

Copilot started reviewing on behalf of Jiajun-Ji April 3, 2026 04:45 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

Jiajun-Ji had a problem deploying to Metax_ci April 3, 2026 08:06 — with GitHub Actions Error

Copilot AI review requested due to automatic review settings April 3, 2026 08:09

Jiajun-Ji had a problem deploying to Metax_ci April 3, 2026 08:09 — with GitHub Actions Failure

Copilot started reviewing on behalf of Jiajun-Ji April 3, 2026 08:09 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

cmcamdy reviewed Apr 8, 2026

View reviewed changes

Comment thread fastdeploy/model_executor/xpu_pre_and_post_process.py Outdated

Jiajun-Ji had a problem deploying to Metax_ci April 9, 2026 06:48 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Copilot AI review requested due to automatic review settings April 14, 2026 08:50

Jiajun-Ji force-pushed the mtp-unify-v4 branch from 06548b9 to f1780fa Compare April 14, 2026 08:50

Jiajun-Ji temporarily deployed to Metax_ci April 14, 2026 08:50 — with GitHub Actions Inactive

Copilot started reviewing on behalf of Jiajun-Ji April 14, 2026 08:50 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

Jiajun-Ji added 8 commits April 15, 2026 11:00

[XPU] cherry-pick PR-6947

3106ee9

[XPU] use unified_update_model_status.

de888d9

refactor xpu_model_runner.

b5b4273

refactor sampler.

495ac5d

fix codestyle.

085cfcc

Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_o…

85be497

…utput/batch_id_per_token_output, correct WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path.

fix codestyle.

964881f

replace output_padding_offset with is_speculative flag in gather_next…

7ceefe4

…_token.

rename hiddden_states.

6b188b8

Jiajun-Ji force-pushed the mtp-unify-v4 branch from f1780fa to 6b188b8 Compare April 15, 2026 07:20

Jiajun-Ji had a problem deploying to Metax_ci April 15, 2026 07:20 — with GitHub Actions Failure

cmcamdy changed the title ~~[XPU] [Cherry-Pick] Unify Spec and non-spec branch.(#6947)~~ [XPU] Unify Spec and non-spec branch.(#6947) Apr 15, 2026

This comment was marked as outdated.

Sign in to view

cmcamdy previously approved these changes Apr 15, 2026

View reviewed changes

unify cu_seqlens_q_output and batch_id_per_token_output init.

5656765

Copilot AI review requested due to automatic review settings April 15, 2026 12:53

Jiajun-Ji dismissed cmcamdy’s stale review via 5656765 April 15, 2026 12:53

Jiajun-Ji had a problem deploying to Metax_ci April 15, 2026 12:53 — with GitHub Actions Failure

Copilot started reviewing on behalf of Jiajun-Ji April 15, 2026 12:53 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread custom_ops/xpu_ops/test/test_adjust_batch_and_gather_next_token.py

Comment thread custom_ops/xpu_ops/src/ops/pybind/pybind.cc

This comment was marked as outdated.

Sign in to view

Merge branch 'develop' into mtp-unify-v4

a4d2594

cmcamdy temporarily deployed to Metax_ci April 16, 2026 02:27 — with GitHub Actions Inactive

PaddlePaddle-bot reviewed Apr 16, 2026

View reviewed changes

Comment thread fastdeploy/worker/xpu_model_runner.py

Comment thread fastdeploy/worker/xpu_model_runner.py

Jiang-Jia-Jun merged commit 29495b2 into PaddlePaddle:develop Apr 16, 2026
34 of 37 checks passed

Conversation

Jiajun-Ji commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cmcamdy left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

Jiajun-Ji commented Apr 3, 2026 •

edited

Loading

codecov-commenter commented Apr 3, 2026 •

edited

Loading